Attention

1. Self Attention

1.1 Vector perspective:

The following shows how to calculate $\boldsymbol{b}^1$ , given $[\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4]$ . For $\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4$ , it is similar.

1.2 Matrix perspective: (parallel computing)

The following shows how to calculate $[\boldsymbol{b}^1,\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4]$ , given $[\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4]$ .

Input matrix $I=[\boldsymbol{a}^1,\boldsymbol{a}^2,\boldsymbol{a}^3,\boldsymbol{a}^4]$ represents a list of vectors for a sentence. Each vector $\boldsymbol{a}^i$ represents a word vector.
Output matrix $O=[\boldsymbol{b}^1,\boldsymbol{b}^2,\boldsymbol{b}^3,\boldsymbol{b}^4]$ corresponds the input matrix.
Dimension:
- $I: r\times n$ , $W^q: q \times r$ , $Q: q \times n$
- $I: r\times n$ , $W^k: q \times r$ , $K: q \times n$
- $I: r\times n$ , $W^v: v \times r$ , $V: v \times n$
- $Q: q\times n$ , $K: q \times n$ , $A(A'): n \times n$
- $A(A'): n \times n$ , $V: v \times n$ , $O: v \times n$
- In other words: $I: r\times n \rightarrow O: v \times n$
Why "self-attention"? Because all the operations happen among different $\boldsymbol{a}^i$ s within $I$ .
Only need to estimate parameters $W^q,W^k,W^v$ in the neural network.

2. Multi-Head Self-Attention

This means that multiple sets of self-attention weights or heads are learned in parallel, independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common.

The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. While another head may focus on the activity of the sentence. While yet another head may focus on some other properties, such as if the words rhyme.

One word can have multiple meanings. These are homonyms.
Words within a sentence structure can be ambiguous or have what we might call syntactic ambiguity.

PreviousTransformer NextTransformer

Last updated 1 year ago