Attention

1. Self Attention

input and output for a self-attention module

1.1 Vector perspective:

The following shows how to calculate b1\boldsymbol{b}^1, given [a1,a2,a3,a4][\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4]. For b2,b3,b4\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4, it is similar.

Step 1: attention score
Step 2: normalziation of attention score. Soft-max is one option. Some other option: ReLU
Step 3: weighted sum of values

1.2 Matrix perspective: (parallel computing)

The following shows how to calculate [b1,b2,b3,b4][\boldsymbol{b}^1,\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4], given [a1,a2,a3,a4][\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4].

Step 1: Generate Q, K, V
Step 2: Generate attention score A and normalized attention score A'
Step 3: weighted sum of values
Summary of matrix operation
  • Input matrix I=[a1,a2,a3,a4]I=[\boldsymbol{a}^1,\boldsymbol{a}^2,\boldsymbol{a}^3,\boldsymbol{a}^4] represents a list of vectors for a sentence. Each vector ai\boldsymbol{a}^i represents a word vector.

  • Output matrix O=[b1,b2,b3,b4]O=[\boldsymbol{b}^1,\boldsymbol{b}^2,\boldsymbol{b}^3,\boldsymbol{b}^4]corresponds the input matrix.

  • Dimension:

    • I:r×nI: r\times n, Wq:q×rW^q: q \times r, Q:q×nQ: q \times n

    • I:r×nI: r\times n, Wk:q×rW^k: q \times r, K:q×nK: q \times n

    • I:r×nI: r\times n, Wv:v×rW^v: v \times r, V:v×nV: v \times n

    • Q:q×nQ: q\times n, K:q×nK: q \times n, A(A):n×nA(A'): n \times n

    • A(A):n×nA(A'): n \times n, V:v×nV: v \times n, O:v×nO: v \times n

    • In other words: I:r×nO:v×nI: r\times n \rightarrow O: v \times n

  • Why "self-attention"? Because all the operations happen among different ai\boldsymbol{a}^is within II.

  • Only need to estimate parameters Wq,Wk,WvW^q,W^k,W^v in the neural network.

2. Multi-Head Self-Attention

This means that multiple sets of self-attention weights or heads are learned in parallel, independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common.

The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. While another head may focus on the activity of the sentence. While yet another head may focus on some other properties, such as if the words rhyme.

  • One word can have multiple meanings. These are homonyms.

  • Words within a sentence structure can be ambiguous or have what we might call syntactic ambiguity.

Last updated