Attention
Last updated
Last updated
Input matrix represents a list of vectors for a sentence. Each vector represents a word vector.
Output matrix corresponds the input matrix.
Dimension:
, ,
, ,
, ,
, ,
, ,
In other words:
Why "self-attention"? Because all the operations happen among different s within .
Only need to estimate parameters in the neural network.
This means that multiple sets of self-attention weights or heads are learned in parallel, independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common.
The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. While another head may focus on the activity of the sentence. While yet another head may focus on some other properties, such as if the words rhyme.
One word can have multiple meanings. These are homonyms.
Words within a sentence structure can be ambiguous or have what we might call syntactic ambiguity.