# Attention

## 1. Self Attention

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2Fo61IyqC7FuRje3Tre7jM%2Fimage.png?alt=media&#x26;token=42695e88-3148-444b-bc77-7b1c7b63877e" alt="" width="375"><figcaption><p>input and output for a self-attention module</p></figcaption></figure>

### 1.1 Vector perspective:&#x20;

#### The following shows how to calculate $$\boldsymbol{b}^1$$, given $$\[\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4]$$. For $$\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4$$, it is similar.

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FJSNFpSBoibbsYtG0p82m%2Fimage.png?alt=media&#x26;token=69f7378f-1b9e-4ccd-b694-ac67c73288e7" alt="" width="375"><figcaption><p>Step 1: attention score</p></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2Ftcw96aBGE8neXCy5wzJL%2Fimage.png?alt=media&#x26;token=d40d4d51-3c84-42b9-a034-9c409509cc28" alt="" width="375"><figcaption><p>Step 2: normalziation of attention score. Soft-max is one option. Some other option: ReLU</p></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FB3aesDruWaC0vw6AY0YV%2Fimage.png?alt=media&#x26;token=54bfbadd-b8cc-435e-a75b-53fc81f6e21e" alt="" width="375"><figcaption><p>Step 3: weighted sum of values</p></figcaption></figure>

### 1.2 Matrix perspective: (parallel computing)&#x20;

#### The following shows how to calculate  $$\[\boldsymbol{b}^1,\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4]$$, given $$\[\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4]$$.

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FNSjH58pmiKCTirh3WvR8%2Fimage.png?alt=media&#x26;token=45ca0c70-3a93-412d-b170-0be6ec82fc99" alt="" width="375"><figcaption><p>Step 1: Generate Q, K, V</p></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FWjcSludi8VQEYQlUHfwR%2Fimage.png?alt=media&#x26;token=634e03b3-2ad7-4237-9898-4f715ed4a393" alt="" width="375"><figcaption><p>Step 2: Generate attention score A and normalized attention score A'</p></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FMQHUuwOepJicayn3ty3X%2Fimage.png?alt=media&#x26;token=95ec1e3e-2b71-4d4a-8459-d37c243a0148" alt="" width="375"><figcaption><p>Step 3: weighted sum of values</p></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FRbTVOciGqm4gVoGOlqv5%2Fimage.png?alt=media&#x26;token=5b1a8295-9644-4b1e-8df7-aa7bb665ee4a" alt="" width="375"><figcaption><p>Summary of matrix operation</p></figcaption></figure>

* Input matrix $$I=\[\boldsymbol{a}^1,\boldsymbol{a}^2,\boldsymbol{a}^3,\boldsymbol{a}^4]$$ represents a list of vectors for a sentence. Each vector $$\boldsymbol{a}^i$$ represents a word vector.&#x20;
* Output matrix $$O=\[\boldsymbol{b}^1,\boldsymbol{b}^2,\boldsymbol{b}^3,\boldsymbol{b}^4]$$corresponds the input matrix.
* **Dimension**:&#x20;
  * $$I: r\times n$$, $$W^q: q \times r$$,  $$Q: q \times n$$
  * $$I: r\times n$$, $$W^k: q \times r$$, $$K: q \times n$$
  * $$I: r\times n$$, $$W^v: v \times r$$, $$V: v \times n$$
  * $$Q: q\times n$$, $$K: q \times n$$, $$A(A'): n \times n$$
  * $$A(A'): n \times n$$, $$V: v \times n$$, $$O: v \times n$$
  * **In other words:** $$I: r\times n \rightarrow O: v \times n$$
* Why "self-attention"? Because all the operations happen among different $$\boldsymbol{a}^i$$s within $$I$$.
* Only need to estimate parameters $$W^q,W^k,W^v$$ in the neural network.

## 2. Multi-Head Self-Attention

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FGYLMpWbGtGzDD2tA21rg%2Fimage.png?alt=media&#x26;token=cc9bb39a-516b-4076-b607-9ea32cc6a0d9" alt="" width="375"><figcaption></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2FjgpEvcpaGuaR2KGy3WCu%2Fimage.png?alt=media&#x26;token=c70a429b-8039-42b9-bdf9-b672b2f83531" alt="" width="375"><figcaption></figcaption></figure>

<figure><img src="https://3144105245-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F4ntaiKloVurWu8wS4ChR%2Fuploads%2F6q8LOcF6qTnRpE8mwawE%2Fimage.png?alt=media&#x26;token=c81a1452-c69c-4c0e-9065-14341fe5e09d" alt="" width="375"><figcaption></figcaption></figure>

This means that multiple sets of self-attention weights or heads are learned in parallel, independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common.

**The intuition** here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. While another head may focus on the activity of the sentence. While yet another head may focus on some other properties, such as if the words rhyme.

* One word can have multiple meanings. These are homonyms.
* Words within a sentence structure can be ambiguous or have what we might call syntactic ambiguity.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://lichangbin.gitbook.io/generative-models/transformer/attention.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
