Generative AI
  • 📝Outline
  • 🧐Variational Autoencoder (VAE)
  • 🤓Diffusion Model
    • Diffusion Model
    • Latent Diffusion Model
  • 😎Transformer
    • Attention
    • Transformer
    • Switch Transformer
  • 🥸LLM
    • Introduction
    • Fine-tuning LLMs
    • Reinforcement Learning from Human Feedback (RLHF)
    • LLM-powered applications
  • 🤩Multi-modal Foundation Model
    • CLIP
Powered by GitBook
On this page
  • 1. Self Attention
  • 1.1 Vector perspective:
  • 1.2 Matrix perspective: (parallel computing)
  • 2. Multi-Head Self-Attention
  1. Transformer

Attention

PreviousTransformerNextTransformer

Last updated 1 year ago

1. Self Attention

1.1 Vector perspective:

The following shows how to calculate b1\boldsymbol{b}^1b1, given [a1,a2,a3,a4][\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4][a1,a2,a3,a4]. For b2,b3,b4\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4b2,b3,b4, it is similar.

1.2 Matrix perspective: (parallel computing)

The following shows how to calculate [b1,b2,b3,b4][\boldsymbol{b}^1,\boldsymbol{b}^2, \boldsymbol{b}^3,\boldsymbol{b}^4][b1,b2,b3,b4], given [a1,a2,a3,a4][\boldsymbol{a}^1,\boldsymbol{a}^2, \boldsymbol{a}^3,\boldsymbol{a}^4][a1,a2,a3,a4].

  • Input matrix I=[a1,a2,a3,a4]I=[\boldsymbol{a}^1,\boldsymbol{a}^2,\boldsymbol{a}^3,\boldsymbol{a}^4]I=[a1,a2,a3,a4] represents a list of vectors for a sentence. Each vector ai\boldsymbol{a}^iai represents a word vector.

  • Output matrix O=[b1,b2,b3,b4]O=[\boldsymbol{b}^1,\boldsymbol{b}^2,\boldsymbol{b}^3,\boldsymbol{b}^4]O=[b1,b2,b3,b4]corresponds the input matrix.

  • Dimension:

    • I:r×nI: r\times nI:r×n, Wq:q×rW^q: q \times rWq:q×r, Q:q×nQ: q \times nQ:q×n

    • I:r×nI: r\times nI:r×n, Wk:q×rW^k: q \times rWk:q×r, K:q×nK: q \times nK:q×n

    • I:r×nI: r\times nI:r×n, Wv:v×rW^v: v \times rWv:v×r, V:v×nV: v \times nV:v×n

    • Q:q×nQ: q\times nQ:q×n, K:q×nK: q \times nK:q×n, A(A′):n×nA(A'): n \times nA(A′):n×n

    • A(A′):n×nA(A'): n \times nA(A′):n×n, V:v×nV: v \times nV:v×n, O:v×nO: v \times nO:v×n

    • In other words: I:r×n→O:v×nI: r\times n \rightarrow O: v \times nI:r×n→O:v×n

  • Why "self-attention"? Because all the operations happen among different ai\boldsymbol{a}^iais within III.

  • Only need to estimate parameters Wq,Wk,WvW^q,W^k,W^vWq,Wk,Wv in the neural network.

2. Multi-Head Self-Attention

This means that multiple sets of self-attention weights or heads are learned in parallel, independently of each other. The number of attention heads included in the attention layer varies from model to model, but numbers in the range of 12-100 are common.

The intuition here is that each self-attention head will learn a different aspect of language. For example, one head may see the relationship between the people entities in our sentence. While another head may focus on the activity of the sentence. While yet another head may focus on some other properties, such as if the words rhyme.

  • One word can have multiple meanings. These are homonyms.

  • Words within a sentence structure can be ambiguous or have what we might call syntactic ambiguity.

😎
input and output for a self-attention module
Step 1: attention score
Step 2: normalziation of attention score. Soft-max is one option. Some other option: ReLU
Step 3: weighted sum of values
Step 1: Generate Q, K, V
Step 2: Generate attention score A and normalized attention score A'
Step 3: weighted sum of values
Summary of matrix operation