Introduction

This is an introduction for LLM, for transformer details, check: Transformer.

Terminology:

  • Prompt

  • Context window: The space or memory that is available to the prompt.

  • Completion: The output of the model.

  • Inference: The act of using the model to generate text.

  • Next word prediction is the base concept behind a number of different capabilities.

  • Instruction fine-tuning: involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights.

  • In-context learning: Providing examples inside the context window is called in-context learning. It provide prompt-completion examples during inference.

    • zero-shot inference, one-shot inference, few-shot inference

    • Greedy decoding

    • random (-weighted) sampling

    • top-p and top-k sampling: Two Settings, top p and top k are sampling techniques that we can use to help limit the random sampling and increase the chance that the output will be sensible.

    • Broadly speaking, the higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness.

Variance of LLM

  • Encoder-only: BERT, RoBERTA

  • Decoder-only: GPT, BLOOM

  • Encoder-decoder: T5, BART

1) Encoder-only (Autoencoding Models)

Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence. This is also called a denoising objective.

Good use case:

  • Sentiment analysis

  • Named entity recognition

  • Word classification

2) Decoder-only (Autoregressive Models)

Predicting the next token is sometimes called full language modeling by researchers. Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token.

Good use case:

  • Text generation

  • Other emergent behavior, such as depends on model size

3) Encoder-Decoder (Sequence-to-sequence models)

Good use case:

  • Translation

  • Text summarization

  • Question answering

Summary:

Generative AI project lifecycle

Computational Challenge

Approximate GPU RAM needed to store 1B parameters

  • 1 parameter = 4 bytes (32-bit float)

  • 1B parameters = 4×1094 \times 10^9 bytes = 4GB

  • 4GB @ 32-bit, full precision

Additional GPU RAM needed to train 1B parameters

Memory needed to train 1B parameters

  • 24GB @ 32-bit, full precision

Scaling Laws

To maximize model performance, usually we can improve

  • Dataset size (number of tokens)

  • Model size (number of parameters)

But both of them are constrained by computing budget (GPUs, training time, cost).

Petaflops per second-day ("petaflop/s-day") is a useful measure for computing budget as it reflects the both hardware and time required to train the model.

1 petaflop/s-day is the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day.

Note, one petaFLOP corresponds to one quadrillion floating point operations per second.

  • one quadrillion: 1,000,000,000,000,000 = 10^15

When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to

  • eight NVIDIA V100 GPUs, operating at full efficiency for one full day.

  • two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.

Training Compute-Optimal Large Language Models (https://arxiv.org/pdf/2203.15556.pdf)

  • Chinchilla

  • Very large models may be over-parameterized and under-trained

  • Smaller models trained on more data could perform as well as large models.

The Chinchilla paper hints that many of the 100 billion parameter large language models like GPT-3 may actually be over-parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under-trained so that they would benefit from seeing more training data.

Pre-training for domain adaptation:

  • Legal language

  • Medical language

  • Financial language

BloombergGPT: domain adaptation for finance (https://arxiv.org/pdf/2303.17564.pdf)

  • Training set:

    • Financial (Public and Private): 51%

    • Other (Public): 49%

Reference

Transformer Architecture

  • Attention is All You Need - This paper introduced the Transformer architecture, with the core “self-attention” mechanism. This article was the foundation for LLMs.

  • BLOOM: BigScience 176B Model - BLOOM is a open-source LLM with 176B parameters trained in an open and transparent way. In this paper, the authors present a detailed discussion of the dataset and process used to train the model. You can also see a high-level overview of the model here.

  • Vector Space Models - Series of lessons from DeepLearning.AI's Natural Language Processing specialization discussing the basics of vector space models and their use in language modeling.

Pre-training and scaling laws

Model architectures and pre-training objectives

Scaling laws and compute-optimal models

Last updated