Introduction

This is an introduction for LLM, for transformer details, check: Transformer.

Terminology:

Prompt
Context window: The space or memory that is available to the prompt.
Completion: The output of the model.
Inference: The act of using the model to generate text.
Next word prediction is the base concept behind a number of different capabilities.
Instruction fine-tuning: involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights.
In-context learning: Providing examples inside the context window is called in-context learning. It provide prompt-completion examples during inference.
- zero-shot inference, one-shot inference, few-shot inference
- Greedy decoding
- random (-weighted) sampling
- top-p and top-k sampling: Two Settings, top p and top k are sampling techniques that we can use to help limit the random sampling and increase the chance that the output will be sensible.
- Broadly speaking, the higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness.

Variance of LLM

Encoder-only: BERT, RoBERTA
Decoder-only: GPT, BLOOM
Encoder-decoder: T5, BART

1) Encoder-only (Autoencoding Models)

Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence. This is also called a denoising objective.

Good use case:

Sentiment analysis
Named entity recognition
Word classification

2) Decoder-only (Autoregressive Models)

Predicting the next token is sometimes called full language modeling by researchers. Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token.

Good use case:

Text generation
Other emergent behavior, such as depends on model size

3) Encoder-Decoder (Sequence-to-sequence models)

Good use case:

Translation
Text summarization
Question answering

Summary:

Generative AI project lifecycle

Computational Challenge

Approximate GPU RAM needed to store 1B parameters

1 parameter = 4 bytes (32-bit float)
1B parameters = $4 \times 10^9$ bytes = 4GB
4GB @ 32-bit, full precision

Additional GPU RAM needed to train 1B parameters

Memory needed to train 1B parameters

24GB @ 32-bit, full precision

Scaling Laws

To maximize model performance, usually we can improve

Dataset size (number of tokens)
Model size (number of parameters)

But both of them are constrained by computing budget (GPUs, training time, cost).

Petaflops per second-day ("petaflop/s-day") is a useful measure for computing budget as it reflects the both hardware and time required to train the model.

1 petaflop/s-day is the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day.

Note, one petaFLOP corresponds to one quadrillion floating point operations per second.

one quadrillion: 1,000,000,000,000,000 = 10^15

When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to

eight NVIDIA V100 GPUs, operating at full efficiency for one full day.
two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.

Training Compute-Optimal Large Language Models (https://arxiv.org/pdf/2203.15556.pdf)

Chinchilla
Very large models may be over-parameterized and under-trained
Smaller models trained on more data could perform as well as large models.

The Chinchilla paper hints that many of the 100 billion parameter large language models like GPT-3 may actually be over-parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under-trained so that they would benefit from seeing more training data.

Pre-training for domain adaptation:

Legal language
Medical language
Financial language

BloombergGPT: domain adaptation for finance (https://arxiv.org/pdf/2303.17564.pdf)

Training set:
- Financial (Public and Private): 51%
- Other (Public): 49%

Reference

Transformer Architecture

Attention is All You Need - This paper introduced the Transformer architecture, with the core “self-attention” mechanism. This article was the foundation for LLMs.
BLOOM: BigScience 176B Model - BLOOM is a open-source LLM with 176B parameters trained in an open and transparent way. In this paper, the authors present a detailed discussion of the dataset and process used to train the model. You can also see a high-level overview of the model here.
Vector Space Models - Series of lessons from DeepLearning.AI's Natural Language Processing specialization discussing the basics of vector space models and their use in language modeling.

Pre-training and scaling laws

Scaling Laws for Neural Language Models - empirical study by researchers at OpenAI exploring the scaling laws for large language models.

Model architectures and pre-training objectives

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? - The paper examines modeling choices in large pre-trained language models and identifies the optimal approach for zero-shot generalization.
HuggingFace Tasks and Model Hub - Collection of resources to tackle varying machine learning tasks using the HuggingFace library.
LLaMA: Open and Efficient Foundation Language Models - Article from Meta AI proposing Efficient LLMs (their model with 13B parameters outperform GPT3 with 175B parameters on most benchmarks)

Scaling laws and compute-optimal models

Language Models are Few-Shot Learners - This paper investigates the potential of few-shot learning in Large Language Models.
Training Compute-Optimal Large Language Models - Study from DeepMind to evaluate the optimal model size and number of tokens for training LLMs. Also known as “Chinchilla Paper”.
BloombergGPT: A Large Language Model for Finance - LLM trained specifically for the finance domain, a good example that tried to follow chinchilla laws.

PreviousLLM NextFine-tuning LLMs

Last updated 1 year ago