Generative AI
  • 📝Outline
  • 🧐Variational Autoencoder (VAE)
  • 🤓Diffusion Model
    • Diffusion Model
    • Latent Diffusion Model
  • 😎Transformer
    • Attention
    • Transformer
    • Switch Transformer
  • 🥸LLM
    • Introduction
    • Fine-tuning LLMs
    • Reinforcement Learning from Human Feedback (RLHF)
    • LLM-powered applications
  • 🤩Multi-modal Foundation Model
    • CLIP
Powered by GitBook
On this page
  • Terminology:
  • Variance of LLM
  • 1) Encoder-only (Autoencoding Models)
  • 2) Decoder-only (Autoregressive Models)
  • 3) Encoder-Decoder (Sequence-to-sequence models)
  • Generative AI project lifecycle
  • Computational Challenge
  • Scaling Laws
  • Pre-training for domain adaptation:
  • Reference
  • Transformer Architecture
  • Pre-training and scaling laws
  • Model architectures and pre-training objectives
  • Scaling laws and compute-optimal models
  1. LLM

Introduction

This is an introduction for LLM, for transformer details, check: Transformer.

Terminology:

  • Prompt

  • Context window: The space or memory that is available to the prompt.

  • Completion: The output of the model.

  • Inference: The act of using the model to generate text.

  • Next word prediction is the base concept behind a number of different capabilities.

  • Instruction fine-tuning: involves using many prompt-completion examples as the labeled training dataset to continue training the model by updating its weights.

  • In-context learning: Providing examples inside the context window is called in-context learning. It provide prompt-completion examples during inference.

    • zero-shot inference, one-shot inference, few-shot inference

    • Greedy decoding

    • random (-weighted) sampling

    • top-p and top-k sampling: Two Settings, top p and top k are sampling techniques that we can use to help limit the random sampling and increase the chance that the output will be sensible.

    • Broadly speaking, the higher the temperature, the higher the randomness, and the lower the temperature, the lower the randomness.

Variance of LLM

  • Encoder-only: BERT, RoBERTA

  • Decoder-only: GPT, BLOOM

  • Encoder-decoder: T5, BART

1) Encoder-only (Autoencoding Models)

Encoder-only models are also known as Autoencoding models, and they are pre-trained using masked language modeling. Here, tokens in the input sequence or randomly mask, and the training objective is to predict the mask tokens in order to reconstruct the original sentence. This is also called a denoising objective.

Good use case:

  • Sentiment analysis

  • Named entity recognition

  • Word classification

2) Decoder-only (Autoregressive Models)

Predicting the next token is sometimes called full language modeling by researchers. Decoder-based autoregressive models, mask the input sequence and can only see the input tokens leading up to the token in question. The model has no knowledge of the end of the sentence. The model then iterates over the input sequence one by one to predict the following token.

Good use case:

  • Text generation

  • Other emergent behavior, such as depends on model size

3) Encoder-Decoder (Sequence-to-sequence models)

Good use case:

  • Translation

  • Text summarization

  • Question answering

Summary:

Generative AI project lifecycle

Computational Challenge

Approximate GPU RAM needed to store 1B parameters

  • 1 parameter = 4 bytes (32-bit float)

  • 1B parameters = 4×1094 \times 10^94×109 bytes = 4GB

  • 4GB @ 32-bit, full precision

Additional GPU RAM needed to train 1B parameters

Memory needed to train 1B parameters

  • 24GB @ 32-bit, full precision

Scaling Laws

To maximize model performance, usually we can improve

  • Dataset size (number of tokens)

  • Model size (number of parameters)

But both of them are constrained by computing budget (GPUs, training time, cost).

Petaflops per second-day ("petaflop/s-day") is a useful measure for computing budget as it reflects the both hardware and time required to train the model.

1 petaflop/s-day is the number of floating point operations performed at a rate of one petaFLOP per second, running for an entire day.

Note, one petaFLOP corresponds to one quadrillion floating point operations per second.

  • one quadrillion: 1,000,000,000,000,000 = 10^15

When specifically thinking about training transformers, one petaFLOP per second day is approximately equivalent to

  • eight NVIDIA V100 GPUs, operating at full efficiency for one full day.

  • two NVIDIA A100 GPUs give equivalent compute to the eight V100 chips.

  • Chinchilla

  • Very large models may be over-parameterized and under-trained

  • Smaller models trained on more data could perform as well as large models.

The Chinchilla paper hints that many of the 100 billion parameter large language models like GPT-3 may actually be over-parameterized, meaning they have more parameters than they need to achieve a good understanding of language and under-trained so that they would benefit from seeing more training data.

Pre-training for domain adaptation:

  • Legal language

  • Medical language

  • Financial language

  • Training set:

    • Financial (Public and Private): 51%

    • Other (Public): 49%

Reference

Transformer Architecture

Pre-training and scaling laws

Model architectures and pre-training objectives

Scaling laws and compute-optimal models

PreviousLLMNextFine-tuning LLMs

Last updated 1 year ago

Training Compute-Optimal Large Language Models ()

BloombergGPT: domain adaptation for finance ()

- This paper introduced the Transformer architecture, with the core “self-attention” mechanism. This article was the foundation for LLMs.

- BLOOM is a open-source LLM with 176B parameters trained in an open and transparent way. In this paper, the authors present a detailed discussion of the dataset and process used to train the model. You can also see a high-level overview of the model .

- Series of lessons from DeepLearning.AI's Natural Language Processing specialization discussing the basics of vector space models and their use in language modeling.

- empirical study by researchers at OpenAI exploring the scaling laws for large language models.

- The paper examines modeling choices in large pre-trained language models and identifies the optimal approach for zero-shot generalization.

and - Collection of resources to tackle varying machine learning tasks using the HuggingFace library.

- Article from Meta AI proposing Efficient LLMs (their model with 13B parameters outperform GPT3 with 175B parameters on most benchmarks)

- This paper investigates the potential of few-shot learning in Large Language Models.

- Study from DeepMind to evaluate the optimal model size and number of tokens for training LLMs. Also known as “Chinchilla Paper”.

- LLM trained specifically for the finance domain, a good example that tried to follow chinchilla laws.

🥸
https://arxiv.org/pdf/2203.15556.pdf
https://arxiv.org/pdf/2303.17564.pdf
Attention is All You Need
BLOOM: BigScience 176B Model
here
Vector Space Models
Scaling Laws for Neural Language Models
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
HuggingFace Tasks
Model Hub
LLaMA: Open and Efficient Foundation Language Models
Language Models are Few-Shot Learners
Training Compute-Optimal Large Language Models
BloombergGPT: A Large Language Model for Finance
Number of petaflop/s-days to pretrain various LLMs
Scaling Laws for Neural Language Models ()
Scaling Laws for Neural Language Models ()
Scaling Laws for Neural Language Models ()
https://arxiv.org/pdf/2001.08361.pdf
https://arxiv.org/pdf/2001.08361.pdf
https://arxiv.org/pdf/2001.08361.pdf