Fine-tuning LLMs

Fine-tuning LLM with instruction

limitation of In-context Learning

may not work for smaller models
Examples take up space in the context window

Using prompts to fine-tune LLMs with instructions

other tasks: summarize the following text, translate this sentence to...

LLM fine-tuning process

Fine-tuned LLM: instruct LLM

Fine-tuning with instruction prompts is the most common way to fine-tune LLMs these days. From this point on, when you hear or see the term fine-tuning, you can assume that it always means instruction fine tuning.

For a single task, often, only 500-1000 examples needed to fine-tune a single task

Limitation for fine-tuning on a single task

Catastrophic forgetting: Fine-tuning can significantly increase the performance of a model on a specific task, but can lead to reduction in ability on other tasks.

How to avoid catastrophic forgetting

First note that we might not have to
Fine-tune on multiple tasks at the same time
- Scaling Instruction-Finetuned Language Models (https://arxiv.org/pdf/2210.11416.pdf)
Consider Parameter Efficient Fine-Tuning (PEFT)

Parameters Efficient Fine-Tuning (PEFT)

PEFT trade-offs

Parameter Efficiency
Memory Efficiency
Model performance
Training speed
Inference Costs

PEFT methods (Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning)

https://arxiv.org/pdf/2303.15647.pdf

Selective: Select subset of initial LLM parameters to fine-tune
Reparameterization: Reparameterize model weights using a low-rank representation
1. LoRA
Additive: Add trainable layers or parameters to model
1. Adapters
2. Soft Prompts: Prompt tuning

LoRA

LoRA represents large weight matrices as two smaller, rank decomposition matrices, and trains those instead of the full weights. The product of these smaller matrices is then added to the original weights for inference.

Prompt tuning

A soft prompt refers to a set of trainable tokens that are added to a prompt. Unlike the tokens that represent language, these tokens can take on any value within the embedding space. The token values may not be interpretable by humans, but are located in the embedding space close to words related to the language prompt or task to be completed.

Evaluation Metric

ROUGE
- used for text summarization
BLEU SCORE
- used for text translation

Reference

Multi-task, instruction fine-tuning

Scaling Instruction-Finetuned Language Models - Scaling fine-tuning with a focus on task, model size and chain-of-thought data.
Introducing FLAN: More generalizable Language Models with Instruction Fine-Tuning - This blog (and article) explores instruction fine-tuning, which aims to make language models better at performing NLP tasks with zero-shot inference.

Model Evaluation Metrics

HELM - Holistic Evaluation of Language Models - HELM is a living benchmark to evaluate Language Models more transparently.
General Language Understanding Evaluation (GLUE) benchmark - This paper introduces GLUE, a benchmark for evaluating models on diverse natural language understanding (NLU) tasks and emphasizing the importance of improved general NLU systems.
SuperGLUE - This paper introduces SuperGLUE, a benchmark designed to evaluate the performance of various NLP models on a range of challenging language understanding tasks.
ROUGE: A Package for Automatic Evaluation of Summaries - This paper introduces and evaluates four different measures (ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S) in the ROUGE summarization evaluation package, which assess the quality of summaries by comparing them to ideal human-generated summaries.
Measuring Massive Multitask Language Understanding (MMLU) - This paper presents a new test to measure multitask accuracy in text models, highlighting the need for substantial improvements in achieving expert-level accuracy and addressing lopsided performance and low accuracy on socially important subjects.
BigBench-Hard - Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models - The paper introduces BIG-bench, a benchmark for evaluating language models on challenging tasks, providing insights on scale, calibration, and social bias.

Parameter- efficient fine tuning (PEFT)

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning - This paper provides a systematic overview of Parameter-Efficient Fine-tuning (PEFT) Methods in all three categories discussed in the lecture videos.
On the Effectiveness of Parameter-Efficient Fine-Tuning - The paper analyzes sparse fine-tuning methods for pre-trained models in NLP.

LoRA

LoRA Low-Rank Adaptation of Large Language Models - This paper proposes a parameter-efficient fine-tuning method that makes use of low-rank decomposition matrices to reduce the number of trainable parameters needed for fine-tuning language models.
QLoRA: Efficient Finetuning of Quantized LLMs - This paper introduces an efficient method for fine-tuning large language models on a single GPU, based on quantization, achieving impressive results on benchmark tests.

Prompt tuning with soft prompts

The Power of Scale for Parameter-Efficient Prompt Tuning - The paper explores "prompt tuning," a method for conditioning language models with learned soft prompts, achieving competitive performance compared to full fine-tuning and enabling model reuse for many tasks.

PreviousIntroduction NextReinforcement Learning from Human Feedback (RLHF)

Last updated 1 year ago