Watch full video from the OpenAI team here.

Optimizing LLMs is still hard, no matter how many frameworks there are.

Main difficulties:

Extracting signal from the noise is hard
Performance can be best abstract and difficult to measure
When to use what optimization

The optimization flow

There are two axis along which we can optimize the models:

Context optimization: what the model needs to know
LLM optimization: How the model needs to act

The path isn’t necessarily linear, you usually start off with prompt engineering, use few-shot examples, then you notice the issue is in knowledge – so you add retrieval, then the format of answering isn’t right – so you do fine-tuning, then you go back and improve retrieval with HyDE and fact-checking and so on…

RAG and fine-tuning solve inherently different problems. If you imagine that the model has short and long term memory-loss, then RAG solves the short and fine-tune solves the long.

RAG – when the model needs specific information to answer questions
Fine-tune – when the model needs to follow a structure, style, format Both approaches are additive, sometimes you might need both for the most optimal performance.

Prompt engineering

Start with:

Writing clear instructions
Split complex tasks into smaller subtasks
Give GPTs time to “think”
- Like ReAct
Test changes systematically

Extend to:

Providing reference text
Use external tools

Good for:

Testing and learning early
With evaluation, can be set up as a baseline for further optimization

Not good for:

Introducing new information
- Models can hallucinate
Reliably replicate a complex style/method
- For example, can’t learn a new programming language
Minimizing token usage
- Providing instructions increases the number of tokens

Retrieval augmented generation

Giving model access to domain specific knowledge. There are many ways to do it. Usually, some documents are embedded into a knowledge base. Afterwards, a user can ask an agent a question and the agent will be able to retrieve relevant documents from the knowledge base and give a relevant answer.

Good for

Introducing new information to the model to update its knowledge

Reducing hallucinations be controlling the content

Give an instruction to only use the knowledge provided to do the generation

Not good for:

Embedding understanding of a broad domain

The model isn’t learning this information, it uses what it has at hand to generate a response

Teaching the model to learn a new language, format or style

This is more of fine-tuning

Reducing token usage

This is gonna use much more tokens

The methods OpenAI team used for a customer and what performance boosts they gave. Ones with the checkmarks made it to production.

Methods that they used:

Basic retrieval using cosine similarity on embeddings
Hypothetical Document Embeddings (HyDE)
- Generate hypothetical document and then use it to retrieve similar real ones
Fine-tuning the embeddings
- Okay in improving accuracy, but
- Turns out to be impractically expensive with time and money
Chunking/ embedding experiments
- Trying different sizes of chunks for embeddings
Reranking to choose better from the retrieved content
- Use cross encoder to compare the query and retrieved result to see if any are more relevant
- Use rules, for example for research more recent articles could be preferred
Classification of domain, for conditional prompts
- Let the model classify what case it is and then based on that add additional prompts
More prompt engineering
Observe where the model is making mistakes and improve there
- Tools for structured data questions
Query expansion
- Users ask a lot of questions in one, you split them, process them in parallel and combine join back to one

RAG is about context optimization

There are two directions to improve with RAG, either you are

Not giving it the right context or

The model doesn’t know which of those context blocks is the right one

Personally, it reminds of the precision and recall problem

Warning about RAG

To prevent hallucinations, often, an instruction of sorts “use only the provided content as a source” is added. However, if the retrieval is poorly made – the LLM can use spurious results from the search which would lead to an incorrect answer.

By adding RAG, you add another axis of where stuff could go wrong. The LLM can make mistakes, and now there is search – which is an unsolved problem.

Evaluation RAG

Ragas score evaluates into two directions

Generation: How well the LLM answered the question

Faithfulness – how factually accurate is the generated answer
- Basically check for hallucinations, how much is LLM not making up?
Answer relevancy – how relevant is the generated answer to the question
- The answer can be factually correct, but totally irrelevant

Retrieval: How relevant the content retrieved is to the question

Context precision – the signal to noise ratio of retrieved context
- Of the information that was retrieved, how much is actually relevant
- Technically we can give it a lot of content, but then the model can have a lost in the middle problem where it forgets things in the middle of the context and starts to hallucinate
- We want to increase this metric, partially because noice wastes tokens (hence money and space for actually relevant information)
- Can we get high quality answers why using less context
Context recall – can it retrieve all the relevant information required to answer a question
- Is there information that was not retrieved, but was actually relevant?
- We want to increase this so that the information retrieved for the LLM is complete
- This metric shows if you need to do reranking, changing embeddings to surface more relevant content

Fine-tuning

Fine-tuning is when a trained model, is continuing the training on a smaller, domain-specific dataset to optimize for the model for a specific task

For an LLM, instead of explaining how you want the model to reply (prompt engineering) we fine-tune a model to output in that format. No need to provide examples.

Benefits:

Improve model’s performance on a specific task
- During fine-tuning the model will see many more examples of how to do a task than wtih prompt engineering or few-shot learning
Improve model efficiency
- Reduce the number of tokens needed to explain a specific task
  - The model knows the pattern of the task, so you don’t need to explain it
- Distill the expertise of a large model into a smaller one
  - Generate examples using a bigger smarter model and using that train a smaller one

Good for

Emphasizing knowledge that already exists in the model

Customizing the structure or tone of responses

Teaching a model very complex instructions

You can just show much more

Bad for

Adding new knowledge to the base model

It doesn’t efficiently learn new information like this

You are better off doing RAG

Quickly iterating on a new use-case

It takes time to create a dataset and time to get the training of the model right

Slow feedback loop

Fine-tuning best practices

Don’t start with fine-tuning: try prompt engineering and few-shot learning

Establish a baseline: use it to compare your fine-tuned models to

Quality over quantity: build small but high quality examples

Datasets are difficult to build, but you also don’t need a HUGE amount of data, the model is pre-trained, what you do need are a few but high quality examples of expected behaviour

Warning about fine-tuning

During fine-tuning the model will learn to reflect the dataset that was used for training. It will reflect the type of language that is in the dataset.

A funny story mentioned was about a team fine-tuning GPT on their Slack messages to capture their tone of writing for blog posts. But when prompted to write a post, GPT said, “sure, I’ll work on that on Monday.”

Fine-tuning worked well, it reflected their writing style – but for Slack, and not their tone for the blog as the engineers expected.

What they should have done is fine-tuned on a small dataset an checked if the model is going in the right direction.

Steps to fine-tuning a model:

Collecting the data, cleaning, validating, formatting it
- This is usually the most difficult part of the ML workflow
- Possible ways to acquire one could be: downloading from online, buy data, hiring labelers or distill it from a bigger model (if it’s allowed)
Training
- if DIY, then you need GPUs, use a framework
- selecting hyperparameters
  - overfitting/ underfitting
- understanding the loss function
  - in this case, the loss function is a proxy for next token prediction
  - improvements in loss might not correlate to with the downstream tasks
Evaluation
- Have humans label how good results are
- Compare the results of two models against each other
- Have a bigger model evaluate the smaller one
Inference
- Save the examples and results from production and go back to improve, ie feedback loop

Fine-tuning + RAG

Combining the two, brings the best of both worlds

Fine-tuning the model to understand the complex instructions
- No more need to explain it, leading to…
Minimizing prompt engineering tokens
- More space for retrieved context
Use RAG to inject relevant knowledge into the context
- Model answers based on retrieved knowledge

Afterwards the team went through an example of how they optimized LLMs for SQL query generation based on the schema and question.

Mansur N

Recent Thoughts

making offline first web app

ai assisted data science

language learning app

gift from the netherlands

Recent Notes

a survey of techniques for maximizing LLM performance

AI and ironies of automation

a survey of techniques for maximizing LLM performance

The optimization flow

Prompt engineering

Retrieval augmented generation

Evaluation RAG

Fine-tuning

Fine-tuning + RAG

Recent Thoughts

making offline first web app

ai assisted data science

language learning app

gift from the netherlands

Recent Notes

a survey of techniques for maximizing LLM performance

AI and ironies of automation

Graph View

Table of Contents