We Need a Better Set of LLM Evaluations

Data-Centric AI April 29, 2024

Before we dive in, if you’d like a briefer on strategies for evaluating LLMs, check out this post.

This post will focus on Reinforcement Learning evaluations and why they are so hard.

Language models like ChatGPT are powerful knowledge and productivity tools. However, if you spend just a small amount of time with these models, you’ll quickly find that they are good at certain tasks and not others. For example, language models are great at Question-and-Answer tasks as well as Retrieval tasks. Why? It’s likely the reason why language models are so good at those two kinds of tasks is because the majority of their training data focused on these tasks.

What are some tasks that models are often bad at? In general, these models are still learning how to effectively respond to domain-specific business tasks and some open-ended questions. Some models already perform well on open-ended questions, while others don’t. For example, if you two different language models how a change in monetary policy affects your daily life, they come up with two pretty different responses.

Input: If the FED buys bonds, how does that affect my daily life? Give 3 examples.

Model 1 Output:

1. It increases the money supply.

2. It increases the money supply and lowers the discount rate.

3. Increases the money supply, lowers the discount rate and increases the money supply.

This response is repetitive and not very informative.

Model 2 Output: 

Here are 3 examples:

1. Interest Rates: Rates on mortgages, car loans, and credit cards may go up.

2. Inflation: This can make it more expensive to buy basic good and services like groceries, gas, and other necessities.

3. Employment: Since there’s more money in the economy, the economy may grow, leading to an increase in employment opportunities.

This response is clear and informative.

It’s clear that we need some way to evaluate language models in order to select the best model for our use cases.

Human-generated datasets

One way of doing LLM evaluations is to build a human-generated dataset. Human-generated datasets are great at aligning with human preferences. It can also be difficult to specify what exactly makes for a good output (is it accuracy, brevity, creativity or something else?). Sometimes multiple attributes we care about may be in conflict with each other: how helpful vs. harmless do you want a model to be? It could be helpful for a model to tell you how to hack into your neighbors wifi, but that wouldn’t be very harmless. Simultaneously, the model could be maximally harmless by refusing to discuss politics or current affairs with you, but that might prevent the model from helping you write a short and sweet poem about your favorite Presidential candidate. Often human judges can understand the nuance of what is or is not acceptable without having to specifically describe the balance between attributes.

However, one issue with human datasets is that they are inherently biased towards the annotators who are labeling your data. Humans don’t always agree, and that can make it challenging when you’re trying to set a standard for what good looks like! Some may prefer a certain tone or level of verbosity over others. Some may prefer chocolate to vanilla ice cream, and other may prefer savory sandwich recipes over sweet ones. It is worth noting that humans do agree on some things: repetitiveness is bad, NSFW content is bad, and being vague is bad.

This is why it’s important to build a diverse pool of annotators what will bring a variety of backgrounds and perspectives to the table.

Another third problem with human-generated datasets is that they are time-consuming and costly. As a result, they may also be difficult and expensive to scale to large volumes. Because of these challenges, researchers and businesses have started looking towards LLM-generated evaluations to augment that process.

So what do these datasets look like?

LLM-generated datasets

So what are some options for LLM-generated evaluations right now? One area that has been researched is how LLM’s perform as judges of other LLMs.

LLM-evaluations are meant to perform similar tasks that human-generated evaluations do, often at a larger scale. LLM-evaluations may include:

Pairwise comparison: LLMs are presented with two answer to a prompt, from two different LLMs, and chooses a winner or declares a tie.
Single answer grading: LLMs scores the quality of an answer based on its general knowledge.
Reference-guided grading: the LLM is provided a reference solution for what a good answer looks like. This works well for math problems.

Pairwise comparisons may include assessing whether Assistant A vs. B is better. Single answer grading may include a score in the Likert scale of a simple thumbs up or thumbs down ranking on a single input and output. (Source)

LLMs may be particularly capable of replacing humans during the evaluation stage since many LLMs are already trained on human preference via RLHF. Because of that, those large models will exhibit strong alignment with human behavior.

However, research has been done on the limitations of using LLMs as judges. Source:

This paper explores 4 kinds of biases that LLMs have when performing this kind of task:

Position Bias
Verbosity Bias
Self-Enhancement Bias
Limited Capability in Math and Reasoning Questions

Position bias refers to the preference that LLMs give to the first or section position option in a pairwise comparison, regardless of the content. Models may consistently prefer position 1 vs. 2 because it’s training data may have provided more correct answers in pairwise comparisons in position 1 vs. 2. There is also name bias, where models will prefer Assistant A vs. B, regardless of content. Claude v1 and GPT-3.5 are consistently biased in this way. Interestingly, GPT-4 is seemingly less biased in this way and performs well regardless of the position or name of the assistant in 60% of cases. This indicates that progress can be made in this bias area in future models.

Verbosity bias refers to when LLMs prefer longer, more verbose responses that contain no meaningful new information over shorter, accurate responses. In the research, LLMs are given short answer, like 5 bullets, and is asked to rewrite those bullets in a verbose way. The rewritten result is then compared to the original non-verbose 5 bullets by an LLM judge. LLMs consistently prefer the more verbose answers, even when they don’t meaningfully share new information. Brevity is something foundation models are continuously working on, so it will be interesting to see what progress is made here in the future.

Self-enhancement bias refers to when LLM-Judges prefer answers generated by the same underlying LLM. For example, GPT-4 may be your LLM-Judge of choice. This bias refers to when GPT-4 prefers outputs generated by GPT-4 over other models. GPT-4 prefers itself with a 10% higher win rate, and Claude v1 prefers itself with a 25%. higher win rate. Research here is limited though, so more will have to be done to form any conclusive answers.

Limited math and reasoning capabilities have been discovered in even the most sophisticated LLMs. LLMs are able to generate correct answers to math and reasoning questions, but when LLM-Judges are asked to select between two answers, they often fail to select the correct one. Research suggests this bias can be mitigated with reference-guided grading. Since these LLMs can generate the correct answer themselves, when they are asked a math and reasoning question, they are asked to generate an answer for themselves first. Then, they are asked to use that answer they generated as a reference when deciding which answer between two choices is correct. This process improves model performance on these kinds of tasks.

Overall, this research found agreement between human judges and GPT-4 in 85% of cases. Surprisingly, this is actually higher than the agreement between two human judges, which is 81%. This leads the researchers to believe that LLM-Judges provide a scalable alternative to human evaluations.

The main upshot is if LLMs can generate similar quality evaluations to human-raters, LLM evaluation datasets can approximate human preferences. This is exciting because getting enough human data to do LLM evaluations is difficult, time-consuming and expensive. If we can leverage LLMs to produce large evaluation datasets, we can scale evaluations quickly and cost-effectively.

The question is, is this enough? LLM evaluations are so tricky because the methods we have right now are still limited. There are many measurements related to good text outputs that matter: accuracy, fluency, clarity, brevity, tone and many more. AI Leaderboards, like Hugging Face’s Open LLM Leaderboard, and benchmark datasets, like MMLU, only measure a small number of characteristics that matter. Winning on one benchmark may not generalize to a variety of tasks and may not signal that the same model will win on other benchmarks. Just because one model is good at math doesn’t mean it will be good on dialogue or retrieval. So, if what constitutes good outputs to enterprises is so multi-varied, then why would we use limited benchmarks or leaderboards to decide which model to use?

So where does that leave us?

We need a better set of LLM evals

Benchmarks and AI Leaderboards are a great start, but they’re not enough for enterprise-grade production.

Why?

Benchmarks and AI Leaderboards only measure a few characteristics that enterprises may care about: truthfulness, knowledge of the world, and style to name a few. Enterprises may care if AI models know about their specific product and use cases.

What’s the problem?

If our goal with these language model is to have them engage in multi-turn enterprise-specific conversation with humans, maybe we should create a benchmark that measures that outcome directly, instead of by proxy through these other benchmarks.

So where do we go from here?

The best path forward may be generating custom language model evaluation datasets. Not only do you want to test whether models understand your specific product and documentation, you may also want to test the model on your real world use cases. You may want to build a custom dataset that tests how well models performs on tasks they consistently have failed in the past as well as a dataset that focuses on high-value examples from your most important customers.

You may also want to build a dataset to fine-tune a small classifier model. This model can act an an enhanced LLM-Judge, custom to your specific needs. In order to fine-tune that small classifier model, you’ll want to build a custom dataset on your unique business use cases.

Either way, the first step forward is building a custom expert-labeled dataset for your needs.

If you’d like to build a dataset of human- or LLM-generated evaluations of your model, reach out to our team here at HumanSignal.