Why Humans Still Matter in the World of GenAI
Generative AI has been marketed to us as the silver bullet that will change the way that we think about and practice machine learning, and in some ways that’s true. It’s easier than ever before to leverage models trained on huge amounts of data, and out of the box they seem to work well for many use cases, ranging from Natural Language Processing to Image Processing and more. But the idea that these models are perfect all the time and need no further monitoring or turning from humans isn’t true. In fact, there are many well-known places where LLMs don’t perform well, and they’re known to hallucinate, or make up information. Without careful monitoring and a deep understanding of your task and its goals, it’s difficult to get a real idea of where the model is working, and where it isn’t.
Enter a human in the loop. According to the research, “"Being meticulous about the data labeling process is important for improving the quality of data, which has a direct impact on the quality of the predictions made by the machine learning models. It can make the difference between having predictions that are accurate 60% to 70% of the time, and getting into that 95% range,”  While data annotation for LLMs may look and feel somewhat different than the data annotation of the past, it’s still a crucial step of the machine learning process.
In traditional Machine Learning, a ground truth dataset is human-created and has the exact correct answers. Even in scenarios where you’re prompting an LLM to give some answer, knowing what those answers should be can help us get more definitive answers about what’s really going on, especially if you’re asking the LLM to do something like NER or classification, which has a correct answer. Without ground truth data, how will you be able to compare different versions of a prompt to know if you’re really getting closer to what you’re trying to do? How will you know that your model isn’t letting important things slip through, as in a PII extraction scenario? Hard metrics, like precision and recall, are still important in the world of Generative AI, and there are other metrics that we might also want to consider, such as clarity, relevancy, or faithfulness.
Metrics are Human Based
While some metrics, like faithfulness, can be calculated by a machine (think string matching or finding the space between vectors), others, like precision and recall, can only be calculated when we have a ground truth dataset to compare them to. Many machine learning systems rely on benchmark datasets to create metrics about how well the system is doing. Like a ground truth dataset, benchmarks historically have helped us to compare different models working on the same task, because the benchmark dataset is a ground truth dataset in and of itself. The problem is that these benchmark datasets are available to the public on the internet, and as such have become part of the training data for many LLMs. This renders the benchmark dataset nearly useless, as it has been absorbed into the LLM itself. The benchmark datasets also don’t provide a good sense of human preference or alignment for the models. In the world of GenAI, these ground truth datasets may look different than the ground truth sets that we’re used to.Â
There are still some cases with Generative AI where it makes sense to use a traditional ground truth dataset. Asking an LLM to perform a task with a right answer, like classification or NER, still benefits from being able to calculate a precision or recall score based on the ground truth data. Especially when we can format the output of an LLM to be exactly what we need, ground truth datasets provide invaluable metrics.
In the case of Generative AI, where we have non-deterministic models at work, generating these sets can be difficult if not impossible, because the outputs of our models are constantly changing. No two answers, even given the same inputs, are guaranteed to be the same. In these cases, it’s important that we evaluate the responses of our models in every iteration, looking for things like faithfulness, clarity, and reliability. By keeping a pulse on how these models behave, we can get other metrics that serve as a proxy for how well our models are performing, such as how often we rate an answer with 5 stars or how often a model or system is unfaithful to its sources.Â
Most of the metrics that drive machine learning projects like precision and recall, can be calculated using the ground truth data that we create. Other metrics, like answer quality, can also be solved with these ground truth or ground truth proxy sets. But what about the metrics that drive business value? There’s almost no way to mathematically predict which output a human will prefer without actually asking a human first. These cases require human control and “vibe checks” to make sure that we’re achieving what we want to achieve with our models in a way that will drive business value.
We’re All Responsible
Another key reason to keep a human in the loop in the world of Generative AI is to make sure that we’re using our AI tools in a responsible way. With LLMs, it’s most often the case that we’ll use an off the shelf model trained by someone else, and it’s easy to declare that it isn’t our responsibility to ensure that those models are behaving fairly, but that’s not true. The fact of the matter is that we are all responsible for the models that we train or use, and it’s up to us to understand the potential harms they might cause.
There are two main types of harms we need to consider when we look at the ways that our models impact the world – Harms of Allocation, where a person or group is unfairly provided with or restricted from receiving a resource or opportunity, and Harms of Representation, which are the harms that are caused by systems that reinforce the subordination of one group to another based on identity (like race, gender, etc.). Kate Crawford explains this concept further in her 2017 NIPS keynote, The Trouble with Bias. Both of these types of harms can occur when we use Generative AI, because our models are only as good as the data we train them on. Humans are imperfect, and when we historically underrepresent groups of people in our data, or systemically deny people access to resources, our models learn to exhibit the same behavior. Many of the available LLMs today try to put in safeguards against these types of harms, often by systematically not answering questions that appear to be rooted in something “unsafe”, but these systems aren’t perfect, either.
LLMs are not Reasoners
Outside of the harms that our models could cause to the world, we need to be aware of other shortcomings of LLMs. These models work using deep mathematical concepts, and as such continue to “learn” about the patterns and behaviors of our world. Even when an LLM “provides it’s reasoning”, it is relying on statistical relationships to do so. Models do not “think”, they do not “reason”, and while their explanations can be insightful and interesting, they are not an unequivocal statement of fact.
Additionally, LLMs have their own sets of biases that impact the way they produce responses that could impact the bigger world. Think about a common use case for LLMs – asking the model to pick between two different items to determine which is better. Here are just a few examples of the biases that LLMs have been known to show in this case:
- Position Bias: When asked to choose between two items, an LLM often picks whichever one comes first.
- Verbosity Bias: When asked to choose between two items, an LLM often picks whichever one is longest.
- Self-Enhancement Bias: When asked to choose between items, an LLM often picks whichever one it produced in the first place.
Finally, it’s important to address the ever present problem of hallucinations. While model hallucinations is a problem that is being actively addressed in the research world, the fact is that models continue to sometimes just make things up. Without a human in the loop, we might never know if our models are actually factually correct, or if they’re making up facts and presenting them in a convincing manner. The spread of misinformation, intentionally or otherwise, can not only be harmful to the people in our world but also can harm our business’ credibility.
Humans Can Help Reduce the Problems
While there are many potential missteps that one could make in the world of Generative AI, it remains true that these models are extremely powerful and largely helpful in the world today. The important thing is to be aware of what our models are doing and how we’re using them. Keeping a human in the loop, either by crafting metrics or by manually reviewing a subset of the outputs of the models, can help us ensure that the models are working the way we believe they are, and that their outputs are correct in every sense of the word. Tools like Label Studio make this process easier, allowing us to seamlessly integrate models into the data annotation process so that we can simultaneously create ground truth data for metrics while deeply investigating the real time performance of our models. No matter how you do it, keeping a human in the loop in our Generative AI world can help us ensure that we’re acting in safe, responsible, and helpful ways for our businesses and our world.