Evaluating LLM Based Chat Systems for Continuous Improvement
This tutorial walks through how to evaluate multi-turn LLM chat systems using Label Studio. Learn how to collect annotations, generate metrics, and identify areas for improvement through human-in-the-loop feedback.
Transcript
Nate:
Thanks, everyone, for joining. I'm really excited about today’s topic—it’s an actionable and highly relevant area for many teams right now. I think it’ll be useful for everyone listening.
Before we dive in, a few quick announcements. First, we are recording this session and will send out the recording afterward to all attendees. Second, we have additional resources to share—including a GitHub repo, a blog post, and more. Lastly, we’ll have a Q&A after the walkthrough. If you have questions, use the Q&A widget or drop them in the chat. We’ll monitor both and answer as many as we can once the main presentation wraps up.
With that, I’d like to introduce our speaker today: Jimmy Whitaker, Data Scientist in Residence here at HumanSignal. He’s a seasoned expert in AI and designed the entire process he’ll walk us through. So without further delay, I’ll hand it over to Jimmy.
Jimmy:
Thanks, Nate. Let’s dive in.
Today we’re going to talk about evaluating LLM-based chat systems for continuous improvement. It’s a long title, but to put it simply: we’re trying to collect actionable insights from virtual assistants—and for virtual assistants—so we can improve them.
The focus is on understanding multi-turn chat interactions. We’ll use Label Studio to annotate these conversations, and then analyze the data to generate metrics and identify improvement areas. Ultimately, we want to build assistants that perform consistently and effectively.
First, let’s define what a multi-turn conversation is. You’ve likely interacted with chatbots or LLMs where you ask a question, get a response, ask a follow-up, and get another response. Each user-assistant exchange is a “turn,” and a multi-turn conversation is made up of several of these.
There are a few key challenges with multi-turn chats. First is topic shifting—users often start on one subject and then veer into another. That makes context tracking harder. Second, context itself can be subtle. If someone asks about a return policy, the assistant needs to know not just that they’re asking about returns, but also what company they’re referencing. Third, users often do unexpected things—maybe asking the assistant to perform an action it can’t do. And finally, there’s hallucination risk. LLMs might generate responses that sound right but are incorrect or misleading.
All of this makes evaluation essential. We want to improve these systems iteratively and ensure they’re working reliably as we add features or handle new use cases.
The way we approach this is through a continuous feedback loop. This is a familiar concept from software engineering, and it applies just as well to machine learning and LLM development. The loop is: evaluate the system, identify areas for improvement, make those changes, put the system back in production, and repeat.
In fact, I’d argue that real development begins when a system is in production. That’s when it starts interacting with real-world data, and that’s the data we need to evaluate and improve upon.
Now let’s talk about how Label Studio fits in. For those unfamiliar, Label Studio is an open-source tool that supports flexible data labeling for a variety of modalities—images, audio, video, and of course, text. Today we’ll use it to evaluate multi-turn chat systems.
Here’s what the workflow looks like:
We create a Label Studio project programmatically using the Python SDK. This gives us flexibility in configuring our labeling interface. Then we format our data—typically starting from OpenAI-style message arrays—and restructure it to match our template. We annotate the turns in each conversation, export those annotations, and analyze the data in a Jupyter notebook.
The core idea is to split the conversation into turns so we can assess each one individually. We’re setting a static configuration in Label Studio, meaning we define a maximum number of turns per conversation. Shorter conversations get padded. For longer ones, we split them into multiple tasks.
We’ll use a custom UI template in Label Studio that shows the full conversation on one side and lets you answer multiple-choice questions for each turn on the other. These questions focus on things like user intent, whether the assistant addressed that intent, the quality of the response, and any suggested actions.
Once annotation is done, we export the data as JSON and analyze it.
In our example, we’re using e-commerce conversations. Each one has three to five turns. After annotating, we parse the results to extract useful metrics like:
- What users are asking about (returns, order status, product inquiries, etc.)
- How well the assistant addressed those topics
- Whether responses were accurate and helpful
- Whether the assistant suggested or confirmed any actions
Then we zoom out to the conversation level. We analyze transitions between intents—say, someone starts with a question about an order and ends up asking about a refund. We assess how consistent users are across turns, and whether conversations stay on a single topic or move between several.
This helps us identify common flows, edge cases, or pain points. For example, if people often ask about refunds after checking an order status, maybe there’s a larger issue in the fulfillment process—not just with the assistant.
We can also look at how often the assistant partially or fully addressed an intent, and where responses were not helpful. This guides what to fix, refine, or flag.
From there, there are many possible actions. We could:
Use strong conversations as training data for model fine-tuning
Update the vector database in a RAG system with better source material
Flag common failure points or hallucinations
Audit risky actions like issuing refunds or triggering tools
All of this feeds back into that continuous improvement loop.
Now, this is a small example—just three conversations—but you can see how it scales. With more data, the same metrics reveal trends that can significantly shape your assistant's development.
Some people ask whether AI can help with the annotation process. The answer is yes. You can use an LLM to generate predictions and then have humans review or correct those. This saves time and reduces manual effort while still maintaining quality through human-in-the-loop oversight.
You can also extend this workflow. For example, if you’re using a RAG or agent-based system, you might include the retrieved context or tool calls in the labeling interface. You could ask annotators whether the right tool was used or if the context was appropriate.
In Label Studio Enterprise, you can also automate some of these evaluation workflows so you don’t need to rerun the notebook each time. That’s another powerful way to make this process more scalable.
To wrap up: Label Studio offers a flexible framework for evaluating multi-turn conversations. You can start with a simple template like the one we showed today, and then customize it to your own use case. The goal isn’t just to collect metrics—it’s to improve your assistant over time.
Here’s a link to the blog post if you want to read more or grab the code examples. And with that, I’ll turn it back over to Nate.
Nate:
Thanks so much, Jimmy. That was great.
We’ve got a few questions. We don’t have time to get to them all, but we’ll answer a couple quickly.
Audience Question:
What are some best practices for designing effective labeling guidelines to ensure high-quality annotations for multi-turn generative chat systems?
Jimmy:
Great question. First, it depends on what stage you’re in. If your assistant is already deployed, you probably have a sense of what users are asking. Start by categorizing those into broad buckets—what are the main topics?
Then, get into the data. There’s really no substitute for looking at real interactions. That will help you design useful labeling questions. Keep your interface simple and quick to annotate, and consider having review layers—maybe a first-pass annotator and then a reviewer. It starts to look a lot like software development: iterative, with review cycles and refinement over time.
Nate:
Next question—were the evaluation questions generated by Label Studio or did you write them yourself?
Jimmy:
I wrote them myself. They were based on the e-commerce conversations I created for the example. I had specific questions I wanted to answer about the assistant’s performance, so I structured the template accordingly. But others can easily start with this and modify it for their own use case.
Nate:
Can you inspect tasks where annotators disagree?
Jimmy:
Yes. Label Studio does track that. In the open source version, you can compare annotations by user. In the Enterprise version, there are reports and visualizations that help identify disagreement rates across annotators or specific questions.
Nate:
Should you use the first batch of manual labels to train a model instead of generating more labels with an LLM?
Jimmy:
It depends. You could train a lightweight classifier or model on your labeled data, then use it to label more examples—just make sure to validate its performance first. You can also use LLMs to generate predictions and have humans review them. The key is quality control—don’t assume automated labels are always right.
Nate:
Last question—what LLMs have you found most effective for generating templates?
Jimmy:
I’ve used a few—ChatGPT, Claude, LLaMA 3. They can do a solid job with basic templates, especially for formatting and CSS. But be cautious—sometimes they invent features that don’t exist in Label Studio. I still find myself tweaking and iterating manually.
Nate:
Awesome. That’s all the time we’ve got. Huge thanks again to Jimmy, and thanks to everyone who joined. We’ll send the recording and links shortly. Have a great rest of your day!