NEW10X Faster Labeling with Prompts—Now Generally Available in SaaS

How to Evaluate and Compare LLMs Using Prompts in Label Studio

Tutorials

As large language models become more central to labeling workflows, one question comes up again and again: how do you know if a model is performing well for your specific use case? Whether you're using an LLM to pre-label data, auto-generate annotations, or assist human reviewers, evaluating model outputs is critical to building trust and scaling your system with confidence.

In this blog, we’ll explore two approaches for assessing model quality in Label Studio. The first relies on ground truth data to generate accuracy scores. The second compares model predictions in cases where ground truth is unavailable, helping you identify gaps and make informed decisions about which models to use or fine-tune.

Why LLM Evaluation Needs Structure

Using LLMs to label data offers a major speed boost, but without a clear evaluation method, it's easy to lose track of quality. Manual inspection is time-consuming and subjective. What’s needed is a structured, repeatable way to compare outputs, both against existing annotations and across different models or prompt versions.

Label Studio’s Prompts feature makes this possible. It allows users to:

  • Connect to leading model providers or custom endpoints
  • Build and version prompts directly within the UI
  • Run evaluations on any project, with or without ground truth
  • Compare results side by side
  • Automatically annotate tasks with selected model outputs

With Prompts, you get a complete loop for testing, improving, and operationalizing LLM-based labeling.

Workflow 1: Evaluate with Ground Truth

When you already have a labeled dataset, evaluating a model is straightforward. In Label Studio, this starts by marking a subset of tasks as ground truth. You can then run a prompt using a selected base model and compare the model's predictions to the ground truth annotations.

Label Studio displays both outputs side by side, highlighting differences. It also calculates an agreement score for each task, as well as a macro-level score across all evaluated tasks. This helps you answer questions like:

  • How close is the model output to human-reviewed labels?
  • Are there specific types of inputs where the model underperforms?
  • What’s the cost-performance tradeoff across different models?

You can also evaluate multiple models or prompt variations within the same interface by creating new prompt versions. Each version is stored with its associated model, instruction, and results, making it easy to track what changes had the biggest impact.

Use Case: Image Captioning with Pre-Labeled Data

In the live workshop, the team walked through a real-world image captioning project. The dataset contained 200 images that needed both classification and descriptive captions. Rather than labeling each image from scratch, the team used Prompts to test two models—GPT-4 and Gemini 1.5 Pro—against a small ground truth set.

Despite similar accuracy scores, Gemini delivered comparable results at a significantly lower cost. This allowed participants to see firsthand how tradeoffs between performance and price can shape production decisions.

Workflow 2: Compare Without Ground Truth

 When ground truth isn't available, Label Studio still gives you a way to evaluate. Instead of comparing model outputs to existing annotations, you compare two different model predictions to each other.

The steps look like this:

  • Run a prompt using Model A on your dataset
  • Save the predictions as annotations
  • Run a prompt using Model B on the same dataset
  • Save those predictions as well
  • Use Label Studio’s agreement scoring to compare the annotations

This approach helps identify disagreement areas where model behavior diverges. Those tasks often represent edge cases, ambiguous inputs, or gaps in model coverage.

From there, you can prioritize human review on the most problematic tasks or convert them into new ground truth examples. This iterative workflow helps you bootstrap evaluation even in early-stage projects.

Use Case: PII Detection and Summarization

Another example covered in the session was a text-based project involving conversation transcripts. The task was to detect PII and summarize call topics. Two models, Claude Haiku and GPT-4, were run on the same inputs, and their outputs were compared.

By converting predictions into annotations and scoring agreement, the team quickly pinpointed where the models aligned and where they diverged. This helped highlight sensitive cases that warranted closer inspection and potential ground truth creation.

Track Model Cost Alongside Accuracy

When evaluating models, performance isn’t the only metric that matters. Label Studio also tracks inference cost, giving you visibility into what it would take to scale a particular model across your full dataset.

For example, if two models have similar agreement scores but one is three times more expensive, that may influence which model you deploy for production pre-labeling or auto-annotation. Being able to balance quality and cost is essential when scaling human-in-the-loop systems.

Flexible Evaluation for Any Data Type

The workflows described above work for both text and image tasks and support a wide range of annotation types, including:

  • Text classification
  • Named entity recognition (NER)
  • Text summarization and extraction
  • Image captioning and classification

Prompts can be used with standard LLMs or connected to custom models hosted in your own environment. You can also customize agreement metrics in the settings, allowing for more advanced evaluations beyond simple accuracy.

Building Confidence, One Prompt at a Time

Evaluating and comparing models shouldn't require writing scripts or manually reviewing every output. Prompts in Label Studio give teams a fast, structured way to test models, adjust prompts, and build better labeling workflows over time.

Whether you're working with ground truth or starting from scratch, you can use these tools to build trust in your model’s predictions and speed up your data pipeline with confidence.

Watch the full webinar below:


Related Content

  • Craft the Greatest GPTs 😸

    In this post, we walk through a creative use case for Label Studio’s AI Assistant and Prompts: generating rich, image-based narratives. Follow along to see how you can build a project that combines structured labeling with a touch of storytelling.

    Sheree Zhang

    April 1, 2025

  • How to do Medical Image Labeling with Label Studio

    In this tutorial, we'll learn how to efficiently label medical images using Label Studio, from setting up your project to drawing bounding boxes and configuring classification tags.

    Micaela Kaplan

    January 16, 2025

  • Monitor & Evaluate Models in Production with Label Studio

    Understanding how your production model is performing is the most crucial yet most challenging step in the Machine Learning pipeline. This tutorial gives you an easy way to keep a pulse on what is really happening in production.

    Micaela Kaplan

    December 4, 2024