NewTemplates and Tutorials for Evaluating Agentic AI Traces

How to build a labeling tool for ASR hypothesis selection

Evaluating automatic speech recognition models requires human reviewers to listen to audio clips and choose the best transcript from a list of candidate outputs. Building a custom application for this specific evaluation loop takes time away from model development. This guide shows you how to build a custom ASR hypothesis selection interface using Label Studio and an AI coding agent to quickly test and deploy an evaluation workflow.

Generate custom labeling interfaces from plain-language requirements using the XML configuration builder skill.

Deploy the generated XML configuration programmatically using the Label Studio SDK.

Present transcript lists alongside a waveform player to speed up annotator evaluation.

Sort annotation queues based on model confidence scores to prioritize difficult audio segments.

The problem

ASR hypothesis selection requires annotators to evaluate an N-best list of candidate transcripts against an underlying audio recording. Reviewers struggle with fatigue when they lack hotkeys for fast playback or when interfaces force them to scroll past messy JSON logs to compare text. Data compliance constraints around biometric voice data mean you cannot send recordings to a public cloud application. Building an internal evaluation tool from scratch costs engineering weeks that your team should spend improving the core model.

The short answer

With Label Studio as the foundation, an AI coding agent generates the labeling interface itself. The agent uses two tools together. First, it uses the XML labeling config builder skill, which produces optimized Label Studio interface configurations from a plain-language spec. Second, it uses the Label Studio SDK, which wires the config into a real project programmatically. So rather than building a new labeling application from scratch, agents generate the interface from your spec and deploy it into Label Studio in one pass.

Docs: XML config builder skill → https://github.com/HumanSignal/create-xml-labeling-config-skill

Docs: Label Studio SDK/CLI → https://api.labelstud.io/api-reference/introduction/getting-started

Docs: Ranker tag → https://labelstud.io/tags/ranker

Docs: Audio tag → https://labelstud.io/tags/audio.html

Docs: LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

What you're building

An audio player renders a zoomable waveform and supports hotkeys to speed up repeated listening.

A list view displays the candidate transcripts alongside their model identifiers and confidence scores.

A ranker control allows reviewers to drag and drop candidate transcripts to order them by accuracy.

A pairwise comparison mode activates when you need to evaluate exactly two candidate transcripts side by side.

A numeric rating scale captures a subjective quality score for the selected transcript.

A text area captures a free-text rationale from the reviewer explaining why they chose a specific transcript.

How to build it in Label Studio

1. Set up the project

Install and host Label Studio within your own infrastructure to keep sensitive biometric voice data compliant with privacy regulations. Create a task data structure that pairs one audio URL with an array of candidate transcripts and their associated model scores for ASR hypothesis selection. Add metadata fields for language locale and audio domain so reviewers can filter the queue for specific accents or background noise conditions. Provide any required reference files, such as audio codec converters, to ensure waveforms sync properly during playback.

2. Generate the labeling interface with the XML config skill

Instruct your coding agent to build the interface by providing your feature specification to the XML labeling config builder skill. The agent processes your plain-language requirements and outputs a validated Label Studio XML configuration that maps your audio and text data to the correct interface components. This generated configuration guarantees that the resulting application uses the exact tags optimized for ASR hypothesis selection workflows.

<Audio name="audio" value="$audio" hotkey="space" /> – An Audio tag renders the source recording with a waveform and custom hotkeys to speed up playback for ASR hypothesis selection.

<List name="hyps" value="$hyps" /> – A List tag displays the array of candidate transcripts and model scores for ASR hypothesis selection.

<Ranker name="select" toName="hyps" /> – A Ranker tag connects to the list and allows reviewers to reorder the hypotheses by accuracy for ASR hypothesis selection.

<Pairwise name="compare" toName="hyps" /> – A Pairwise tag evaluates exactly two candidate transcripts side by side during A/B ASR hypothesis selection tests.

<Rating name="quality" toName="hyps" /> – A Rating tag collects a numeric score evaluating the overall quality of the chosen transcript for ASR hypothesis selection.

3. Wire it into a project with the SDK

The agent uses the Label Studio SDK/CLI to create a new project with the generated XML configuration and to upload your dataset of audio clips. You can command the agent to import your existing model confidence scores as pre-annotations so reviewers see the initial system ranking. Run a small batch of tasks to observe how reviewers interact with the interface. If annotators struggle to read long transcripts, direct the agent to regenerate the XML configuration and redeploy the updated interface.

4. Set up review and quality workflows

Configure a multi-annotator overlap percentage to route the same audio clip to multiple reviewers to measure consensus on difficult acoustic segments. Set up reviewer queues to isolate instances where annotators disagree on the best candidate transcript. Track specific agreement metrics that matter for ASR hypothesis selection, such as binary top-1 agreement for the primary selection or rank correlation metrics for full transcript list sorting.

5. Export and integrate

Export your completed evaluations in the default JSON format to preserve the ordered lists and text rationales. Extract the stable hypothesis identifiers and final rank positions to calculate the preference dataset for your underlying models. Pass this structured output directly into your evaluation harness or into a training pipeline for an automated rescoring model.

Why Label Studio for ASR hypothesis selection

With the native audio tag, you can configure custom hotkeys to eliminate annotator fatigue during repeated listening.

With the ranker control, you can transform messy JSON arrays into interactive lists so reviewers avoid scrolling through log files.

With self-hosted deployment options, you can process sensitive biometric voice recordings within your secure infrastructure to maintain data privacy compliance.

With the software development kit, your engineering team can test evaluation workflows instantly instead of spending weeks building web applications from scratch.

Common variations

Evaluating large language model summarization quality relies on the identical list and ranker pattern applied to text blocks instead of audio.

Benchmarking translation model outputs requires the same pairwise interface to choose the most natural translation of a source sentence.

Ranking synthetic text-to-speech outputs uses a similar layout that presents multiple audio clips against a single source transcript.

Next steps

XML labeling config builder skill → https://github.com/HumanSignal/create-xml-labeling-config-skill

Label Studio SDK/CLI → https://api.labelstud.io/api-reference/introduction/getting-started

LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

Manage tasks and data → https://docs.humansignal.com/guide/manage_data

Quality and review workflows → https://docs.humansignal.com/guide/quality.html

GitHub → https://github.com/HumanSignal/label-studio

How do platform terms of service impact ASR audio ingestion?

You cannot rip audio directly from consumer platforms for labeling workflows. YouTube terms of service explicitly forbid downloading media outside of their approved application programming interfaces. For external testing datasets, pull audio from openly licensed repositories like Mozilla Common Voice or LibriSpeech to maintain compliance.

What data privacy constraints apply to voice recordings?

Regulations like the General Data Protection Regulation and the California Consumer Privacy Act classify voiceprints as sensitive biometric data when used for identification. You must host labeling infrastructure securely within your own environment to prevent exposing this personal data to third-party public clouds. Configure deletion workflows in Label Studio using the data manager to respect user takedown requests.

How do you map candidate transcripts to machine-joinable outputs?

When supplying the Label Studio list tag with candidate transcripts, assign a stable model identifier to each item rather than relying on positional array indexes. The ranker control outputs the final sorted array using these explicit identifiers. This guarantees that your downstream preference datasets map perfectly back to the original decoder logs.

Why do audio waveforms desynchronize during playback?

Waveform rendering desynchronizes when the underlying audio file uses complex container formats or variable bitrates. Convert your MP3 or compressed source files into standard WAV formats before uploading them to the labeling pipeline. This standardized data engineering step ensures the visual waveform matches the audio playback perfectly.

When should you use a pairwise control instead of a ranker?

Use the pairwise tag strictly for evaluating exactly two candidate transcripts side by side. If your decoder generates an N-best list with three or more hypotheses, build the interface using a list tag combined with a ranker tag. This setup allows reviewers to drag and drop multiple items into a completely sorted preference order.

Related Content