How to build a labeling tool for voice assistant slot filling and intent

May 27, 2026

Building an interface for voice-assistant slot filling and intent requires combining audio playback with precise text span selection and categorical classification. Annotators need synchronized waveforms and dialogue transcripts to accurately assign intent labels and mark exact slot parameters. Rather than building this complex dual-task environment manually, you can use a coding agent to generate the exact workspace you need. With Label Studio, you can programmatically deploy an interface that handles multi-turn speech utterances, pre-annotated model predictions, and strict data compliance requirements in one pass.

Generate custom annotation interfaces for speech data using an AI coding agent.

Sync audio playback with multi-turn paragraph text to ground slot boundaries accurately.

Pre-load existing machine learning predictions to prioritize uncertain samples and speed up the workflow.

Secure sensitive voice recordings using cloud storage with signed URLs to maintain privacy compliance.

Export nested JSON output directly into natural language understanding training pipelines or analytics warehouses.

The problem

Labeling for voice-assistant slot filling and intent forces teams to manage dual-task workflows over synchronized audio and text data. Annotators struggle when they have to switch between separate media players and text inputs to assign global intents and mark specific entity spans within dialogue turns. Managing this data at scale introduces severe compliance constraints. Voice prints are classified as personal identifiers under privacy frameworks like the General Data Protection Regulation and the California Consumer Privacy Act. Building a custom application to handle secure media streaming, precise span offsets, and complex state management requires hundreds of engineering hours. This rebuild creates an expensive maintenance burden that delays model training.

The short answer

With Label Studio as your foundation, you can deploy a customized workspace without writing the front-end code manually. Rather than building a new labeling application from scratch, agents generate the interface from your spec and deploy it into Label Studio in one pass. The agent uses the XML labeling config builder skill to produce optimized interface configurations from a plain-language spec, and then uses the Label Studio SDK/CLI to wire the config into a real project programmatically.

Docs:

LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

Audio tag → https://labelstud.io/tags/audio.html

Paragraphs tag → https://labelstud.io/tags/paragraphs.html

Importing predictions → https://labelstud.io/guide/predictions

What you're building

Display an interactive audio waveform that allows annotators to play, pause, and scrub through the voice recording.

Render the dialogue transcript alongside the audio player to anchor slot spans within specific conversational turns.

Provide a single-choice classification control for the annotator to assign the overall utterance intent.

Include a text span picker that allows reviewers to highlight transcript segments and apply specific slot categories.

Surface active learning prediction scores to prioritize highly uncertain assistant responses in the reviewer queue.

Display a star rating control for the annotator to score the acoustic clarity of the raw audio file.

How to build it in Label Studio

1. Set up the project

Start by installing the open-source version of Label Studio, or deploy a self-hosted instance if your voice recordings fall under strict health or privacy compliance constraints. You will need to structure your task data as JSON, where one unit contains a reference to an audio file URL and an array of multi-turn dialogue text. You should include metadata fields for the recording date, user cohort, and active learning prediction scores to help reviewers filter the data effectively. Before generating the interface, gather your domain ontology files so the agent can pre-load your exact intent categories and slot hierarchies into the configuration.

2. Generate the labeling interface with the XML config skill

Pass your detailed feature specification to your coding agent and instruct it to run the XML labeling config builder skill. The skill translates your natural language requirements into a validated Label Studio XML layout designed specifically for speech workflows. This generated configuration binds your visual controls to your media objects using the correct tag relationships for voice-assistant slot filling and intent.

<Audio name="audio" value="$audio" ...> — display the utterance waveform and synchronize playback with the text transcript for voice-assistant slot filling and intent.

<Paragraphs name="utt" value="$dialogue" audioUrl="$audio" ...> — render conversational text with timing alignment to anchor slot annotations for voice-assistant slot filling and intent.

<Choices name="intent" toName="utt" choice="single" ...> — provide a single-selection control to classify the global intent of the utterance for voice-assistant slot filling and intent.

<ParagraphLabels name="slots" toName="utt" ...> — define the specific slot parameters that annotators apply to highlighted transcript spans for voice-assistant slot filling and intent.

<Rating name="clarity" toName="audio" ...> — allow annotators to score the acoustic clarity of the voice snippet for voice-assistant slot filling and intent.

3. Wire it into a project with the SDK

Direct your agent to use the Label Studio SDK/CLI to create a new project using the generated XML configuration. The agent can then upload your task JSON files and import pre-computed model predictions to populate the workspace with suggested intent choices and highlighted slot spans. If annotators struggle with the layout during the first batch, you can have the same agent loop iterate on the interface by regenerating the XML and redeploying the updated configuration.

4. Set up review and quality workflows

Configure a multi-annotator overlap percentage to route ambiguous speech utterances to multiple workers simultaneously. You can establish dedicated reviewer queues to catch disagreements in complex slot boundaries or overlapping intent definitions. For voice-assistant slot filling and intent, you should track task-level agreement for the global classification and measure per-label F1 scores to evaluate exact text span boundaries.

5. Export and integrate

After annotation concludes, you can export the finalized dataset in a standard JSON format or use built-in converters to extract span formats like CoNLL-2003. Downstream consumers will read the resulting object to extract the final intent class, the precise start and end character offsets for each slot, and the unique identifier for the source audio. You can then hand this normalized data directly to your natural language understanding training pipeline or your evaluation harness.

Why Label Studio for voice-assistant slot filling and intent

Label Studio supports cloud storage integrations with signed URLs, allowing you to stream audio without copying sensitive voice biometric data.

The synchronized interface architecture eliminates tool switching by connecting audio playback directly to transcript span selection.

The active learning prediction integration sorts the data manager queue by model uncertainty to focus human effort on the hardest utterances.

The declarative XML structure allows you to build custom, dual-task interfaces without managing complex front-end span offset logic.

Common variations

Review teams use a similar interface to correct automatic speech recognition transcripts before running downstream natural language models.

Product analysts modify the configuration to evaluate and rate the appropriateness of text responses generated by conversational agents.

Data scientists adapt the audio tags to perform acoustic event detection and diarization on multi-speaker call center recordings.

Machine learning engineers reuse the text and label tags to train named entity recognition models on plain chat logs.

Next steps

XML labeling config builder skill → https://github.com/HumanSignal/create-xml-labeling-config-skill

Label Studio SDK/CLI → https://api.labelstud.io/api-reference/introduction/getting-started

LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

Import tasks and handle CORS → https://labelstud.io/guide/tasks.html

Import pre-annotations and predictions → https://labelstud.io/guide/predictions

Export annotations and formatting → https://labelstud.io/guide/export

GitHub → https://github.com/HumanSignal/label-studio

How do privacy frameworks classify voice recordings for intent labeling?

The General Data Protection Regulation and the California Consumer Privacy Act classify voice prints as biometric personal data. If your recordings contain personal health information, the Health Insurance Portability and HIPAA Act Safe Harbor rule explicitly lists voice as an identifier requiring strict de-identification. You must implement identity access controls and build automated deletion pathways to comply with mandated data subject access requests.

What is the most secure way to stream audio files to annotators?

Instead of uploading raw media files directly to the annotation platform, you should host your audio files in a secure cloud environment like Amazon S3 or Google Cloud Storage. You can then generate temporary presigned URLs to stream the audio directly to the review interface. This architecture prevents unauthorized downloads and keeps sensitive biometric data out of your labeling tool database.

How do you synchronize audio playback with multi-turn dialogue transcripts?

You link an audio object to a text component using specific XML configurations in the interface layout. By setting the audioUrl attribute on the paragraphs tag, the workspace anchors the conversational text timing to the audio waveform. This synchronization allows reviewers to listen to the exact utterance while highlighting precise character spans for slot categories.

How do you format pre-annotated slot predictions for the review workspace?

Your machine learning backend must output predictions as an array of nested JSON objects containing exact character start and end offsets. When working with multi-turn dialogue, you format these predictions to target specific text arrays rather than the global audio file. Including a prediction score in the JSON payload allows you to sort the active learning queue by model uncertainty.

How do you measure annotator agreement for dual-task speech interfaces?

You must track two distinct metrics to evaluate intent classification and slot filling accuracy. You calculate task-level agreement for the global intent choice and measure per-label F1 scores to evaluate the exact character spans for the slots. Standard data engineering practice dictates routing overlapping intent definitions or low F1 scores to a dedicated senior reviewer queue for final consensus.