NewTemplates and Tutorials for Evaluating Agentic AI Traces

How to build a labeling tool for translation quality estimation with MQM

Evaluating machine translation requires more than a simple approval rating. Translation quality estimation with MQM demands that professional linguists identify explicit error spans in a text and assign hierarchical categories and severity scores to each specific region. This multidimensional approach gives engineering teams precise quantitative data to evaluate model performance, but it creates a complicated interface problem for the humans doing the work.

Configure side-by-side text panes to display source and target strings with synchronized scrolling.

Map the multidimensional error typology into a taxonomy control for precise span categorization.

Attach severity ratings and free-text rationale inputs to specific highlighted text regions.

Inject model-generated error predictions as draft annotations to accelerate the linguist review process.

Calculate span-level agreement and hierarchy code overlap to adjudicate professional reviewer disagreements.

The problem

Labeling for translation quality estimation with MQM is difficult because linguists must highlight precise character spans in a target string and map them to a deep taxonomy of error types. The data shape requires rendering side-by-side source and candidate texts while capturing layered metadata for every selected region. Professional translators struggle with disjointed workflows where they must cross-reference external guidelines or switch between multiple forms to log severity and rationale. When processing proprietary logs that contain personal data, strict retention policies make moving data to third-party annotation platforms a compliance risk. Building a custom interface to handle these complex interactions and compliance constraints from scratch costs months of engineering time and delays critical evaluation cycles.

The short answer

Rather than building a new labeling application from scratch, agents generate the interface from your spec and deploy it into Label Studio in one pass. A coding agent uses the XML labeling config builder skill to produce an optimized interface configuration from a plain-language specification. The agent then uses the Label Studio SDK/CLI to wire that configuration into a live project programmatically.

Docs: LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

Docs: Text tag → https://labelstud.io/tags/text.html

Docs: Taxonomy tag → https://labelstud.io/tags/taxonomy.html

Docs: Import predictions → https://labelstud.io/guide/predictions

What you're building

Provide read-only text views to display the source text and the candidate translation.

Enable character-level span selection on the target text to mark precise error boundaries.

Present a hierarchical taxonomy tree so linguists can select specific accuracy or style categories.

Attach a region-specific severity picker to classify each error as minor, major, or critical.

Include an editable text area linked to each span to capture the suggested fix.

Display pre-annotated error spans generated by quality estimation models for human confirmation.

Calculate bounding span intersection over union to measure agreement across multiple reviewers.

How to build it in Label Studio

1. Set up the project

Host Label Studio on your own infrastructure to keep proprietary translation logs secure and maintain compliance with data minimization requirements. Structure each task as a single JSON object containing the source string and the target string. Include project-level metadata fields like the language pair, the machine translation system version, and the domain context. Pre-load your specific version of the multidimensional quality metrics taxonomy file so the interface can render the correct hierarchy.

2. Generate the labeling interface with the XML config skill

Hand the interface specification to a coding agent equipped with the XML labeling config builder skill. The agent processes your requirements and emits a validated Label Studio XML configuration that structures the annotation workflow. This generated configuration applies specific control tags that map perfectly to the required data types for translation quality estimation with MQM.

<Text name="tgt" value="$tgt"> - renders the candidate translation string and normalizes line endings to ensure accurate span offsets.

<Taxonomy name="mqm_cat" toName="tgt" labeling="true"> - applies hierarchical error categories directly to selected text spans.

<Choices name="severity" toName="tgt" perRegion="true" choice="single"> - attaches a distinct severity level to each individual highlighted region.

<TextArea name="note" toName="tgt" perRegion="true" editable="true"> - captures optional free-text rationale or corrections for a specific error span.

3. Wire it into a project with the SDK

The agent uses the Label Studio SDK/CLI to create the project workspace and apply the generated configuration. Next, the agent uploads the task batches and imports confidence-scored model predictions as draft annotations. This automated deployment allows you to run a small batch of translations, observe how annotators interact with the controls, and command the agent to regenerate the XML to fix workflow friction.

4. Set up review and quality workflows

Assign overlapping annotators to a seed set of translations to calibrate your linguistic team on the taxonomy guidelines. Route these overlapping tasks into a dedicated reviewer queue to adjudicate disagreements on error severity or categorization. Track specific agreement metrics that matter for translation quality estimation with MQM, focusing on span intersection over union and hierarchy-code agreement. Require reviewers to leave explanatory comments when rejecting annotations to build a feedback loop for your translators.

5. Export and integrate

Export the finalized annotation data in the default JSON format to preserve the complex relationship between text offsets, taxonomic categories, and severity ratings. Extract the specific start and end character offsets alongside the assigned error codes to calculate final penalty scores. Hand this structured data off to an evaluation harness to benchmark your machine translation models or route it into a training pipeline.

Why Label Studio for translation quality estimation with MQM

On-premises hosting keeps sensitive log data inside your perimeter to satisfy strict personal data retention policies.

The synchronized text display renders source and candidate strings together to eliminate disjointed application switching.

Nested taxonomy controls present the complete error hierarchy directly alongside the text to prevent cross-referencing external guidelines.

Region-specific controls attach severities and rationales to exact character boundaries to capture layered metadata accurately.

Programmatic project configuration replaces months of custom engineering time with a single automated deployment command.

Common variations

Direct Assessment evaluates translations using segment-level scalar scores instead of marking specific error boundaries.

Pairwise preference ranking presents two candidate translations side-by-side so annotators can choose the better output.

Biomedical translation review applies a specialized medical domain taxonomy to evaluate complex scientific terminology.

Language model evaluation adapts the multidimensional framework to audit generative outputs for style and audience appropriateness.

Next steps

XML labeling config builder skill → https://github.com/HumanSignal/create-xml-labeling-config-skill

Label Studio SDK/CLI → https://api.labelstud.io/api-reference/introduction/getting-started

LLM-friendly docs (markdown) → https://labelstud.io/llms.txt

Label Studio taxonomy tag → https://labelstud.io/tags/taxonomy.html

Label Studio predictions → https://labelstud.io/guide/predictions

GitHub → https://github.com/HumanSignal/label-studio

How do you manage data retention policies when labeling proprietary machine translation logs?

You must establish a lawful basis for processing under GDPR Article 6 before ingesting translation content that contains personal data. Keep dataset retention windows short and configure your interface to redact annotator access to the underlying data manager. On-premises deployments restrict these proprietary logs within your own perimeter to comply with strict data minimization mandates.

What is the best way to present the multidimensional quality metrics hierarchy in the review interface?

Configure the taxonomy control tag with the labeling parameter set to true to display the complete error tree directly alongside the candidate text. This setup prevents linguists from switching tabs to reference external guidelines when they assign specific accuracy or style categories. You can also map hotkeys to the most frequent error types to speed up the annotation workflow.

How do you prevent text offset errors when highlighting translation error spans?

You need to normalize all line endings to LF before importing your task JSON. Because the text system counts CRLF as two characters, unnormalized source data will misalign the exact start and end character boundaries on the target text. Maintaining consistent line endings ensures that your exported annotations perfectly match the span predictions generated by your evaluation harness.

How do you integrate active learning predictions into the span evaluation process?

Connect your custom machine learning backend using the official SDK to serve span-level predictions directly to the workspace. Format these model outputs to match the interface configuration and include a prediction score for each proposed error boundary. Reviewers can then confirm or reject these draft annotations to shift the workflow from manual highlighting to rapid verification.

Can annotators assign distinct severity ratings to overlapping text regions?

Yes, you can attach severity levels to individual spans by setting the per-region parameter to true on your choices tag. This configuration allows a reviewer to mark a critical terminology error inside a larger segment flagged for a minor stylistic issue. Exporting this data preserves the exact JSON relationship between the character offsets, the taxonomic categories, and the specific severity ratings.

Related Content