Ways to do document annotation in Label Studio

April 30, 2026

Extracting accurate training data from documents requires choosing your Label Studio XML architecture before you import a single file. Pick the wrong tag group, and you break your downstream JSON exports. Successfully routing complex files through a machine learning interface depends on matching the correct markup configuration to your specific data type. You will learn the specific XML tags, prerequisite formatting constraints, server parameters, and infrastructure limits needed to process plain text, multi-page images, and native PDFs correctly.

TL;DR

Label Studio Community requires converting documents into images and using the <Image valueList> tag, outputting an item_index property to indicate the page number.

Label Studio Enterprise natively renders PDFs up to 100 pages long using the <Pdf> tag to directly generate an output property containing pageIndex and ocrtext.

Image-based multi-page workflows demand external hosting governed by CORS (Access-Control-Allow-Origin) server policies.

Python-based formatting prevents offset corruption caused by carriage breaks and tag-specific text length calculations in HTML architectures.

Injecting machine-generated labels into the predictions array of your task payload accelerates manual review pipelines.

Why plain text conversion breaks document ML pipelines

Stripping PDFs to plain text destroys the visual context annotators rely on to spot OCR errors and layout-dependent entities. If a team processes scanned financial statements by extracting only the underlying text via a terminal command, the data appears clean in the interface until a reviewer attempts to map a "final amount due" label. Without visual table borders, spatial proximity to headers, font formatting cues, and stamped approval seals, the reviewer cannot differentiate the final total from a subtotal separated by a single hidden line break. Reviewers end up spending unbillable hours manually opening original files on a second monitor to fix broken entity relationships.

Formatting is data. Plain text extraction eliminates the layout coordinates machine learning models require to execute spatial reasoning. When parsing complex documents, downstream extraction algorithms rely heavily on the relative distance between bounding boxes to group related fields together. The primary engineering challenge becomes building a layout-aware workflow without exceeding the rendering limits of the browser.

The XML divide: how your license dictates your export schema

Your XML configuration determines how the software parses files and dictates the shape of your resulting information. Because the underlying rendering engines vary heavily depending on your software tier, developers need to finalize their technical approach before deploying workspaces.

Depending on your target data type and licensing tier, Label Studio supports three distinct document-annotation paths: document-level native PDF classification, region-level native PDF OCR review, and multi-page image-based document annotation. Choosing the correct path establishes your core data architecture for the project lifecycle.

Community projects rely on passing image references via custom values within the interface. The resulting multi-page JSON export schemas produce an item_index property starting at 0 to define the page number that corresponds to the drawn box. The database registers each page as an isolated entity with a local coordinate system.

Label Studio Enterprise processes files differently. The commercial interface outputs a hardcoded pageIndex variable backed by document coordinates. Older forum threads frequently push the outdated myth that annotation software cannot handle native PDFs. Native capability requires configuring the XML node structure using the correct advanced tags. Migrating between these two output architectures mid-project requires rebuilding your dataset schemas.

Configuring the multi-page image workaround

Because Label Studio Community does not parse PDFs natively, operators build an image-conversion pipeline to get visible information onto the screen. Translating pages to PNG or JPEG formats requires specific infrastructure decisions to maintain labeling speed.

The multi-page image tagging configuration requires an <Image> element. The setup pairs the image node with a valueList parameter pointing to a generated array of hosted URLs inside your task configuration file.

Image fallback mechanics enforce secure hosting policies. Web browsers block unverified cross-origin requests by default to protect end users. Your backend bucket needs an explicit CORS response transmitting Access-Control-Allow-Origin headers allowing the domain to pull the files securely. If you skip compiling CORS rules on your AWS S3 bucket, the user interface fails to generate the workspace visually, logging errors silently in the browser console.

Large documents also place significant pressure on backend capacity arrays. One uploaded task containing 100 images translates to roughly 100 actual tasks against the database query limits. To prevent frontend lag during bounding box drawing, keep aggregate asset counts under 100,000 for standard server deployments.

Setting up native PDF OCR review

If managing external image servers and custom conversion scripts bogs down your pipeline, native parsing bypasses those infrastructure requirements. HumanSignal built Label Studio Enterprise to load documents natively, preserving spatial alignment without secondary hosting.

The native region-level PDF OCR review configuration pairs Pdf and OcrLabels tags. The application tracks rotation, zooms efficiently, and renders files up to 100 pages long. Operating directly on the native file eliminates the latency introduced by external Python rendering scripts.

The enterprise XML format calculates complex properties automatically for every drawn bounding area. Output tables automatically capture downstream variables including x, y, width, height, rotation, pageIndex, and ocrtext for every defined sequence.

Highlighting text interactively preserves precise coordinate mapping matching the original source file architecture. The native tool does carry one distinct functional limitation. The source document requires a selectable text overlay to function correctly. Flattened scans lack a metadata layer. If you import a document without embedded characters, the system requires running external OCR over the physical scans independently before the application reads the data.

Managing character offsets in plain text and HTML documents

Whether you process plain text or parse raw HTML, browser-based rendering introduces rigid alignment rules. Both formats demand aggressive upstream sanitization to prevent mapping failures during data export.

For text annotation tasks, the \r\n carriage return acts as two individual symbols in the system memory. Most Windows-based systems use dual returns while Unix systems prioritize single characters. If a user highlights a span traversing a paragraph break, the final export miscalculates the logical start and end locations.

HTML documents trigger similar calculation errors within the browser. The startOffset and endOffset metrics calculate string length within individual wrapper tags. Simultaneously, the globalOffsets metric calculates characters continuously across the raw parent file. The dual logic rules cause frequent mismatch challenges for overlapping elements and heavily nested layout components.

Introducing two Python logic rules into your data engineering queue stops these alignment problems before they hit the interface. Build preprocessing scripts using standard regex or parsing libraries to clean text structures heavily.

Standardize all file line breaks to single returns before sending strings to the database to prevent cursor rendering displacement.

Flatten hidden div elements and styling wrappers out of HTML files prior to loading sets into the visual workspace.

Accelerating document workflows with pre-annotations

Once rendering architecture is stable, machine-generated proposals can accelerate throughput. External model predictions inject straight into the task format via the predictions array. Reviewing automated outputs requires significantly less total effort than tracing bounding boxes from scratch. Teams import read-only model predictions securely to test Named Entity Recognition pipelines visually.

Large Language Models require specific prompting structures to format responses correctly before loading them into your annotation pipeline. Specialized Python libraries like Pydantic validate whether OpenAI models output JSON structures matching your configured XML schemas, a workflow detailed on the HumanSignal blog. The validation code identifies an invoice total, assigns the designated regional formatting structure, and pushes the data directly to the dashboard for human verification. Reviewers simply accept the proposed region and proceed to the next document without drawing manual bounding boxes.

Establishing your data pipeline infrastructure

Formatting is data. Bypassing custom image-conversion scripts and jumping directly to HumanSignal's native PDF rendering inside Label Studio Enterprise preserves spatial context without adding fragile server overhead. The structural reliability of your pipeline starts before loading a single asset into the application database. Construct your visual interface variables properly, and review the native properties to prepare your dataset for downstream extraction models.

Can Label Studio label native PDFs?

Yes, Label Studio Enterprise handles PDFs natively up to 100 pages per file using specific XML tag setup parameters. Open-source users continue converting files into images via external libraries to mirror native application processing. Outdated support documentation asserting the inability to edit file structures visually is factually inaccurate regarding the modern enterprise application.

Why are my multi-page document images failing to load?

Your file hosting server likely lacks the mandatory CORS policies dictating browser access privileges. The labeling interface relies heavily on active allowances to cache PNG assets inside the local screen securely. Updating your cloud storage rules to trust the operational domain resolves the missing graphic problem.

Why are my text character offsets misaligned on export?

The parsing engine counts paragraph formatting rules as individual memory symbols which conflict with visual cursor locations. Standardizing character logic across your source text documents using Python prior to uploading batches resolves the misalignment inside workspace tasks. Failing to delete invisible HTML layout wrappers generates comparable offset calculation errors within the database layer.

What is the page limit for document annotation?

The enterprise platform supports 100 individual pages natively per assigned document task. Community users face distinct task calculation limits where one uploaded asset containing 100 individual images triggers a database load equivalent to 100 discrete tasks. Operating heavily beyond 100,000 total tasks per project limits response speeds within standard deployments.

How do I output pageIndex instead of item_index?

Generating the page number index relies on processing data through the native <Pdf> tag offered inside Label Studio Enterprise subscriptions. Community environments restricted to image grouping configurations return the list-style metric property natively. Attempting to parse image collections through raw formatting tools cannot change the foundational JSON parameters attached to the system tier.