NLP Autolabeling with Label Studio
In this workshop, the Label Studio team demonstrates how to use LLM-based prompts to automate labeling for NLP projects. Learn how to set up prompts, evaluate performance, and scale annotation efficiently with built-in review workflows.
Transcript
Nate:
Welcome, everyone, to today’s webinar on NLP auto-labeling with Label Studio. This is going to be a hands-on workshop, and we’ll walk through new functionality available in Label Studio Enterprise and Starter Cloud. It’s also part of our 14-day free trial—no credit card required—so if you haven’t signed up yet, go ahead and do that now. I’ll drop the link in the chat.
We are recording this session, and you’ll receive a link to the recording in a day or two. Unlike a typical webinar, this is more interactive, so if you have questions or run into issues, drop them in the chat or the Q&A widget. We’ll address them as we go to keep everyone moving smoothly.
With that, I’ll turn it over to Sheree, Product Manager here at HumanSignal and the driving force behind this feature. Also with us is Michaela, our ML Evangelist, who’ll help field questions. Sheree, take it away.
Sheree:
Thanks, Nate. Good morning, afternoon, or evening to everyone. I’m excited to walk you through NLP auto-labeling with Label Studio. If you’ve seen our recent blog post, today’s walkthrough will feel familiar.
We’ll be working with an NLP project that analyzes product reviews, and we’ll go through three key steps: setting up prompts, evaluating prompt versions using ground truth data, and reviewing the output using human-in-the-loop workflows.
Let’s start by looking at some of the challenges teams face when labeling with LLMs. Manual labeling is time-consuming and expensive, especially for projects that require subject matter expertise or internal training. Scaling manual labeling to larger datasets becomes impractical, and traditional automation pipelines require heavy infrastructure.
LLMs, while powerful, have their own hurdles. They’re non-deterministic, meaning you can get different outputs from the same prompt. Their formats can vary, which complicates integration. Setting up batch processing for LLMs also requires engineering work, and generic models may not understand domain-specific terminology. Without feedback loops, it’s hard to gauge how well a prompt will scale.
We’ve built Label Studio’s prompts feature to address all of these challenges. It adds a reliability layer to constrain outputs to the formats you need. You can iterate on prompt versions directly in the UI and run evaluations without needing to build your own infrastructure.
It’s accessible to subject matter experts who may not be familiar with LLM prompting—they can add logic, examples, definitions, and review predictions through a user-friendly interface. And it supports human-in-the-loop workflows so you can monitor performance and improve prompts over time.
Here are just a few use cases our users are tackling with prompts: classifying customer service emails, detecting topics in documents, analyzing chatbot conversations, extracting PII, and summarizing long-form text. One major manufacturing customer saw 93% labeling accuracy relative to manual work, at a fraction of the cost and with 4–5x faster throughput.
Let’s get into the live demo.
Michaela:
The sign-up link is in the chat. Let us know if anyone runs into trouble.
Sheree:
Thanks, Michaela. Once you sign in, you’ll land in Label Studio and see a demo project called "Product Reviews." That’s the dataset we’ll use today. It contains 100 product reviews, 20 of which are already labeled so we can jump straight into evaluation.
Our task is to analyze these reviews for sentiment, identify relevant categories, and extract reasoning. Some tasks include entity labeling too, like identifying emotions, locations, or food-related terms.
If you were labeling all 100 tasks manually, that could take hours. But since we already have some labeled data, we can use it as a base and automate the rest. To get started, click on the hamburger menu in the top left, then click "Prompts." There you’ll see a demo prompt associated with the product reviews project.
Click into that demo prompt, and you’ll see the onboarding guide walk through how prompts work. On the left, you’ll see your labeled dataset. You can select your base LLM—here, we’re using GPT-4.0 Mini—and then hit “Evaluate” to get predictions from the model.
The predictions will appear as purple labels, and the ground truth annotations show up with a gold star. Once predictions are generated, you’ll see metrics below: accuracy, number of outputs, and inference cost.
Here we got 66% accuracy on our prompt. We had 19 out of 20 outputs complete successfully, and the cost was minimal—less than a cent. If you click into a row, you’ll see both the human annotation and the LLM prediction side by side, including the extracted entities, predicted sentiment, and reasoning.
This helps you quickly assess whether your prompt is working or if it needs improvement.
Michaela:
Sheree, we’ve got a question in the chat: What types of labels does prompts support?
Sheree:
Great question. Prompts supports all the core Label Studio labeling types: classification (via choices), named entity recognition, pairwise ranking, and rating. We have a full list of supported tags in our documentation.
Michaela:
Another attendee asks: Why does the LLM sometimes break entities mid-word?
Sheree:
That’s due to how LLMs tokenize inputs—they work with tokens, not words. If you don’t specify that entities should be returned as full words, you might see partial outputs. To fix this, you can refine your prompt by adding clearer instructions like “return only full-word entities.” You can also add descriptions to guide how LLMs interpret each label type.
Let’s say you want to improve your prompt. Once you’re back in the prompt interface, click “Create Prompt” in the top right. You’ll select your project and see the available label types automatically parsed from your config.
You can write a basic prompt to get started—maybe something like “Here’s a review: {{text}}”—and run it. Even with a basic prompt, the system passes the necessary context behind the scenes, so the LLM can still generate outputs. But the results won’t be great.
To improve it, click “Enhance Prompt.” This will generate a more detailed version, including definitions, label expectations, and a few-shot example. The new prompt is saved as a new version so you can compare metrics. In our test, accuracy went up after enhancement.
Michaela:
Can users choose their own LLMs, or are they locked to what HumanSignal provides?
Sheree:
You can absolutely bring your own. In the top-right of the prompts page, click “API Keys.” You can remove our default OpenAI key and add your own. You can also connect Azure OpenAI deployments or custom endpoints that follow the OpenAI format.
Michaela:
Roughly how far will the $5 in credits go?
Sheree:
It depends on the size and complexity of your tasks. In our example, 20 predictions cost about 0.4 cents, so you could easily run thousands of inferences on the GPT-4.0 Mini model before hitting that limit. You can monitor costs directly in the UI using the cost KPI.
Let’s return to our prompt results. If you're happy with the model's performance, you can scale from evaluation to labeling the entire project. Just select “All project tasks” and click “Run.” We just labeled 100 tasks for 12 cents. These outputs are saved as pre-labels, so when annotators come in, they’re not starting from scratch—they can focus on correcting edge cases.
This approach scales well, especially for large datasets. You get a head start on labeling, reduce cost, and still maintain quality through human review.
Michaela:
This is amazing. Is image data support coming to prompts?
Sheree:
It is! Our team is actively working on supporting image use cases—classification, captioning, and combined image-text tasks. Expect that to roll out in the next month or so. If you're interested, reach out and we’d be happy to walk you through it.
Michaela:
Can’t wait to try that out.
Sheree:
Before we wrap up, if you’d like a written version of this walkthrough, visit the HumanSignal blog. There’s a post called NLP Labeling with Label Studio Prompts that steps through everything we covered today—from creating prompts to reviewing results and scaling across your full dataset.
Nate:
Thanks so much, Sheree and Michaela. That wraps up today’s session. We’ll send out the recording and all the links via email shortly. Have a great rest of your day—and happy holidays!