Live!Check out the Label Studio 2024 recap post
2

Get Started with Data Labeling

Data labeling (sometimes referred to as data annotation or dataset development) is an important step in your data science or machine learning practice. By adding meaningful information to your data through labeling, you can improve the accuracy of your models, identify and remove bias, and improve the efficiency of your machine learning and data science operations.

Data Labeling in the ML Pipeline

  1. Before you train a model: Many traditional Machine Learning models, especially those that use supervised or semi-supervised learning, require data that has been labeled with the correct answer so that the model can learn.
  2. Evaluating the model: After the model is trained, it is important to evaluate its performance on the validation and test data. This step allows you to determine how well the model can generalize to new data and to identify any issues or areas for improvement. Labeling this data can look like validating the model output, or you can use test data that was previously labeled.
  3. Fine-tuning the model: Based on the evaluation results, the model may need to be fine-tuned by adjusting the algorithms and parameters or by collecting additional data. This feedback loop can require the addition of new labeled data, and the labels and labeling processes themselves may be fine-tuned. This step may need to be repeated multiple times until the model's performance is satisfactory.
  4. Continuously monitoring and updating: Even after the model is deployed, monitoring its performance and accuracy is important. Without regular updates, models can drift in the effectiveness of their predictions. Preventing this requires collecting new data to test and retrain the model as necessary. New, accurately labeled data is a critical component of this ongoing process.
  5. Large Language Models: LLMs also benefit from data labeling. You can use data labeling to better understand how an LLM is performing on some data given a prompt, or use labeled data to implement an active learning loop to fine-tune your LLM. LLMs also make great labelers themselves. Either way, keeping a human in the loop with labeled data is essential to building robust, ethical systems.

Beginning the Labeling Process

Beginning an annotation project can feel challenging and overwhelming. It can be hard to know exactly where to start building out your project.

Step 1: Understand your project

The first step to building out an annotation project is to understand your overarching project and its goals. Having a good idea of what you’re trying to achieve and how you’ll know you succeeded can help guide the rest of the process. Below are some questions to help you think through your project before you begin.

  • What are you trying to accomplish with this annotation project? Are you trying to train a model, or are you hoping to evaluate the performance of another model or system? A good understanding of your goals can help you pick the right questions to ask your annotators, and will lead you to the right data to use for the job.
  • What metrics will you use to prove success? Metrics such as precision, recall, and F1 are good if you’re looking for an in-depth understanding of how well a new model or an existing system are performing. For LLMs, you might want to use other metrics such as relevancy or faithfulness. If you’re in the early stages of evaluating a model, you might just be looking for a “vibe check” – a broad idea of how well the model is performing without the specificity of clean metrics. Knowing which metrics you’re using will help you better understand what kind of data you need to collect from your annotators.
  • What do you need to accomplish your goals and calculate your metrics? Now that you know what your goal is and how you’ll prove success, you can think about what data you’ll need to get there. You’ll need two types of data – evaluation data, or data that we’re going to annotate, and annotation data, the data that we collect through the annotation process.

Step 2: Collect your data

Once you’ve thought about the overarching goals and needs of your project, you’re ready to gather your data for this project. It’s important that you spend a little time understanding exactly what data you’re using. Why? Understanding your data will help you better understand the expected distribution of labels that you’ll likely get from annotators, so that you can address issues that may arise such as class imbalance. A deep understanding of your data also allows you to know what kinds of questions – and possible answers – your annotators will run into, whether your annotators are humans or a model. Finally, you’ll be able to look for any inconsistencies, flaws, or errors in your data that may impact your modeling.

Understanding your data is also important from a responsible AI perspective. In 2018, Gebru et.al. released a paper called Datasheets for Datasets. This paper provides an in depth look at why it’s important to know what’s in your data so that you can be aware of the potential biases it may be introducing into your model or your understanding of model performance. In the appendix of the paper, Gebru et.al. provides an extensive list of the questions practitioners should answer to most responsibly use and distribute their data. Academic and industry best practice is trending towards creating datasheets for datasets that answer some or all of these questions.

For the sake of this exploration — we’ll be working from an open and well known dataset. We’ll be using the IMDB Dataset provided by Andrew Maas (ref). This is a very large dataset, with over 100,000 reviews. For this tutorial, we will use a much smaller sampling of only 100 reviews, but this will give you a flavor of how you can organize a large dataset for distribution for your labeling team. Click here to learn more about the full dataset and how we processed it.

Supported Data Types in Label Studio

One of the community’s most loved features of Label Studio is the ability to handle a multitude of different file types in the same platform. Label Studio has you covered from text, audio, image, or even video and time series data.

The file type may change depending on your use case and the project you’re working with.

It’s important to understand what file type is best for your goals and how to format your data best to prepare it for labeling.

This tutorial aims to prepare the data for training a sentiment analysis model for movie reviews. Sentiment analysis is one of the most popular use cases for data labeling and machine learning. This falls under a category of machine learning known as Natural Language Processing, or NLP.

Step 3: Build an Annotation Schema

With a deep understanding of your data, you’re in a good position to build out your annotation schema, or the list of questions you’ll ask annotators and the format in which you’ll ask them.

First, you’ll want to outline all the tasks that you’re asking the annotator to do, be it a human or a model. Then, for each task, you’ll want to make sure that you define all key terms and explain all labels or scales that you’ll be using, so that each annotation can be consistent regardless of who the annotator was. If you’re using a model to prelabel your data, you’ll want to make sure you explain how the humans in the loop will vet the answers. If you’re evaluating an LLM, you’ll want to make sure you explain what aspects of the LLM output you’re evaluating and how you’re evaluating them. What does accuracy look like? What does fluency look like? For the humans in the loop, regardless of whether they’re vetting a model output or annotating brand new data, make sure to provide positive and negative examples for each label they can assign, and make sure you address any ambiguities that may arise.

Now, you’re ready to start building out your Label Studio Project and begin labeling!