The Building Blocks of an Efficient Data Labeling Process
According to Arvind Krishna, senior vice president of cloud and cognitive software at IBM, about 80% of the activities done during AI/ML projects are centered around collecting and preparing data.
Data labeling is crucial to providing AI/ML models with the accurate, consistent, and continuous data they need to better understand the world around us. It is part of the data preparation tedium that determines how well self-driving cars can avoid collisions, security cameras can send alerts for suspicious activity, or digital assistants can recognize your voice, among other things.
Instituting an efficient data labeling process is the key to eliminating inaccuracies in the data fed to algorithms. There are some generally applicable principles that can improve the efficiency and accuracy of your data labeling process.
What Is Data Labeling?
Data labeling refers to a systematic process for creating datasets that train machine learning models to become data-centric AI solutions.
When data labeling is properly done, it provides context to machine learning models, fast tracks their learning process, and accelerates the development of reliable AI-powered solutions. The usefulness of data labeling cuts across many fields, from speech recognition to natural language processing and computer vision. For instance, data labeling is how you train AI to differentiate between uttered words to create reliable voice-controlled or text-to-speech solutions.
There are five main data labeling approaches, and the one you choose depends on a number of factors, such as how much data you want to label, how fast you need to label them, and what resources you can apply to the exercise.
- Internal: Using in-house data science experts to label data
- Synthetic: Using computing resources to generate new project data from existing datasets
- Programmatic: Running of programming scripts to automate the data labeling process without the need for human annotations
- Outsourced: Using and managing freelance or ad hoc data professionals to do data labeling, such as on Amazon Mechanical Turk
- Crowdsourced: Using micro-tasking solutions layered on web-based distribution to integrate data labeling tasks into regular web activity, such as the Recaptcha project
How to Create an Efficient Data Labeling Process
Data labeling helps add relevant metadata to create reliable connections between raw data that your models are fed and the expected actions in response to that data. Data is crucial to training AI/ML algorithms to carry out their intended functions correctly. However, data by itself doesn’t provide enough context or meaning to equip the algorithm with the information it needs to make the right decision.
Below are some of the key building blocks for developing an efficient data labeling process:
1. Provide the Data Labeling Team with the Right Tooling
Many AI/ML projects rely on third-party data labeling solutions, while others rely on open-source solutions or build in-house data labeling tools. Irrespective of the origins of your data labeling solutions, a data labeling toolbox will include the following annotation tools:
- Image
- Video
- Text
- Audio
- Time-series
- Sensor
It is also helpful to have clear and comprehensive documentation, resources, tutorials, and training materials. Also, the tools should be equipped to support troubleshooting, reporting, and issue resolution.
When labeling datasets from different sources or in different formats, having a data labeling solution that supports all multiple file formats can make the job of the data labelers easier. Also, a file export feature makes it easier to transfer the labeled datasets to your machine learning environments in supported file formats and extensions.
Beyond the features of your data labeling tools, the tools should also have a streamlined and intuitive user interface. The design of the interface in which the data labeling activity happens must be optimized to minimize cognitive load and reduce context switching for human labelers.
The task interface must be intuitive enough to provide easy access to the dataset, the label set, and the data labeling instructions.
2. Insist on Proper Dataset Collection and Cleaning
The quality of your datasets is crucial to the effectiveness of your data labeling process. When building out a data labeling process, the challenging part comes down to dataset collection: collect too much data, and you have an overfitting issue; don’t collect enough data, and you have underfitting.
To determine how much data you need, the one in ten rule, which suggests that one predictive variable is enough for every 10 events, is a good starting point. There’s a long list of other rules of thumb that can guide you in choosing the number of parameters, managing regularization, or deciding on the number of hash buckets, among other things.
Collect diverse data across multiple sources to minimize dataset bias. For example, to train a model for autonomous flight, collecting diverse data on the scenery requires datasets on a wide range of scenes, including day and nighttime, as well as in degraded visual conditions.
This diverse data also reduces confusion and shortens the learning journey for your algorithms. Representative sampling helps to better anticipate and manage “known unknowns” and “unknown unknowns” within the knowledge quadrant of your model.
Aside from a good method of data collection, you need a streamlined process for cleaning data to fix or remove incomplete, duplicate, corrupted, incorrect, or wrongly formatted data in the dataset. If you label the wrong data, then you train an inaccurate model.
3. Develop Comprehensive Data Annotation Guidelines
Annotation guidelines serve as guardrails for your data scientists, annotators, and domain experts when labeling data. Accurate data annotation makes it easier for ML algorithms to learn the expected features of the dataset, detect problems in the data, and propose solutions.
When labeling data to be used in training AI/ML models, data labelers come across many label tasks that are straightforward. But, sometimes, they encounter tasks that require some nuance—for example, should “Apple” be labeled as a fruit or a company? A comprehensive annotation guideline contains annotating argument schemes that help labelers make inferential connections that drive consistent labeling across the dataset.
An annotation guideline provides annotators with an explanation of what is required, gives them a mission to work toward, and helps reduce potential mistakes in the data labeling process.
4. Create an Accurate Data Tagging Taxonomy
A tag taxonomy refers to a categorization of things or concepts based on discrete, unique sets. In data labeling, your data tagging taxonomy determines how you group labels into layers of abstraction to avoid duplication or overlapping.
A well-thought-out data tagging system helps to keep your datasets relevant and free from duplicates, errors, and noise to increase the accuracy of your data labels. The tagging taxonomy could be broadly divided into flat and hierarchical.
A. Flat data tagging taxonomy:
A flat data tagging taxonomy, also known as unlayered taxonomy, is essentially a list of items. For example, species of animals, types of vehicles, and classes of food—tags “dog, car, France” are flat tags, and there’s no explicit relationship between them. The flat data tagging taxonomy is best for companies with a lower volume of data or companies that have clearly segmented departments with different types of data classes.
B. Hierarchical data tagging taxonomy
The hierarchical data tagging taxonomy follows an order of abstraction in determining how tags are arranged. In the hierarchical tagging taxonomy, there are parent, children, and sibling tags—“2021.documents.jpeg” creates a tag “jpeg” within a tag “document” within a tag “2021.” Data tags can be expanded, in which case, they move up the hierarchy, or they can be compressed, in which case, the data tag is refined in more detail. The hierarchical taxonomy is best suited for companies with large datasets because it allows more specificity and details when adding new tags.
5. Develop a Labeler Consensus System
Labeler consensus helps to minimize bias and errors when multiple individuals are labeling data. To institute a labeler consensus system, you’ll need to first design the consensus parameters that determine when a label is considered accurate. Next, you’ll randomly send the same datasets to multiple annotators at random intervals. You’ll then track the annotation responses for consistency.
For example, if two labelers make mistakes that cause them to mislabel data 10% of the time, they’ll individually have a 90% label accuracy. However, if you include a consensus labeling system that refactors the labels, the dataset will have a label accuracy of 99% because the incorrect labels from one annotator are counterbalanced with the correct labels from the other annotator. The 1% of errors will only occur when both labelers mislabel the same objects.
And when both annotators mislabel the same items, the mistakes could provide opportunities to review the annotation guidelines and consider other perspectives to ensure that the labels better align with the ground truth.
6. Design a Label Verification Process
When data labeling is done by humans, it is practically impossible to get 100% accuracy at the start of the data labeling exercise. Creating a label verification process is an iterative exercise to ensure that labels accurately match the data with real-world objects. When many labelers are working together, label verification increases label quality and ensures consistency across the entire dataset.
To include label verification in your data labeling process, you’ll need to include an audit workflow. The audit workflow facilitates collaboration among reviewers to verify the accuracy of the labels and update inaccurate labels where necessary.
An audit workflow for a label verification workflow should include features for
- Approving or rejecting submitted labels
- Creating an issue thread for rejected labels
- Filtering reviewed labels by reviewer
- Tracking review progress
7. Speed Up Data Labeling with Advanced Labeling Techniques
Another step you can take to make your data labeling process more efficient is to leverage machine learning algorithms to fast-track the data labeling process. Using ML algorithms for data labeling can take two routes: active learning and transfer learning.
Active learning reduces the number of labeling tasks
In active learning, you select the optimal or most important unlabeled data for a human to label. The human-labeled data is then used in a supervised machine learning system to train the algorithm on how to label the rest of the dataset.
When using active learning, you can use diversity sampling and uncertainty sampling to select the data to be labeled by humans:
- Diversity sampling tries to address gaps in your model’s knowledge by giving humans a chance to label datasets that your model doesn’t know how to label.
- Uncertainty sampling helps increase your model’s confidence by allowing humans to label datasets that your model can label with low confidence.
Combine both sampling approaches to get a diverse, representative, and concise sample dataset for humans to label.
Transfer learning shortens the data labeling process
Transfer learning adapts a model trained for a use case for another use case. In data labeling, transfer learning is the process of retraining a model that has been trained to label an initial dataset to label similar datasets. There’s less need for human-labeled examples, unlike when training a new model from scratch.
In a transfer learning model, it is important that the labeling remains consistent with the initial training, such as classification only, object detection, semantic segmentation, or instance segmentation labeling.
One of the benefits of using transfer learning is that your model can leverage its existing neural model to identify points of confusion and predict potential errors, which can then be fixed during the label verification process.
8. Adopt Quality Assurance Check Practices
Implement quality assurance standards in your data labeling process to enhance the quality of your AI/ML model’s output. Below are some quality assurance activities that could lead to an efficient data labeling process.
Integrate QA into the data labeling pipeline
Integrate QA to assess the quality of labels and spot issues early enough to prevent them. You should include audit QA tasks, which look similar to other labeling tasks, to avoid bias, targeted QA tasks to look for and review labels that contain disagreements from annotators, and random QA tasks for regular and random sampling of labels to test for quality.
Test label accuracy
Include QA activities that are designed to analyze how closely the labels follow the annotation guidelines and match expectations. When creating the annotation guidelines, set benchmarks for accuracy. In testing label accuracy, the QA exercise is focused on matching high-confidence data labels to a subset of new labeling tasks to estimate the accuracy of individual data labelers and to find opportunities for improvement.
Also, using techniques such as Cronbach’s Alpha to test for correlation and consistency within the labeled dataset could also help assess the overall quality of the labels.
An Efficient Data Labeling Process Must Manage People and Tools
An efficient data labeling process must address both the people management part and the technological component of the data labeling exercise.
The best data labeling solution is the one that improves your processes, empowers your team, and has enough flexibility to be adaptable to changing needs at scale.
Label Studio is an end-to-end solution designed to support internal data labeling operations. Learn more about how to equip your data team with data labeling tools to build competitive and differentiated ML/AI models at scale.