TimelineLabels Model for Temporal Video Multi-Label Classification in Label Studio
This documentation provides a clear and comprehensive guide on how to use the TimelineLabels model for temporal multi-label classification of video data in Label Studio.
By integrating an LSTM neural network on top of YOLO’s classification capabilities — specifically utilizing features from YOLO’s last layer — the model handles temporal labeling tasks. Users can easily customize neural network parameters directly within the labeling configuration to tailor the model to their specific use cases or use this model as a foundation for further development.
In trainable mode, you’ll begin by annotating a few samples by hand. Each time you click Submit, the model will retrain on the new annotation that you’ve provided. Once the model begins predicting your trained labels on new tasks, it will automatically populate the timeline with the labels that it has predicted. You can validate or change these labels, and updating them will again retrain the model, helping you to iteratively improve.
Tip: If you’re looking for a more advanced approach to temporal classification, check out the VideoMAE model. While we don’t provide an example backend for VideoMAE, you can integrate it as your own ML backend.
Installation and quickstart
Before you begin, you need to install the Label Studio ML backend.
This tutorial uses the YOLO example. See the main README for detailed instructions on setting up the YOLO-models family in Label Studio.
Labeling configuration
<View>
<TimelineLabels name="label" toName="video"
model_trainable="true"
model_classifier_epochs="1000"
model_classifier_sequence_size="16"
model_classifier_hidden_size="32"
model_classifier_num_layers="1"
model_classifier_f1_threshold="0.95"
model_classifier_accuracy_threshold="0.99"
model_score_threshold="0.5"
>
<Label value="Ball touch" background="red"/>
<Label value="Ball in frame" background="blue"/>
</TimelineLabels>
<Video name="video" value="$video" height="700" frameRate="25.0" timelineHeight="200" />
</View>
IMPORTANT: You must set the frameRate
attribute in the Video
tag to the correct value.
All your videos should have the same frame rate. Otherwise, the submitted annotations will be misaligned with videos.
Parameters
Parameter | Type | Default | Description |
---|---|---|---|
model_trainable |
bool | False | Enables the trainable mode, allowing the model to learn from your annotations incrementally. |
model_classifier_epochs |
int | 1000 | Number of training epochs for the LSTM neural network. |
model_classifier_sequence_size |
int | 16 | Size of the LSTM sequence in frames. Adjust to capture longer or shorter temporal dependencies, 16 frames are about ~0.6 sec with 25 frame rate. |
model_classifier_hidden_size |
int | 32 | Size of the LSTM hidden state. Modify to change the capacity of the LSTM. |
model_classifier_num_layers |
int | 1 | Number of LSTM layers. Increase for a deeper LSTM network. |
model_classifier_f1_threshold |
float | 0.95 | F1 score threshold for early stopping during training. Set to prevent overfitting. |
model_classifier_accuracy_threshold |
float | 1.00 | Accuracy threshold for early stopping during training. Set to prevent overfitting. |
model_score_threshold |
float | 0.5 | Minimum confidence threshold for predictions. Labels with confidence below this threshold will be disregarded. |
model_path |
string | None | Path to the custom YOLO model. See more in the section Your own custom models |
Note: You can customize the neural network parameters directly in the labeling configuration by adjusting the attributes in the <TimelineLabels>
tag.
Using the model
Simple mode
In the simple mode, the model uses pre-trained YOLO classes to generate predictions without additional training.
When to Use: Quick setup without the need for custom training. It starts generating predictions immediately.
Configuration: Set
model_trainable="false"
in the labeling config (or omit it asfalse
is the default).Example:
<View> <Video name="video" value="$video" height="700" frameRate="25.0" timelineHeight="200" /> <TimelineLabels name="label" toName="video" model_trainable="false"> <Label value="Ball" predicted_values="soccer_ball"/> <Label value="tiger_shark" /> </TimelineLabels> </View>
Trainable mode
The trainable mode enables the model to learn from your annotations incrementally.
It uses the pre-trained YOLO classification model and a custom LSTM neural network on the top to capture temporal dependencies in video data. The LSTM model works from scratch, so it requires about 10-20 well-annotated videos 500 frames each (~20 seconds) to start making meaningful predictions.
- When to Use: When custom labels or improved accuracy are needed relative to simple mode.
- Configuration: Set
model_trainable="true"
in the labeling config. - Training Process:
- Start annotating videos using the
TimelineLabels
tag. - After submitting the first annotation, the model begins training.
- The
partial_fit()
method allows the model to train incrementally with each new annotation.
- Start annotating videos using the
- Requirements: Approximately 10-20 annotated tasks are needed to achieve reasonable performance.
Note: The predicted_values
attribute in the <Label>
tag doesn’t make sense for trainable models.
Example:
<View>
<Video name="video" value="$video" height="700" frameRate="25.0" timelineHeight="200" />
<TimelineLabels name="label" toName="video"
model_trainable="true"
model_classifier_epochs="1000"
model_classifier_sequence_size="16"
model_classifier_hidden_size="32"
model_classifier_num_layers="1"
model_classifier_f1_threshold="0.95"
model_classifier_accuracy_threshold="0.99"
model_score_threshold="0.5">
<Label value="Ball in frame"/>
<Label value="Ball touch"/>
</TimelineLabels>
</View>
How the trainable model works
The trainable mode uses a custom implementation of the temporal LSTM classification model. The model is trained incrementally with each new annotation submit or update, and it generates predictions for each frame in the video.
1. Feature extraction with YOLO
- Pre-trained YOLO model: Uses a YOLO classification model (e.g.
yolov8n-cls.pt
) to extract features from video frames. - Layer modification: The model removes the last classification layer to use the feature representations from the penultimate layer (see
utils/neural_nets.py::cached_feature_extraction()
). - Cached predictions: Uses caching to store YOLO intermediate feature extractions for efficiency and incremental training on the fly.
Custom YOLO models for feature extraction
You can load your own YOLO models using the steps described in the main README.
However, it should have similar architecture as yolov8-cls
models. See utils/neural_nets.py::cached_feature_extraction()
for more details.
Cache folder
It’s located in /app/cache_dir
and stores the cached intermediate features from the last layer of the YOLO model.
The cache is used for incremental training on the fly and prediction speedup.
2. LSTM Neural Network
- Purpose: Captures temporal dependencies in video data by processing sequences of feature vectors from the last layer of YOLO.
- Architecture:
- Input layer: Takes feature vectors from YOLO.
- Fully connected layer: Reduces dimensionality.
- Layer normalization and dropout: Improves training stability and prevents overfitting.
- LSTM layer: Processes sequences to model temporal relationships.
- Output layer: Generates multi-label predictions for each time step.
- Loss function: Uses binary cross-entropy loss with logits for multi-label classification and the weight decay for L2 regularization.
3. Incremental training with partial_fit()
- Functionality: Allows the model to update its parameters with each new annotation.
- Process:
- Extracts features and labels from the annotated video using
utils/converter.py::convert_timelinelabels_to_probs()
. - Pre-processes data into sequences suitable for LSTM input split by
model_classifier_sequence_size
chunks. - Trains the model incrementally, using early stopping based on F1 score and accuracy thresholds.
- Extracts features and labels from the annotated video using
- Advantages:
- Few-shot learning: Capable of learning from a small number of examples.
- Avoids overfitting: Early stopping using F1 score and accuracy prevents the model from overfitting on limited data.
Limitations and considerations
- Not a final production model: While promising, the model is primarily a demo and may require further validation for production use.
- Performance depends on data: Requires sufficient and diverse annotations (at least 10-20 annotated tasks) to start performing.
- Parameter sensitivity: Adjusting neural network parameters may significantly impact performance.
- Early stop on training data: The model uses early stopping based on the F1 score and accuracy on the training data. This may lead to overfitting on the training data. It was made because of the lack of validation data when updating on one annotation.
- YOLO model limitations: The model uses a pre-trained YOLO model trained on object classification tasks for feature extraction, which may not be optimal for all use cases such as event detection. This approach doesn’t tune the YOLO model, it trains only the LSTM piece upon the YOLO last layer.
- Label balance: The model may struggle with imbalanced labels. Ensure that the labels are well-distributed in the training data. Consider modifying the loss function (
BCEWithLogitsLoss
) and using class pos weights to address this issue. - Training on all daa: Training on all data is not yet implemented, so the model trains only on the last annotation. See
timeline_labels.py::fit()
for more details.
Example use case: detecting a ball in football videos
Setup
Labeling configuration:
<View> <TimelineLabels name="videoLabels" toName="video"> <Label value="Ball touch" background="red"/> <Label value="Ball in frame" background="blue"/> </TimelineLabels> <Video name="video" value="$video" height="700" timelineHeight="200" frameRate="25.0" /> </View>
Connect the model Backend:
Create a new project, go to Settings > Model and add the YOLO backend.
- Navigate to the
yolo
folder in this repository in your terminal. - Update your
docker-compose.yml
file. - Execute
docker compose up
to run the backend. - Connect this backend to your Label Studio project in the project settings. Make sure that Interactive Preannotations is OFF (this is the default).
- Navigate to the
Annotation and training
Annotate videos:
- Upload football videos to the project.
- Use the
<TimelineLabels>
control tag to label time intervals where the ball is visible in the frame.
Model training:
- After submitting annotations, the model begins training incrementally.
- Continue annotating until the model starts making accurate predictions.
Review predictions:
- The model suggests labels for unannotated videos.
- Validate and correct predictions to further improve the model.
Adjusting training parameters and resetting the model
If the model is not performing well, consider modifying the LSTM and classifier training parameters in the labeling config. These parameters
start with the model_classifier_
prefix.
The model will be reset after changing these parameters:
model_classifier_sequence_size
model_classifier_hidden_size
model_classifier_num_layers
- New labels added or removed from the labeling config
So you may need to update (click Update) on annotations to see improvements.
If you want to modify more parameters, you can do it directly in the code in utils/neural_nets.py::MultiLabelLSTM
.
If you need to reset the model completely, you can remove the model file from /app/models
.
See timeline_labels.py::get_classifier_path()
for the model path. Usually it starts with the timelinelabels-
prefix.
Debug
To debug the model, you should run it with the LOG_LEVEL=DEBUG
environment variable (see docker-compose.yml
),
then check the logs in the (docker) console.
Convert TimelineLabels regions to label arrays and back
There are two main functions to convert the TimelineLabels regions to label arrays and back:
utils/converter.py::convert_timelinelabels_to_probs()
- Converts TimelineLabels regions to label arraysutils/converter.py::convert_probs_to_timelinelabels()
- Converts label arrays to TimelineLabels regions
Each row in the label array corresponds to a frame in the video.
The label array is a binary matrix where each column corresponds to a label.
If the label is present in the frame, the corresponding cell is set to 1
, otherwise 0
.
For example:
[
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]
]
This corresponds to the labels [label3, label2, label1]
in the frames 1, 2, 3.
See tests/test_timeline_labels.py::test_convert_probs_to_timelinelabels()
for more examples.
For developers
This guide provides an in-depth look at the architecture and code flow of the TimelineLabels ML backend for Label Studio. It includes class inheritance diagrams and method call flowcharts to help developers understand how the components interact. Additionally, it offers explanations of key methods and classes, highlighting starting points and their roles in the overall workflow.
Class inheritance diagram
The following diagram illustrates the class inheritance hierarchy in the TimelineLabels ML backend.
classDiagram
class ControlModel
class TimelineLabelsModel
class TorchNNModule
class BaseNN
class MultiLabelLSTM
ControlModel <|-- TimelineLabelsModel
TorchNNModule <|-- BaseNN
BaseNN <|-- MultiLabelLSTM
ControlModel
: Base class for control tags in Label Studio.TimelineLabelsModel
: Inherits fromControlModel
and implements specific functionality for the<TimelineLabels>
tag.torch.nn.Module
: Base class for all neural network modules in PyTorch.BaseNN
: Custom base class for neural networks, inherits fromtorch.nn.Module
.MultiLabelLSTM
: Inherits fromBaseNN
, implements an LSTM neural network for multi-label classification.
Method call flowcharts
Prediction workflow
The following flowchart depicts the method calls during the prediction process.
flowchart TD
A[TimelineLabelsModel.predict_regions]
A --> B{Is self.trainable?}
B -->|Yes| C[create_timelines_trainable]
B -->|No| D[create_timelines_simple]
C --> E[cached_feature_extraction]
C --> F[Load classifier using BaseNN.load_cached_model]
C --> G[classifier.predict]
C --> H[convert_probs_to_timelinelabels]
C --> I[Return predicted regions]
D --> J[cached_yolo_predict]
D --> K[Process frame results]
D --> L[convert_probs_to_timelinelabels]
D --> M[Return predicted regions]
Training workflow
The following flowchart shows the method calls during the training process.
flowchart TD
N[TimelineLabelsModel.fit]
N --> O{Event is 'ANNOTATION_CREATED' or 'ANNOTATION_UPDATED'?}
O -->|Yes| P[Extract task and regions]
P --> Q[Get model parameters]
Q --> R[Get video path]
R --> S[cached_feature_extraction]
S --> T[Prepare features and labels]
T --> U[Load or create classifier]
U --> V[classifier.partial_fit]
V --> W[classifier.save]
W --> X[Return True]
O -->|No| Y[Return False]
Code structure and explanations
TimelineLabelsModel class
File: timeline_labels.py
The TimelineLabelsModel
class extends the ControlModel
base class and implements functionality specific to the <TimelineLabels>
control tag.
Key methods:
is_control_matched(cls, control)
: Class method that checks if the provided control tag matches the<TimelineLabels>
tag.create(cls, *args, **kwargs)
: Class method that creates an instance of the model, initializing attributes liketrainable
andlabel_map
.predict_regions(self, video_path)
: Main method called during prediction. Determines whether to use the simple or trainable prediction method based on thetrainable
attribute.create_timelines_simple(self, video_path)
: Uses pre-trained YOLO classes for prediction without additional training.- Calls
cached_yolo_predict
to get predictions from the YOLO model. - Processes frame results to extract probabilities.
- Converts probabilities to timeline labels.
- Calls
create_timelines_trainable(self, video_path)
: Uses the custom-trained LSTM neural network for prediction.- Calls
cached_feature_extraction
to extract features from the video. - Loads the trained classifier model.
- Uses the classifier to predict probabilities.
- Converts probabilities to timeline labels.
- Calls
fit(self, event, data, **kwargs)
: Called when new annotations are created or updated. Handles the incremental training of the LSTM model.- Extracts features and labels from the annotated video.
- Preprocesses data for LSTM input.
- Loads or initializes the classifier.
- Calls
partial_fit
on the classifier to update model parameters. - Saves the updated classifier model.
get_classifier_path(self, project_id)
: Generates the file path for storing the classifier model based on the project ID and model name.
Neural network classes
BaseNN class
File: neural_nets.py
The BaseNN
class serves as a base class for neural network models, providing common methods for saving, loading, and managing label mappings.
Key methods:
set_label_map(self, label_map)
: Stores the label mapping dictionary.get_label_map(self)
: Retrieves the label mapping dictionary.save(self, path)
: Saves the model to the specified path usingtorch.save
.load(cls, path)
: Class method to load a saved model from the specified path.load_cached_model(cls, model_path)
: Loads a cached model if it exists, otherwise returnsNone
.
MultiLabelLSTM class
File: neural_nets.py
The MultiLabelLSTM
class inherits from BaseNN
and implements an LSTM neural network for multi-label classification.
Key methods:
__init__(...)
: Initializes the neural network layers and parameters, including input size, hidden layers, dropout, and optimizer settings.forward(self, x)
: Defines the forward pass of the network.- Reduces input dimensionality using a fully connected layer.
- Applies layer normalization and dropout.
- Passes data through the LSTM layer.
- Applies a fully connected layer to generate output predictions.
preprocess_sequence(self, sequence, labels=None, overlap=2)
: Prepares input sequences and labels for training by splitting and padding them.partial_fit(self, sequence, labels, ...)
: Trains the model incrementally on new data.- Pre-processes the input sequence.
- Creates a DataLoader for batching.
- Runs the training loop with early stopping based on accuracy and F1 score thresholds.
predict(self, sequence)
: Generates predictions for a given input sequence.- Splits the sequence into chunks.
- Passes data through the model in evaluation mode.
- Concatenates outputs to match the original sequence length.
Starting points and execution flow
Prediction process
Prediction request: When a prediction is requested for a video,
TimelineLabelsModel.predict_regions(video_path)
is called.Determine mode:
Trainable mode (
self.trainable == True
): Callscreate_timelines_trainable(video_path)
.- Extracts features using
cached_feature_extraction
. - Loads the trained classifier model using
BaseNN.load_cached_model
. - Predicts probabilities with
classifier.predict(yolo_probs)
. - Converts probabilities to timeline labels using
convert_probs_to_timelinelabels
.
- Extracts features using
Simple mode (
self.trainable == False
): Callscreate_timelines_simple(video_path)
.- Gets YOLO predictions using
cached_yolo_predict
. - Processes frame results to extract probabilities.
- Converts probabilities to timeline labels using
convert_probs_to_timelinelabels
.
- Gets YOLO predictions using
Return predictions: The method returns a list of predicted regions with labels and timestamps.
Training process
Event Trigger: The
fit(event, data, **kwargs)
method is called when an annotation event occurs (e.g., ‘ANNOTATION_CREATED’ or ‘ANNOTATION_UPDATED’).Event Handling:
- Checks if the event is relevant for training.
- Extracts the task and annotation data.
Parameter Extraction: Retrieves model parameters from the control tag attributes, such as epochs, sequence size, hidden size, and thresholds.
Data Preparation:
- Gets the video path associated with the task.
- Extracts features from the video using
cached_feature_extraction
. - Converts annotations to labels suitable for training.
Model Loading or Initialization:
- Attempts to load an existing classifier model using
BaseNN.load_cached_model
. - If no model exists or parameters have changed, initializes a new
MultiLabelLSTM
model.
- Attempts to load an existing classifier model using
Training:
- Calls
classifier.partial_fit(features, labels, ...)
to train the model incrementally. - Training includes early stopping based on accuracy and F1 score thresholds to prevent overfitting.
- Calls
Model Saving: Saves the trained model to disk using
classifier.save(path)
.
Utilities and helper functions
Cached Prediction and Feature Extraction:
cached_yolo_predict(yolo_model, video_path, cache_params)
: Uses joblib’sMemory
to cache YOLO predictions, avoiding redundant computations.cached_feature_extraction(yolo_model, video_path, cache_params)
: Extracts features from the YOLO model by hooking into the penultimate layer and caches the results.
Data Conversion Functions:
convert_probs_to_timelinelabels(probs, label_map, threshold)
: Converts probability outputs to timeline labels suitable for Label Studio.convert_timelinelabels_to_probs(regions, label_map, max_frame)
: Converts annotated regions back into a sequence of probabilities for training.
Conclusion
The TimelineLabels ML backend integrates seamlessly with Label Studio to provide temporal multi-label classification capabilities for video data. The architecture leverages pre-trained YOLO models for feature extraction and enhances them with an LSTM neural network for capturing temporal dependencies.
Understanding the class hierarchies and method flows is crucial for developers looking to extend or modify the backend. By following the starting points and execution flows outlined in this guide, developers can navigate the codebase more effectively and implement custom features or optimizations.
Note: For further development or contributions, please refer to the README_DEVELOP.md
file, which provides additional guidelines and architectural details.