NEWLabel Studio 1.20.0 Now Available: Spectrograms, Time Series Sync, and more!
Guide

Video Frame Classification with Ultralytics YOLOv8 and Label Studio

This walkthrough demonstrates how to classify video frames in Label Studio using YOLOv8 and LSTM. You'll learn how to use both simple prediction mode and trainable mode to customize annotations in real time.

Transcript

We're excited to announce another new way to use YOLOv8 with your data in Label Studio: video frame classification.

This new feature uses YOLOv8 combined with an LSTM layer for temporal classification, and we offer two modes—simple and trainable. In simple mode, you use an existing YOLO and LSTM model to label your data. In trainable mode, you train a new YOLO and LSTM model to classify video frames with custom labels that don’t exist off the shelf.

Let’s dive in.

For either mode, you’ll need to create a new project and connect a YOLO model as the ML backend. I’ve already done that for this demonstration.

Let’s start with simple mode. When you open the labeling interface, you’ll see a Video tag where the video is displayed. One key parameter is frameRate. You need to set this so the model can align its predictions with the correct frames. Next is the TimelineLabels tag. In simple mode, you can either omit the modelTrainable parameter or set it to false—it defaults to false anyway. Just list your labels and you're ready.

When you open a video that hasn’t been annotated, it may take a second to load. That’s because the model is generating predictions for every frame. For longer videos, loading will take more time. Once loaded, you’ll see timeline labels applied—for example, the “ball” label will appear on all frames where a ball is detected.

Now, let’s talk about trainable mode. The Video tag remains the same, but the TimelineLabels tag includes additional parameters. These let you configure model training directly in Label Studio. You can find a full list of those parameters in the README file on GitHub.

The labels used in trainable mode aren’t part of any pre-trained model—you need to teach the model what to recognize.

When you open a new task, the model will try to predict but won’t return any labels. That’s because it hasn’t been trained on your custom classes yet. You’ll need to add a few annotations manually. Once you click submit, the model begins fine-tuning on the backend.

Open a second task and repeat the process—again, no predictions yet. After another round of annotation and submission, you open a third task. Still no predictions. After submitting again, the model has now been fine-tuned on a small dataset.

Now, when you open the fourth task, the model successfully predicts “ball” in the correct frame. Click submit again, and the model will continue retraining itself. You can also edit or correct predictions, then click submit to help the model learn.

As you go, the model continuously fine-tunes and updates, which shortens the time it takes to reach a fully functional model. We recommend making 10 to 20 annotations in trainable mode before expecting accurate results.

And that’s it. Happy labeling!

Related Content