Using Text Prompts for Image Annotation with Grounding DINO and Label Studio

Tutorials December 21, 2023

Introduction

In a previous post, we explored using the Segment Anything Model with Label Studio to exploit zero-shot object identification. This method demonstrated the power of using a foundation image model to automatically create high-quality image masks for image segmentation tasks. However, one drawback of this approach was that annotators must interpret labels and manually select the location in images where the objects were.

In this post, we will explore how we can use the Grounding DINO and Grounding SAM models to automatically perform object identification and image segmentation tasks using text labels, allowing your annotation team to perform image annotation quickly using the custom label text.

What are Grounding DINO and Grounding SAM?

Grounding DINO is a zero-shot object detection model that combines DINO transformer architecture with grounded pre-training. This fusion results in a model that can do bounding box image classification tasks based on corresponding text prompts.

Grounding SAM builds on this framework by replacing the DINO image model with the Segment Anything Model, making it possible to refine your object identification with an image segmentation mask rather than a bounding box.

What Does the Grounding DINO ML Backend Provide?

The Grounding DINO ML backend provides an all-in-one package for both the Grounding DINO and Grounding SAM models, with the option to use the more efficient but less accurate version of Mobile SAM. The backend is shipped as a Dockerfile to build a portable solution with GPU and CPU-enabled systems and a docker-compose.yml file to configure and launch the ML backend.

Installing the Groudding DINO ML Backend

Prerequisites

We recommend using Docker to host the Grounding DINO ML Backend and Label Studio for this example. Docker allows you to install the software without any other system requirements and helps to make the installation and maintenance process much more manageable. For desktop or laptop use, the fastest way to get Docker is by installing the official Docker Desktop client for Mac and Windows operating systems or installing Docker using the official package manager for Linux systems.

Grounding DINO is a large, complex foundation model that works best on a GPU. Because many people will test this software on commodity hardware like laptops or desktop computers, the model ships with Mobile SAM as the default object detection backend. The backend will automatically detect if you have a GPU available, using the most appropriate hardware for your system.

Consult the official Docker documentation and Docker Compose documentation for enabling GPU passthrough for your guest containers.

At a minimum, your system should have 16 GB of RAM available, with at least 8 GB allocated to the Docker runtime.

You must also install Git to download the Label Studio ML Backend repository.

Clone the Repository

After you have installed Docker and Git, the next step is to clone the Label Studio ML Backend git repository into your system.

git clone https://github.com/HumanSignal/label-studio-ml-backend.git

Then, change into the SAM working directory.

cd label-studio-ml-backend/label_studio_ml/examples/grounding_dino

Build the Docker Image

You can now build the SAM ML Backend for your system.

docker compose build

Depending on your internet connection speed, building the model can take up to 8 minutes. This build process embeds the model weights into the Docker image with a 6G file size. In production usage, it’s best practice to store the model weights separately from the model runtime to allow for updates and checkpointing. For this example, the model weights are built directly into the container.

Verify that the model is built and ready to use.

docker image list

Docker should output a list of available images with an entry similar to this:

REPOSITORY                  TAG       IMAGE ID       CREATED       SIZE
grounding_dino-ml-backend   latest    d8e9c1c8a537   2 minutes ago 5.48GB

Using the Grounding DINO ML Backend

With the image built, it’s time to build an image segmentation project using Label Studio.

Install Label Studio

First, you need to install Label Studio. For this example, the Grounding DINO ML Backend relies upon enabling local storage serving. To start an instance of Label Studio with this turned on, enter the following commands:

docker image pull heartexlabs/label-studio:latest

docker run \
    -it --rm -p 8080:8080 \
    -v $(pwd)/mydata:/label-studio/data \
    --env LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true \
    --env LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/label-studio/data/images \
    heartexlabs/label-studio:latest

This command tells Docker to launch Label Studio, make it available for viewing at http://localhost:8080, store the database and task files on your local hard drive, and enable local file serving. Once Label Studio has started, you can navigate with your browser to http://localhost:8080 where you will find the Label Studio login screen.

Select the “Sign Up” tab, create a new Label Studio user account, and log in to your newly created account.

Set up the Project

Upon your first login, you’ll be greeted by Heidi the Opossum, asking you to create a new project. Select “Create Project”.

In the “Create Project” dialog, you will see three tabs. In the first tab, enter the name of your project (for example, “Grounding DINO”).

Next, go to the “Data Import” tab to upload your image data. Select “Upload Files” and choose any images you wish to use. For this example, we will use images of dogs and possums, but you should use any data you want.

Finally, select Labeling Setup. You’ll be presented with pre-configured labeling templates, but we will import a custom template for this project. Select the “Custom Template” option.

Within the labeling configuration interface, select “Code,” then paste the following configuration into the text field (ensuring you replace all the preexisting code).

<View>
  <Image name="image" value="$image"/>
  <Style>
    .lsf-main-content.lsf-requesting 
    .prompt::before { content: ' loading...'; color: #808080; }
  </Style>
  <View className="prompt">
    <TextArea name="prompt" toName="image" 
              editable="true" rows="2"
              maxSubmissions="1" showSubmitButton="true"/>
  </View>
  <Header>Rectangle Labels</Header>
  <RectangleLabels name="label" toName="image">
    <Label value="dog" background="orange"/>
    <Label value="possum" background="purple"/>
  </RectangleLabels>
  <Header>Brush Labels</Header>
  <BrushLabels name="label2" toName="image">
    <Label value="dog" background="orange"/>
    <Label value="possum" background="purple"/>
  </BrushLabels>
</View>

Note that for this example, we are interested in labeling images with “Dogs” and “Possums.” Feel free to replace these tag values with whatever application you’re interested in and with as many tags as you wish.

As you build your custom interface, you will see a live rendering of the view your annotation team will see to the right. When you are satisfied with the interface, select “Save.”

Before starting the ML Backend, you must gather additional information about your Label Studio installation. You’ll first need the API token to access Label Studio. The token is required to download images from the Label Studio instance to the SAM ML Backend. You can find the token by selecting the user setting icon in the upper right corner of the interface, then selecting “Accounts & Settings.”

Copy the Access Token from this screen and make a note of it. You will need it to configure the Grounding DINO ML Backend.

You will also need the local IP address of your host machine. You can find this in several ways. On Mac and Windows systems, the easiest way is to look up your IP address in your system settings. You can use network commands from the command line like `ip a' or `ifconfig` to discover the local IP address. It’s important to know the actual address because Label Studio and the Label Studio ML backend treat `localhost` as local to the container, not to the container host. As a result, the `localhost` name will result in unexpected and incorrect behavior. On some platforms, the variable `host.docker.internal` is available as a convenient hostname to connect to locally hosted containers.

Start the Grounding DINO ML Backend

With the project set up and the host and access information about Label Studio available, we can now start the Grounding DINO Backend. Open the `docker-compose.yml` file using your favorite text editor and edit the following lines to include your Label Studio host and API access keys.

# Add these variables if you want to access the images stored in Label Studio
- LABEL_STUDIO_HOST=http://<YOUR_HOST_IP_ADDRESS_HERE>:8080
- LABEL_STUDIO_ACCESS_TOKEN=<YOUR_ACCESS_TOKEN_HERE>

Save the file, and start the backend with the command:

docker compose up

Connect the ML Backend

Go back to your browser and the Label Studio instance, and select the menu, “Projects,” and “Grounding DINO” (or whatever name you set for your project). Select “Settings” then “Machine Learning.”

Select “Add Model,” and fill in the details for the “Title” and the “URL”. The URL will be your local IP address and port 9090. For example, `http://host.docker.local:9090`. Toggle the switch to enable “Use for interactive preannotations”, then select “Validate and Save.”

Select “Save,” then select the project name in the navigator (right after “Projects,” in this example, “Projects / Segment Anything”) to return to the task interface.

Bounding Box Labeling with Text Prompts

You are now ready to start annotating your data with interactive text prompts! Select “Label All Tasks” to begin the annotation process.

Note that the “Auto Annotation” tab is turned on at the bottom of the labeling interface. Ensure that “Auto accept annotation suggestions” is toggled off at the top of your interface.

You can enter a natural-language query in the text field to prompt the Grounding DINO model to make a prediction. In this case, we just enter “possum”.

You’ll know the model is working on making a prediction when you see a spinning progress indicator at the top of your task interface.

After the prediction is made, a suggested bounding box will appear. Select the tag you want to apply from the “Rectangle Label” section, in this case, “possum,” then click the ✔️ button in the interface to apply the prediction. Select “Submit” to apply the label and move to the next task.

One of the tasks in our dataset includes three dogs in the image. Navigate to that task, and enter “dog” into the prompt entry box. Note that when Grounding DINO returns its predictions, it draws three bounding boxes. You can select the “dog” rectangle level and approve each prediction individually.

Image Segmentation with Text Prompts

The Grounding DINO ML backend also includes an implementation of Grounding SAM if you want to label with image segmentation rather than bounding boxes. To use this mode, you must first reconfigure the ML backend. Press <CTRL>-C to terminate the container in the terminal where your ML model is running.

Open the docker-compose.yml file and set USE_SAM=True. Optimally, if your system is resource-constrained, set USE_MOBILE_SAM=True. Save the file, close it, and restart the model by running:

docker compose up

After the container has launched, navigate back to Label Studio. Reload the interface and move on to the next labeling task.

Type your prompt into the text box, select Add, and wait for the segmentation task to complete. If you’re satisfied with the result, select the corresponding brush annotation, then select the ✔️button to apply it.

What’s Next?

Label Studio plays a critical role in the machine learning pipeline, giving an interface for humans to guide the essential step of labeling and annotating data alongside machine learning systems to speed the process. You can learn more about integrating Label Studio into your machine-learning pipeline in the Label Studio docs. Check out the GitHub repository for complete documentation on the Grounding DINO Backend.

Once a labeling project is finished, you can export the labels using the “Export” interface from the project management home. Masks and annotations are exported as a JSON catalog for your ML and data science pipeline.

Shivansh Sharma, an active Label Studio Community member, developed the original Grounding DINO Backend. If you have projects you’d like to share or want to collaborate with others in launching your labeling project, join the Label Studio Discourse Community, where we have an entire topic dedicated to automated labeling with Machine Learning Models.

Happy Labeling!