NEW Fine-Tuning OpenAI Models: A Guide 🚀

Sync data from external storage

Integrate popular cloud and external storage systems with Label Studio to collect new items uploaded to the buckets, containers, databases, or directories and return the annotation results so that you can use them in your machine learning pipelines.

Set up the following cloud and other storage systems with Label Studio:

Troubleshooting

When working with an external cloud storage connection, keep the following in mind:

  • Label Studio doesn’t import the data stored in the bucket, but instead creates references to the objects. Therefore, you must have full access control on the data to be synced and shown on the labeling screen.
  • Sync operations with external buckets only goes one way. It either creates tasks from objects on the bucket (Source storage) or pushes annotations to the output bucket (Target storage). Changing something on the bucket side doesn’t guarantee consistency in results.
  • We recommend using a separate bucket folder for each Label Studio project.

For more troubleshooting information, see Troubleshooting Label Studio.

How external storage connections and sync work

You can add source storage connections to sync data from an external source to a Label Studio project, and add target storage connections to sync annotations from Label Studio to external storage. Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project.

Source storage

Label Studio does not automatically sync data from source storage. If you upload new data to a connected cloud storage bucket, sync the storage connection using the UI to add the new labeling tasks to Label Studio without restarting. You can also use the API to set up or sync storage connections. See Label Studio API and locate the relevant storage connection type.

Task data synced from cloud storage is not stored in Label Studio. Instead, the data is accessed using a URL. You can also secure access to cloud storage using cloud storage credentials. For details, see Secure access to cloud storage.

Source storage permissions

  • If you enable the “Treat every bucket object as a source file” option, Label Studio backend will only need LIST permissions and won’t download any data from your buckets.

  • If you disable this option in your storage settings, Label Studio backend will require GET permissions to read JSON files and convert them to Label Studio tasks.

When your users access labeling, the backend will attempt to resolve URI (e.g., s3://) to URL (https://) links. URLs will be returned to the frontend and loaded by the user’s browser. To load these URLs, the browser will require HEAD and GET permissions from your Cloud Storage. The HEAD request is made at the beginning and allows the browser to determine the size of the audio, video, or other files. The browser then makes a GET request to retrieve the file body.

Source storage Sync and URI resolving

Source storage functionality can be divided into two parts:

  • Sync - when Label Studio scans your storage and imports tasks from it.
  • URI resolving - when the Label Studio backend requests Cloud Storage to resolve URI links (e.g., s3://bucket/1.jpg) into HTTPS (https://aws.amazon.com/bucket/1.jpg). This way, user’s browsers are able to load media.

Treat every bucket object as a source file

Label Studio Source Storages feature an option called “Treat every bucket object as a source file.” This option enables two different methods of loading tasks into Label Studio.

Off

When disabled, tasks in JSON format can be loaded directly from storage buckets into Label Studio. This approach is particularly helpful when dealing with complex tasks that involve multiple media sources.

On

When enabled, Label Studio automatically lists files from the storage bucket and constructs tasks. This is only possible for simple labeling tasks that involve a single media source (such as an image, text, etc.).*

One Task - One JSON File

If you plan to load JSON tasks from the Source Storage (Treat every bucket object as a source file = No), you must place only one task as the dict per one JSON file. Otherwise, Label Studio will not load your data properly.

Example with tasks in separate JSON files

task_01.json

{
  "image": "s3://bucket/1.jpg",
  "text": "opossums are awesome"
}

task_02.json

{
  "image": "s3://bucket/2.jpg",
  "text": "cats are awesome"
}

Example with tasks, annotations and predictions in separate JSON files

task_with_predictions_and_annotations_01.json

{
    "data": {
        "image": "s3://bucket/1.jpg",
        "text": "opossums are awesome"
    },
    "annotations": [...],  
    "predictions": [...]
}

task_with_predictions_and_annotations_02.json

{
    "data": {
      "image": "s3://bucket/2.jpg",
      "text": "cats are awesome"
    }
    "annotations": [...],  
    "predictions": [...]
}

Python script to split a single JSON file with multiple tasks

Python script to split a single JSON file containing multiple tasks into separate JSON files, each containing one task:

import sys
import json

input_json = sys.argv[1]
with open(input_json) as inp:
    tasks = json.load(inp)

for i, v in enumerate(tasks):
    with open(f'task_{i}.json', 'w') as f:
        json.dump(v, f)

Target storage

When annotators click Submit or Update while labeling tasks, Label Studio saves annotations in the Label Studio database.

If you configure target storage, annotations are sent to target storage after you click Sync for the configured target storage connection. The target storage receives a JSON-formatted export of each annotation. See Label Studio JSON format of annotated tasks for details about how exported tasks appear in target storage.

You can also delete annotations in target storage when they are deleted in Label Studio. See Set up target storage connection in the Label Studio UI for more details.

Target storage permissions

To use this type of storage, you must have PUT permission, and DELETE permission is optional.

Amazon S3

Connect your Amazon S3 bucket to Label Studio to retrieve labeling tasks or store completed annotations.

For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Configure access to your S3 bucket

Before you set up your S3 bucket or buckets with Label Studio, configure access and permissions. These steps assume that you’re using the same AWS role to manage both source and target storage with Label Studio. If you only use S3 for source storage, Label Studio does not need PUT access to the bucket.

  1. Enable programmatic access to your bucket. See the Amazon Boto3 configuration documentation for more on how to set up access to your S3 bucket.

note

A session token is only required in case of temporary security credentials. See the AWS Identity and Access Management documentation on Requesting temporary security credentials.

  1. Assign the following role policy to an account you set up to retrieve source tasks and store annotations in S3, replacing <your_bucket_name> with your bucket name:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:PutObject",
                    "s3:DeleteObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<your_bucket_name>",
                    "arn:aws:s3:::<your_bucket_name>/*"
                ]
            }
        ]
    }

note

"s3:PutObject" is only needed for target storage connections, and "s3:DeleteObject" is only needed for target storage connections in Label Studio Enterprise where you want to allow deleted annotations in Label Studio to also be deleted in the target S3 bucket.

  1. Set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. See Configuring cross-origin resource sharing (CORS) in the Amazon S3 User Guide. Use or modify the following example:
    [
        {
            "AllowedHeaders": [
                "*"
            ],
            "AllowedMethods": [
                "GET"
            ],
            "AllowedOrigins": [
                "*"
            ],
            "ExposeHeaders": [
                "x-amz-server-side-encryption",
                "x-amz-request-id",
                "x-amz-id-2"
            ],
            "MaxAgeSeconds": 3000
        }
    ]

Set up connection in the Label Studio UI

After you configure access to your S3 bucket, do the following to set up Amazon S3 as a data source connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Amazon S3 as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the S3 bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • In the Region Name field, specify the AWS region name. For example us-east-1.
    • (Optional) In the S3 Endpoint field, specify an S3 endpoint if you want to override the URL created by S3 to access your bucket.
    • In the Access Key ID field, specify the access key ID of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Secret Access Key field, specify the secret key of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Session Token field, specify a session token of the temporary security credentials for an AWS account with access to your S3 bucket.
    • (Optional) Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • (Optional) Enable Recursive scan to perform recursive scans over the bucket contents if you have nested folders in your S3 bucket.
    • Choose whether to disable Use pre-signed URLs.
      • All s3://… links will be resolved on the fly and converted to https URLs, if this option is on.
      • All s3://… objects will be preloaded into Label Studio tasks as base64 codes, if this option is off. It’s not recommended way, because Label Studio task payload will be huge and UI will slow down. Also it requires GET permissions from your storage.
    • Adjust the counter for how many minutes the pre-signed URLs are valid.
  8. Click Add Storage.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Set up target storage connection in the Label Studio UI

After you configure access to your S3 bucket, do the following to set up Amazon S3 as a target storage connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Target Storage.
  4. In the dialog box that appears, select Amazon S3 as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the S3 bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining parameters:
    • In the Region Name field, specify the AWS region name. For example us-east-1.
    • (Optional) In the S3 Endpoint field, specify an S3 endpoint if you want to override the URL created by S3 to access your bucket.
    • In the Access Key ID field, specify the access key ID of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Secret Access Key field, specify the secret key of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Session Token field, specify a session token of the temporary security credentials for an AWS account with access to your S3 bucket.
  8. Click Add Storage.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync export storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Google Cloud Storage

Dynamically import tasks and export annotations to Google Cloud Storage (GCS) buckets in Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Prerequisites

To connect your GCS bucket with Label Studio, set up the following:

  • Enable programmatic access to your bucket. See Cloud Storage Client Libraries in the Google Cloud Storage documentation for how to set up access to your GCS bucket.
  • Set up authentication to your bucket. Your account must have the Service Account Token Creator and Storage Object Viewer roles and storage.buckets.get access permission. See Setting up authentication and IAM permissions for Cloud Storage in the Google Cloud Storage documentation.
  • If you’re using a service account to authorize access to the Google Cloud Platform, make sure to activate it. See gcloud auth activate-service-account in the Google Cloud SDK: Command Line Interface documentation.
  • Set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. See Configuring cross-origin resource sharing (CORS) in the Google Cloud User Guide. Use or modify the following example:
    echo '[
       {
          "origin": ["*"],
          "method": ["GET"],
          "responseHeader": ["Content-Type","Access-Control-Allow-Origin"],
          "maxAgeSeconds": 3600
       }
    ]' > cors-config.json

Replace YOUR_BUCKET_NAME with your actual bucket name in the following command to update CORS for your bucket:

gsutil cors set cors-config.json gs://YOUR_BUCKET_NAME

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Google Cloud Storage as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the GCS bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining optional parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, such as gs://my-gcs-bucket/image.jpg. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • Choose whether to disable Use pre-signed URLs. If your tasks contain gs://… links, they must be pre-signed in order to be displayed in the browser.
    • Adjust the counter for how many minutes the pre-signed URLs are valid.
  8. In the Google Application Credentials field, add a JSON file with the GCS credentials you created to manage authentication for your bucket. You can also use the GOOGLE_APPLICATION_CREDENTIALS environment variable to specify this file. For example:
    export GOOGLE_APPLICATION_CREDENTIALS=json-file-with-GCP-creds-23441-8f8sd99vsd115a.json
  9. Click Add Storage.
  10. Repeat these steps for Target Storage to sync completed data annotations to a bucket.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Microsoft Azure Blob storage

Connect your Microsoft Azure Blob storage container with Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Prerequisites

You must set two environment variables in Label Studio to connect to Azure Blob storage:

  • AZURE_BLOB_ACCOUNT_NAME to specify the name of the storage account.
  • AZURE_BLOB_ACCOUNT_KEY to specify the secret key for the storage account.

Configure the specific Azure Blob container that you want Label Studio to use in the UI. In most cases involving CORS issues, the GET permission (/GET//Access-Control-Allow-Origin/3600) is necessary within the Resource Sharing tab:

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Microsoft Azure as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the Azure Blob container, and if relevant, the container prefix to specify an internal folder or container.
  7. Adjust the remaining optional parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • In the Account Name field, specify the account name for the Azure storage. You can also set this field as an environment variable,AZURE_BLOB_ACCOUNT_NAME.
    • In the Account Key field, specify the secret key to access the storage account. You can also set this field as an environment variable,AZURE_BLOB_ACCOUNT_KEY.
    • Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, for example azure-blob://container-name/image.jpg. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • Choose whether to disable Use pre-signed URLs, or shared access signatures. If your tasks contain azure-blob://… links, they must be pre-signed in order to be displayed in the browser.
    • Adjust the counter for how many minutes the shared access signatures are valid.
  8. Click Add Storage.
  9. Repeat these steps for Target Storage to sync completed data annotations to a container.

After adding the storage, click Sync to collect tasks from the container, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Redis database

You can also store your tasks and annotations in a Redis database. You must store the tasks and annotations in different databases. You might want to use a Redis database if you find that relying on a file-based cloud storage connection is slow for your datasets.

Currently, this configuration is only supported if you host the Redis database in the default mode, with the default IP address.

Label Studio does not manage the Redis database for you. See the Redis Quick Start for details about hosting and managing your own Redis database. Because Redis is an in-memory database, data saved in Redis does not persist. To make sure you don’t lose data, set up Redis persistence or use another method to persist the data, such as using Redis in the cloud with Microsoft Azure or Amazon AWS.

Task format for Source Redis Storage

Label Studio only supports string values for Redis databases, which should represent Label Studio tasks in JSON format.

For example:

'ls-task-1': '{"image": "http://example.com/1.jpg"}'
'ls-task-2': '{"image": "http://example.com/2.jpg"}'
...
> redis-cli -n 1
127.0.0.1:6379[1]> SET ls-task-1 '{"image": "http://example.com/1.jpg"}'
OK
127.0.0.1:6379[1]> GET ls-task-1
"{\"image\": \"http://example.com/1.jpg\"}"
127.0.0.1:6379[1]> TYPE ls-task-1
string

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Redis Database as the storage type.
  5. Update the Redis configuration parameters:
    • In the Path field, specify the path to the database. Used as the keys prefix, values under this path are scanned for tasks.
    • In the Password field, specify the server password.
    • In the Host field, specify the IP of the server hosting the database, or localhost.
    • In the Port field, specify the port that you can use to access the database.
    • In the File Filter Regex field, specify a regular expression to filter database objects. Use .* to collect all objects.
    • Enable Treat every bucket object as a source file if your database contains files such as JPG, MP3, or similar file types. This setting creates a URL for each database object to use for labeling. Leave this option disabled if you have multiple JSON files in the database, with one task per JSON file.
  6. Click Add Storage.
  7. Repeat these steps for Target Storage to sync completed data annotations to a database.

After adding the storage, click Sync to collect tasks from the database, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Local storage

If you have local files that you want to add to Label Studio from a specific directory, you can set up a specific local directory on the machine where LS is running as source or target storage. Label Studio steps through the directory recursively to read tasks.

Prerequisites

Add these variables to your environment setup:

  • LABEL_STUDIO_LOCAL_FILES_SERVING_ENABLED=true
  • LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/home/user (or LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=C:\\data\\media for Windows).

Without these settings, Local storage and URLs in tasks that point to local files won’t work. Keep in mind that serving data from the local file system can be a security risk. See Set environment variables for more about using environment variables.

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
Screenshot of the storage settings modal described in the preceding steps.
  1. In the dialog box that appears, select Local Files as the storage type.

  2. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.

  3. Specify an Absolute local path to the directory with your files. The local path must be an absolute path and include the LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT value.

    For example, if LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/home/user, then your local path must be /home/user/dataset1. For more about that environment variable, see Run Label Studio on Docker and use local storage.

note

If you are using Windows, ensure that you use backslashes when entering your Absolute local path.

  1. (Optional) In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
  2. (Optional) Toggle Treat every bucket object as a source file.
    • Enable this option if you want to create Label Studio tasks from media files automatically, such as JPG, MP3, or similar file types. Use this option for labeling configurations with one source tag.
    • Disable this option if you want to import tasks in Label Studio JSON format directly from your storage. Use this option for complex labeling configurations with HyperText or multiple source tags.
  3. Click Add Storage.
  4. Repeat these steps for Add Target Storage to use a local file directory for exporting.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Tasks with local storage file references

In cases where your tasks have multiple or complex input sources, such as multiple object tags in the labeling config or a HyperText tag with custom data values, you must prepare tasks manually.

In those cases, you have to repeat all stages above to create local storage, but skip optional stages. Your Absolute local path have to lead to directory with files (not tasks) that you want to include by task, it also can contain other directories or files, you will specified them inside task.

Differences with instruction above:

  • 7. File Filter Regex - stay empty (because you will specify it inside tasks)
  • 8. Treat every bucket object as a source file - switch off (because you will specify it inside tasks)

Your window will look like this: Screenshot of the local storage settings for user task.

Click Add Storage, but not use synchronization (don’t touch button Sync Storage) after the storage creation, to avoid automatic task creation from storage files.

When referencing your files within a task, adhere to the following guidelines:

  • “Absolute local path” must be a sub-directory of LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT (see 6).
  • All file paths must begin with /data/local-files/?d=.
  • In the following example, the first directory is dataset1. For instance, if you have mixed data types in tasks, including
    • audio files 1.wav, 2.wav within an audio folder and
    • image files 1.jpg, 2.jpg within an images folder, construct the paths as follows:
[{
 "id": 1,
 "data": {
    "audio": "/data/local-files/?d=dataset1/audio/1.wav",
    "image": "/data/local-files/?d=dataset1/images/1.jpg"
  }
},
{
 "id": 2,
 "data": {
    "audio": "/data/local-files/?d=dataset1/audio/2.wav",
    "image": "/data/local-files/?d=dataset1/images/2.jpg"
  }
}]

There are several ways to add your custom task: API, web interface, another storage. The simplest one is to use Import button on the Data Manager page. Drag and drop your json file inside the window, then click the blue Import button .

Task upload via web.

Local Storage with Custom Task Format

This video tutorial demonstrates how to setup Local Storage from scratch and import json tasks in a complex task format that are linked to the Local Storage files.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Set up local storage with Docker

If you’re using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See Run Label Studio on Docker and use local storage.

Troubleshooting cloud storage

For more troubleshooting information, see Troubleshooting Label Studio.