logofirst
logofirst GitHub

Sync data from external storage

Integrate popular cloud and external storage systems with Label Studio to collect new items uploaded to the buckets, containers, databases, or directories and return the annotation results so that you can use them in your machine learning pipelines.

Set up the following cloud and other storage systems with Label Studio:

Each source and target storage setup is project-specific. You can connect multiple buckets, containers, databases, or directories as source or target storage for a project. If you upload new data to a connected cloud storage bucket, sync the storage connection to add the new labeling tasks to Label Studio without restarting.

After setting up target storage and performing annotations, manually sync annotations using the Sync button for the configured target storage. Annotations are still stored in the Label Studio database, and the target storage receives a JSON export of each annotation. Annotations are sent to target storage as a one-way export. You can also export or sync using the API.

Secure access to cloud storage using workspaces and cloud storage credentials. For details, see Secure access to cloud storage.

Amazon S3

Connect your Amazon S3 bucket to Label Studio to retrieve labeling tasks or store completed annotations.

For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Configure access to your S3 bucket

Before you set up your S3 bucket or buckets with Label Studio, configure access and permissions. These steps assume that you’re using the same AWS role to manage both source and target storage with Label Studio. If you only use S3 for source storage, Label Studio does not need PUT access to the bucket.

  1. Enable programmatic access to your bucket. See the Amazon Boto3 configuration documentation for more on how to set up access to your S3 bucket.
  2. You must get temporary security credentials to use to grant access to your S3 bucket. See the AWS Identity and Access Management documentation on Requesting temporary security credentials.
  3. Assign the following role policy to an account you set up to retrieve source tasks and store annotations in S3, replacing <your_bucket_name> with your bucket name:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject",
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<your_bucket_name>",
                    "arn:aws:s3:::<your_bucket_name>/*"
                ]
            }
        ]
    }
  4. Set up cross-origin resource sharing (CORS) access to your bucket, using a policy that allows GET access from the same host name as your Label Studio deployment. See Configuring cross-origin resource sharing (CORS) in the Amazon S3 User Guide. Use or modify the following example:
    [
        {
            "AllowedHeaders": [
                "*"
            ],
            "AllowedMethods": [
                "GET"
            ],
            "AllowedOrigins": [
                "*"
            ],
            "ExposeHeaders": [
                "x-amz-server-side-encryption",
                "x-amz-request-id",
                "x-amz-id-2"
            ],
            "MaxAgeSeconds": 3000
        }
    ]

Set up connection in the Label Studio UI

After you configure access to your S3 bucket, do the following to set up Amazon S3 as a data source connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Amazon S3 as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the S3 bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • In the Region Name field, specify the AWS region name. For example us-east-1.
    • (Optional) In the S3 Endpoint field, specify an S3 endpoint if you want to override the URL created by S3 to access your bucket.
    • In the Access Key ID field, specify the access key ID of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Secret Access Key field, specify the secret key of the temporary security credentials for an AWS account with access to your S3 bucket.
    • In the Session Token field, specify a session token of the temporary security credentials for an AWS account with access to your S3 bucket.
    • (Optional) Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • (Optional) Enable Recursive scan to perform recursive scans over the bucket contents if you have nested folders in your S3 bucket.
    • Choose whether to disable Use pre-signed URLs. For example, if you host Label Studio in the same AWS network as your storage buckets, you can disable presigned URLs and have direct access to the storage using s3:// links.
    • Adjust the counter for how many minutes the pre-signed URLs are valid.
  8. Click Add Storage.
  9. Repeat these steps for Target Storage to sync completed data annotations to a bucket.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Set up an S3 connection with IAM role access

If you want to use a revocable method to grant Label Studio access to your Amazon S3 bucket, use an IAM role and its temporary security credentials instead of an access key ID and secret. This added layer of security is only available in Label Studio Enterprise. For more details about security in Label Studio and Label Studio Enterprise, see Secure Label Studio.

Set up an IAM role in Amazon AWS

Set up an IAM role in Amazon AWS to use with Label Studio.

  1. In the Label Studio UI, open the Organization page to get an External ID to use for the IAM role creation in Amazon AWS. You must be an administrator to view the Organization page.
  2. Follow the Amazon AWS documentation to create an IAM role in your AWS account.
    Make sure to require an external ID and do not require multi-factor authentication when you set up the role. Select an existing permissions policy, or create one that allows programmatic access to the bucket.
  3. Create a trust policy using the external ID. Use the following example:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": [
              "arn:aws:iam::490065312183:user/rw_bucket"
            ]
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": [
                "<YOUR-ORG-ExternalId>"
              ]
            }
          }
        }
      ]
    }
  4. After you create the IAM role, note the Amazon Resource Name (ARN) of the role. You need it to set up the S3 source storage in Label Studio.
  5. Assign role policies to the role to allow it to access your S3 bucket. Replace <your_bucket_name> with your S3 bucket name. Use the following role policy for S3 source storage:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<your_bucket_name>",
                    "arn:aws:s3:::<your_bucket_name>/*"
                ]
            }
        ]
    }
    Use the following role policy for S3 target storage:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:PutObject",
                    "s3:GetObject"
                ],
                "Resource": [
                    "arn:aws:s3:::<your_bucket_name>",
                    "arn:aws:s3:::<your_bucket_name>/*"
                ]
            }
        ]
    }

For more details about using an IAM role with an external ID to provide access to a third party (Label Studio), see the Amazon AWS documentation How to use an external ID when granting access to your AWS resources to a third party.

Create the connection to S3 in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Amazon S3 (IAM role access) as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the S3 bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • In the Region Name field, specify the AWS region name. For example us-east-1.
    • In the S3 Endpoint field, specify an S3 endpoint if you want to override the URL created by S3 to access your bucket.
    • In the Role ARN field, specify the Amazon Resource Name (ARN) of the IAM role that you created to grant access to Label Studio.
    • In the External ID field, specify the external ID that identifies Label Studio to your AWS account. You can find the external ID on your Organization page.
    • Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • Enable Recursive scan to perform recursive scans over the bucket contents if you have nested folders in your S3 bucket.
    • Choose whether to disable Use pre-signed URLs. For example, if you host Label Studio in the same AWS network as your storage buckets, you can disable presigned URLs and have direct access to the storage using s3:// links.
    • Adjust the counter for how many minutes the pre-signed URLs are valid.
  8. Click Add Storage.
  9. Repeat these steps for Target Storage to sync completed data annotations to a bucket.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Google Cloud Storage

Dynamically import tasks and export annotations to Google Cloud Storage (GCS) buckets in Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Prerequisites

To connect your GCS bucket with Label Studio, set up the following:

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Google Cloud Storage as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the GCS bucket, and if relevant, the bucket prefix to specify an internal folder or container.
  7. Adjust the remaining optional parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, such as gs://my-gcs-bucket/image.jpg. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • Choose whether to disable Use pre-signed URLs. For example, if you host Label Studio in the same network as your storage buckets, you can disable presigned URLs and have direct access to the storage.
    • Adjust the counter for how many minutes the pre-signed URLs are valid.
  8. In the Google Application Credentials field, add a JSON file with the GCS credentials you created to manage authentication for your bucket. You can also use the GOOGLE_APPLICATION_CREDENTIALS environment variable to specify this file. For example:
    export GOOGLE_APPLICATION_CREDENTIALS=json-file-with-GCP-creds-23441-8f8sd99vsd115a.json
  9. Click Add Storage.
  10. Repeat these steps for Target Storage to sync completed data annotations to a bucket.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Microsoft Azure Blob storage

Connect your Microsoft Azure Blob storage container with Label Studio. For details about how Label Studio secures access to cloud storage, see Secure access to cloud storage.

Prerequisites

You must set two environment variables in Label Studio to connect to Azure Blob storage:

Configure the specific Azure Blob container that you want Label Studio to use in the UI.

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Microsoft Azure as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify the name of the Azure Blob container, and if relevant, the container prefix to specify an internal folder or container.
  7. Adjust the remaining optional parameters:
    • In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
    • In the Account Name field, specify the account name for the Azure storage. You can also set this field as an environment variable,AZURE_BLOB_ACCOUNT_NAME.
    • In the Account Key field, specify the secret key to access the storage account. You can also set this field as an environment variable,AZURE_BLOB_ACCOUNT_KEY.
    • Enable Treat every bucket object as a source file if your bucket contains BLOB storage files such as JPG, MP3, or similar file types. This setting creates a URL for each bucket object to use for labeling, for example azure-blob://container-name/image.jpg. Leave this option disabled if you have multiple JSON files in the bucket with one task per JSON file.
    • Choose whether to disable Use pre-signed URLs, or shared access signatures. For example, if you host Label Studio in the same network as your storage containers, you can disable presigned URLs and have direct access to the storage.
    • Adjust the counter for how many minutes the shared access signatures are valid.
  8. Click Add Storage.
  9. Repeat these steps for Target Storage to sync completed data annotations to a container.

After adding the storage, click Sync to collect tasks from the container, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Redis database

You can also store your tasks and annotations in a Redis database. You must store the tasks and annotations in different databases. You might want to use a Redis database if you find that relying on a file-based cloud storage connection is slow for your datasets.

Currently, this configuration is only supported if you host the Redis database in the default mode, with the default IP address.

Label Studio does not manage the Redis database for you. See the Redis Quick Start for details about hosting and managing your own Redis database. Because Redis is an in-memory database, data saved in Redis does not persist. To make sure you don’t lose data, set up Redis persistence or use another method to persist the data, such as using Redis in the cloud with Microsoft Azure or Amazon AWS.

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.
  4. In the dialog box that appears, select Redis Database as the storage type.
  5. Update the Redis configuration parameters:
    • In the Path field, specify the path to the database. Used as the keys prefix, values under this path are scanned for tasks.
    • In the Password field, specify the server password.
    • In the Host field, specify the IP of the server hosting the database, or localhost.
    • In the Port field, specify the port that you can use to access the database.
    • In the File Filter Regex field, specify a regular expression to filter database objects. Use .* to collect all objects.
    • Enable Treat every bucket object as a source file if your database contains files such as JPG, MP3, or similar file types. This setting creates a URL for each database object to use for labeling. Leave this option disabled if you have multiple JSON files in the database, with one task per JSON file.
  6. Click Add Storage.
  7. Repeat these steps for Target Storage to sync completed data annotations to a database.

After adding the storage, click Sync to collect tasks from the database, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Local storage

If you have local files that you want to add to Label Studio from a specific directory, you can set up a specific local directory on the machine where LS is running as source or target storage. Label Studio steps through the directory recursively to read tasks.

Tasks with local storage file references

In cases where your tasks have multiple or complex input sources, such as multiple object tags in the labeling config or a HyperText tag with custom data values, you must prepare tasks manually.

In those cases, you can add local storage without syncing (to avoid automatic task creation from storage files) and specify the local files in your data values. For example, to specify multiple data types in the Label Studio JSON format, specifically an audio file 1.wav and an image file 1.jpg:

{
 "data": {
    "audio": "/data/local-files/?d=dataset1/1.wav",
    "image": "/data/local-files/?d=dataset1/1.jpg"
  }
}

Prerequisites

Add these variables to your environment setup:

Without these settings, Local storage and URLs in tasks that point to local files won’t work. Keep in mind that serving data from the local file system can be a security risk. See Set environment variables for more about using environment variables.

Set up connection in the Label Studio UI

In the Label Studio UI, do the following to set up the connection:

  1. Open Label Studio in your web browser.
  2. For a specific project, open Settings > Cloud Storage.
  3. Click Add Source Storage.Screenshot of the storage settings modal described in the preceding steps.
  4. In the dialog box that appears, select Local Files as the storage type.
  5. In the Storage Title field, type a name for the storage to appear in the Label Studio UI.
  6. Specify an Absolute local path to the directory with your files. The local path must be an absolute path and include the LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT value.
    For example, if LABEL_STUDIO_LOCAL_FILES_DOCUMENT_ROOT=/home/user, then your local path must be /home/user/dataset1. For more about that environment variable, see Run Label Studio on Docker and use local storage.
  7. (Optional) In the File Filter Regex field, specify a regular expression to filter bucket objects. Use .* to collect all objects.
  8. (Optional) Toggle Treat every bucket object as a source file.
    • Enable this option if you want to create Label Studio tasks from media files automatically, such as JPG, MP3, or similar file types. Use this option for labeling configurations with one source tag.
    • Disable this option if you want to import tasks in Label Studio JSON format directly from your storage. Use this option for complex labeling configurations with HyperText or multiple source tags.
  9. Click Add Storage.
  10. Repeat these steps for Add Target Storage to use a local file directory for exporting.

After adding the storage, click Sync to collect tasks from the bucket, or make an API call to sync import storage.

Add storage with the Label Studio API

You can also create a storage connection using the Label Studio API.

Set up local storage with Docker

If you’re using Label Studio in Docker, you need to mount the local directory that you want to access as a volume when you start the Docker container. See Run Label Studio on Docker and use local storage.

Troubleshoot CORS and access problems

Troubleshoot some common problems when using cloud or external storage with Label Studio.

I can’t see the data in my tasks

Check your web browser console for errors.

Tasks or annotations do not sync

If you’re pressing the Sync button but tasks do not sync, or you can’t see the new tasks in the Data Manager, check the following:

Tasks don’t load the way I expect

If the tasks sync to Label Studio but don’t appear the way that you expect, maybe with URLs instead of images or with one task where you expect to see many, check the following: