Google Space Datasets x Label Studio

Integrations February 6, 2024

What is Space Datasets?

Space Datasets is a powerful new open-source storage framework designed to streamline the machine learning data lifecycle. This innovative framework aims to unify a wide range of use cases under a single storage solution. Key features include:

Ground Truth Database: Space stores data in open-source formats (both locally or in the cloud). It excels at data ingestion, batch reading, random querying, data mutation, and version control - all essential for managing your ground truth data.
OLAP Database: Integrate with SQL engines to rapidly analyze structured data (like annotations); Leverages the techniques of established data warehouses/lakes to manage data and accelerate operations effectively.
Data Processing Integration: Transform your data using popular ML data processing frameworks. Efficiently store processed results as materialized views that can be refreshed incrementally, streamlining the data preparation process.
Training Framework Integration: Space eliminates the need for additional conversion or RPC interfaces. Stored data can be directly consumed by your ML frameworks, such as TensorFlow or PyTorch, or easily converted into other popular datasets formats (TFDS, Ray, HuggingFaces).

Space Datasets is a natural fit for storing annotations generated from Label Studio. Here's how it simplifies the workflow:

Write or update your annotations directly within Space Datasets using its straightforward APIs. Need to revisit previous versions? Version control functionality makes it easy.
Effortlessly transform annotations, along with associated data like images or videos, into the training-ready format required by your chosen training framework. Space Datasets creates a streamlined pipeline from annotation to training.

How Can Space Datasets Integrate with Label Studio?

Once you've exported your annotations from Label Studio, you can seamlessly integrate them into a Space dataset. Given Label Studio's support for various export formats, we will focus on the versatile JSON format for this example.

Step1: Prepare annotations

Space Datasets leverages Apache Arrow, an efficient columnar in-memory data format. Here's how to convert your Label Studio JSON annotations into Arrow format:

# The exported LabelStudio annotation file.
json_path = "your-label-studio-annotations.json"

# Preprocess before loading to Space.
# Drop empty fields in JSON. It is impossible to infer types for them.
with open(json_path) as f:
  labels_json = json.load(f)

for entry in labels_json:
  for annotation in entry["annotations"]:
    del annotation["draft_created_at"]
    del annotation["prediction"]
    del annotation["import_id"]
    del annotation["last_action"]
    del annotation["parent_prediction"]
    del annotation["parent_annotation"]
    del annotation["last_created_by"]
  del entry["drafts"]
  del entry["predictions"]
  del entry["meta"]
  del entry["last_comment_updated_at"]
  del entry["comment_authors"]

# Convert the JSON array to a PyArrow table `labels`.
labels = pa.Table.from_pandas(pd.json_normalize(labels_json))

During this conversion, feel free to remove unused JSON fields or retain them by providing an explicit schema. The resulting Arrow table labels will be loaded into your Space dataset.

Step2: Create a Space Dataset and Load Data

# Check all fields.
labels.schema.names
# >>>
# ['id', 'annotations', 'inner_id', 'total_annotations', 'cancelled_annotations',
# 'total_predictions', 'comment_count', 'unresolved_comment_count', 'project', 'updated_by',
# 'data.image']

Provide the following inputs when creating the Space dataset:

Location: The path (local or cloud-mapped) where data and metadata will be stored (details).
Schema: The Arrow schema derived from your annotations.
Primary keys: Space requires primary keys for data operations.
Record fields: Space stores data in Parquet files by default; use this field to indicate fields to store in separate ArrayRecord files (often used for bulky, unstructured data).

from space import Dataset

label_ds_location = "/space/labelstudio/label_ds"

# Create an empty dataset using the `labels`'s schema.
label_ds = Dataset.create(label_ds_location, labels.schema,
  primary_keys=["id"], record_fields=[])

Now, you can perform various operations on your data within the dataset:

import pyarrow.compute as pc

# Append `labels` into the dataset.
label_ds.local().append(labels)
# Tag this version.
label_ds.add_tag("after_append")

# Check all `id`s we have, and delete 2 rows.
label_ds.local().read_all(fields=["id"])
label_ds.local().delete((pc.field("id") == 8) | (pc.field("id") == 9))
label_ds.add_tag("after_delete")

# Read an old version.
label_ds.local().read_all(version="after_append", fields=["id"])

Next, you can do a lot with the annotation data in Space:

Analyze the annotations with SQL engines like DuckDB.
Transform the annotations, together with other input data like images, to a training ready format. Space supports Ray transform and persists processed results as Materialized Views for incremental processing.
Feed training ready data to ML frameworks directly from Space, or via conversion to a popular ML datasets (e.g., Tensorflow, Ray, HuggingFace datasets)

With your annotation data now in a Space Dataset, you have a wealth of possibilities:

Analyze with SQL: Use SQL engines like DuckDB to dive deep into your annotations and extract valuable insights.
Transform: Prepare your annotations for training by combining them with other data sources like images. Space integrates with Ray transform and efficiently stores results as Materialized Views, supporting incremental processing.
Streamline Training: Feed your training-ready data directly from Space into your ML frameworks. Alternatively, use Space’s built-in conversion capabilities to transform the data into popular ML dataset formats (like Tensorflow, Ray, or HuggingFace datasets).

What’s Next?

Space Datasets is an exciting new project undergoing rapid development. We are eagerly anticipating the release of version 0.1, which will deliver:

Ensure a reliable foundation for your data workflows.
Guarantee the quality and robustness of the framework.
Provide performance benchmarks so you can make informed decisions for your projects.

We envision even deeper integration between Space Datasets and LabelStudio on the horizon. This includes exciting possibilities like:

Annotations completed within LabelStudio could be automatically written to Space Datasets, providing a frictionless user experience.
Data stored within Space Datasets could be directly loaded, visualized, and modified within LabelStudio, streamlining the annotation and data management process.

These enhanced integrations will significantly elevate the user experience, offering a more streamlined and efficient workflow for managing machine learning data.