NEWOpenAI Structured Outputs with Label Studio 🚀
6

Appendix: About the Data

The dataset used for this tutorial was derived from the “Large Movie Review Dataset,” a collection of 100,000 movie reviews. The original dataset is divided into two main categories, training data and testing data. Within each category, there are 25,000 data points labeled with “positive” or “negative” sentiment. The training data set also includes an additional 50,000 unlabeled data points. Each review is stored as an individual text file, with metadata encoded into the file name, directory structure, and additional sidecar files.

To prepare the data for this tutorial, we wrote a script that walked the directory structure to capture the data and metadata as rows of data. The data was written in randomized batches with rows corresponding to:

  • 0 - 25,000: Labeled training data, with positive and negative sentiment mixed.
  • 25,001 - 75000: Unlabeled training data.
  • 75001 - 100,000: Labeled testing data, with positive and negative sentiment mixed.

These batches were also written out as separate files for convenience. Finally, the first 100 rows of each batch were written out as separate files to support faster loading for a streamlined learning experience.

We noticed a small bug in the data and updated the file to drop the empty “Sentiment” column. This allows the dataset to work with Label Studio.

Our thanks to Andrew Maa for having provided this free data set from their research.