Partner WebinarsJune 23, 2022

Leveraging Great Expectations with Label Studio to Test Your Data

This webinar tutorial, Superconductive Head of Product Eugene Mandel & moderated by Heartex Head of Open Source Community Michael Ludden, will showcase how to leverage Superconductive's Great Expectations in concert with Label Studio to test your data.

Transcript

Welcome to another Label Studio partner webinar. Today’s session features Eugene Mandel, Head of Product at Superconductive, the creators of the open source data testing framework Great Expectations.

Before we begin, a quick note: if you'd like to submit questions during the webinar, please join our Label Studio community Slack using the link in the video description or on the screen. There will be a Q&A with Eugene at the end.

A few housekeeping items: we’re launching a monthly Label Studio newsletter with blog posts, tutorials, release info, and featured community content. You can subscribe at labelstudio.substack.com. We also have another community conversation webinar coming soon, as well as a new blog post about NLP labeling best practices by our documentation lead and CTO—check our Slack announcements for details.

Now, let’s dive into Great Expectations with Eugene Mandel.

Eugene is Head of Product at Superconductive and has long worked with data across startups. Before Superconductive, he helped build customer support bots, where most of the real work was building pipelines and labeling workflows with humans in the loop. That experience affirmed a belief: better data leads to better algorithms.

Great Expectations is an open source library for testing data pipelines. It was started in 2018 by Abe Gong and James Campbell, who now lead Superconductive. The project is the most popular open source tool for data validation, with thousands of GitHub stars and an active contributor community.

So, what problems does Great Expectations solve? It tackles two types of data quality issues: pipeline risks (e.g. outages or missing files) and data risks (e.g. drift, outliers, or misunderstood inputs). For example, a real estate dataset might contain outliers that throw off a model—data that isn’t “bad,” but needs to be interpreted correctly.

Great Expectations addresses these problems using expectations—structured statements about data. For example, "I expect values in this column to be between 1 and 6" can be encoded in code, configuration, or plain language. Expectations can validate whether data meets the criteria, and failed validations include rich output showing exactly what went wrong.

There are expectations for checking column existence, value ranges, uniqueness, null values, regex matches, statistical properties, and even distribution shape. As an open standard, Great Expectations also supports domain-specific expectations—like checking for valid FDA codes or geographic formats—which are contributed by the broader community.

Expectations come from either domain expertise (e.g. knowing what a valid temperature is) or data profiling (e.g. analyzing 1,000 hourly log files to infer characteristics). A profiler can automatically generate expectations, which can be reviewed and refined with expert knowledge.

Once expectations are defined, they’re used to validate incoming data batches. Validation results can be rendered into HTML data docs—a feature the community loves because it helps teams communicate data issues clearly. Instead of explaining bad data in long emails, you can just share a data doc showing the failed expectation and example rows.

This aligns with a key principle: your tests are your docs, and your docs are your tests. When tests generate documentation automatically, you keep docs up-to-date without extra effort.

Beyond the expectations themselves, the library includes tools for configuration, validation storage, and pipeline integration across different compute engines like Pandas, Spark, and SQLAlchemy.

You can get started on GitHub, explore the documentation with tutorials and integrations, and join the Great Expectations Slack community to collaborate with other users and contributors.

Finally, Eugene demonstrated using Jupyter notebooks to create and validate expectations on a real dataset. For example, if you expect passenger counts to be positive integers, you can define and test that in a few lines of code. If the data violates the rule, the system shows exactly what went wrong.

Great Expectations is especially powerful for validating inputs and outputs of machine learning models. It can prevent garbage-in/garbage-out scenarios by flagging problematic data before models use it. This is particularly useful in real-time or production ML environments.

During the Q&A, Eugene discussed upcoming plans for a SaaS version with a more user-friendly interface for non-coders, feedback loops for community-driven expectations, and ongoing collaboration between Label Studio and Great Expectations communities.

Leveraging Great Expectations with Label Studio to Test Your Data

Transcript

Related Content