Introduction

In this part, we:

  • Define what are data consistency tests.
  • Identify data consistency issues and create tests for them.
  • Commit a new version with the issues solved to the platform.

Data consistency tests

The next category of tests we explore is data consistency.

What are data consistency tests?

While data integrity tests focus on datasets individually, consistency tests are associated with the relationship between two datasets, namely the training and validation sets.

Ultimately, the training and validation sets should be consistent. Otherwise, there is no point in training the model in a dataset that looks nothing like the one used to validate it. Consistency tests exist to help ensure that these datasets are sufficiently aligned while keeping the separation needed.

Identify data consistency issues

Let’s explore some of the data consistency tests.

Click on “Consistency” to view all consisteny tests.

“Consistency” is under “Create tests” on the left sidebar.

Again, at a glance, we can already spot some consistency issues.

Actionable insights

Notice there are:

  • data leakage between training and validation sets
  • feature drift.

Create data consistency tests

Leakage is probably the most critical issue among the ones identified. If there are rows from the training set contaminating the validation set, all the aggregate metrics we saw for the model are probably erroneously optimistic.

Create a no data leakage test.

Create a test that fails when there is leakage between the training and validation sets. You should set the appropriate threshold.

It is easy to identify the indices of the leaked rows by inspecting the test report. The solution would then be dropping the leaked rows from one of the datasets and re-training the model.

Create more data consistency tests.

You are encouraged to create tests even when everything looks just right.

For example, even though there was no label drift between the training and validation sets, creating such a test puts a guardrail in place to ensure all future dataset versions also do not suffer from this problem.

For now, this test is passing, but if something changes in the future, it won’t go unnoticed.

Commit a new version

Again, we are at a point where we can solve the failing issues and add another commit to our project.

New colab notebook

Here is a new Colab notebook where we solve some of the data consistency tests and commit the new version to the platform.

Alternatively, if you don’t want to solve the issues just yet, feel free to continue the tutorial and commit the solution to multiple issues in one go in the next parts.

Until now, we focused only on the data component of improving our model. In the next part of the tutorial, we will start looking deeper at the model itself.