In this part, we:
- Define what are data integrity tests.
- Identify data integrity issues and create tests for them.
- Use supporting information to diagnose the issues.
- Commit a new version with the issues solved to the platform.
Data integrity tests
Let’s start with the first category of tests: data integrity.
What are data integrity tests?
Data integrity tests relate to the quality of the training and validation datasets.
By creating such tests, we are not only identifying data quality issues that influence our model’s performance but also acting to make sure the same problems do not haunt our future datasets again.
Identify data integrity issues
We can now explore the data integrity tests for the training set.
Click on “Integrity” to view all integrity tests.
“Integrity” is on the left sidebar, under “Create tests.” Additionally, select “Training” in the upper left corner, to display the information for the training set.
At a glance, we can already spot multiple issues with our training data.
Notice there are:
- duplicate rows
- many values missing
- quasi-constant features
- and more.
These are all certainly influencing our model’s performance.
Create data integrity tests
We can start creating our first tests to ensure such problems don’t go unnoticed.
Click on “Create test” under the duplicate rows issue.
A modal will appear asking you for more information.
Among other things, the modal asks us for a threshold.
Every test has a threshold, which is the condition that must be satisfied so that its status is considered passing.
After clicking on “Create test,” the test is created and if we go back to the project’s home page on the left sidebar, we see a card for the test we have just created signaling that it is failing.
Diagnose data integrity issues
Identifying an issue is the first step towards abandoning guesswork. The second step is diagnosing why it is happening. That’s the role of the test report page.
Click on the newly created “No duplicate rows” test.
Clicking on the test card opens the test overview. Its purpose is to support quick diagnostics of the issue.
The test overview page has two parts.
The left-hand part provides supporting information to help us figure out how to solve the issue. In this case, we see the duplicate rows and the number of times each row appears in the training set. However, each test has an overview page tailored to help with its specific diagnostics, so other test overviews will show different information.
The right-hand part contains the activity log. It registers every relevant change to the test. For instance, it will log if someone alters the test’s threshold. Additionally, it supports comments so that everyone from the same workspace can collaborate on diagnosing the test.
Leave a comment.
You might have a hypothesis about why there are duplicate rows in the training set. Maybe it was an issue with the data pipeline. Leave a comment on the test to let the rest of the team know.
Finally, as we add new dataset versions to the project, the new test statuses get logged. This ensures that the full picture of a test is always available.
Create more data integrity tests.
Duplicate rows are not the only data integrity issue and the training set is not the only one afflicted by it. Feel free to spend some time creating more data integrity tests and exploring each test report.
Commit a new version
With more data integrity tests created, we are at a good point where we can start to solve the failing tests and commit new versions to the platform.
In general, the solutions to the problems identified can vary and require a new iteration of model training. The failing tests here are no different.
New colab notebook
Here is a new Colab notebook where we solve all the data integrity tests and commit the new version to the platform.
Alternatively, if you don’t want to solve the issues just yet, feel free to continue the tutorial and commit the solution to multiple issues in one go in the next parts.
Notice that now all our data integrity tests are passing! Not only that, some aggregate metrics improved a little bit.
In the next part of the tutorial, we explore data consistency tests!