Data integrity goals
Let’s start with the first category of goals: data integrity.
What are data integrity goals?
Data integrity goals relate to the quality of the training and validation datasets.
By creating such goals, we are not only identifying data quality issues that influence our model’s performance but also acting to make sure the same problems do not haunt our future datasets again.
We can now explore the data integrity goals for the training set.
Browsing data integrity goals
Return to the “Create” goals page on the data integrity tab. Additionally select “Training” in the upper left corner, to display the information for the training set.
At a glance, we can already spot multiple issues with our training data.
Notice there are:
- duplicate rows;
- many values missing;
- quasi-constant features;
- and more.
These are all certainly influencing our model’s performance.
Creating a data integrity goal
We can start creating our first goals to ensure such problems don’t go unnoticed.
Creating a no duplicate rows goal
Create a no duplicate rows goal. To do so, click on “Create” under the duplicate rows issue. A modal will appear asking you for more information.
Among other things, the modal asks us for a threshold.
Every goal has a threshold, which is the condition that must be satisfied so that its status is considered passing.
After clicking on “Create goal,” the goal is created and we are re-directed to the initial project’s goals page. We see a card for the goal we have just created signaling that it is failing.
Diagnosing data integrity issues
Identifying an issue is the first step towards abandoning guesswork. The second step is diagnosing why it is happening. That’s the role of the goal report page.
Click on the newly created “No duplicate rows in the training set” goal
Clicking on the goal card opens the goal report. Its purpose is to support quick diagnostics of the issue.
The goal report page has two parts.
The left-hand part provides supporting information to help us figure out how to solve the goal. In this case, we see the duplicate rows and the number of times each row appears in the training set. However, each goal has a report tailored to help with its specific diagnostics, so other goal reports will show different information.
The right-hand part contains the activity log. It registers every relevant change to the goal. For instance, it will log if someone alters the goal’s threshold (which can be done by going to the goal metadata). Additionally, it supports comments so that everyone from the same workspace can collaborate on diagnosing the goal.
Leave a comment
You might have a hypothesis about why there are duplicate rows in the training set. Maybe it was an issue with the data pipeline. Leave a comment on the goal to let the rest of the team know.
Finally, as we add new dataset versions to the project, the new goal statuses get logged. This ensures that the full picture of a goal is always available.
Create more data integrity goals
Duplicate rows are not the only data integrity issue and the training set is not the only one afflicted by it. Feel free to spend some time creating more data integrity goals and exploring each goal report.
Committing a new version
With more data integrity goals created, we are at a good point where we can start to solve the failing goals and commit new versions to the platform.
Solving failing goals
In general, the solutions to the problems identified can vary and require a new iteration of model training. The failing goals here are no different. Here is a new Colab notebook where we solve all the data integrity goals and commit the new version to the platform.
Alternatively, if you don’t want to solve the issues just yet, feel free to continue the tutorial and commit the solution to multiple issues in one go in the next parts.
Notice that now all our data integrity goals are passing! Not only that, some aggregate metrics improved a little bit.
In the next part of the tutorial, we explore data consistency goals!
Updated about 2 months ago