Data consistency goals
The next category of goals we explore is data consistency.
What are data consistency goals?
While data integrity goals focus on datasets individually, consistency goals are associated with the relationship between two datasets, namely the training and validation sets.
Ultimately, the training and validation sets should be consistent. Otherwise, there is no point in training the model in a dataset that looks nothing like the one used to validate it. Consistency goals exist to help ensure that these datasets are sufficiently aligned while keeping the separation needed.
Let’s explore some of the data consistency goals.
Browsing data consistency goals
Navigate to the data consistency tab on the goal creation page.
Again, at a glance, we can already spot some consistency issues.
Notice there are:
- data leakage between training and validation sets;
- feature drift.
Creating a data consistency goal
Leakage is probably the most critical issue among the ones identified. If there are rows from the training set contaminating the validation set, all the aggregate metrics we saw for the model are probably erroneously optimistic.
Create a no data leakage goal
Create a goal that fails when there is leakage between the training and validation sets. You should set the appropriate threshold.
It is easy to identify the indices of the leaked rows by inspecting the goal report. The solution would then be dropping the leaked rows from one of the datasets and re-training the model.
Create more data consistency goals
You are encouraged to create goals even when everything looks just right.
For example, even though there was no label drift between the training and validation sets, creating such a goal puts a guardrail in place to ensure all future dataset versions also do not suffer from this problem.
For now, this goal is passing, but if something changes in the future, it won’t go unnoticed.
Committing a new version
Solving failing goals
Again, we are at a point where we can solve the failing issues and add another commit to our project. Here is a new Colab notebook where we solve some of the data consistency goals and commit the new version to the platform.
Alternatively, if you don’t want to solve the issues just yet, feel free to continue the tutorial and commit the solution to multiple issues in one go in the next parts.
Until now, we focused only on the data component of improving our model. In the next part of the tutorial, we will start looking deeper at the model itself.
Updated about 2 months ago