In the previous parts of the tutorial, we’ve been exploring the Report, which contains powerful tools for model validation and debugging. Now, we will step back and return to the task page to explore another aspect of error analysis: testing.
Returning to the Project page
Please return to the project page to continue the tutorial.
Let’s briefly talk about testing.
Test-driven development is common practice in software engineering. In ML, a field not that far away, tests are not as common as they should be.
Testing in ML (if done at all) is usually comprised of a single engineer writing a script to test a few cases that came up during a sloppy error analysis procedure. However, thorough testing goes a long way in ensuring model quality, helping practitioners catch mistakes proactively rather than retroactively.
The test suite is at your disposal to guarantee you are systematically moving in the right direction and that the same identified issues on this iteration of error analysis won't haunt your future model versions.
In this part of the tutorial, we will create a metric test. The goal is to assert that every future model we commit to the platform has a better recall for the class
Urgent than our initial version.
To check out the other testing possibilities, refer to the testing page.
In the project page, notice that there is a block with all the tests created. If you have been following the tutorial, yours will not display any tests yet.
To create your first test, click on the Create a test button in the upper right corner of the Tests section.
You will be redirected to the test creation page. The first thing you’ll see, at the top of the page is the test category to select from. For now, for NLP data, we offer Metric, Confidence, and Invariance tests.
If you would like to use other testing frameworks, feel free to reach out so that we can accommodate your needs!
Your first test will be a metric test for our urgent event classification model.
Metric tests are extremely powerful. The idea is to assert that the model performance, measured by an aggregate metric, is above a specified threshold for a certain user group.
Let’s create a test that asserts the recall for the future models is above 0.9 for the
First, select Metric on the Category panel on the Test page. After selecting it, the configuration section should appear below it.
Now, we can define the test’s configuration:
- Metric: the aggregate metric of choice. In our case, we will select the recall, which was the problematic metric we identified in the previous parts of the tutorial;
- Pass threshold: the value that needs to be surpassed for a test to be considered successful. In our case, since we want the recall for the class
Urgentto be above 0.8 and we will use a data cohort with only samples from this class, we will set the threshold to 0.4 (i.e., half). This is done because the test asserts that the overall recall is above the threshold and our problem has 2 classes.
Finally, in the Data section, we can define the data cohort over which the aggregate metric is computed. Let’s select the rows for which the label is equal to
Urgentusing the filter over the data table.
After clicking on Create, we are all set!
To run the test, hover over the tests for the current model table and click on Run test.
By adding a test to a project, we make sure that every new model that we include is going to be tested. This will allow us to assert that the same problems won’t happen again.
It is also important that in the process of fixing our model’s problems we don’t regress in other fronts. For that, we need to create a second test. Create a second test that asserts the model F1 for the samples from the class
Not urgentalso stays above 0.9 (which means that the overall threshold should be 0.45).
In the next part of the tutorial, we will solve some of our model’s issues and commit the new version to the platform.
Updated 3 months ago