We want models that perform well on not only whole datasets but also on every potential edge case it might encounter out in the wild. The problem is that when we strive to achieve that goal, it is easy to be overwhelmed by the number of possibilities and suffer from analysis paralysis.
A better and more realistic approach would be to increase the model’s performance data slice at a time.
The data distribution section on the error analysis panel helps us identify our models' most common mistakes. A good idea is, then, to focus on improving model performance on these error classes iteratively for the next rounds of ML development.
The error analysis panel is divided in two in the Data distribution section.
On the left-hand part, you see the labels for your task. In our churn binary classifier, we see the
Not urgent and
Urgent classes. Furthermore, right beneath each label, we see the performance, measured by aggregate metrics per class.
Low recall for
Note that the model’s performance is not uniform for the two classes. The class
Urgenthas a particularly low recall. For this application, this is likely the metric we should care the most about. After all, we are in a high-stakes situation and the worst mistake our model can make is misclassifying
Not urgent(i.e., false negatives).
In our urgent event classifier’s case, a quick inspection of the training set reveals that the non-uniform model performance across the classes is a symptom of an unbalanced dataset.
Using aggregate metrics per class is particularly important in such scenarios, where the model performance on the majority class might distort some of the overall metrics.
After diagnosing there is an issue with the model’s performance due to an unbalanced training set, one possible path of action is rebalancing it. In that case, one can collect more real-world data or use synthetic data, generated by Openlayer. For this particular case, we decided not to move on and rebalance the training set. As we will soon see, there are other more pressing issues with our model. If you are interested in the different ways of dealing with class imbalance, feel free to check out our blog post series on it.
On the right-hand part of the error analysis panel, we initially see the different error classes. Alternatively, if we click on Show success classes, we can see the different success classes. Each of these views provides a piece of the confusion matrix in a flattened display.
Can you filter the dataset to show only the samples our model predicted as
Not urgentbut that the label was
To do so, we need to click on that error class. Once it is selected in the error analysis panel, the data slice shown below corresponds to the dataset rows where the model is making that type of mistake.
Large error class
Our urgent event classification model has an error class that has almost 6 times more rows than the other. Our model often mistakenly predicts a message is not urgent, when in fact it is. This is a critical error that definitely needs improvement.
Particularly common error classes might deserve special attention, after all, if we manage to diagnose their root cause, we can significantly boost the model’s performance.
Now that we have identified a good slice of data to focus on, we can move on to understanding its cause. In the next part of the tutorial, we will explore explainability and how we can leverage its power to get to the root cause of our model’s mistakes.
Updated 3 months ago