Tagging and suggested tags

Conducting error cohort analysis

Aggregate metrics, such as F1 and accuracy, can be very misleading. We might be led into thinking that we have a good model, when, in fact, we cannot be so sure.

The 97% accuracy we obtained, as an aggregate metric, summarizes the performance of our model across our whole validation set. It is a useful first metric to look at, but it doesn’t convey the complete story of how our model behaves.

For example, how does our model perform for different user groups? What’s the performance for messages about the weather? What about earthquakes? And for Urgent messages?

What we will most likely find out is that the performance of our model is not uniform across different cohorts of the data. Furthermore, we may even encounter some data pockets with low accuracies and specific failure modes.

The image below illustrates well what is often the case with the model performance.

Analyzing different cohorts of the data is critical to building trust in your model and not being surprised by failure modes only after your model is serviced in production.

In this part of the tutorial, we will conduct error cohort analysis to understand how our model performs for different message groups.

The key functionality that allows analyzing multiple data cohorts is tagging. For a comprehensive reference on the importance of tagging, check out Andrew Ng’s online course on ML in production.


The first step required to conduct error cohort analysis is being able to easily query our dataset so that we can access the data cohorts we are interested in exploring further.

This can be done with the Filter and Search bar, right below the Error analysis panel.

For example, let’s filter the data to only look at the dataset rows from messages that contain the word “earthquake." To do so, we can simply type “earthquake” in the filter bar and press Enter. Now, below the filter bar, we only see the data that satisfies our query.


Combining flexible tagging with easy filtering results in endless possibilities to conduct repeatable and precise data cohort analysis.

Now that we filtered to only see the data we want, let’s create a tag for it.

On the upper left corner of the data table, click on the first checkbox. This will select all of the rows being filtered.

With our rows selected, click on Options, which will show you some of the actions you can take to the selected rows. In our case, we’d like to create a new tag, so click on Tag rows. Let’s name that data cohort about_eathquakes and press Enter.

Voilà! All of the data samples are now tagged! This is the user group we are going to focus on.

You might have noticed that on the Error analysis panel, there is a section called Tags. If you click on it, you will see all the tags you already created. In our case, you will only see our newly created about_earthquakes tag.

Every time you need to have a look or need to show this data cohort to someone, you can simply click on it and the data below will be filtered according to the query used to generate it. This is a great way to document patterns.

Filtering with a tag

Clear the filters in the filter bar. Then, click on the about_earthquakes tag to see what happens to the data rows shown below the error analysis panel

Back to the Tag section. Did you notice something interesting?

Right below our newly created tag, you can see our model’s performance for that specific data cohort.

The model performance for that user group is pretty good, with mainly samples from the Urgent class.


Performance for messages about earthquakes

Notice, that there is a row with a Not urgent message that our model is misclassifying. In that row, the user simply wanted to know the energy released in an earthquake, but our model classified it as Urgent mostly because of the word “earthquake” (you can use local explainability to figure this out).


Actionable insights

In some data groups, the model performs much better than in others. Error cohort analysis not only aids in identifying possible model biases, but also allows practitioners to quickly identify data pockets with specific failure modes. Furthermore, as a diagnostic, it is possible to know exactly the kind of additional data needed to boost the model’s performance. In this case, it is clear we need to either use synthetic data or collect more data for female users to augment our training set.


The key to systematic model improvement is to focus on improving the model performance one data slice at a time.

Filtering and tagging empower practitioners to analyze flexible data cohorts.

Additionally, Openlayer presents automatically slices that might be worth focusing on next. These are the suggested tags.

On the Tags section in the Error analysis panel, you will find some suggested tags that we made for you because we think that these can be data samples that you might be interested in taking a closer look at.

Each suggested tag has a different meaning and if you click on Create, we will automatically tag all the samples that satisfy that criteria.

For example, one of the most powerful suggested tags is the potential_mislabel tag. This shows data that might have been mislabeled, because (among other things) the model is making mistakes with low uncertainty. This is not a guarantee that the points are mislabeled, but it might be worth double-checking them, as training models with mislabeled data will likely hinder their performance.

Using a suggested tag

Click on Create in one of the suggested tags. Once you create a tag, it will show on the Tags section with the associated aggregate metrics.

Now that you are familiar with filtering and tagging, let’s move on to the next part of the tutorial, where we explore tests in the context of ML.