Tagging and suggested tags

Conducting error cohort analysis

Aggregate metrics, such as F1 and accuracy, can be very misleading. We might be led into thinking that we have a good model, when, in fact, we cannot be so sure.

The F1 we obtained, as an aggregate metric, summarizes the performance of our model across our whole validation set. It is a useful first metric to look at, but it doesn’t convey the complete story of how our model behaves.

For example, how does our model perform for different user groups? What’s the performance for male users? What about for female users? And for users aged between 25 and 35?

What we will most likely find out is that the performance of our model is not uniform across different cohorts of the data. Furthermore, we may even encounter some data pockets with low accuracies and specific failure modes.

The image below illustrates well what is often the case with the model performance.

Analyzing different cohorts of the data is critical to building trust in your model and not being surprised by failure modes only after your model is serviced in production.

In this part of the tutorial, we will conduct error cohort analysis to understand how our model performs for different user groups. More specifically, we will dive deeper into the issue with genders we diagnosed in the previous section.

The key functionality that allows analyzing multiple data cohorts is tagging. For a comprehensive reference on the importance of tagging, check out Andrew Ng’s online course on ML in production.


The first step required to conduct error cohort analysis is being able to easily query our dataset so that we can access the data cohorts we are interested in exploring further.

This can be done by adding filters, in the upper right corner of the data table.

For example, let’s filter the data to only look at the dataset rows from female users.

First, we select the feature we are interested in, which is Gender. Then, we select the relationship we want, in this case, equal. Finally, we type the value we are interested in and click on Apply filter.

Now, below the filter bar, we only see the data that satisfies our query.


Combining flexible tagging with easy filtering results in endless possibilities to conduct repeatable and precise data cohort analysis.

Now that we filtered to only see the data for female users, let’s create a tag for them.

On the upper left corner of the data table, click on the first checkbox. This will select all of the rows being filtered.

With our rows selected, click on Options, which will show you some of the actions you can take to the selected rows. In our case, we’d like to create a new tag, so click on Tag rows. Let’s name that data cohort female_users and press Enter.

Voilà! All of the data samples are now tagged! This is the user group we are going to focus on.

You might have noticed that on the Error analysis panel, there is a tab called Tags. If you click on it, you will see all the tags you already created. In our case, you will only see our newly created female_users tag.

Every time you need to have a look or need to show this data cohort to someone, you can simply click on it and the data below will be filtered according to the query used to generate it. This is a great way to document patterns.

Filter with a tag

Clear the filters in the filter bar. Then, click on the female_users tag to see what happens to the data rows shown below the error analysis panel

Back to the Tag tab. Did you notice something interesting?

Right below our newly created tag, you can see our model’s performance for that specific data cohort.

The model performance for that user group is pretty bad! Conversely, let’s look at what’s happening for the male users.

Inspecting different data cohorts

With what you’ve learned so far, can you check what’s the model’s F1 for male users? Hint: you need to use the filter bar and create a new tag


Performance for female users

Our model has a very different performance depending on the user’s genders. The F1 for male users is equal to 0.75. On the other hand, for female users, the F1 is equal to 0.51. Our model is biased and this is the behavior behind the largest error class, as we’ve seen in the previous section.


Actionable insights

The model bias in this case, as confirmed in the previous section, is a symptom of the female gender being underrepresented in the training data. Error cohort analysis not only aids in identifying possible model biases, but also allows practitioners to quickly identify data pockets with specific failure modes. Furthermore, as a diagnostic, it is possible to know exactly the kind of additional data needed to boost the model’s performance. In this case, it is clear we need to either use synthetic data or collect more data for female users to augment our training set.


The key to systematic model improvement is to focus on improving the model performance one data slice at a time.

Filtering and tagging empower practitioners to analyze flexible data cohorts.

Additionally, Openlayer presents automatically slices that might be worth focusing on next. These are the suggested tags.

On the Tags section in the Error analysis panel, you will find some suggested tags that we made for you because we think that these can be data samples that you might be interested in taking a closer look at.

Each suggested tag has a different meaning and if you click on Create, we will tag all the samples that satisfy that criteria.

One of the most powerful suggested tags are the error slices. To construct them, we automatically find data pockets where the model has particularly high error rates. For example, notice that the data cohort where Gender is equal to Female has an error rate of almost 50%. This is a strong hint pointing in the direction of the model's bias we already diagnosed in the previous sections of the tutorial.

Using a suggested tag

Click on Create in one of the suggested tags. Once you create a tag, it will show on the Tags tab with the associated aggregate metrics.

Now that you are familiar with filtering and tagging, let’s move on to the next part of the tutorial, where we explore tests in the context of ML.