Tests are a central component of Openlayer. This topic guide covers everything you need to know about monitoring tests. It answers the following questions:

New to Openlayer tests?

If this is your first contact with Openlayer tests, we encourage you to take a look at the (Development) Tests topic guide first. Here, we highlight the main differences between development and monitoring tests.

Introduction to monitoring tests

Model health impacts businesses and, in production, model health fluctuates and requires corrective action from time to time. In this context, the central questions that practitioners struggle with are:

  • How to measure model health in production?
  • How to track model health over time, since it fluctuates?
  • How to know when corrective actions are needed?

Model health is a broad term, and calls for many pieces of information: aggregate metrics, drift measurements, data checks, etc. When all of these pieces of information are stitched together, practitioners can start to have a comprehensive view of model health in production.

Monitoring tests

As defined in the (Development) Tests topic guide, Openlayer tests cover various aspects of model health. Thus, the test statuses at a point in time are a snapshot of model health.

By creating tests that span the different test types for production data, you are composing a comprehensive view of your model’s health.

Model health over time

We mentioned that one of the key characteristics of monitoring relates to the fluctuation of model health in production. This means that a single snapshot of model health is not enough — the time component must also be factored in. This is where the idea of an evaluation window comes in — a topic we explore in a separate topic guide (Evaluation and delay windows).

Monitoring tests are evaluated at a regular cadence. Therefore, the test statuses can change over time, after each evaluation. The ups and downs of model health are, thus, captured by the changes in test statuses.


The final component of the monitoring problem is alerting — which helps signal when corrective action might be needed. For now, to complete painting the overall picture, it is enough to mention that Openlayer’s notification functionalities for monitoring projects are tailored to handle exactly this situation.

Types of tests

As mentioned in the previous section, in the context of monitoring, the different test types cover complementary aspects of model health.

Differences between development and monitoring tests

Each Openlayer test falls into one of the following types: integrity, consistency, performance, fairness, and robustness. In this guide, we explain which test types are available for monitoring and the changes in interpretation if compared to development (introduced in the (Development) Tests topic guide).

  • Integrity tests for monitoring are very similar to integrity tests for development. After all, data quality can (and should) also be tested in production. The only difference is that while in development the tests are defined for the training or validation sets, in monitoring, the tests run on production data.
  • Consistency tests also exist for monitoring. However, they are interpreted differently. In development, consistency tests measure the consistency between the training and validation sets. In monitoring, the consistency tests measure the consistency between a reference dataset and production data. This is where drift tests live, for example. The reference dataset is usually a representative sample of the training set used to train the model.
  • Performance tests for monitoring are identical to the ones for development. It is worth noting, though, that ground truths are needed to compute most of the model performance metrics. In the monitoring setting, ground truths are usually delayed with respect to the model predictions. Therefore, oftentimes, setting delay windows is also important for performance tests — for details, refer to the Evaluation and delay windows topic guide.
  • Fairness tests defined on metrics, such as equal opportunity and demographic parity apply. On the other hand, fairness tests that rely on running synthetic data samples through the model, do not make sense for monitoring purposes.
  • Robustness tests do not apply to monitoring. In development, robustness tests exist to anticipate edge cases that can be encountered in production by the model. Since monitoring focuses on a different problem, the idea of robustness does not translate directly to monitoring.