Tests are a central component of Openlayer. This guide covers everything you need to know about them.

It answers the following questions:

  • What are Openlayer tests, and why do they exist?
  • What are the different test types? Is there a hierarchy between them?
  • How should teams create Openlayer tests?

Introduction to tests

Practitioners must have high expectations for their models and data to build trustworthy AI/ML. Openlayer tests materialize these expectations.

If we used plain English to describe them, some test examples would be:

  • Expect less than 3% of values missing for a feature from the training set.
  • Ensure that there is no leakage between training and validation sets.
  • Assert that the model has an average semantic similarity of more than 0.9 for critical queries.
  • Make sure that the model is robust to typos and paraphrases.

The main purpose of Openlayer tests is to set guardrails around models and data. As you iterate on model and dataset versions, the artifacts generated get evaluated against a set of well-defined criteria. By doing so, you ensure that you are systematically making progress and avoiding regressions in your quest toward high-quality models.

However, having guardrails in place is not the only benefit offered by testing.

The test creation pages, for example, can help you identify, at a glance, issues that might be affecting your model and data. In addition to helping you decide it is time to set up a test, these insights increase your understanding of the system you are building.

The test overview page provides insights that can help you get to the root cause of a failed test. For example, in the screenshot below, the histograms should raise a red flag of why the performance for a given subpopulation is below expectations. This is evident from to the significant distribution disparity between the training and validation sets. Explainability, a pillar of trustworthy ML, is also available on the test overview page.

Finally, collaboration is an important aspect of development. Tests direct the team’s attention to what matters and foster collaboration with the comments and activity log. It becomes easier to debug issues when there’s a log that provides a clear timeline of contributions and changes made throughout the development process.

Now, it is time to dive deeper and understand the different test types.

Test types

If we take a closer look at the test examples presented in the previous section, it is possible to note that some tests rely on a single dataset, while others use two datasets, and some even need the model as well.

This motivates the existence of different test types. Each test type is designed to cover a distinct aspect of quality and performance. Let’s double-click on this idea.


On Openlayer, tests are organized into five categories: integrity, consistency, performance, fairness, and robustness.


Integrity tests are defined based on quantities computed from individual datasets. Usually, they are related to data quality.

For example, the number of missing values on a dataset is a typical integrity test.


Consistency tests are associated with the relationship between two datasets, namely the training and validation sets.

Ultimately, the training and validation sets should be consistent. Otherwise, there is no point in training the model in a dataset that looks nothing like the one used to validate it. Consistency tests help ensure that these datasets are sufficiently aligned while keeping the separation needed.

An example of a consistency test is the test that asserts there is no leakage between training and validation sets.


Performance tests set guardrails on the model’s performance on the whole validation set or specific subpopulations. They usually rely on aggregate metrics (which vary according to the task type — refer to the Aggregate metrics guide for details).

For instance, the test that sets an accuracy threshold for a specific subpopulation within the validation set is a performance test.


Fairness tests assess the model using different fairness metrics. The fairness metrics also change depending on the task type. Some examples are equal opportunity and demographic parity metrics for tabular models; and invariance to gender pronoun usage for language models.


Robustness tests are all about asserting robustness to different data perturbation recipes. The idea is that it is better to try to anticipate the edge cases that a model will invariably encounter in production. By doing so, corrective action can be taken before deployment.

For example, asserting that a language model is robust to typos, and paraphrases; or that tabular models are invariant to certain feature value changes.

Most categories above are valid for all the task types supported by Openlayer: from tabular regression to LLMs. However, the tests within the categories for each task type differ. Each task type has its idiosyncrasies, and the tests are customized to address prevalent challenges specific to it.

For example, for LLMs, an important issue is ensuring that the model output follows a specific format (JSON, for example). This fits under the umbrella of integrity tests because it uses a single dataset. On the other hand, such a test doesn’t apply to tabular classification or tabular regression, and is, thus, replaced with specific tests such as correlated features, feature value ranges, etc.

Test hierarchy

The different test types cover complementary aspects of quality and performance. Additionally, they form a hierarchy.

Our recommended approach involves beginning by creating and tackling the easily achievable integrity tests before moving on to consistency and the successive types, up to robustness.

The reason for the suggested order is that some tests are tightly interrelated. For example, a high number of missing values on the training set (an integrity test) may be the culprit for a model bias (fairness test). Solving the issue at its root can produce positive ripple effects downstream.

Test creation process

At this point, we have explored what Openlayer tests are and how they are organized. Furthermore, we know that at their core, tests define expectations on the model and data.

A natural question that arises next is: where do these expectations come from for a concrete project? In other words, how does one create the tests for their models and datasets?

In practice, tests can have many origin stories. For example, a performance test could have been created even before model development started, since a team could already have envisioned the target performance they wanted to achieve. Alternatively, a consistency test can be created well into model development, after the team noticed that the culprit for a poor performance was the data drift between training and validation sets.

Even though each AI/ML project is unique, in this guide, we explore some of the common paths for test creation.

Openlayer suggestions

When models and datasets are onboarded to Openlayer, a series of analyses are run to find potentially important insights. As a consequence, the Openlayer platform can automatically suggest a series of tests that might be interesting for the use case at hand.

These suggested tests can give teams a head start on the development process. Below is a screenshot of the list of suggested tests for a sample text classification project.

Notice that from the 12 suggested tests in this example, 2 were passing and 10 were failing.

The suggestion of tests that are passing illustrates a key point:

The suggested tests not only to surface immediate issues affecting the current data and model but also to set up tests on quantities deemed important.

This stems from the idea that you should create tests even when everything looks just right.

Take label drift, for example. Even though there was no label drift between the training and validation sets, creating such a test puts a guardrail in place to ensure all future dataset versions also do not suffer from this problem.

For now, this test is passing, but if something changes in the future, it won’t go unnoticed.

If the user deems all (or part) of the suggestions useful, these tests are created in bulk, saving some manual discovery work.

Regression tests

The AI/ML development process is inherently iterative, and teams are bound to encounter issues as they iterate on model and dataset versions. Tests are often created as a response to such issues.

From this perspective, Openlayer tests resemble regression tests from software engineering:

When an issue is identified, a test is created to safeguard against future regressions, ensuring that the same problem does not haunt the future versions of models and datasets.

This is probably the most common origin story for tests and a simplified version of this process is followed in our Development Tutorial.

A corollary of this process is that the number of tests in a project continuously increases as the team iterates. This is a natural consequence since it is very difficult to have the foresight to know all the tests needed from the beginning and they are expected to adapt as a project matures.

Domain expertise

Tests also commonly emerge from domain expertise.

For example, consider a fraud classification model that takes in the transaction amount as one of the features. If it is known that all transaction amounts must be positive numbers, a test that asserts that the feature values are within the expected range can be created.

As the team iterates on dataset versions, it is possible that issues on the data pipeline, for example, inadvertently introduce negative values or other anomalies into the transaction amount feature. If this happens, the expectation can be violated for some rows and negatively affect model performance. With the test in place, the issue is detected and solved early.

AI/ML projects are often multi-disciplinary. Therefore, the engineers and data scientists developing the models may not always have all the context needed from experts in the domain they are working on. Tests can serve as an alignment mechanism in such scenarios. Domain experts can collaborate with developers to provide the extra context needed, ensuring that all stakeholders are on the same page and facilitating better-informed model development.