The process of setting up monitoring and continuously publishing production data to the Openlayer platform is done using one of our client libraries or via our REST API.

The flow usually goes as follows:

1

Create or load an inference pipeline

2

Publish production data

3

(Optional) Upload a reference dataset

4

(Optional) Update the ground truths for previously published production data

Check out our examples gallery

If you’d like to see multiple examples of the process above, refer to our examples gallery GitHub repository.

In this guide, we will use the openlayer Python client library to illustrate this process.

Prerequisites

To follow along with this guide, you’ll need:

1. Create or load an inference pipeline

To load an existing project, you can run:

import openlayer
from openlayer.tasks import TaskType

client = openlayer.OpenlayerClient("YOUR_API_KEY_HERE")

project = client.load_project(name="Fraud classification")

With the project object retrieved, you can create a new inference pipeline, which is where everything related to monitoring happens:

# To create a new pipeline
inference_pipeline = project.create_inference_pipeline(
    name="Your inference pipeline name"
)

# To load an existing pipeline specifying the name
inference_pipeline = project.load_inference_pipeline(
    name="Your inference pipeline name"
)

2. Publish production data

There are two ways to publish production data to Openlayer:

  • Stream: data is published one row at a time.
  • Batch: data is published in batches.

The choice of which method to use depends on your use case. If your model predictions are available one at a time, streaming can be a good option. If want to accumulate your model predictions before publishing them, batch publishing is a good option.

In both cases, your data should have a column with timestamps (in UNIX seconds format) and inference ids. The name of these columns is specified in the config dictionary. If these columns are not present, they will be created with defaults. Note that inference ids are particularly important if you wish to update ground truths later — as in the next section.

  • Stream

  • Batch

Individual rows of production data are published to an inference pipeline with the stream_data method.

After preparing a config for monitoring data, the stream can be published to Openlayer with:

inference_pipeline.stream_data(
    stream_data=data,
    stream_config=config,
)

where data is a dictionary (or a list of dictionaries) with the individual rows of production data. For example:

data = {
    "timestamp": 1631971200,
    "inference_id": "1aef4",
    "feature_1": 0.1,
    "feature_2": 0.2,
    "feature_3": 0.3,
    "output": 1
}

3. Upload a reference dataset

A reference dataset is optional, but encouraged if you want to monitor drift. Ideally, the reference dataset is a representative sample of the training set used by the deployed model.

Reference datasets are uploaded to an inference pipeline with the upload_reference_dataframe or upload_reference_dataset methods. The former is used if the dataset is loaded into memory as a pandas dataframe, while the latter is used if the dataset is saved to disk as a CSV file.

Here, we will show the use of upload_reference_dataframe but the process is similar for upload_reference_dataset.

A reference dataset is uploaded with:

inference_pipeline.upload_reference_dataframe(
    dataset_df=df,
    dataset_config=config,
)

where config is a Python dictionary with information about the dataset. The items in such dataset depend on the task type. Refer to the How to write dataset config guides for details.

4. Update the ground truths for previously published production data

Updating the ground truths for previously published production data is optional but encouraged if you want to monitor performance metrics that need ground truths, such as accuracy and F1 score.

To update ground truths for previously published data, you can run:

inference_pipeline.update_data(
    df=df,
    ground_truth_column_name="labels",
    inference_id_column_name="inference_id",
)

where df is a pandas DataFrame with only 2 columns: one for the ground truths (called labels in this case) and one for the inference ids (named inference_id here).