Skip to main content
A reference dataset is a representative sample of the data your model was trained on (or any dataset you want to use as a baseline). Openlayer uses this dataset for tests that monitor data drift — by comparing the distribution of your live data against the reference distribution.

How to upload a reference dataset

You can upload a reference dataset to your inference pipeline with the Python SDK.

See full Python example

1

Load your dataset into a DataFrame

Your dataset should be in a format Openlayer can understand. Here’s a minimal example with a single row:
Python
import pandas as pd

df = pd.DataFrame(
    {
        "CreditScore": [600],
        "Geography": ["France"],
        "Gender": ["Male"],
        "Age": [40],
        "Tenure": [5],
        "Balance": [100000],
        "NumOfProducts": [1],
        "HasCrCard": [1],
        "IsActiveMember": [1],
        "EstimatedSalary": [50000],
        "AggregateRate": [0.5],
        "Year": [2020],
        "Exited": [0],
    }
)

2

Define the dataset configuration

The dataset config is a dictionary containing information that helps Openlayer understand your data.For example, the dataset above is from a tabular classification task, so our dataset config will have information such as the feature names, class names, and others:
Python
from openlayer.types.inference_pipelines import data_stream_params

# You can replace with `ConfigTabularRegressionData`, `ConfigTextClassificationData`
# or `ConfigTabularLlmData`, according to your task type
config = data_stream_params.ConfigTabularClassificationData(
    categorical_feature_names=["Gender", "Geography"],
    class_names=["Retained", "Exited"],
    feature_names=[
        "CreditScore",
        "Geography",
        "Gender",
        "Age",
        "Tenure",
        "Balance",
        "NumOfProducts",
        "HasCrCard",
        "IsActiveMember",
        "EstimatedSalary",
        "AggregateRate",
        "Year",
    ],
    label_column_name="Exited",
)
3

Upload the dataset to Openlayer

Now, you can upload your reference dataset alongside its config to Openlayer:
Python
from openlayer import Openlayer
from openlayer.lib import data

data.upload_reference_dataframe(
    client=Openlayer(api_key="YOUR_OPENLAYER_API_KEY_HERE"),
    inference_pipeline_id="YOUR_INFERENCE_PIPELINE_ID_HERE",
    dataset_df=df,
    config=config,
)
I