A reference dataset is usually a representative sample of the training data used by the model. It is required to monitor data drift — as its distribution serves as a reference to compare the distribution of your published data.

How to upload a reference dataset

You can upload a reference dataset to your inference pipeline on Openlayer with the Python SDK.

See full Python example

1

Load your dataset as a pandas DataFrame

Let’s say that your reference dataset looks like the one below. For simplicity, we show a single row.

Python
import pandas as pd

df = pd.DataFrame(
    {
        "CreditScore": [600],
        "Geography": ["France"],
        "Gender": ["Male"],
        "Age": [40],
        "Tenure": [5],
        "Balance": [100000],
        "NumOfProducts": [1],
        "HasCrCard": [1],
        "IsActiveMember": [1],
        "EstimatedSalary": [50000],
        "AggregateRate": [0.5],
        "Year": [2020],
        "Exited": [0],
    }
)

2

Prepare the dataset configuration

The dataset config is a dictionary containing information that helps Openlayer understand your data.

For example, the dataset above is from a tabular classification task, so our dataset config will have information such as the feature names, class names, and others:

Python
from openlayer.types.inference_pipelines import data_stream_params

# You can replace with `ConfigTabularRegressionData`, `ConfigTextClassificationData`
# or `ConfigTabularLlmData`, according to your task type
config = data_stream_params.ConfigTabularClassificationData(
    categorical_feature_names=["Gender", "Geography"],
    class_names=["Retained", "Exited"],
    feature_names=[
        "CreditScore",
        "Geography",
        "Gender",
        "Age",
        "Tenure",
        "Balance",
        "NumOfProducts",
        "HasCrCard",
        "IsActiveMember",
        "EstimatedSalary",
        "AggregateRate",
        "Year",
    ],
    label_column_name="Exited",
)
3

Upload to Openlayer

Now, you can upload your reference dataset alongside its config to Openlayer:

Python
from openlayer import Openlayer
from openlayer.lib import data

data.upload_reference_dataframe(
    client=Openlayer(api_key="YOUR_OPENLAYER_API_KEY_HERE"),
    inference_pipeline_id="YOUR_INFERENCE_PIPELINE_ID_HERE",
    dataset_df=df,
    config=config,
)