A reference dataset is an optional component of a monitoring set up. However, it is necessary if you want to use one of the drift tests (e.g., feature drift, label drift, and others).

Ideally, the reference dataset is a representative sample of the training set used by the deployed model.

Reference datasets are uploaded to an inference pipeline with the upload_reference_dataframe or upload_reference_dataset methods from Openlayer’s Python SDK. The former is used if the dataset is loaded into memory as a pandas dataframe, while the latter is used if the dataset is saved to disk as a CSV file.

Here, we will show the use of upload_reference_dataframe but the process is similar for upload_reference_dataset.

A reference dataset is uploaded with:

Python
inference_pipeline.upload_reference_dataframe(
    dataset_df=df,
    dataset_config=config,
)

where config is a Python dictionary with information about the dataset. The items in such dataset depend on the task type. Refer to the Dataset config guides for details.