This guide explains how to use datasets available in a Databricks IPython kernel with Openlayer.

Convert a Spark dataframe to a pandas dataframe

The Databricks IPython kernel is an environment used to interact with a Spark cluster. Therefore, the only assumption made by this integration guide is that your datasets can be read as Spark dataframes.

Openlayer currently accepts datasets in two formats: pandas dataframes and CSV files. Consequently, the first step is to ensure that the data you wish to use is in one of these formats.

Databricks uses the Delta Lake for tables by default. Therefore, to read a table and convert it to a pandas dataframe, you can use the code below:

import pandas as pd

# Read as a Spark df
spark_df = spark.read.table("<catalog_name>.<schema_name>.<table_name>")

# Convert to a pandas df
pandas_df = spark_df.toPandas()

Alternatively, if your dataset is saved in your Databricks environment as a file, you can read it and convert it to a pandas dataframe using the code:

import pandas as pd

# Read as a Spark df
spark_df = (
    spark.read
        .format("parquet")  # Change according to your file format
        .option("header", "true")
        .option("inferSchema", "true")
        .load("/databricks-datasets/path/to/your/dataset/file")
)

# Convert to a pandas df
pandas_df = spark_df.toPandas()

Upload to Openlayer

With the dataset as a pandas dataframe, you can use Openlayer’s Python API add_dataframe method to upload the dataset to the platform. Now, you can follow our How to upload datasets and models for development guide and refer to the API reference for the details on the add_dataframe method.