The purpose of this integration guide is to explain how to use datasets available in a Databricks IPython kernel with Openlayer.
Converting a Spark dataframe to a pandas dataframe
The Databricks IPython kernel is an environment used to interact with a Spark cluster. Therefore, the only assumption made by this integration guide is that your datasets can be read as Spark dataframes.
Openlayer currently accepts datasets in two formats: pandas dataframes and CSV files. Consequently, the first step is to ensure that the data you wish to use is in one of these formats.
Databricks uses the Delta Lake for tables by default. Therefore, to read a table and convert it to a pandas dataframe, you can use the code below:
import pandas as pd # Read as a Spark df spark_df = spark.read.table("<catalog_name>.<schema_name>.<table_name>") # Convert to a pandas df pandas_df = spark_df.toPandas()
Alternatively, if your dataset is saved in your Databricks environment as a file, you can read it and convert it to a pandas dataframe using the code:
import pandas as pd # Read as a Spark df spark_df = ( spark.read .format("parquet") # Change according to your file format .option("header", "true") .option("inferSchema", "true") .load("/databricks-datasets/path/to/your/dataset/file") ) # Convert to a pandas df pandas_df = spark_df.toPandas()
Uploading to Openlayer
With the dataset as a pandas dataframe, you can use Openlayer’s Python API
add_dataframe method to upload the dataset to the platform. Now, you can follow our how-to guide on the upload process and refer to the API reference for the details on the
Updated about 1 month ago