The Databricks IPython kernel is an environment used to interact with a Spark cluster. Therefore, the only assumption made by this integration guide is that your datasets can be read as Spark dataframes.Openlayer currently accepts datasets in two formats: pandas dataframes and CSV files. Consequently, the first step is to ensure that the data you wish to use is in one of these formats.Databricks uses the Delta Lake for tables by default. Therefore, to read a table and convert it to a pandas dataframe, you can use the code below:
Copy
Ask AI
import pandas as pd# Read as a Spark dfspark_df = spark.read.table("<catalog_name>.<schema_name>.<table_name>")# Convert to a pandas dfpandas_df = spark_df.toPandas()
Alternatively, if your dataset is saved in your Databricks environment as a file, you can read it and convert it to a pandas dataframe using the code:
Copy
Ask AI
import pandas as pd# Read as a Spark dfspark_df = ( spark.read .format("parquet") # Change according to your file format .option("header", "true") .option("inferSchema", "true") .load("/databricks-datasets/path/to/your/dataset/file"))# Convert to a pandas dfpandas_df = spark_df.toPandas()