This guide explains how to use datasets stored in Amazon S3 bucket with Openlayer.

Openlayer currently accepts datasets in two formats: pandas dataframes and CSV files. Consequently, the first step is to ensure that the data you wish to use is in one of these formats.

Pull a dataset from S3 into a pandas dataframe

This is the recommended option if you can load your dataset into memory using a pandas dataframe. To retrieve your data from S3 and load it into a pandas dataframe, use the following code:

import boto3
import pandas as pd

# The AWS profile that has access to the S3 bucket
AWS_PROFILE = "your_profile"

# Information about the location of the dataset in the S3 bucket
S3_BUCKET = "bucket_name"
S3_KEY = "path/to/dataset.csv"


session = boto3.session.Session(
    profile_name=AWS_PROFILE
)
s3 = session.client("s3")
s3_data = s3.get_object(
    Bucket=S3_BUCKET,
    Key=S3_KEY
)

df = pd.read_csv(s3_data["Body"])

With the dataset as a pandas dataframe, you can use Openlayer’s Python API add_dataframe method to upload the dataset to the platform. You can follow our How to upload datasets and models for development guide and refer to the API reference for the details on the add_dataframe method.

Pull a dataset from S3 into a CSV file

This is the recommended option if you prefer saving your dataset to disk instead of loading it to memory, as in the previous section. To retrieve your data from S3 and save it to disk, use the following code:

import boto3

# The AWS profile that has access to the S3 bucket
AWS_PROFILE = "your_profile"

# Information about the location of the dataset in the S3 bucket
S3_BUCKET = "bucket_name"
S3_KEY = "path/to/dataset.csv"

OUTPUT_FILE = "dataset.csv"

session = boto3.session.Session(
    profile_name=AWS_PROFILE
)
s3 = session.client("s3")
s3.download_file(
    Bucket=S3_BUCKET,
    Key=S3_KEY,
    Filename=OUTPUT_FILE
)

Upload to Openlayer

With the dataset saved as a CSV file, you can use Openlayer’s Python API add_dataset method to upload the dataset to the platform. You can follow our How to upload datasets and models for development guide and refer to the API reference for the details on the add_dataset method.