Creating your first project

It's time to give your models and datasets a new home: Openlayer

Now you are ready for the fun part!

In this tutorial, we will explore the problem of churn classification using Openlayer.

Let’s say that we have an online platform with lots of active users. We know for a fact that some users love our platform and intend to continue using it indefinitely. However, after some time, other users exit our platform to never come back, i.e., churn.

The idea is that by observing some of the users’ characteristics, such as age, gender, geography, and others, we can train an ML model that predicts whether a given user will be retained or exit. This binary classifier can be quite useful for different teams inside our organization and hopefully, if our model is good enough, we can take specific actions in time to retain users that were likely to churn, thus, continually enjoying a healthy growth rate.

As a data scientist or ML engineer, it’s all in your hands now.

Let’s train a model to see what happens.

Training the model

To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.

We took the liberty of writing all the code that loads the dataset, applies a one-hot-encoding to the categorical features, splits the dataset into training and validation sets, and trains a gradient boosting classifier (which is our model of choice). We added comments on the notebook to guide you throughout this process.

Running the notebook cells

Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy and F1?

Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are good first metrics to look at, but they help little to answer questions such as:

  • How does our model perform for different user groups? For example, what’s the performance for users aged between 25-35? What about for users from different countries?
  • Are there common errors our model is making that could be easily fixed if we had a little bit more data?
  • Are there biases hidden in our model?
  • Why is our model predicting a user will churn? Is it doing something reasonable or simply over-indexing to certain features?

The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.

The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.

The first step is giving the model and the validation set a new home: the Openlayer platform. To create a project and upload models & datasets to Openlayer, you are going to use our Python API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.

Instantiating the client

First of all, when you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Openlayer account.

Therefore, before interacting with the Openlayer platform, you need to instantiate the client with your API key.

Instantiating the client

Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Openlayer Client and you will replace YOUR_API_KEY_HERE with your API key.

import openlayer

client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')

If you don’t know what’s your API key or if you get a ModuleNotFoundError when trying to import openlayer, check out the installation part of the tutorial and verify if the openlayer is successfully installed.

Creating a project

Now, it's time to create your first project.

A project is a logical unit that bundles models and datasets together. It is inside a project that you are able to deeply evaluate how a model behaves on a particular dataset, track their versions, create tests, and much more. In summary, that's where the bulk of error analysis happens!

To create a new project on the platform to organize our exploration of churn prediction, you can use the client's create_project method. As arguments, you need to specify the name of the project, the type of ML task (in this case, tabular classification), and optionally a short description.

Creating a project

Create a new cell on the notebook we provided, right after the client instantiation cell. On that cell, we will create a new project on the platform by calling the client's create_projectfunction.

from openlayer.tasks import TaskType

project = client.create_project(name="Churn prediction",
                               task_type=TaskType.TabularClassification,
                               description="Evaluation of ML approaches to predict churn")

🚧

Note about project names

The project names need to be unique within a user's account. Therefore, if you try to create another project named "Churn prediction" using the same API key, you will receive an error.

In the future, if you'd like to retrieve the "Churn prediction" project to upload new models and datasets, you will likely use the client'sload_project method, specifying "Churn prediction" and the name argument. Refer to the API reference for further details.

Uploading a dataset

Now that the project is created, we can start populating it with models and datasets. We are going to first upload our validation set to the project.

To upload datasets to Openlayer, there are mainly two methods at our disposal: the add_dataset and the add_dataframe. Both of them do essentially the same thing, but you would use add_dataset if your validation set is saved as a csv file and add_dataframe if you have already loaded your dataset and have it as a pandas dataframe. Refer to the API reference for all details.

In our example, the validation set is already loaded on the notebook as a single pandas dataframe, so we will use the latter method.

Uploading a dataset

Create a new cell on the notebook we provided, right after the project creation. On that cell, we will upload the validation set to the project by calling the project's add_dataframefunction.

dataset = project.add_dataframe(
  df=validation_set,  
  commit_message='churn validation set for October',
  class_names=class_names,  
  label_column_name='Exited',    
  feature_names=feature_names,  
  categorical_feature_names=categorical_feature_names,  
)

For a complete description of the arguments as well as the other optional arguments you can pass to the add_dataframe method, check our API reference page.

Uploading a model

Finally, let’s briefly talk about uploading the model.

The gradient boosting classifier we trained on the notebook is a scikit-learn model. Currently, we support models from the following frameworks:

270270

🛠️

Reach out

Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.
Let us know if you use a different framework!

To be able to upload our model to Openlayer, we first need to package it into a predict_proba function that adheres to the following signature:

def predict_proba(model, input_features: np.ndarray, **kwargs):
    # Optional pre-processing of input_features
    preds = model.predict_proba(input_features)
    # Optional re-weighting of preds
    return preds

I.e., the function needs to receive the model object and the model’s input as arguments and it should output an array-like with class probabilities.

For sci-kit learn models, this is basically a wrapper around the predict_proba method available for most models.

Therefore, in our case, the predict function simply looks like this:

def predict_proba(model, input_features: np.ndarray, col_names: list, one_hot_encoder, encoders):
    # Pre-processing the categorical features
    df = pd.DataFrame(input_features, columns=col_names)
    encoded_df = one_hot_encoder(df, encoders)
    
    # Getting the model's predictions
    preds = model.predict_proba(encoded_df.to_numpy())
    
    return preds

Now that we have our model’s predict function, we are ready to upload it to our project. The model upload is done with the project's add_model method.

Uploading a model

Create two new cells on the notebook we provided, right after the dataset upload. First, define the model's predict function, like we did above. Then, on the second cell, we will upload the gradient boosting classifier to the project by calling the project's add_modelmethod.

from openlayer.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=class_names,
    name='Churn Classifier',
    commit_message='this is my churn classification model',
    feature_names=feature_names,
    train_sample_df=training_set[:3000],
    train_sample_label_column_name='Exited',
    categorical_feature_names=categorical_feature_names,
    requirements_txt_file='requirements.txt',
    col_names=feature_names,
    one_hot_encoder=data_encode_one_hot,
    encoders=encoders,
)

For a complete reference on the add_model method, check our API reference page.

Verifying the upload

After following the previous steps, if you log in to Openlayer, you should be able to see your newly created project with the model and the dataset that you just uploaded!

If the project is there with your model and dataset, you are good to move on to the next part of the tutorial!

Something went wrong with the upload?

If you encountered errors while running the previous steps, here are some common issues worth double-checking:

  • check if you installed the most recent version of openlayer. You can find out which version you have installed by opening your shell and typing:
$ pip show openlayer
  • verify if you imported the ModelType and TaskType and you are passing the correct model type and task type as arguments;
  • verify that you are passing all other arguments correctly, as in the code samples we provided.

If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.