Creating your first project

It's time to give your models and datasets a new home: Openlayer

Now you are ready for the fun part!

In this tutorial, we will explore an urgent event classifier using Openlayer.

Let’s say that we are developing a personal assistant AI, which receives transcribed user inquiries and answers them accordingly. We want to develop an ML model that triages those messages. More specifically, we want a model that classifies the messages as being either Urgent or Not urgent.

For example, messages such as “Is it going to be sunny on Saturday?” and “I’m feeling happy today!” should be classified as Not urgent and receive a polite response from the personal assistant. On the other hand, messages such as “I need help, someone is breaking into my house" and “There was an earthquake, what should I do?” should be classified as Urgent and possibly routed to the authorities so that assistance can be provided.

The idea is that by putting such a model in the front line of user inquiry triaging, we can quickly separate critical messages from the rest so that special measures can be taken.

As a data scientist or ML engineer, it’s all in your hands now.

Let’s train a model to see what happens.

Training the model

To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.

Our initial dataset is constructed from a few open-source datasets available on Kaggle, such as the Multilingual Disaster Response Messages, the Yahoo! Answers Topic Classification, and a few other sentiment analysis datasets.

We took the liberty of writing all the code that loads the dataset, tokenizes the messages, splits the dataset into training and validation sets, and trains a gradient boosting classifier (which is our model of choice). We added comments on the notebook to guide you throughout this process.

Running the notebook cells

Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy and the F1?

Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are a good first metric to look at, but they help little to answer questions such as:

  • How does our model perform for different groups of data?
  • Are there common errors our model is making that could be easily fixed if we had a little bit more data?
  • Are there biases hidden in our model?
  • Why is our model making predictions like this? Is it doing something reasonable or simply over-indexing to certain tokens and stopwords?

The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.

The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.

The first step is giving the model and the validation set a new home: the Openlayer platform. To create a project and upload models & datasets to Openlayer, you are going to use our Python API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.

Instantiating the client

First of all, when you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Openlayer account.

Therefore, before interacting with the Openlayer platform, you need to instantiate the client with your API key.

👍

Instantiating the client

Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Openlayer Client and you will replace ‘YOUR_API_KEY_HERE’ with your API key.

import openlayer

client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')

If you don’t know what’s your API key or if you get a ModuleNotFoundError when trying to import openlayer, check out the installation part of the tutorial and verify if the openlayer is successfully installed.

Creating a project

Now, it's time to create your first project.

A project is a logical unit that bundles models and datasets together. It is inside a project that you are able to deeply evaluate how a model behaves on a particular dataset, track their versions, create tests, and much more. In summary, that's where the bulk of error analysis happens!

To create a new project on the platform to organize our exploration of churn prediction, you can use the client's create_project method. As arguments, you need to specify the name of the project, the type of ML task (in this case, tabular classification), and optionally a short description.

Creating a project

Create a new cell on the notebook we provided, right after the client instantiation cell. On that cell, we will create a new project on the platform by calling the client's create_project function.

from openlayer.tasks import TaskType

project = client.create_project(name="Urgent event classification",
                               task_type=TaskType.TextClassification,
                               description="Evaluation of ML approaches to classify messages")

🚧

Note about project names

The project names need to be unique within a user's account. Therefore, if you try to create another project named "Urgent event classification" using the same API key, you will receive an error.
In the future, if you'd like to retrieve the "Urgent event classification" project to upload new models and datasets, you will likely use the client's load_project method, specifying "Urgent event classification" and the name argument. Refer to the API reference for further details.

Uploading a dataset

Now that the project is created, we can start populating it with models and datasets. We are going to first upload our validation set to the project.

To upload datasets to Openlayer, there are mainly two methods at our disposal: the add_dataset and the add_dataframe. Both of them do essentially the same thing, but you would use add_dataset if your validation set is saved as a csv file and add_dataframe if you have already loaded your dataset and have it as a pandas dataframe. Refer to the API reference for all details.

In our example, the validation set is already loaded on the notebook as a single pandas dataframe, so we will use the latter method.

Uploading a dataset

Create a new cell on the notebook we provided, right after the project creation. On that cell, we will upload the validation set to the project by calling the project's add_dataframe function.

dataset = project.add_dataframe(
    df=val_set,
    class_names=["Not urgent", "Urgent"],
    label_column_name="label",
    text_column_name="text",
    commit_message="First commit!"  
)

For a complete description of the arguments as well as the other optional arguments you can pass to the add_dataframe method, check our API reference page.

Uploading a model

Finally, let’s briefly talk about uploading the model.

The gradient boosting classifier we trained on the notebook is a scikit-learn model. Currently, we support models from the following frameworks:

270270

🛠️

Reach out

Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.

Let us know if you use a different framework!

To be able to upload our model to Openlayer, we first need to package it into a predict_proba function that adheres to the following signature:

def predict_proba(model, text_list: List[str], **kwargs):
    # Optional pre-processing of text_list
      preds = model.predict_proba(text_list)
        # Optional re-weighting of preds
    return preds

I.e., the function needs to receive the actual trained model and the model’s input as arguments and it should output an array-like with class probabilities.

For sci-kit learn models, this is basically a wrapper around the predict_proba method available for most models.

Therefore, in our case, the predict function simply looks like this:

def predict_proba(model, text_list: List[str]):
    # Getting the model's predictions
    preds = model.predict_proba(text_list)
    
    return preds

Now that we have our model’s predict function, we are ready to upload it to our project. The model upload is done with the project's add_model method.

Uploading a model

Create two new cells on the notebook we provided, right after the dataset upload. First, define the model's predict function, like we did above. Then, on the second cell, we will upload the gradient boosting classifier to the project by calling the project's add_model method.

from openlayer.models import ModelType

model = project.add_model(
    function=predict_proba, 
    model=sklearn_model,
    model_type=ModelType.sklearn,
    class_names=["Not urgent", "Urgent"],
    name='Gradient boosting classifier',
    commit_message='First commit!',
    requirements_txt_file='requirements.txt'
)

For a complete reference on the add_model method, check our API reference page.

Verifying the upload

After following the previous steps, if you log in to Openlayer, you should be able to see your newly created project with the model and the dataset that you just uploaded!

If both are there, you are good to move on to the next part of the tutorial!

Something went wrong with the upload?

If you encountered errors while running the previous steps, here are some common issues worth double-checking:

  • check if you installed the most recent version of openlayer. You can find out which version you have installed by opening your shell and typing:
$ pip show openlayer
  • verify if you imported the ModelType and TaskType and you are passing the correct model type and task type as arguments;
  • verify that you are passing all other arguments correctly, as in the code samples we provided.

If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.