Now you are ready for the fun part!
In this tutorial, we will explore an urgent event classifier using Openlayer.
Let’s say that we are developing a personal assistant AI, which receives transcribed user inquiries and answers them accordingly. We want to develop an ML model that triages those messages. More specifically, we want a model that classifies the messages as being either
For example, messages such as “Is it going to be sunny on Saturday?” and “I’m feeling happy today!” should be classified as
Not urgent and receive a polite response from the personal assistant. On the other hand, messages such as “I need help, someone is breaking into my house" and “There was an earthquake, what should I do?” should be classified as
Urgent and possibly routed to the authorities so that assistance can be provided.
The idea is that by putting such a model in the front line of user inquiry triaging, we can quickly separate critical messages from the rest so that special measures can be taken.
As a data scientist or ML engineer, it’s all in your hands now.
Let’s train a model to see what happens.
To make your life easier, here is the link to a Colab notebook where you have everything you’ll need to follow this tutorial.
Our initial dataset is constructed from a few open-source datasets available on Kaggle, such as the Multilingual Disaster Response Messages, the Yahoo! Answers Topic Classification, and a few other sentiment analysis datasets.
We took the liberty of writing all the code that loads the dataset, tokenizes the messages, splits the dataset into training and validation sets, and trains a gradient boosting classifier (which is our model of choice). We added comments on the notebook to guide you throughout this process.
Running the notebook cells
Please, run the notebook cells up to the point where we evaluate the model’s performance on the validation set. How is our model doing? Do you see the accuracy and the F1?
Despite their popularity, aggregate metrics, such as accuracy, can be very misleading. They are a good first metric to look at, but they help little to answer questions such as:
- How does our model perform for different groups of data?
- Are there common errors our model is making that could be easily fixed if we had a little bit more data?
- Are there biases hidden in our model?
- Why is our model making predictions like this? Is it doing something reasonable or simply over-indexing to certain tokens and stopwords?
The list of questions we can ask is virtually infinite and staring at the accuracy won’t get us very far. Furthermore, notice that from a business perspective, the answers to these questions might be very relevant, so you need to be confident that your model is coherent enough to answer them.
The only way to start getting the answers we need before we ship the churn model is by systematically conducting error analysis.
The first step is giving the model and the validation set a new home: the Openlayer platform. To create a project and upload models & datasets to Openlayer, you are going to use our Python API. You will be modifying the notebook we provided to call the API and auto-magically load and deploy the dataset and the model.
First of all, when you call our API, it is critical that we know who is calling us, so that we can upload the model and dataset to the correct Openlayer account.
Therefore, before interacting with the Openlayer platform, you need to instantiate the client with your API key.
Instantiating the client
Create a new cell on the notebook we provided, right after the model evaluation part. On that cell, we will instantiate the Openlayer Client and you will replace
‘YOUR_API_KEY_HERE’with your API key.
import openlayer client = openlayer.OpenlayerClient('YOUR_API_KEY_HERE')
If you don’t know what’s your API key or if you get a
ModuleNotFoundError when trying to import
openlayer, check out the installation part of the tutorial and verify if the
openlayer is successfully installed.
Now, it's time to create your first project.
A project is a logical unit that bundles models and datasets together. It is inside a project that you are able to deeply evaluate how a model behaves on a particular dataset, track their versions, create tests, and much more. In summary, that's where the bulk of error analysis happens!
To create a new project on the platform to organize our exploration of churn prediction, you can use the client's
create_project method. As arguments, you need to specify the name of the project, the type of ML task (in this case, tabular classification), and optionally a short description.
Creating a project
Create a new cell on the notebook we provided, right after the client instantiation cell. On that cell, we will create a new project on the platform by calling the client's
from openlayer.tasks import TaskType project = client.create_project(name="Urgent event classification", task_type=TaskType.TextClassification, description="Evaluation of ML approaches to classify messages")
Note about project names
The project names need to be unique within a user's account. Therefore, if you try to create another project named "Urgent event classification" using the same API key, you will receive an error.
In the future, if you'd like to retrieve the "Urgent event classification" project to upload new models and datasets, you will likely use the client's
load_projectmethod, specifying "Urgent event classification" and the
nameargument. Refer to the API reference for further details.
Now that the project is created, we can start populating it with models and datasets. We are going to first upload our validation set to the project.
To upload datasets to Openlayer, there are mainly two methods at our disposal: the
add_dataset and the
add_dataframe. Both of them do essentially the same thing, but you would use
add_dataset if your validation set is saved as a csv file and
add_dataframe if you have already loaded your dataset and have it as a
pandas dataframe. Refer to the API reference for all details.
In our example, the validation set is already loaded on the notebook as a single
pandas dataframe, so we will use the latter method.
Uploading a dataset
Create a new cell on the notebook we provided, right after the project creation. On that cell, we will upload the validation set to the project by calling the project's
dataset = project.add_dataframe( df=val_set, class_names=["Not urgent", "Urgent"], label_column_name="label", text_column_name="text", commit_message="First commit!" )
For a complete description of the arguments as well as the other optional arguments you can pass to the add_dataframe method, check our API reference page.
Finally, let’s briefly talk about uploading the model.
The gradient boosting classifier we trained on the notebook is a
scikit-learn model. Currently, we support models from the following frameworks:
Frameworks we currently support: Tensorflow, Scikit-learn, PyTorch, HuggingFace, FastText, Rasa, and XGBoost.
Let us know if you use a different framework!
To be able to upload our model to Openlayer, we first need to package it into a predict_proba function that adheres to the following signature:
def predict_proba(model, text_list: List[str], **kwargs): # Optional pre-processing of text_list preds = model.predict_proba(text_list) # Optional re-weighting of preds return preds
I.e., the function needs to receive the actual trained model and the model’s input as arguments and it should output an array-like with class probabilities.
sci-kit learn models, this is basically a wrapper around the
predict_proba method available for most models.
Therefore, in our case, the predict function simply looks like this:
def predict_proba(model, text_list: List[str]): # Getting the model's predictions preds = model.predict_proba(text_list) return preds
Now that we have our model’s predict function, we are ready to upload it to our project. The model upload is done with the project's
Uploading a model
Create two new cells on the notebook we provided, right after the dataset upload. First, define the model's predict function, like we did above. Then, on the second cell, we will upload the gradient boosting classifier to the project by calling the project's
from openlayer.models import ModelType model = project.add_model( function=predict_proba, model=sklearn_model, model_type=ModelType.sklearn, class_names=["Not urgent", "Urgent"], name='Gradient boosting classifier', commit_message='First commit!', requirements_txt_file='requirements.txt' )
For a complete reference on the add_model method, check our API reference page.
After following the previous steps, if you log in to Openlayer, you should be able to see your newly created project with the model and the dataset that you just uploaded!
If both are there, you are good to move on to the next part of the tutorial!
If you encountered errors while running the previous steps, here are some common issues worth double-checking:
- check if you installed the most recent version of
openlayer. You can find out which version you have installed by opening your shell and typing:
$ pip show openlayer
- verify if you imported the
TaskTypeand you are passing the correct model type and task type as arguments;
- verify that you are passing all other arguments correctly, as in the code samples we provided.
If you need a more comprehensive reference on the API methods, feel free to check out our API reference page.
Updated 3 months ago