This guide explains the model evaluation metrics currently available throughout the Openlayer platform.

Classification metrics

MetricDescriptionComments
AccuracyThe classification accuracy. Defined as the ratio of the number of correctly classified samples and the total number of samples.
Precision per classThe precision score for each class. Given by TP / (TP + FP).
Recall per classThe recall score for each class. Given by TP / (TP + FN).
F1 per classThe F1 score for each class. Given by 2 _ ( Precision _ Recall ) / ( Precision + Recall ).
PrecisionFor binary classification, the precision considering class 1 as “positive.” For multiclass classification, the macro-average of the precision score for each class, i.e., treating all classes equally.
RecallFor binary classification, the recall considering class 1 as “positive.” For multiclass classification, the macro-average of the recall score for each class, i.e., treating all classes equally.
F1For binary classification, the F1 considering class 1 as “positive.” For multiclass classification, the macro-average of the F1 score for each class, i.e., treating all classes equally.
ROC AUCThe macro-average of the area under the receiver operating characteristic curve score for each class, i.e., treating all classes equally. For multi-class classification tasks, uses the one-versus-one configuration.The ROC AUC is available only if the class probabilities are uploaded with the model. This is done by specifying a predictionScoresColumnName on the dataset configs. Refer to the How to write dataset config guides for details.
False positive rateGiven by FP / (FP + TN).The false positive rate is only available for binary classification tasks.
Geometric meanThe geometric mean of the precision and the recall.
Log lossMeasure of the dissimilarity between predicted probabilities and the true distribution. Also known as cross-entropy loss or binary cross-entropy (in the binary classification case).The log loss is available only if the class probabilities are uploaded with the model. This is done by specifying a predictionScoresColumnName on the dataset configs. Refer to the How to write dataset config guides for details.

Where:

  • TP: true positive.
  • TN: true negative.
  • FP: false positive.
  • FN: false negative.

LLM metrics

MetricDescriptionComments
Mean BLEUBilingual Evaluation Understudy score. Available precision from unigram to 4-gram (BLEU-1, 2, 3, and 4).
Mean edit distanceMinimum number of single-character insertions, deletions, or substitutions required to transform one string into another, serving as a measure of their similarity.
Mean exact matchAssesses if two strings are identical in every aspect.
Mean JSON scoreMeasures how close the output is to a valid JSON.
Mean quasi-exact matchAssesses if two strings are similar, allowing partial matches and variations.
Mean semantic similarityAssesses the similarity in meaning between sentences, by measuring their closeness in semantic space.
Mean, max, and total number of tokensStatistics on the number of tokens.The tokenColumnName must be specified in the dataset config.
Mean, and max latencyStatistics on the response latency.The latencyColumnName must be specified in the dataset config.
Context relevancy*Measures how relevant the context retrieved is given the question.Applies to RAG problems. The contextColumnName must be specified in the dataset config.
Answer relevancy*Measures how relevant the answer (output) is given the question.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Correctness*Correctness of the answer.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Harmfulness*Harmfulness of the answer.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Coherence*Coherence of the answer.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Conciseness*Conciseness of the answer.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Maliciousness*Maliciousness of the answer.Applies to RAG problems. The questionColumnName must be specified in the dataset config.
Context recall*Measures the ability of the retriever to retrieve all necessary context for the question.Applies for RAG problems. The groundTruthColumnName and contextColumnName must be specified in the dataset config.

*To have access to these metrics, you must have a valid OpenAI key and specify it in the Openlayer platform. Furthermore, to compute them, we run the first 10 rows of your data through OpenAI’s GPT-3.5 turbo model.

Regression metrics

MetricDescriptionComments
Mean squared error (MSE)Average of the squared differences between the predicted values and the true values.
Root mean squared error (RMSE)The square root of the MSE.
Mean absolute error (MAE)Average of the absolute differences between the predicted values and the true values.
R-squaredAlso known as coefficient of determination. Quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables.
Mean absolute percentage error (MAPE)Average of the absolute percentage differences between the predicted values and the true values.