Definition

The F1 score test measures the harmonic mean of precision and recall, calculated as:
2 × (Precision × Recall) / (Precision + Recall)
For binary classification, it considers class 1 as “positive.” For multiclass classification, it uses the macro-average of the F1 score for each class, treating all classes equally.

Taxonomy

  • Task types: Tabular classification, text classification.
  • Availability: and .

Why it matters

  • F1 score provides a balanced measure that considers both precision and recall, making it ideal when you need to balance false positives and false negatives.
  • It’s particularly useful for imbalanced datasets where accuracy alone might be misleading.
  • Higher F1 scores indicate better model performance, with 1.0 representing perfect precision and recall.
  • F1 score is especially valuable when the cost of false positives and false negatives is roughly equal.

Required columns

To compute this metric, your dataset must contain the following columns:
  • Predictions: The predicted class labels from your classification model
  • Ground truths: The actual/true class labels

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the F1 score test:
[
  {
    "name": "F1 score above 0.8",
    "description": "Ensure that the F1 score is above 0.8",
    "type": "performance",
    "subtype": "metricThreshold",
    "thresholds": [
      {
        "insightName": "metrics",
        "insightParameters": null,
        "measurement": "f1",
        "operator": ">",
        "value": 0.8
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": true,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
  }
]