F1 score

Definition
Taxonomy
Why it matters
Required columns
Test configuration examples
Related

Definition

The F1 score test measures the harmonic mean of precision and recall, calculated as:

2 × (Precision × Recall) / (Precision + Recall)

For binary classification, it considers class 1 as “positive.” For multiclass classification, it uses the macro-average of the F1 score for each class, treating all classes equally.

Taxonomy

Task types: Tabular classification, text classification.
Availability: and .

Why it matters

F1 score provides a balanced measure that considers both precision and recall, making it ideal when you need to balance false positives and false negatives.
It’s particularly useful for imbalanced datasets where accuracy alone might be misleading.
Higher F1 scores indicate better model performance, with 1.0 representing perfect precision and recall.
F1 score is especially valuable when the cost of false positives and false negatives is roughly equal.

Required columns

To compute this metric, your dataset must contain the following columns:

Predictions: The predicted class labels from your classification model
Ground truths: The actual/true class labels

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the F1 score test:

[
  {
    "name": "F1 score above 0.8",
    "description": "Ensure that the F1 score is above 0.8",
    "type": "performance",
    "subtype": "metricThreshold",
    "thresholds": [
      {
        "insightName": "metrics",
        "insightParameters": null,
        "measurement": "f1",
        "operator": ">",
        "value": 0.8
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": true,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689"
  }
]

Precision test - Measure positive prediction accuracy.
Recall test - Measure ability to find all positive instances.
Accuracy test - Overall classification correctness.
Geometric mean test - Alternative balanced metric.
Aggregate metrics - Overview of all available metrics.

Recall

False positive rate

⌘I

Get started

Workspace setup

Governance

Observability

Data quality monitoring

Offline testing

Tests

Administration

Other resources

Definition

Taxonomy

Why it matters

Required columns

Test configuration examples

Get started

Workspace setup

Governance

Observability

Data quality monitoring

Offline testing

Tests

Administration

Other resources

​Definition

​Taxonomy

​Why it matters

​Required columns

​Test configuration examples

​Related

Definition

Taxonomy

Why it matters

Required columns

Test configuration examples

Related