Sentence length

Definition

The sentence length test validates that sentences in your text data fall within specified length boundaries. You can set thresholds on both the maximum and minimum sentence lengths to ensure text quality and consistency. This test analyzes individual sentences within your text columns and measures their character length, allowing you to detect overly long or short sentences that might indicate data quality issues or generation problems.

Taxonomy

  • Task types: LLM, text classification.
  • Availability: and .

Why it matters

  • Text quality assurance: Ensures generated or processed text maintains appropriate sentence lengths for readability
  • Model output validation: Prevents LLMs from generating extremely long run-on sentences or incomplete fragments
  • Data consistency: Maintains uniform text formatting and structure across your dataset
  • User experience: Ensures text outputs are readable and well-formatted for end users
  • Detection of generation issues: Identifies when models produce malformed or truncated text

Test configuration examples

If you are writing a tests.json, here are a few valid configurations for the sentence length test:
[
  {
    "name": "Sentences not too long",
    "description": "Ensures no sentences exceed 200 characters to maintain readability",
    "type": "integrity",
    "subtype": "sentenceLength",
    "thresholds": [
      {
        "insightName": "sentenceLength",
        "measurement": "maxSentenceLength",
        "operator": "<=",
        "value": 200  // Maximum 200 characters per sentence
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true, // Apply test to the validation set
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "b4dee7dc-4f15-48ca-a282-63e2c04e0689" // Some unique id
  },
  {
    "name": "Sentences have minimum content",
    "description": "Ensures sentences are at least 10 characters to avoid fragments",
    "type": "integrity",
    "subtype": "sentenceLength",
    "thresholds": [
      {
        "insightName": "sentenceLength",
        "measurement": "minSentenceLength",
        "operator": ">=",
        "value": 10  // Minimum 10 characters per sentence
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "96622fba-ea00-4e42-8f42-5e8f5f60805f" // Some unique id
  },
  {
    "name": "Balanced sentence lengths",
    "description": "Ensures sentences are between 20-150 characters for optimal readability",
    "type": "integrity",
    "subtype": "sentenceLength",
    "thresholds": [
      {
        "insightName": "sentenceLength",
        "measurement": "minSentenceLength",
        "operator": ">=",
        "value": 20
      },
      {
        "insightName": "sentenceLength",
        "measurement": "maxSentenceLength",
        "operator": "<=",
        "value": 150
      }
    ],
    "subpopulationFilters": null,
    "mode": "development",
    "usesValidationDataset": true,
    "usesTrainingDataset": false,
    "usesMlModel": false,
    "syncId": "a1b2c3d4-e5f6-47g8-h9i0-j1k2l3m4n5o6" // Some unique id
  }
]