How to generate textual synthetic data

To create a synthetic dataset, first, go to the corresponding project page.

🚧

Don't have a project yet?

All synthetic datasets live inside a project. If you haven’t created a project yet, make sure you create it first. In case you missed it, here is a tutorial about it.

Once on the project page, open the Run report by clicking on Open report.

Then, in the upper right corner, click on Generate data.

On the data generation page, you can select the augmentation category you are interested in.

Augmentation category

What is the augmentation category?

Synthetic data from the augmentation category is generated by applying perturbations to a subset of the original datasets.

Configuration and data sections

Once the Augmentation category is selected, it is possible to choose from various augmentation types. The augmentation types at Openlayer are mostly based on the CheckList, a testing methodology for NLP models proposed by Ribeiro, Guestrin et al. Some of the augmentation types available are:

  • Counterfactuals: perturb the tokens with the intention of changing the original model’s prediction;
  • Add typos: inject typos by swapping neighboring characters;
  • Paraphrase: paraphrase by swapping in proximal embeddings;
  • Change locations: replace city and country names with others;
  • and more.

All of the remaining available augmentation types with their descriptions can be checked out on the platform.

Finally, the Data section, near the bottom of the page, allows you to filter the input data that shall be perturbed. Once the data displayed is of your liking, scroll all the way down and click on Generate.

Results

Once generated, the synthetic dataset should appear under Synthetic datasets on the project page.

Open the newly created dataset. You can download the csv of the data. Additionally, the synthetic dataset can also be used for testing purposes. In our case, we perturbed some of the original rows by adding typos to the sentences.

Template category

What is the template category?

Synthetic data from the template category is generated from a specified template string.

Configuration and template string sections

Once the Template category is selected, it is possible to define the number of samples to generate and the template string.

For example, let’s say that we want to investigate the how model’s predictions vary with respect to the location mentioned in a sentence. We can quickly generate some synthetic samples with a template that looks like this:

Results

Once generated, the synthetic dataset should appear under Synthetic datasets on the project page.

Open the newly created dataset. You can download the csv of the data. Additionally, the synthetic dataset can also be used for testing purposes.