How to generate tabular synthetic data

To create a synthetic dataset, first, go to the corresponding project page.

🚧

Don't have a project yet?

All synthetic datasets live inside a project. If you haven’t created a project yet, make sure you create it first. In case you missed it, here is a tutorial about it.

Once on the project page, open the Run report by clicking on Open report.

Inside the run report, apply any set of filters that you would like. The filtered data will serve as a basis to generate the synthetic samples.

Once all the filters of interest are selected, click on Generate data in the upper right corner.

On the data generation page, you can define the augmentation type you are interested in.

Lookalike data

Select Lookalike as the augmentation type.

What is lookalike data?

A pre-trained Generative Adversarial Network (GAN) is used to generate lookalike data. The GAN was trained to generate synthetic samples that look similar to the data provided as its input.

Configuration and data sections

Define the number of synthetic samples to be generated. In the case below, we generate 50 synthetic samples in total.

The Data section, near the bottom of the page, allows you to filter even further the input data. Once the data displayed is of your liking, scroll all the way down and click on Generate.

Results

Once generated, the synthetic dataset should appear under Synthetic datasets on the project page.

Open the newly created dataset. You can download the csv of the data. Additionally, the synthetic dataset can also be used for testing purposes.

Counterfactual data

Select Counterfactuals as the augmentation type.

What is counterfactual data?

Counterfactual samples are data samples that change your model’s predictions via perturbations to certain feature values.

Configuration and data sections

To generate counterfactuals, define all the parameters of interest.

Understanding the parameters:

  • Desired class: the label we will try to make or model output;
  • Feature to vary: the feature that will be perturbed as we try to flip the model’s predictions. This feature will be varied while keeping the other features constant and see if the model changes its prediction to the target label;
  • Number of samples per row: the number of perturbations for each row selected.

The Data section, near the bottom of the page, allows you to filter even further the input data. Once the data displayed is of your liking, scroll all the way down and click on Generate.

Results

Once generated, the synthetic dataset should appear under Synthetic datasets on the project page.

Open the newly created dataset. You can download the csv of the data. Additionally, the synthetic dataset can also be used for testing purposes.