Using GANs to generate synthetic tabular data to improve supervised learning

Question

One topic I see some people trying is using GANs to generate synthetic tabular data for supervised learning. Also as a way to oversample the minority class in a binary classification.
For me creating synthetic data is a bit dangerous.
In practice, all the experiments that I have seen to generate new training data using GANs have failed.
Is there any theoretical reason behind?

noe · Answer

GANs have many known problems. The main ones are:

Lack of convergence.
Vanishing gradients when discriminator is "too good", leading to stagnation of the generator.
Mode collapse: the diversity of the generated samples tends to be very low, generating always the same values.

GANs for image generation have been studied extensively. Other domains, like speech filtering, have also been studied, but not so extensively. In other domains, like text generation, GANs are not very successful. For tabular data generation via GANs, the amount of released work is scarce: medGAN, VeeGAN, ehrGAN, TableGAN, CTGAN.
I think that one of the main problems preventing us from devising better GANs in non-image domains is the evaluation. With images, you can eyeball the results and quickly determine if they are of good quality and diverse. However, with other domains, it is not easy to evaluate both the quality and diversity of the generated data.
I think most people nowadays stick to classical oversampling methods to generate tabular data.

Using GANs to generate synthetic tabular data to improve supervised learning

One Answer

Add your own answers!

Ask a Question