What is the best way for synthetic data generation while maintaining privacy?

Question

For one of the projects where we are working as third party contractors, we need a way for the company to share some datasets which can be used for data science. It is not possible for the company to share the real data as that would be a privacy issue.
We are exploring ways so that the company can either share the data while maintaining privacy or else ways to generate fake data that matches the statistics/demographics of the actual data.
We are currently looking at a couple of options:

Using differential privacy to add noise to the data and then sharing the transformed data with us. Can this approach lead to any privacy issue? I am concerned about reverse engineering. Does "privacy budget" apply here? How should it be tackled?
Using encoder-decoder neural networks to learn vector embedding of the real data. Once the vector embedding is learned, the decoder can be destroyed and the encoder's output can be shared with us.

Is there any other approach that can be used for synthetic data generation that resembles the actual data in terms of demography and statistics. Or else what would be the best way to access the real data without violating privacy?

Sandeep Bhutani · Answer

We solved this problem by using NER. Using spacy or similar alternatives, the entities can be detected and replaced with xxx . This way identification of company names, currencies etc becomes difficult/impossible.
Post this synthetic data generation techniques can be applied like multiplication, paraphrasing or NLG.

Donald S · Answer

If you are trying to hide the actual data values, one standard way to make private data available publicly is to process the dataset through PCA or similar algorithms. Also use one hot encoding or embedding for categorical/text data, renaming the columns. Reverse engineering this data exactly would be very difficult, maybe even impossible. There may be ways to get similar data back, but you can mimimize even that by performing a 2nd step:

reducing the number of output features to convolute the original data
using another dimensionality reduction method after using PCA such as SVD, LDA, etc

After this process, the data is not quite the same, but is usually similar enough to the original dataset to be useful for most use cases.

Alfred Rossi · Answer

Whichever method you choose is fine but assuming that you wish to mitigate inference attacks something like differential privacy is required for either approach.
Formally speaking, differential privacy provides some of the strongest guarantees against reverse engineering. Specifically, it promises that any attacker, regardless of attack methodology or available computing power, will be unable to conclude with certainty whether or not any individual has contributed data to a dataset. This is because the results of differentially-private methods are ambiguous up to the addition or removal of the input contributions of any individual. In essence, every individual gets deniability about their participation (or non-participation) in the input.
The problem with synthetic data is that it is generated from a model that is fit to real data. This means that model parameters are aggregate functions of the real data. This is problematic as it is often possible to make inferences from aggregates of data or estimates thereof (this is the motivation for differential privacy in the first place), and parameters of the generative model can be often be estimated from the synthetic data. I am happy to give an example of such an attack if there is interest. Further, it should be noted that this reasoning also implies that white-box exchange of the model is at least as risky and comes with additional concerns like has this network memorized training data. A straight-forward mitigation for this is to apply differential privacy in building the generative model for synthetic data.
In regards to privacy budgets, one can interpret the budget (often referred to as $varepsilon$ and also the privacy-loss) as the amount of information (say in bits) which can be inferred about any individual by an adversary who has access to differentially-private results. Perhaps surprisingly, it can and ideally should be much less than 1. If there are future updates which reference the same individuals then one has to worry about how much individual information can be inferred from the aggregate collection of releases. There is a straight-forward composition theorem (See e.g. Sect 3.5) that follows directly from the definition of differential privacy. It states that aggregate privacy-loss is at most the sum of the individual privacy-losses of the constituent releases. In other words it's additive in the worst case. It may also be helpful to know that when the inputs are disjoint it goes like the max.

What is the best way for synthetic data generation while maintaining privacy?

3 Answers

Add your own answers!

Ask a Question