Data Science Asked by Vivek Maskara on August 20, 2021
For one of the projects where we are working as third party contractors, we need a way for the company to share some datasets which can be used for data science. It is not possible for the company to share the real data as that would be a privacy issue.
We are exploring ways so that the company can either share the data while maintaining privacy or else ways to generate fake data that matches the statistics/demographics of the actual data.
We are currently looking at a couple of options:
Is there any other approach that can be used for synthetic data generation that resembles the actual data in terms of demography and statistics. Or else what would be the best way to access the real data without violating privacy?
We solved this problem by using NER. Using spacy or similar alternatives, the entities can be detected and replaced with xxx . This way identification of company names, currencies etc becomes difficult/impossible.
Post this synthetic data generation techniques can be applied like multiplication, paraphrasing or NLG.
Answered by Sandeep Bhutani on August 20, 2021
If you are trying to hide the actual data values, one standard way to make private data available publicly is to process the dataset through PCA or similar algorithms. Also use one hot encoding or embedding for categorical/text data, renaming the columns. Reverse engineering this data exactly would be very difficult, maybe even impossible. There may be ways to get similar data back, but you can mimimize even that by performing a 2nd step:
After this process, the data is not quite the same, but is usually similar enough to the original dataset to be useful for most use cases.
Answered by Donald S on August 20, 2021
Whichever method you choose is fine but assuming that you wish to mitigate inference attacks something like differential privacy is required for either approach.
Formally speaking, differential privacy provides some of the strongest guarantees against reverse engineering. Specifically, it promises that any attacker, regardless of attack methodology or available computing power, will be unable to conclude with certainty whether or not any individual has contributed data to a dataset. This is because the results of differentially-private methods are ambiguous up to the addition or removal of the input contributions of any individual. In essence, every individual gets deniability about their participation (or non-participation) in the input.
The problem with synthetic data is that it is generated from a model that is fit to real data. This means that model parameters are aggregate functions of the real data. This is problematic as it is often possible to make inferences from aggregates of data or estimates thereof (this is the motivation for differential privacy in the first place), and parameters of the generative model can be often be estimated from the synthetic data. I am happy to give an example of such an attack if there is interest. Further, it should be noted that this reasoning also implies that white-box exchange of the model is at least as risky and comes with additional concerns like has this network memorized training data. A straight-forward mitigation for this is to apply differential privacy in building the generative model for synthetic data.
In regards to privacy budgets, one can interpret the budget (often referred to as $varepsilon$ and also the privacy-loss) as the amount of information (say in bits) which can be inferred about any individual by an adversary who has access to differentially-private results. Perhaps surprisingly, it can and ideally should be much less than 1. If there are future updates which reference the same individuals then one has to worry about how much individual information can be inferred from the aggregate collection of releases. There is a straight-forward composition theorem (See e.g. Sect 3.5) that follows directly from the definition of differential privacy. It states that aggregate privacy-loss is at most the sum of the individual privacy-losses of the constituent releases. In other words it's additive in the worst case. It may also be helpful to know that when the inputs are disjoint it goes like the max.
Answered by Alfred Rossi on August 20, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP