Privacy and Machine Learning: Concerns and Possible Solutions

Machine learning models are becoming an increasingly integral part of the global healthcare infrastructure. They have led to improvements in computer vision, predictive genomics, palliative care, among other fields, and often their performance has turned out to be better than the human experts. But very few people in the industry are aware of the new and unique security threats that come with these new algorithms.

In 2003, Nissim and Dinur published a paper titled “Revealing information while preserving privacy” where they revealed two surprising facts:

“It is impossible to publish information from a private statistical database without revealing some amount of private information.”

“The entire database can be revealed by publishing the results of a surprisingly small number of queries.”

The implications were huge.

They showed that there exists a tradeoff between the privacy and usability of statistical databases. To someone working with sensitive medical data, this is deeply troubling news. This revelation entails that when we publish results from algorithms, machine learning or otherwise, that were run on these databases, we are always at risk of allowing data-reconstruction and leaking private information. Always.

Attack Vectors on Training Data

Here are some ways privacy of training data can be compromised:

  1. Tracing attacks: You can figure out that a specific person is in a sensitive dataset with only API access to a model trained on the dataset.
  2. Linkage attacks: You can find the real identities of anonymized participants of a dataset.
  3. Model theft: You can steal the model parameters and architecture design by using API access and reconstruct the training set.

There are other methods which attack the model that do not directly affect privacy:

  1. Training set poisoning: You can ‘poison’ correct training data with false training data, and the accuracy of the targeted training label changes very quickly.
  2. Adversarial examples: Adding tailored noise to images and making the model misclassify them.

Some popular examples where privacy attacks happened:

  1. In a 2008 paper, Homer et al. showed a tracing attack on a genomic dataset. This revelation led to the National Institute of Health (NIH) changing their policy to protect against such attacks.
  2. In another 2008 paper, Narayanan & Shmatikov found the real identities of anonymized users in a Netflix dataset using a linkage attack. This privacy breach immediately affected thousands of people.

Differential Privacy

Differential privacy was born out of the need to tackle this threat to user privacy in aggregate database settings and limit indirect leakage of data. For an end-user, the core value proposition of differential privacy is this: a database with differential privacy guarantees that anything we can discover in the database with your data, we would’ve discovered without your data. This statement means there is nothing we can discover that can be uniquely attributed to you.

With intuition out of the way, let’s dive into the formal definition and pick it apart because we’ll need to understand it before we see how this technique can help us do privacy-preserving data analytics.

A randomized mechanism M: D → R with domain D and range R satisfies (ε,δ)-differential privacy if for any two adjacent inputs d, d′ D and for any subset of outputs S R it holds that

Pr[M(d) S] ≤ eε Pr[M(d′) S] + δ

The first thing to notice is that it’s a mechanism. Most formal frameworks for privacy try to model the attacker and the user, not the mechanism itself. The fact that differential privacy allows us to get guarantees on the privacy of an algorithm, independent of the attacker or user, is one of its strengths.

The highlighted proposition above says the following: Imagine two separate datasets, one with your data and one without it. The information we get from the dataset without your data should be multiplicatively close to the information we get from the dataset with your data.

Differential privacy is a powerful technique because it has a compelling composition property. The composition theorem says that for subsequent sequential queries on the database, the individual privacy loss of each query can simply be added to get the privacy loss. This property is useful in practice because it allows us to easily craft an effective “accountant” for the privacy loss of any learning algorithm to estimate the privacy of the training data and the robustness of the learning.

Differential Privacy with Deep Learning

Deep learning allows us to turn meaning into vectors and geometric spaces and then incrementally learn complex geometric transformations that map one space to another. This simple but powerful technique has led to significant improvements in predictive analytics.

The problem with this technique is that like other machine learning algorithms, it is susceptible to memorizing person-specific information from the training data instead of learning general traits from it. This trait is not desirable because it says the learning hasn’t been robust, and that a malicious actor can extract personal and sensitive information from a private training data using just public API access to the analytics engine. This paper by Carlini et al. goes so far as to design a simple metric which measures the memorization of sensitive data by neural networks.

The general steps for adding differential privacy to any learning algorithm are as follows:

  1. Initialize learning parameters randomly.
  2. Take a random sample.
  3. Compute gradient on that random sample.
  4. Clip the gradient.
  5. Add noise.
  6. Descent.
  7. Compute the overall privacy cost using a privacy accountant.

This list is an overly simplified version of applying differential privacy to mini-batch stochastic gradient descent (SGD), which is one of the more common optimization technique used in training neural networks. But like other algorithms, it’s prone to learning private information from the dataset. To stop this, we can use the above guidelines to implement a differentially private SGD (DP-SGD).

Here is an example of basic implementation of DP-SGD in Google’s deep learning framework TensorFlow.

 

The Privacy Accountant

In the 2016 paper titled “Deep Learning by Differential Privacy,” the researchers invented a method for estimating the privacy loss, which was the primary contribution of that paper. They called it the PrivacyAccountant which keeps track of the privacy spending over the course of training. It analyzes the moment of the random variable responsible for measuring the privacy loss. That random variable is dependent on the random noise added to the algorithm.

Saying that a mechanism M is (ε,δ)-differentially private is equivalent to obtaining a bound on M’s privacy loss random variable. The PrivacyAccountant allows us to estimate this very bound for any mechanism’s privacy loss random variable.

While this is not new work, the method used in the paper for finding the bound is tighter and hence a better estimate of the privacy loss. The privacy loss for the above-implemented algorithm, found using the PrivacyAccountant, is (O(qε T), δ)- differentially private.

Here is an example from the paper, where the DP-SGD was applied to the PCA algorithm, and the dataset used was MNIST.

The intersection between differential privacy and learning theory is still an active area of research and has found some fascinating and useful intersections with game theory. As machine learning algorithms become more prevalent in our healthcare systems, we’ll experience different attacks and challenges on the security front. Differential privacy will come to be one of the founding stones of privacy-preserving data analysis, and its importance will only continue to grow in the future.

References

  1. Ian Goodfellow on adverserial examples –  http://www.iangoodfellow.com/slides/2017-10-19-BayLearn.pdf
  2. Deep Learning with differential privacy – https://dl.acm.org/citation.cfm?id=2978318
  3. Revealing information while preserving privacy –  https://dl.acm.org/citation.cfm?id=773173
  4. Tracing attack by Homer et al. – http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000167
  5. De-anonymization attack on the Netflix dataset by Narayanan & Shmatikov – https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf
  6. Algorithmic foundations of differential privacy by Dwork & Roth – https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
Facebook
LinkedIn
Twitter
YouTube

About Arjun Bahuguna

mmArjun Bahuguna is a security researcher and blockchain developer. At Saama, he's helping speed up clinical trials with blockchain-based solutions. At Next Tech Lab, he's building tools for blockchain network measurement, and analyzing security of machine learning models. His interests include applied cryptography, statistical learning, privacy law, and tech policy.


Related Posts