Differential privacy in practice

Uber opens their differncial privacy project

Using personal data is not as such evil. In the case of a cab, we certainly want our driver to know where to pick us up, as well as we want to be sure to be invoiced on the correct journey. However, if we want the intermediary company that is supposed to just connect us with our driver, to have all the details is at least debatable.

Uber has a reputation of exploiting their customers’ data, even abusing it to an extent that might be regarded criminal when they threatened critical jounalists. But even without bad intention it might just be harmful to potentially let everyone always know where your customers are.

Thus it makes perfect sense for companies like Uber to develop a framework for differential privacy – to make the data somewhat available but by using stochatics to blurr it making it very unlikely that individual people could be singled out.

That Uber opens their project for differential privacy is great example how to see protecting your customers not as a necessary evil, but to use data privacy as a public proof that you actually listen, learn, and act. It is a nice way of rebuilding trust.

Read Uber’s blog post:

https://medium.com/uber-security-privacy/differential-privacy-open-source-7892c82c42b6

Homomorphic Encryption

Privacy-preserving computation

With homomorphic encryption data can remain fully secret while the information in it can still be used to do meaningful and valuable analyses – without intruding on people’s privacy.

Data- even when encrypted – usually has to be first decrypted before any analysis can be carried out. Hashed data e.g. is not considered to be private. Think of personally identifiable data such as an email address. If the email address is hashed and other data is then linked to the hash, all can be deanonymized: All you need to do is calculate the hashes over many email addresses and look for a match. Thus using hashes, anchor hashes, etc. does not do the job of making a process privacy compliant e.g. under the GDPR.

Most privacy intrusions don’t take place in the form of data leaks or security breaches. Informational self-determination is at the basic level undermined by the data analytic practices of many companies and government institutions. The fact that millions of users delete their Facebook apps from their phones shows that privacy preservation at the level of business-models is a very sensible topic worth considerations beyond mere compliance to legal regulations.

On the blockchain, where all data remains visible, in the open, it is particularily important to keep the data itself private.

Recently, a general concept of privacy-preserving computation has been summarized under the term differencial privacy. Its requirements go far beyond what most data-handling companies until recently considered private. 

Homomorphic encryption (HE) presents a new paradigm for data storage, queries, analytics, and computation. HE basically adds just so much random noise to the data that its factual values remain clouded. The procedure has not only the advantage of making analytics on encrypted data possible, hence offering truly differencial privacy. Other than the common cryptography with prime numbers, the mathematics behind HE is also resistant to crypto-attacks with quantum computing.

With HE data can be encrypted before it is stored e.g. on a cloud server, and be queried and analyzed without the need of ever decrypting it. 

First, the results can remain fully or partially encrypted, too. It is e.g. possible, to include data points into an aggregate of data like an average calculated over a set of individuals without disclosing the single individual behind the data.

A second important application is comparing data. E.g. to calculate the individual risk of carrying a genetic disease, the similarity of a patient’s genome to that of people who show the symptoms is measured. Usually this requires to send a patient’s DNA fully sequenced and unencrypted to the genome lab where it is analyzed. Even if the data is diligently erased by the lab after the results are derived, the lab personnel sees the results. Also of course, the data might be leaked. With HE, the patient’s DNA data is encrypted before it is sent to the lab. The lab calculates the similarities, but the results remain still encrypted and are never revealed until the patient (or their docotor) decrypts it locally on their computer. Nevertheless, the patient’s data could even be added to the group’s aggregate, hence adding to medical knowledge, without any disclosure of he patient’s details.

The third way to use HE is in zero knowledge proofs (zkp). In a zero knowledge proof it is possible get confirmation on certain, predefined questions. E.g. in age verification, it might be necessary to prove legal age, however the exact birth date is of no concern to the second party. With HE, the age can be calculated over the encrypted data of the birth date and correctly -and mathematically proven- give the answer to the comparism if the age is greater than the required threshold.

In a first project, we built a proof-of-concept for HE on location-based data. Our HE scrambles the data before it is stored on a blockchain (in our proof-of-concept the IOTA DLT). Any kind of geo-computation such as geo-fencing can be done without revealing any other details of the data.

uWe now generalize the approach to first apply HE for streaming data onto other databases and blockchain technologies, then second implement APIs to drive analytics with the data in a fully integrated differencial privacy framework called tyuya core (Working title – suggestions welcome!).

Being a statistician and social researcher by vocation my favorite next application with HE will be using personal data from social networks and other media. I believe we as an industry can come up with a better way to serving content to people than by data intrusion. I am excited to start exploring that.