The changing paradigm of Big Data

“The 2010s was the decade of Big Data.
The 2020s will be the decade of privacy.”

Big Data has changed people, now people change big data.

  • People are not Luddites. People love what online and the cloud can offer.
  • Connected data is more than just convenience:  health, genome-driven precision medicine, AI and autonomous systems, industrial internet, …
  • But not at all costs:
  • No one wants to feel exploited
  • The threats are real. It is not just an annoyance. More than just money will be lost.
  • Cloud services often mean “lock-in” for businesses – giving away your trade secret might just not work for all. 
  • And policy changes, too. GDPR is just the beginning.

Incorporating privacy

A stack of technologies was just invented that finally does the trick:

  • Homomorphic encryption and zero-knowledge proofs for privacy preserving computation
  • Federated learning to train models decentralized for privacy preserving AI
  • True peer-to-peer networks as data infrastructure without a single point of failure.

Technology is ready

  • Researchers at Microsoft, Intel, IBM, and others have defined standards and opened homomorphic encryption for broad application.
  • Zero-knowledge proofs have been working flawlessly in cryptocurrencies such as zCash and Monero for years
  • Federated learning has been pushed to be ready by the AI and chatbot wave.
  • The blockchain offers true secure and trustable peer-to-peer interactions without intermediation.


Consumer DNA data and privacy: The case for homomophic encryption

Respecting the autonomy of participants in genetic studies is necessary for long term sustainability of the genomics ecosystem.

Yaniv Erlich

Personally identifiable information is data that, even if it does not link immediately to the person who generated it, can be traced back and de-anonymized. The more dimensions the data spans, the more variability it shows, the more likely it is to find an exact match to one single person.

Genome data, the sequence of base pairs in a person’s DNA is probably the most personal of all today’s data. It is unique not only to every human -more even than fingerprints- but it also ties together families, closer or more remote relatives, and even ethnic communities.

Yaniv Erlich and his research team has put together a good summary of what DNA databases and ancestry-related research entails in terms of informational self-determination. The conclusion: We need privacy-preserving mechanisms like differential privacy and homomorphic encryption. 

http://science.sciencemag.org/content/early/2018/10/10/science.aau4832

Teenagers have no choice to opt out

Almost all teenagers are hooked up to mobile services. The smartphone is the single device of communication, dominating the lives of the 13 to 17 years old.

Parents and teachers might bemoaning the digitalized lives of our children. However, the reality is, that for adolescents who have to find their place in society, to difine their identity, and to build their network of relationships, backing off from mobile communication and opting out of the platforms of social exchange is not at all an option.

The pervasiveness of mobile services in the lives of young people makes it impossible to neglect privacy. The platforms will have to do much more than just secure access via password or multi-factor-authentication. The data as such has to be made inaccessible for abusers.

Differential privacy regimes with strong cryptography, using homomorphic encryption and zero knowledge proofs lead the way.

A lot of useful data on how teenagers use mobile services can be found in this recent report by Common Sense Media:

https://www.commonsensemedia.org/research/social-media-social-life-2018

More data leaks …

Krebs on Security has another massive data leak from mSpy, the infamous service that let their users stalk on other people’s mobile phones.

This is disturbing on two levels – first, because it is unbelievable that such an assaultive business modell can be legal, second, even in the case of full consent, how can a company dealing with highly sensitive personal data be so irrisponsible.

Read the story at Krebs on security blog:

https://krebsonsecurity.com/2018/09/for-2nd-time-in-3-years-mobile-spyware-maker-mspy-leaks-millions-of-sensitive-records/

Differential privacy in practice

Uber opens their differncial privacy project

Using personal data is not as such evil. In the case of a cab, we certainly want our driver to know where to pick us up, as well as we want to be sure to be invoiced on the correct journey. However, if we want the intermediary company that is supposed to just connect us with our driver, to have all the details is at least debatable.

Uber has a reputation of exploiting their customers’ data, even abusing it to an extent that might be regarded criminal when they threatened critical jounalists. But even without bad intention it might just be harmful to potentially let everyone always know where your customers are.

Thus it makes perfect sense for companies like Uber to develop a framework for differential privacy – to make the data somewhat available but by using stochatics to blurr it making it very unlikely that individual people could be singled out.

That Uber opens their project for differential privacy is great example how to see protecting your customers not as a necessary evil, but to use data privacy as a public proof that you actually listen, learn, and act. It is a nice way of rebuilding trust.

Read Uber’s blog post:

https://medium.com/uber-security-privacy/differential-privacy-open-source-7892c82c42b6

Homomorphic Encryption

Privacy-preserving computation

With homomorphic encryption data can remain fully secret while the information in it can still be used to do meaningful and valuable analyses – without intruding on people’s privacy.

Data- even when encrypted – usually has to be first decrypted before any analysis can be carried out. Hashed data e.g. is not considered to be private. Think of personally identifiable data such as an email address. If the email address is hashed and other data is then linked to the hash, all can be deanonymized: All you need to do is calculate the hashes over many email addresses and look for a match. Thus using hashes, anchor hashes, etc. does not do the job of making a process privacy compliant e.g. under the GDPR.

Most privacy intrusions don’t take place in the form of data leaks or security breaches. Informational self-determination is at the basic level undermined by the data analytic practices of many companies and government institutions. The fact that millions of users delete their Facebook apps from their phones shows that privacy preservation at the level of business-models is a very sensible topic worth considerations beyond mere compliance to legal regulations.

On the blockchain, where all data remains visible, in the open, it is particularily important to keep the data itself private.

Recently, a general concept of privacy-preserving computation has been summarized under the term differencial privacy. Its requirements go far beyond what most data-handling companies until recently considered private. 

Homomorphic encryption (HE) presents a new paradigm for data storage, queries, analytics, and computation. HE basically adds just so much random noise to the data that its factual values remain clouded. The procedure has not only the advantage of making analytics on encrypted data possible, hence offering truly differencial privacy. Other than the common cryptography with prime numbers, the mathematics behind HE is also resistant to crypto-attacks with quantum computing.

With HE data can be encrypted before it is stored e.g. on a cloud server, and be queried and analyzed without the need of ever decrypting it. 

First, the results can remain fully or partially encrypted, too. It is e.g. possible, to include data points into an aggregate of data like an average calculated over a set of individuals without disclosing the single individual behind the data.

A second important application is comparing data. E.g. to calculate the individual risk of carrying a genetic disease, the similarity of a patient’s genome to that of people who show the symptoms is measured. Usually this requires to send a patient’s DNA fully sequenced and unencrypted to the genome lab where it is analyzed. Even if the data is diligently erased by the lab after the results are derived, the lab personnel sees the results. Also of course, the data might be leaked. With HE, the patient’s DNA data is encrypted before it is sent to the lab. The lab calculates the similarities, but the results remain still encrypted and are never revealed until the patient (or their docotor) decrypts it locally on their computer. Nevertheless, the patient’s data could even be added to the group’s aggregate, hence adding to medical knowledge, without any disclosure of he patient’s details.

The third way to use HE is in zero knowledge proofs (zkp). In a zero knowledge proof it is possible get confirmation on certain, predefined questions. E.g. in age verification, it might be necessary to prove legal age, however the exact birth date is of no concern to the second party. With HE, the age can be calculated over the encrypted data of the birth date and correctly -and mathematically proven- give the answer to the comparism if the age is greater than the required threshold.

In a first project, we built a proof-of-concept for HE on location-based data. Our HE scrambles the data before it is stored on a blockchain (in our proof-of-concept the IOTA DLT). Any kind of geo-computation such as geo-fencing can be done without revealing any other details of the data.

uWe now generalize the approach to first apply HE for streaming data onto other databases and blockchain technologies, then second implement APIs to drive analytics with the data in a fully integrated differencial privacy framework called tyuya core (Working title – suggestions welcome!).

Being a statistician and social researcher by vocation my favorite next application with HE will be using personal data from social networks and other media. I believe we as an industry can come up with a better way to serving content to people than by data intrusion. I am excited to start exploring that.