The changing paradigm of Big Data

“The 2010s was the decade of Big Data.
The 2020s will be the decade of privacy.”

Big Data has changed people, now people change big data.

  • People are not Luddites. People love what online and the cloud can offer.
  • Connected data is more than just convenience:  health, genome-driven precision medicine, AI and autonomous systems, industrial internet, …
  • But not at all costs:
  • No one wants to feel exploited
  • The threats are real. It is not just an annoyance. More than just money will be lost.
  • Cloud services often mean “lock-in” for businesses – giving away your trade secret might just not work for all. 
  • And policy changes, too. GDPR is just the beginning.

Incorporating privacy

A stack of technologies was just invented that finally does the trick:

  • Homomorphic encryption and zero-knowledge proofs for privacy preserving computation
  • Federated learning to train models decentralized for privacy preserving AI
  • True peer-to-peer networks as data infrastructure without a single point of failure.

Technology is ready

  • Researchers at Microsoft, Intel, IBM, and others have defined standards and opened homomorphic encryption for broad application.
  • Zero-knowledge proofs have been working flawlessly in cryptocurrencies such as zCash and Monero for years
  • Federated learning has been pushed to be ready by the AI and chatbot wave.
  • The blockchain offers true secure and trustable peer-to-peer interactions without intermediation.


Consumer DNA data and privacy: The case for homomophic encryption

Respecting the autonomy of participants in genetic studies is necessary for long term sustainability of the genomics ecosystem.

Yaniv Erlich

Personally identifiable information is data that, even if it does not link immediately to the person who generated it, can be traced back and de-anonymized. The more dimensions the data spans, the more variability it shows, the more likely it is to find an exact match to one single person.

Genome data, the sequence of base pairs in a person’s DNA is probably the most personal of all today’s data. It is unique not only to every human -more even than fingerprints- but it also ties together families, closer or more remote relatives, and even ethnic communities.

Yaniv Erlich and his research team has put together a good summary of what DNA databases and ancestry-related research entails in terms of informational self-determination. The conclusion: We need privacy-preserving mechanisms like differential privacy and homomorphic encryption. 

http://science.sciencemag.org/content/early/2018/10/10/science.aau4832

Privacy-preserving deep learning

Paper by Ryffel, et.al.

Abstract

We detail a new framework for privacy preserving deep learning and discuss its assets. The framework puts a premium on ownership and secure processing of data and introduces a valuable representation based on chains of commands and tensors. This abstraction allows one to implement complex privacy preserving constructs such as Federated Learning, Secure Multiparty Computation, and Differential Privacy while still exposing a familiar deep learning API to the end-user. We report early results on the Boston Housing and Pima Indian Diabetes datasets. While the privacy features apart from Differential Privacy do not impact the prediction accuracy, the current implementation of the framework introduces a significant overhead in performance, which will be addressed at a later stage of the development. We believe this work is an important milestone introducing the first reliable, general framework for privacy preserving deep learning.

https://arxiv.org/abs/1811.04017

Training data for natural language processing

Paper by Gerlach & Font-Clos

Abstract:

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

https://arxiv.org/abs/1812.08092

Removing information is information, too.

Readers of Neal Stephenson’s Cryptonomicon know it: If you try to keep things secret, better not just erase the news you want to keep to your self.

By being forced to blurr military facilities, satellite data services like Google Maps are made to unintentionally reveal the very information they are supposed to keep secret: Just keep looking for the weird clouds covering the otherwise sunny landscape and you will find all that should be hidden.

Read the full story here:

https://fas.org/blogs/security/2018/12/widespread-blurring-of-satellite-images-reveals-secret-facilities/

More efficiency in dealing with noise in homomorphic encryption

Paper by Wang, et.al.

Dealing with noise is what slows down homomorphic encryption because the encryption has continuously to be refreshed.

Abstract

Achieving both simplicity and efficiency in fully homomorphic encryption (FHE) schemes is important for practical applications. In the simple FHE scheme proposed by Ducas and Micciancio (DM), ciphertexts are refreshed after each homomorphic operation. And ciphertext refreshing has become a major bottleneck for the overall efficiency of the scheme. In this paper, we propose a more efficient FHE scheme with fewer ciphertext refreshings. Based on the DM scheme and another simple FHE scheme proposed by Gentry, Sahai, and Waters (GSW), ciphertext matrix operations and ciphertext vector additions are both applied in our scheme. Compared with the DM scheme, one more homomorphic NOT AND (NAND) operation can be performed on ciphertexts before ciphertext refreshing. Results show that, under the same security parameters, the computational cost of our scheme is obviously lower than that of GSW and DM schemes for a depth-2 binary circuit with NAND gates. And the error rate of our scheme is kept at a sufficiently low level.

https://www.hindawi.com/journals/scn/2018/8706940/

Microsoft puts powerful homomorphic encryption open source

Microsoft has published their “Simple Encryoted Arithmetic Library” SEAL. The code is written in standard c++ without external dependencies.

Microsoft has lead international efforts to standardize homomorphic encryption. Standardization and a larger base of developers actually working on this new paradigm of provacy-preserving computation are key to turn homomorphic encryption from an experimental, academic research into applications that can practically be implemented for real use cases.

Read Microsofts anouncement here:

The SEAL repo on Gitub:

https://github.com/Microsoft/SEAL