Respecting the autonomy of participants in genetic studies is necessary for long term sustainability of the genomics ecosystem.
Personally identifiable information is data that, even if it does not link immediately to the person who generated it, can be traced back and de-anonymized. The more dimensions the data spans, the more variability it shows, the more likely it is to find an exact match to one single person.
Genome data, the sequence of base pairs in a person’s DNA is probably the most personal of all today’s data. It is unique not only to every human -more even than fingerprints- but it also ties together families, closer or more remote relatives, and even ethnic communities.
Yaniv Erlich and his research team has put together a good summary of what DNA databases and ancestry-related research entails in terms of informational self-determination. The conclusion: We need privacy-preserving mechanisms like differential privacy and homomorphic encryption.
We detail a new framework for privacy preserving deep learning and discuss its assets. The framework puts a premium on ownership and secure processing of data and introduces a valuable representation based on chains of commands and tensors. This abstraction allows one to implement complex privacy preserving constructs such as Federated Learning, Secure Multiparty Computation, and Differential Privacy while still exposing a familiar deep learning API to the end-user. We report early results on the Boston Housing and Pima Indian Diabetes datasets. While the privacy features apart from Differential Privacy do not impact the prediction accuracy, the current implementation of the framework introduces a significant overhead in performance, which will be addressed at a later stage of the development. We believe this work is an important milestone introducing the first reliable, general framework for privacy preserving deep learning.
The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3×109 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.
Readers of Neal Stephenson’s Cryptonomicon know it: If you try to keep things secret, better not just erase the news you want to keep to your self.
By being forced to blurr military facilities, satellite data services like Google Maps are made to unintentionally reveal the very information they are supposed to keep secret: Just keep looking for the weird clouds covering the otherwise sunny landscape and you will find all that should be hidden.
Dealing with noise is what slows down homomorphic encryption because the encryption has continuously to be refreshed.
Achieving both simplicity and efficiency in fully homomorphic encryption (FHE) schemes is important for practical applications. In the simple FHE scheme proposed by Ducas and Micciancio (DM), ciphertexts are refreshed after each homomorphic operation. And ciphertext refreshing has become a major bottleneck for the overall efficiency of the scheme. In this paper, we propose a more efficient FHE scheme with fewer ciphertext refreshings. Based on the DM scheme and another simple FHE scheme proposed by Gentry, Sahai, and Waters (GSW), ciphertext matrix operations and ciphertext vector additions are both applied in our scheme. Compared with the DM scheme, one more homomorphic NOT AND (NAND) operation can be performed on ciphertexts before ciphertext refreshing. Results show that, under the same security parameters, the computational cost of our scheme is obviously lower than that of GSW and DM schemes for a depth-2 binary circuit with NAND gates. And the error rate of our scheme is kept at a sufficiently low level.
Microsoft has published their “Simple Encryoted Arithmetic Library” SEAL. The code is written in standard c++ without external dependencies.
Microsoft has lead international efforts to standardize homomorphic encryption. Standardization and a larger base of developers actually working on this new paradigm of provacy-preserving computation are key to turn homomorphic encryption from an experimental, academic research into applications that can practically be implemented for real use cases.