Why privacy-preserving analytics are crucial for the data economy
Today for most organizations it is hard to impossible to benefit from artificial intelligence (AI). Especially, when it comes to training and applying machine learning models because they require substantial investments of money, time, data and expertise. These factors are all scarce resources. Scarcity should lead to trade and exchange of goods in the marketplace, but for AI models there is no vivid platform yet.
When a model makes a prediction locally, it is trivial to infer its inner workings and thus replicate it. Nowadays, monetizing a model means either one is limited to sell the model once, or the interested party will always have to send over its valuable data. Both options have severe disadvantages, which will be described in the following two scenarios.
Economic problem of an one-off sale (scenario I)
A machine learning practitioner has built a unique and accurate model leveraging a mixture of public and private data. She/he may have spent months perfecting it. Right now, the only option she/he has is to sell the model once. Since sending the model means the receiver can use it without any restrictions. By not being able to rent out the model for predictions she/he loses potential income and gets de-incentivized to improve the model further.
Security and privacy problem of sharing the data (scenario II)
A doctor would like to offer a digital skin disease diagnosing service to her/his clients with them just sending a photo of their skin condition. If the doctor opted for the services of a third-party which would provide remote AI, he would have to forward the pictures of her/his clients to the model owner each time. The model owner would have unrestricted access to all the data, potentially breaching any data privacy and security.
A case study
A machine learning model owner (Alice) with an already trained model for skin cancer detection based on public picture datasets and wants to sell predictions of her model to a doctor (Bob). Bob wants to introduce an automated service of predicting skin cancer for his patients. The current way of doing this is either through Alice sending the model to Bob in which case Bob would have access to all the work Alice has done on the model (scenario I) or by Bob continuously sending his clients’ data to Alice (scenario II).
Illustration scenario I
Illustration scenario II
Both options imply massive risks and the parties’ interests are not always aligned. With advancements in the fields of cryptography and machine learning a novel way to solve these pain points was developed. Neither Alice nor Bob must share any sensitive information; the model predictions are made on encrypted data while the model itself is also encrypted.
Well researched and established cryptographic protocols such as secure multi-party computation and zero-knowledge proofs can be combined with machine learning to augment the privacy of the latter. Major technology companies and financial institutions are already working on solutions that encompass these technologies (JPMorgan, Facebook/Udacity). Decentriq’s offering is allowing individuals and organizations to apply their models to external data without breaching data privacy or releasing their precious model weights. Our privacy-preserving machine learning product suite, Intelligent Cryptographic Membrane (ICM), covers any privacy privacy-preserving analytics, from simple statistics to deep learning. By combining open-source code (e.g. Openmined) with decentriq’s own cryptography and machine learning expertise, we provide an easy to use tool which can be seamlessly integrated into the existing workflows of a data scientists or data manager via APIs in Python based on popular machine learning frameworks such as PyTorch and TensorFlow.
In the example above, both Alice and Bob would have to install an ICM instance on their local machines, which in the background is taking care of all the connections between the parties. Then Alice, with some simple commands, queries the dataset of Bob and starts predicting the labels of his dataset. All this takes place while both the model and the data stays encrypted (throughout the entire transference and communication) the whole time.
In the background, ICM takes care of two critical parts which are required for a privacy-preserving prediction pipeline
- It connects the two (or more) machines which contain the datasets and the model. Additionally, ICM takes care of the work distribution between the machines.
- Following the secure multi-party computation protocol, it applies the technique of secret sharing to encrypt the information between the two parties. Essentially, the technique splits the data in a way that only by combining all parts it is possible to recover the plaintext information. Then any machine learning operation is done collaboratively on this split data, and in the end, it is combined again on the point of origin. Hence, only the original owner of the information will be able to derive meaningful results out of any operation.
ICM in pictures
Step I: Splitting the data and the model into encrypted pieces via PyTorch
Step II: Collaboratively doing operations on the shared data generating shares of the results
Step III: Bob will add together the shares of the results to get the predictions on the images
While the need for advanced data analytics and machine learning is increasing in all industries, insourcing the development of these models is a costly process. However, outsourcing this development is hard because:
- Most of the time the data is too sensitive to be shared continuously.
- Purchasing an one-off model doesn't yield the expected results since the advantages of today's AI models lies in its continuous improvements.
Therefore, we are introducing ICM, which allows any business to benefit from the vast global talent-pool in AI, all while keeping their most sensitive data private and on-premise. All with the goal to enable privacy-preserving analytics contributing to a sustainable data economy.
Experiment with our demo
To better demonstrate how a solution like this would work in practice, we created a demo which; implements a toy model on the well-known CIFAR dataset and then making privacy-preserving predictions on a slice of that dataset which is sent to 2 virtual remote machines.
The demo is accessible to everyone on Google Colab (here), for any questions or feedback please contact us on firstname.lastname@example.org