An Open Framework for Secure and Private AI
Like any other industry, AI is constrained by the supply chain that feeds it. For AI, that supply chain is made up of data, computers, and talented scientists to build it all. The most limiting of these is data, as the most valuable datasets are private and thus very difficult to acquire. Our desire for privacy has the potential to severely constrain progress on some of the most important and personal scientific endeavors if those endeavors require access to data about people (such as medical research). Therefore, we need methods for training machine learning models on private data without compromising personal data or the model itself. To this end, many groups across academia and industry are working on techniques such as differential privacy, federated learning, and multi-party computation to provide secure & private AI models.
[Related article: Google Dataset Search Launched to Help Analysts Scour Repositories]
Federated Learning
If I’m a cancer researcher, the first thing I would do is ask for a copy of personal data about a large group of patients. Not only is this privacy invasive, but this means that each of these patients will no longer be in total control of their data. If I have a copy, how can they know for sure I haven’t sent it elsewhere. However, copying data is no longer necessary thanks to a new technology called Federated Learning.
With federated learning, machine learning models are transmitted to where the data lives such as on phones, in trusted data centers, or in browsers, instead of sending user data to some centralized location. It can help protect a user’s data as the data is never transmitted outside of the device it exists on. Instead, models are trained on individual devices, the updates are aggregated over many such devices, then a global model update is sent back to all devices. Federated learning is already being used by Google and Apple to train machine learning models on millions of phones. However, since the model is being sent to these devices, it can be stolen and analyzed to reveal private information.
Differential Privacy
Thanks to federated learning we can train models without seeing the data, but I still see my model at the end. Assuming my AI model is pretty smart, how do I know that it didn’t memorize the data during the process? Differential privacy allows us to measure how much private information our model is revealing or leaking. We can add noise to the model operations to effectively hide private information that would otherwise be leaked, and even adjust the noise to keep the leakage below some threshold. Adding noise comes at the cost of performance however, but we get the benefit that information about our users is kept secret.
Multi-Party Computation
With differential privacy, we can at least measure the risk of revealing private information, but malevolent agents still could steal our models or snoop on the gradients which reveal information about the data. To keep our models secure, we can use a technology called multi-party computation. In this paradigm, data is split up and shared between multiple devices. Models are trained on the individual data pieces then recombined into a final result. Since no one location has access to all the data about a single user, information about individuals is effectively hidden in our model updates. In this way, we can secure models from leaking private information even if the model updates are compromised.
[Related article: AI and Machine Learning in Practice]
PySyft: A Framework for Secure and Private AI
PySyft is an open source framework developed by the OpenMined community that combines these different tools for building secure and private machine learning models. The idea behind PySyft is to extend the APIs of popular deep learning frameworks such as PyTorch and TensorFlow so data scientists and machine learning engineers can immediately begin to build privacy-preserving applications without having to learn a new Deep Learning framework. In this way, we can quickly spread the adoption of federated learning and other tools for preserving privacy.
The OpenMined community is working to unlock the potential of training AI models on all available data, not just the small fraction that is easily accessible. If you’d like to work with us on PySyft or other projects, check out our group page on GitHub.
— — — — — — — —
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.