Introduction to Differential Privacy Concepts
Editor’s note: Veena is a speaker for ODSC West 2022 this October 31st to November 3rd. Be sure to check out her talk, “Introduction to Differential Privacy Concepts,” there!
Differential Privacy (DP) presents a framework for a robust concept of privacy that provides mathematical rigor to the issue of privacy-preserving analysis of datasets with personal information. Informally, DP requires that the outcome of analysis should remain stable under a change to an individual’s information, thereby protecting individuals from adversaries that try to learn the information particular to them. It is studied in the context of the collection, analysis, and release of aggregate statistics ranging from simple statistical estimations to machine learning.
Motivation — So why do we care about differential privacy?
Over the last decade, the ability to collect and store personal data has exploded. Collected at scale from financial or medical services, when filling in online surveys or liking pages, this data has an incredible potential for good in many domains.
At the same time, the large-scale collection and use of individual-level data raises privacy concerns. In a recent survey, over 70 percent of U.S. citizens reported being worried about sharing personal information online. Sensitive data can be exploited for mass surveillance, social engineering, or identity theft. Data anonymization (de-identification) has been the main paradigm used in research and elsewhere to share data while preserving people’s privacy. However, many of the existing approaches do not sufficiently protect individuals’ data as indicated by the following examples.
In 2019, a team from Imperial College showed that data could often be reverse-engineered, even with incomplete datasets. Over 99 percent of the sample were correctly re-identified by using only 15 attributes such as age, gender, and marital status.
“While there might be a lot of people who are in their thirties, male, and living in New York City, far fewer of them were also born on 5 January, are driving a red sports car, and live with two kids (both girls) and one dog.”
Another study, using census data, voter registration data, and hospital level data, found that 87% of the population in the United States had reported characteristics that likely made them unique based only on {5-digit ZIP, gender, date of birth}. About half of the U.S. population are likely to be uniquely identified by only {place, gender, date of birth}, where place is the city, town, or municipality of residence.
The issue here is the mosaic effect ( “Sorry, your data can still be identified even if it’s anonymized”) … individual pieces of data when released independently may not reveal sensitive information but, when combined, can be used to derive personal information. High-dimensional/high-resolution data is essentially unique while lower dimension and lower resolution data, though more private, is less useful.
Defining Differential Privacy
A strong form of privacy protection with a mathematical definition. It is not a specific process like data anonymization, but a property of a process. For example, it is possible to prove that a specific algorithm “satisfies” differential privacy.
Informally, differential privacy guarantees the following: for each individual who contributes data for analysis the output of a differentially private analysis will be approximately the same, whether or not you contribute your data. A differentially private analysis is often called a mechanism (denoted by ℳ below).
Informal Definition of Differential Privacy (Source: NIST DP Blog)
In the above figure, “A” is computed without Joe’s data while “B” is computed with Joe’s data. Differential privacy says that the two answers should be indistinguishable.
This intuitive idea can be made precise by tuning the privacy parameter ε, also known as the privacy loss or privacy budget, and is a measure of the strength of the privacy guarantee. A lower value of the parameter ε, implies that the results are more indistinguishable, and, hence, the individual’s data is more protected.
Formal Definition of Differential Privacy (Source: NIST DP Blog)
Note that privacy in DP means the logical security of data. It does not refer to the traditional security of data, for example, access control, theft, hacking, etc. It addresses the issue of an adversary using legitimate methods being able to correlate data from multiple databases. See Common Misconceptions About Differential Privacy for more details.
Differential Privacy provides a robust concept of privacy through a mathematical framework for quantifying and managing privacy risks. It is studied in the context of the collection, analysis, and release of aggregate statistics ranging from simple statistical estimations to machine learning. It is an emerging topic with growing interest as an approach for satisfying legal requirements for privacy protection of personal information. Differential privacy can be viewed as a technical solution for protecting individual privacy to meet legal or policy requirements for disclosure limitations while analyzing and sharing personal data. Tools for differentially private analysis are now in the early stages of implementation and are in use across a variety of academic, industry, and government settings.
To learn more, attend my talk at ODSC West 2022 on differential privacy, “Introduction to Differential Privacy Concepts,” where we’ll be using examples and some mathematical formalism to introduce differential privacy concepts and definitions, how differentially private analyses are constructed, and how these can be used in practice.
About the Author/ODSC West 2022 Speaker:
Veena Mendiratta is an Adjunct Professor at Northwestern University, and a technology advisor, speaker, and mentor. She recently retired from Nokia Bell Labs where she was an applied researcher in the area of network reliability and analytics.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Subscribe to our fast-growing Medium Publication too, the ODSC Journal, and inquire about becoming a writer.