Differential privacy in practice (easy version)

最新推荐文章于 2024-03-28 15:52:32 发布

转载最新推荐文章于 2024-03-28 15:52:32 发布 · 408 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://desfontain.es/privacy/differential-privacy-in-practice.html

本文介绍如何在实际场景中应用差分隐私技术，包括计算唯一用户数量、计数事物、求和或平均数值以及同时发布多个统计数据的方法。文章还讨论了避免陷阱的策略，如数据钳制的重要性。

Part of a series on differential privacy. You might want to start with the previous articles below!

Why differential privacy is awesome presents a non-technical explanation of the definition.
Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
Differential privacy in practice (easy version) (this article) explains how to make simple statistics differentially private.
Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

Previously, we saw that differential privacy was pretty awesome, and we looked at the formal definition. Now, how do we obtain it in practice? Let's start with the basics.

Counting unique users

Suppose you have a database, and you want to publish how many people in there satisfy a given condition. Say, how many have green eyes? Even if you have many people in your database, you can't just publish the true answer. Let's take a moment to understand why.

With differential privacy, we assume that the attacker knows almost all elements. They only have uncertainty about their target. Say they want to know whether their target has green eyes. If you output the real number kk, they can compare it with the number of people with green eyes among the people they know. If it's k−1k−1, then the target has green eyes. If it's kk, then the target does not.

So, what do we do? We compute the exact answer, and we add noise. This noise will come from a probability distribution called the Laplace distribution. This distribution has a parameter, its scale, which determines how "flat" it is. It looks like this:

Graph showing a Laplace distribution with scale 1/ln(3), centered on 0

So, to get εε-differential privacy, we pick a random value according to Laplace(1/ε)Laplace(1/ε), and we add this noise to the real value. Why does it work? Let's look at the distribution of the number we return, depending on whether the true count is k=1000k=1000 (blue line, the target doesn't have green eyes) or k=1001k=1001 (yellow line, the target has green eyes).

Let's say the real number is k=1001k=1001, and after adding noise, we published 10031003. Let's put ourselves in the attacker's shoes. What's the likelihood that the original number was 10011001 vs. 10001000? The hypothesis "k=1001k=1001" is a bit more likely: generating a noise of 22 is more likely than a noise of 33. How much more likely? It turns out that the ratio between these likelihoods is… eεeε! So the ratio of probabilities of differential privacy is satisfied.

This works no matter what the output is: the ratio will always be between eεeε and e−εe−ε. If you want to double-check, you can either verify it on the graph, or do the math.

Counting things

OK, so counting unique users was pretty easy. Counting things must also be straightforward, right? Let's say you have a database of suggestions that people sent to your company using a feedback form. You want to publish the number of suggestions you received on a given day. Meanwhile, the attacker wants to get an idea of how many complaints their target published.

What's different about the previous scenario? Can't we just add noise picked from Laplace(1/ε)Laplace(1/ε) and get εε-differential privacy? There's a catch: what if someone sent more than one complaint during one day? Let's say someone was super unhappy and sent five complaints. The other 1000 customers sent one complaint each. The influence of this one disgruntled customer will be larger than before. The two distributions now look like this:

The difference between the curves is much larger than before. Their ratio is at most e5εe5ε, so using a parameter of 1/ε1/ε only gives 5ε5ε-differential privacy. To fix this, we need to add more noise. How much more? It depends on the maximum contribution of one individual user. If the maximum amount of complaints in one day is 5, you must add 5 times the amount of noise. In this example, using Laplace(5/ε)Laplace(5/ε) would give you εε-differential privacy.

Note that you can't fully automate this process: you need to know what the largest contribution can be. A human, with some knowledge over the process, must make a judgment call. In our case, this could be "users won't post more than five complaints per day".

What happens if that judgment call is wrong, and a user later decides to post 10 complaints in one day? To preserve the desired level of privacy, you need to clamp all values to the estimated maximum. In other words, for this outlier user, you would only count 5 complaints in the non-noisy sum.

This process can introduce unexpected bias in the data. So, be careful when estimating the largest contribution! If clamping only happens very rarely, you should be fine.

Summing or averaging numbers

Let's say each of your users gives your customer service a rating, between -10 and 10. You want to release the average rating. Computing an average is pretty much the same as computing a sum — add all ratings, then divide by the number of users1. So, what do we do to the sum to achieve differential privacy?

Among all rating options, we only have to consider the worst possible case. How far can the two noise curves be from each other? If the values are all between -10 and 10, the greatest possible difference is 10−(−10)=2010−(−10)=20. It happens when the attacker tries to determine whether a user voted -10 or 10.

Like in the previous example, you have to add noise of Laplace(20/ε)Laplace(20/ε) to get εε-differential privacy. And just as before, you need to check that each value is between your theoretical minimum and maximum. If you find an anomalous value, e.g. lower than the minimum, you need to clamp it to the minimum before adding it to the sum.

In some cases, estimating these minimum or maximum values is difficult. For example, if you're computing the average salary in a large group of people, how should you estimate the upper salary limit? I don't see that problem as a usability flaw of differential privacy. Rather, it suggests that averages are not a meaningful metric in the presence of outliers. Removing these outliers is a good idea for both accuracy and privacy :-)

Releasing many things at once

OK. What if you don't want to release only one statistic, but many of them? Can you add noise to each of them and be fine? Well… it depends. The main question you have to ask yourself is: what is the maximum influence of one individual? There are two distinct possibilities.

Statistics are about different people

Suppose you want to release the number of users you have depending on their age ranges: 20-29, 30-39, 40-49, etc.

Each user will have an influence in at most one of these categories. Either someone is in a given age range, either they're in another one. This situation often appears when you're trying to release a histogram:

When you're in this case, you can safely add Laplace noise of scale 1/ε1/ε to each bucket count. There is no problematic interaction between buckets. Releasing the entire histogram is still εε-differentially private.

Pretty easy, right? Note that this histogram looks a bit weird: the counts are not integers, and one count ended up being negative! We can make it a bit less suspicious, by rounding all counts and replacing all negative numbers by 0.

This type of post-processing is allowed, thanks to a nifty property of differential property. If you take differentially private data, and make it go through a fixed transformation, you still get differential privacy2. Convenient!

One person is in multiple statistics

OK, what if you're releasing multiple statistics, but this time, they might all be about the same user? Let's say that you want to publish how many of your users…

are younger than 35;
are using an iOS device;
are colorblind;
have started using your app less than a month ago.

The same user could be in all those categories! In this scenario, you can't add Laplace noise of scale 1/ε1/ε to each count and get εε-differential privacy. Instead, you have to consider each count as a separate data release. Thus, if you have CC different counts, you have to add Laplace noise of scale C/εC/ε to each of them. Each independent release will be ε/Cε/C-differentially private. And we can now use the composition property of differential privacy! This allows us to conclude that the entire release is εε-differentially private.

This works for any kind of statistics, not just unique counts. Want to release several pieces of information? Count the maximum influence of one single person, and "split" your εε between each data release. This εε is your "privacy budget": you choose ε1ε1, …, εCεC whose sum is εε, and you release the ii-th statistic with εiεi-differential privacy. This solution is more flexible than simply using ε/Cε/C-differential privacy on each statistic. If one statistic is more important, or more sensitive to noise, you can attribute more budget for it. The larger the budget portion, the lower the noise. Of course, you will have to add more noise to the other values.

Traps to avoid

The mechanisms we saw today are pretty straightforward. But anonymization is full of traps: there are still a number of things that can go wrong.

When summing or averaging numbers, clamping them to the minimum and maximum is essential. Otherwise, all guarantees fly out the window.
Pay attention how you implement this clamping in practice. Special values like NaN can lead to surprising behavior.
Of course, you're not allowed to cheat. You have to choose your privacy strategy in advance, then apply it, and release the noisy data. You can't double-check that it's accurate enough. Otherwise, you skew the randomness, and you lose the privacy guarantees.
When releasing histograms, it's important to choose each category in advance. If you have to look at your data to know what your categories are, you have to use more subtle methods.

This last point is often a problem in practice. For example, say that you want to count how many times each word was used in customer complaints. You can't make a definite word list: people could use words that you didn't predict, or make typos. In that case, the technique I described for histograms doesn't work.

Why doesn't it work? How to fix this? All this, and more, in the next article!

If you want to be extra pedantic, you might also want to add noise to your total number of users. That depends on the flavor of definition that you choose. I'm not going to that level of detail here, and you probably shouldn't either. ↩
If you're wondering why this is true, here's a super short proof. If there was a post-processing function that would break the differential privacy property… The attacker could run it too, and distinguish between two outcomes. But it's impossible, because differential privacy forbids it :-) ↩