Differential privacy in practice (easy version)

本文介绍如何在实际场景中应用差分隐私技术,包括计算唯一用户数量、计数事物、求和或平均数值以及同时发布多个统计数据的方法。文章还讨论了避免陷阱的策略,如数据钳制的重要性。

Part of a series on differential privacy. You might want to start with the previous articles below!

  1. Why differential privacy is awesome presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) (this article) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
  6. The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


Previously, we saw that differential privacy was pretty awesome, and we looked at the formal definition. Now, how do we obtain it in practice? Let's start with the basics.

Counting unique users

Suppose you have a database, and you want to publish how many people in there satisfy a given condition. Say, how many have green eyes? Even if you have many people in your database, you can't just publish the true answer. Let's take a moment to understand why.

With differential privacy, we assume that the attacker knows almost all elements. They only have uncertainty about their target. Say they want to know whether their target has green eyes. If you output the real number kk, they can compare it with the number of people with green eyes among the people they know. If it's k−1k−1, then the target has green eyes. If it's kk, then the target does not.

So, what do we do? We compute the exact answer, and we add noise. This noise will come from a probability distribution called the Laplace distribution. This distribution has a parameter, its scale, which determines how "flat" it is. It looks like this:

 

Graph showing a Laplace distribution with scale 1/ln(3), centered on 0

 

So, to get εε-differential privacy, we pick a random value according to Laplace(1/ε)Laplace(1/ε), and we add this noise to the real value. Why does it work? Let's look at the distribution of the number we return, depending on whether the true count is k=1000k=1000 (blue line, the target doesn't have green eyes) or k=1001k=1001 (yellow line, the target has green eyes).

 

 

 

Let's say the real number is k=1001k=1001, and after adding noise, we published 10031003. Let's put ourselves in the attacker's shoes. What's the likelihood that the original number was 10011001 vs. 10001000? The hypothesis "k=1001k=1001" is a bit more likely: generating a noise of 22 is more likely than a noise of 33. How much more likely? It turns out that the ratio between these likelihoods is… eεeε! So the ratio of probabilities of differential privacy is satisfied.

This works no matter what the output is: the ratio will always be between eεeε and e−εe−ε. If you want to double-check, you can either verify it on the graph, or do the math.

Counting things

OK, so counting unique users was pretty easy. Counting things must also be straightforward, right? Let's say you have a database of suggestions that people sent to your company using a feedback form. You want to publish the number of suggestions you received on a given day. Meanwhile, the attacker wants to get an idea of how many complaints their target published.

What's different about the previous scenario? Can't we just add noise picked from Laplace(1/ε)Laplace(1/ε) and get εε-differential privacy? There's a catch: what if someone sent more than one complaint during one day? Let's say someone was super unhappy and sent five complaints. The other 1000 customers sent one complaint each. The influence of this one disgruntled customer will be larger than before. The two distributions now look like this:

 

 

 

The difference between the curves is much larger than before. Their ratio is at most e5εe5ε, so using a parameter of 1/ε1/ε only gives 5ε5ε-differential privacy. To fix this, we need to add more noise. How much more? It depends on the maximum contribution of one individual user. If the maximum amount of complaints in one day is 5, you must add 5 times the amount of noise. In this example, using Laplace(5/ε)Laplace(5/ε) would give you εε-differential privacy.

 

 

Note that you can't fully automate this process: you need to know what the largest contribution can be. A human, with some knowledge over the process, must make a judgment call. In our case, this could be "users won't post more than five complaints per day".

What happens if that judgment call is wrong, and a user later decides to post 10 complaints in one day? To preserve the desired level of privacy, you need to clamp all values to the estimated maximum. In other words, for this outlier user, you would only count 5 complaints in the non-noisy sum.

This process can introduce unexpected bias in the data. So, be careful when estimating the largest contribution! If clamping only happens very rarely, you should be fine.

Summing or averaging numbers

Let's say each of your users gives your customer service a rating, between -10 and 10. You want to release the average rating. Computing an average is pretty much the same as computing a sum — add all ratings, then divide by the number of users1. So, what do we do to the sum to achieve differential privacy?

Among all rating options, we only have to consider the worst possible case. How far can the two noise curves be from each other? If the values are all between -10 and 10, the greatest possible difference is 10−(−10)=2010−(−10)=20. It happens when the attacker tries to determine whether a user voted -10 or 10.

Like in the previous example, you have to add noise of Laplace(20/ε)Laplace(20/ε) to get εε-differential privacy. And just as before, you need to check that each value is between your theoretical minimum and maximum. If you find an anomalous value, e.g. lower than the minimum, you need to clamp it to the minimum before adding it to the sum.

In some cases, estimating these minimum or maximum values is difficult. For example, if you're computing the average salary in a large group of people, how should you estimate the upper salary limit? I don't see that problem as a usability flaw of differential privacy. Rather, it suggests that averages are not a meaningful metric in the presence of outliers. Removing these outliers is a good idea for both accuracy and privacy :-)

Releasing many things at once

OK. What if you don't want to release only one statistic, but many of them? Can you add noise to each of them and be fine? Well… it depends. The main question you have to ask yourself is: what is the maximum influence of one individual? There are two distinct possibilities.

Statistics are about different people

Suppose you want to release the number of users you have depending on their age ranges: 20-29, 30-39, 40-49, etc.

Each user will have an influence in at most one of these categories. Either someone is in a given age range, either they're in another one. This situation often appears when you're trying to release a histogram:

 

 

 

When you're in this case, you can safely add Laplace noise of scale 1/ε1/ε to each bucket count. There is no problematic interaction between buckets. Releasing the entire histogram is still εε-differentially private.

 

 

 

Pretty easy, right? Note that this histogram looks a bit weird: the counts are not integers, and one count ended up being negative! We can make it a bit less suspicious, by rounding all counts and replacing all negative numbers by 0.

 

 

 

This type of post-processing is allowed, thanks to a nifty property of differential property. If you take differentially private data, and make it go through a fixed transformation, you still get differential privacy2. Convenient!

One person is in multiple statistics

OK, what if you're releasing multiple statistics, but this time, they might all be about the same user? Let's say that you want to publish how many of your users…

  • are younger than 35;
  • are using an iOS device;
  • are colorblind;
  • have started using your app less than a month ago.

The same user could be in all those categories! In this scenario, you can't add Laplace noise of scale 1/ε1/ε to each count and get εε-differential privacy. Instead, you have to consider each count as a separate data release. Thus, if you have CC different counts, you have to add Laplace noise of scale C/εC/ε to each of them. Each independent release will be ε/Cε/C-differentially private. And we can now use the composition property of differential privacy! This allows us to conclude that the entire release is εε-differentially private.

This works for any kind of statistics, not just unique counts. Want to release several pieces of information? Count the maximum influence of one single person, and "split" your εε between each data release. This εε is your "privacy budget": you choose ε1ε1, …, εCεC whose sum is εε, and you release the ii-th statistic with εiεi-differential privacy. This solution is more flexible than simply using ε/Cε/C-differential privacy on each statistic. If one statistic is more important, or more sensitive to noise, you can attribute more budget for it. The larger the budget portion, the lower the noise. Of course, you will have to add more noise to the other values.

Traps to avoid

The mechanisms we saw today are pretty straightforward. But anonymization is full of traps: there are still a number of things that can go wrong.

  • When summing or averaging numbers, clamping them to the minimum and maximum is essential. Otherwise, all guarantees fly out the window.
  • Pay attention how you implement this clamping in practice. Special values like NaN can lead to surprising behavior.
  • Of course, you're not allowed to cheat. You have to choose your privacy strategy in advance, then apply it, and release the noisy data. You can't double-check that it's accurate enough. Otherwise, you skew the randomness, and you lose the privacy guarantees.
  • When releasing histograms, it's important to choose each category in advance. If you have to look at your data to know what your categories are, you have to use more subtle methods.

This last point is often a problem in practice. For example, say that you want to count how many times each word was used in customer complaints. You can't make a definite word list: people could use words that you didn't predict, or make typos. In that case, the technique I described for histograms doesn't work.

Why doesn't it work? How to fix this? All this, and more, in the next article!


  1. If you want to be extra pedantic, you might also want to add noise to your total number of users. That depends on the flavor of definition that you choose. I'm not going to that level of detail here, and you probably shouldn't either. 

  2. If you're wondering why this is true, here's a super short proof. If there was a post-processing function that would break the differential privacy property… The attacker could run it too, and distinguish between two outcomes. But it's impossible, because differential privacy forbids it :-) 

【四轴飞行器】非线性三自由度四轴飞行器模拟器研究(Matlab代码实现)内容概要:本文围绕非线性三自由度四轴飞行器模拟器的研究展开,重点介绍基于Matlab代码实现的四轴飞行器动力学建模与仿真方法。研究构建了考虑非线性特性的飞行器数学模型,涵盖姿态动力学与运动学方程,实现了三自由度(滚转、俯仰、偏航)的精确模拟。文中详细阐述了系统建模过程、控制算法设计思路及仿真结果分析,帮助读者深入理解四轴飞行器的飞行动力学特性与控制机制;同时,该模拟器可用于算法验证、控制器设计与教学实验。; 适合人群:具备一定自动控制理论基础和Matlab编程能力的高校学生、科研人员及无人机相关领域的工程技术人员,尤其适合从事飞行器建模、控制算法开发的研究生和初级研究人员。; 使用场景及目标:①用于四轴飞行器非线性动力学特性的学习与仿真验证;②作为控制器(如PID、LQR、MPC等)设计与测试的仿真平台;③支持无人机控制系统教学与科研项目开发,提升对姿态控制与系统仿真的理解。; 阅读建议:建议读者结合Matlab代码逐模块分析,重点关注动力学方程的推导与实现方式,动手运行并调试仿真程序,以加深对飞行器姿态控制过程的理解。同时可扩展为六自由度模型或加入外部干扰以增强仿真真实性。
基于分布式模型预测控制DMPC的多智能体点对点过渡轨迹生成研究(Matlab代码实现)内容概要:本文围绕“基于分布式模型预测控制(DMPC)的多智能体点对点过渡轨迹生成研究”展开,重点介绍如何利用DMPC方法实现多智能体系统在复杂环境下的协同轨迹规划与控制。文中结合Matlab代码实现,详细阐述了DMPC的基本原理、数学建模过程以及在多智能体系统中的具体应用,涵盖点对点转移、避障处理、状态约束与通信拓扑等关键技术环节。研究强调算法的分布式特性,提升系统的可扩展性与鲁棒性,适用于多无人机、无人车编队等场景。同时,文档列举了大量相关科研方向与代码资源,展示了DMPC在路径规划、协同控制、电力系统、信号处理等多领域的广泛应用。; 适合人群:具备一定自动化、控制理论或机器人学基础的研究生、科研人员及从事智能系统开发的工程技术人员;熟悉Matlab/Simulink仿真环境,对多智能体协同控制、优化算法有一定兴趣或研究需求的人员。; 使用场景及目标:①用于多智能体系统的轨迹生成与协同控制研究,如无人机集群、无人驾驶车队等;②作为DMPC算法学习与仿真实践的参考资料,帮助理解分布式优化与模型预测控制的结合机制;③支撑科研论文复现、毕业设计或项目开发中的算法验证与性能对比。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,重点关注DMPC的优化建模、约束处理与信息交互机制;按文档结构逐步学习,同时参考文中提及的路径规划、协同控制等相关案例,加深对分布式控制系统的整体理解。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值