Differential privacy in (a bit) more detail

本文深入探讨差分隐私的定义,解释其如何通过限制概率变化保护个人隐私,同时提供了一个直观的例子来说明概念,并讨论了攻击者知识量化及隐私保护的组合性质。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Part of a series on differential privacy. In case you need some more reading material!

  1. Why differential privacy is awesome presents a non-technical explanation of the definition.
  2. Differential privacy in (a bit) more detail (this article) introduces the formal definition, with very little math.
  3. Differential privacy in practice (easy version) explains how to make simple statistics differentially private.
  4. Almost differential privacy describes how to publish private histograms without knowing the categories in advance.
  5. Local vs. global differential privacy presents the two main models of differential privacy, depending on who the attacker is.
  6. The privacy loss random variable explains the real meaning of (ε,δ)(ε,δ)-differential privacy.

 


As I mentioned in the previous article, differential privacy is pretty awesome. If I did a good job, you're now wondering what the real definition looks like. So in this post, I will go into a bit more detail into what differential privacy actually means, and why it works so well. There will be some math! But I promise I will explain all the concepts I use, and give lots of intuition.

The definition

We saw that a process satisfies differential privacy if its output is basically the same if you change the data of one individual. And by "basically the same", we meant "the probabilities are close".

 

 

Let's now translate that into a formal definition.

A process AA is εε-differentially private if for all databases D1D1 and D2D2 which differ in only one individual:

P[A(D1)=O]≤eε⋅P[A(D2)=O]P[A(D1)=O]≤eε⋅P[A(D2)=O]

… and this must be true for all possible outputs OO. Let's unpack this.

P[A(D1)=O]P[A(D1)=O] is the probability that when you run the process AA on the database D1D1, the output is OO. This process is probabilistic: if you run it several times, it might give you different answers. A typical process might be: "count the people with blue eyes, add some random number to this count, and return this sum". Since the random number changes every time you run the process, the results will vary.

eεeε is the exponential function applied to the parameter ε>0ε>0. If εε is very close to 0, then eεeε is very close to 1, so the probabilities are very similar. The bigger εε is, the more the probabilities can differ.

Of course, the definition is symmetrical: you can replace D1D1 by D2D2 and vice-versa, and the two databases will still differ in only one individual. So we could replace it by:

e−ε⋅P[A(D2)=O]≤P[A(D1)=O]≤eε⋅P[A(D2)=O]e−ε⋅P[A(D2)=O]≤P[A(D1)=O]≤eε⋅P[A(D2)=O]

Thus, this formula means that the output of the process is similar if you change or remove the data of one person. The degree of similarity depends on εε: the smaller it is, the more similar the outputs are.

What does this similarity have to do with privacy? First, I'll explain this with an intuitive example. Then, I'll formalize this idea with a more generic interpretation.

A simple example: randomized response

Suppose you want to do a survey to know how many people are illegal drug users. If you naively go out and ask people whether they're using illegal drugs, many will lie to you. So you devise the following mechanism. The participants no longer directly answer the question "have you consumed illegal drugs in the past week?". Instead, each of them will flip a coin, without showing it to you.

  • On heads, the participant tells the truth (Yes or No).
  • On tails, they flip a second coin. If the second coin lands on heads, they answer Yes. Otherwise, they answer No.

How is this better for survey respondents? They can now answer Yes without revealing that they're doing something illegal. When someone answers Yes, you can't know their true answer for sure. They could be actually doing drugs, but they might also have answered at random.

Let's compute the probabilities of each answer for a drug user.

  • With probability 50%, they will say the truth and answer Yes.
  • With probability 50%, they will answer at random.
    • They then have another 50% chance to answer Yes, so 25% chance in total.
    • Similarly, in total, they have a 25% chance to answer No.

All in all, we get a 75% chance to answer Yes and a 25% chance to answer No. For someone who is not doing drugs, the probabilities are reversed: 25% chance to answer Yes and 75% to answer No. Using the notations from earlier:

  • P[A(Yes)=Yes]=0.75P[A(Yes)=Yes]=0.75, P[A(Yes)=No]=0.25P[A(Yes)=No]=0.25
  • P[A(No)=Yes]=0.25P[A(No)=Yes]=0.25, P[A(No)=No]=0.75P[A(No)=No]=0.75

Now, 0.750.75 is three times bigger than 0.250.25. So if we choose εε such as eε=3eε=3 (that's ε≃1.1ε≃1.1), this process is εε-differentially private. So this plausible deniability translates nicely in the language of differential privacy.

Of course, with a differentially private process like this one, you're getting some noise into your data. But if you have enough answers, with high probability, the noise will cancel itself out. Suppose you have 1000 answers in total: 400 of them are Yes and 600 are No. About 50% of all 1000 answers are random, so you can remove 250 answers from each count. In total, you get 150 Yes answers out of 500 non-random answers, so about 30% of Yes overall.

What if you want more privacy? Instead of having the participants say the truth with probability 50%, you can have them tell the truth 25% of the time. What if you want less noise instead, at the cost of less protection? Have them tell the truth 75% of the time. Finding out εε and quantifying the noise for each option is left as an exercise for the reader =)

A generalization: quantifying the attacker's knowledge

Let's forget about the previous example and consider a more generic scenario. In line with the previous article, we will describe this scenario from the attacker's perspective. We have a mechanism AA which is εε-differentially private. We run it on some database DD, and release the output A(D)A(D) to an attacker. Then, the attacker tries to figure out whether someone (their target) is in DD.

Under differential privacy, the attacker can't gain a lot of information about their target. And this is true even if this attacker has a lot of knowledge about the dataset. Let's take the stronger attacker we can think of: they know all the database, except their target. This attacker has to determine which database is the real one, between two options: one with their target in it (let's call it DinDin), the other without (DoutDout)1.

So, in the attacker's model of the world, the actual database DD can be either DinDin or DoutDout. They might have an initial suspicion that their target is in the database. This suspicion is represented by a probability, P[D=Din]P[D=Din]. This probability can be anything between 00 and 11. Say, 0.90.9 if the attacker's suspicion is strong, 0.010.01 if they think it's very unlikely, 0.50.5 if they have no idea… Similarly, their suspicion that their target is not in the dataset is also a probability, P[D=Dout]P[D=Dout]. Since there are only two options, P[D=Dout]=1−P[D=Din]P[D=Dout]=1−P[D=Din].

Now, suppose the attacker sees that the mechanism returns output OO. How much information did the attacker gain? This is captured by looking at how much their suspicion changed after seeing this output. In mathematical terms, we have to compare P[D=Din]P[D=Din] with the updated suspicion P[D=Din∣A(D)=O]P[D=Din∣A(D)=O]. This updated suspicion is the attacker's model of the world after seeing OO.

With differential privacy, the updated probability is never too far from the initial suspicion. And we can quantify this phenomenon exactly. For example, with ε=1.1ε=1.1, here is what the upper and lower bounds look like.

 

 

 

The black line is what happens if the attacker didn't get their suspicion updated at all. The blue lines are the lower and upper bounds on the updated suspicion: it can be anywhere between the two. We can visualize the example mentioned in the previous article: for an initial suspicion of 50%, the updated suspicion is approximately between 25% and 75%.

How do we prove that these bounds hold? We'll need a result from probability theory, and some basic arithmetic manipulation. I reproduced the proof as simply as I could, but you still don't have to read it. If you want to, click here: Show me the proof

What does this look like for various values of εε? We can draw a generalization of this graph with pretty colors:

 

 

 

For larger values of εε, this gets scary quite fast. Let's say you're using ε=5ε=5. Then, an attacker can go from a small suspicion (say, 10%) to a very high degree of certainty (94%).

What about composition?

In the previous section, I formalized two claims I made in my last article. First, I explained what it means to quantify information gain. Furthermore, I picked an attacker with full background knowledge. If the attacker knows less information in the first place, the bounds we showed still hold.

What about the third claim? I said that differential privacy was composable. Suppose that two algorithms AA and BB are εε-differentially private. We want to prove that publishing the result of both is 2ε2ε-differentially private. Let's call CC the algorithm which combines AA and BB: C(D)=(A(D),B(D))C(D)=(A(D),B(D)). The output of this algorithm will be a pair of outputs: O=(OA,OB)O=(OA,OB).

The insight is that the two algorithms are independent. They each have their own randomness, so the result of one does not impact the result of the other. This allows us to simply write:

P[C(D1)=O]=P[A(D1)=OA]⋅P[B(D1)=OB]≤e2ε⋅P[A(D2)=OA]⋅P[B(D2)=OB]≤e2ε⋅P[C(D2)=O]P[C(D1)=O]=P[A(D1)=OA]⋅P[B(D1)=OB]≤e2ε⋅P[A(D2)=OA]⋅P[B(D2)=OB]≤e2ε⋅P[C(D2)=O]

so CC is 2ε2ε-differentially private.

Future steps

I hope that I convinced you that differential privacy can be an excellent way to protect your data (if your εε is low). Now, if everything is going according to my master plan, you should be like… "This is awesome! I want to use it everywhere! How do I do that?"

Initially, I planned to answer this question in this post (insofar as it can be answered). But as I started writing it, I realized three things.

  • There are many different answers depending on what task you want to do.
  • There are many classical mistakes you can do when trying to use differential privacy. I would need to explain them to make sure you don't fall for them.
  • This post is pretty long already.

So, you guessed it, I'll keep that for the next article.

Thanks to Chao Li for introducing me to the Bayesian interpretation of differential privacy, and to a3nmArmavicaimmae and p4bl0 for their helpful comments on drafts of this article (as well as previous ones).


  1. This can mean that DoutDout is the same as DinDin with one fewer user. This can also mean that DoutDout is the same as DinDin, except one user has been changed to some arbitrary other user. This distinction doesn't change anything to the reasoning, so we can simply forget about it. 

资源下载链接为: https://pan.quark.cn/s/c705392404e8 在本项目中,我们聚焦于“天池-零基础入门数据挖掘-心跳信号分类预测-EDA分析全过程-代码.rar”这一主题。该压缩包涵盖了一次针对心跳信号分类预测的数据挖掘实践,涉及数据的初步探索性分析(Exploratory Data Analysis, EDA)以及相关代码。 “天池”通常指阿里巴巴天池大数据竞赛平台,这是一个提供各类数据竞赛的平台,旨在助力数据科学家和初学者提升技能并解决实际问题。此数据挖掘任务可能是一项竞赛项目,要求参赛者对心跳信号进行分类预测,例如用于诊断心脏疾病或监测健康状况。EDA是数据分析的关键环节,其目的是通过可视化和统计方法深入了解数据的特性、结构及潜在模式。项目中的“task2 EDA.ipynb”很可能是一个 Jupyter Notebook 文件,记录了使用 Python 编程语言(如 Pandas、Matplotlib 和 Seaborn 等库)进行数据探索的过程。EDA 主要包括以下内容:数据加载,利用 Pandas 读取数据集并检查基本信息,如行数、列数、缺失值和数据类型;描述性统计,计算数据的中心趋势(平均值、中位数)、分散度(方差、标准差)和分布形状;可视化,绘制直方图、散点图、箱线图等,直观呈现数据分布和关联性;特征工程,识别并处理异常值,创建新特征或对现有特征进行转换;相关性分析,计算特征之间的相关系数,挖掘潜在关联。 “example.html”可能是一个示例报告或结果展示,总结了 EDA 过程中的发现,以及初步模型结果,涵盖数据清洗、特征选择、模型训练和验证等环节。“datasets”文件夹则包含用于分析的心跳信号数据集,这类数据通常由多个时间序列组成,每个序列代表一个个体在一段时间内的 ECG 记录。分析时需了解 ECG 的生理背景,如波
局部差分隐私的操纵攻击是指攻击者试图影响隐私保护机制以获取敏感信息的行为。该攻击针对局部差分隐私机制的特性和缺点进行利用,以窃取隐私数据或干扰数据发布的结果。 局部差分隐私的目标是在保护个体隐私的前提下,提供对于整体数据集的有意义的分析结果。然而,攻击者可通过操纵自己的个体数据或其他数据的投入,来影响数据分析结果。例如,攻击者可能故意修改或篡改自己的数据,以改变数据发布的结论,或者通过协作或串通他人进行攻击。 操纵攻击的目的是干扰数据发布的结果,以推断出更多的隐私信息或获得误导性的数据分析结果。攻击者可能通过加入虚假的数据或者删除真实的数据来扰乱数据集的特性,使得发布的结果偏离真实情况。这种攻击可能会导致分析人员得出错误的结论或泄露隐私信息。 对抗局部差分隐私操纵攻击的方法包括对数据进行更严格的验证和校验、采用更复杂的算法进行数据发布,以及增加对攻击行为的监测和检测。此外,用户和数据发布者在数据分享和数据发布过程中需要保持警惕,增强对潜在攻击的认识和防范意识。 总之,局部差分隐私的操纵攻击是一种针对隐私保护机制的攻击行为,可通过操纵个体数据或其他数据的投入来干扰数据发布的结果。为了应对这种攻击,需要采取相应的安全措施和对攻击行为进行检测和防范。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值