使用Python检测贝叶斯网络的因果关系检测

虽然机器学习技术可以实现良好的性能,但提取与目标变量的因果关系并不直观。换句话说,就是:哪些变量对目标变量有直接的因果影响?

机器学习的一个分支是贝叶斯概率图模型(Bayesian probabilistic graphical models),也称为贝叶斯网络(Bayesian networks, BN),可用于确定这些因果因素。

在这里插入图片描述

在我们深入讨论因果模型的技术细节之前,让我们先复习一些术语:包括"相关性"(correlation)和"关联性"(association)。

注意,相关性或关联性并不等同于因果关系。换句话说,两个变量之间的观察到的关系并不一定意味着一个导致了另一个。

从技术上讲,相关性指的是两个变量之间的线性关系,而关联性则指的是两个(或更多)变量之间的任何关系。而因果关系则意味着一个变量(通常称为预测变量或自变量)导致另一个变量(通常称为结果变量或因变量)

接下来,我将通过示例简要描述相关性和关联性。

1.1. 相关性

皮尔逊相关系数(Pearson correlation coefficient)是最常用的相关系数。系数强度由r表示,取值区间-1到1。

在使用相关性时,有三种可能的结果:

  • 正相关:两个变量之间存在一种关系,即两个变量同时朝同一方向移动。
  • 负相关:两个变量之间存在一种关系,即一个变量增加与另一个变量减少相关联。
  • 无相关性:当两个变量之间没有关系时。

正相关的一个示例如图 1 所示,图中展示了巧克力消费与每个国家的诺贝尔奖获得者数量之间的关系。

图片

图1:巧克力消费与诺贝尔奖获得者之间的相互关系

巧克力消费可能意味着诺贝尔奖获得者增加。或者反过来,诺贝尔奖获得者的增加同样可能导致巧克力消费增加。尽管存在强烈的相关性,但更有可能的是未观察到的变量,如社会经济地位或教育系统质量,可能导致巧克力消费和诺贝尔奖获得者数量的增加。

换句话说,我们仍然不知道这种关系是否是因果关系。但这并不意味着相关性本身没有用处,它只是有着不同的目的。

相关性本身并不意味着因果关系,因为统计关系并不能唯一限制因果关系。

1.1.2. 关联性

当我们谈论关联性时,我们指的是一个变量的某些值倾向于与另一个变量的某些值共同出现。

从统计学的角度来看,有许多关联性测量方法,例如卡方检验(chi-square test)、费舍尔精确检验(Fisher exact test)、超几何检验(hypergeometric test)等。它们通常用于其中一个或两个变量为有序(ordinal)或名义(nominal)变量的情况。

注意:相关性是一个技术术语,而关联性不是,因此在统计学中对其含义并不总是一致的。这意味着在使用这些术语时,明确说明其含义总是一个好的做法。

为了举例说明,我将使用超几何检验来演示是否存在两个变量之间的关联性,使用泰坦尼克号数据集。

泰坦尼克号数据集在许多机器学习示例中都有使用,众所周知,性别(女性)是生存的一个很好的预测因子。让我演示一下如何计算幸存和女性之间的关联性。

首先,安装 bnlearn 库,并仅加载泰坦尼克号数据集。

问:女性幸存的概率是多少?

图片

零假设:幸存与性别之间没有关系。

超几何检验使用超几何分布来测量离散概率分布的统计显著性。在这个例子中, 是总体大小(891), 是总体中成功状态的数量(342), 是样本大小/抽样次数(314), 是样本中成功的数量(233)。

图片

方程 1:使用超几何检验测试幸存与女性之间的关联性

在 的显著性水平下,我们可以拒绝零假设,因此可以说幸存和女性之间存在统计显著的关联。

注意,关联性本身并不意味着因果关系。我们需要区分边际关联(marginal)和条件关联(conditional)。后者是因果推断的关键构建模块。

2. 因果关系

什么是因果关系(causality)?

因果关系意味着一个independent变量导致另一个dependent变量,并由 Reichenbach(1956)如下所述:

如果两个随机变量 和 在统计上相关( ),那么要么(a) 导致 ,(b) 导致 ,或者(c)存在一个第三个变量 同时导致 和 。此外,给定 的条件下, 和 变得独立,即 。

这个定义被纳入贝叶斯图模型中。

贝叶斯图模型又称贝叶斯网络、贝叶斯信念网络、Bayes Net、因果概率网络和影响图。都是同一技术,不同的叫法。

为了确定因果关系,我们可以使用贝叶斯网络(BN)。

让我们从图形开始,并可视化 Reichenbach 所描述的三个变量之间的统计依赖关系(参见图 2)。节点对应变量,有向边(箭头)表示依赖关系或条件分布。

图片

图 2:有向无环图(DAG)编码条件独立性。(a、b、c)是等价类。(a、b)级联,(c)共同父节点,(d)是具有 V 结构的特殊类别

可以创建四个图:(a、b)级联,(c)共同父节点和(d)V 结构,这些图构成了贝叶斯网络的基础。

但是我们如何确定什么是造成什么的原因?(how can we tell what causes what?)

确定因果关系的概念思想是通过将一个节点保持不变,然后观察其影响来确定因果关系的方向,即哪个节点影响哪个节点。

举个例子,让我们看一下图 2 中的有向无环图 DAG(a),它描述了 由 引起, 由 引起。如果我们现在将 保持不变,如果这个模型是正确的, 不应该发生变化。每个贝叶斯网络都可以用这四个图来描述,并且通过概率论(参见下面的部分),我们可以将这些部分组合起来。

需要注意的是,贝叶斯网络是有向无环图(Directed Acyclic Graph, DAG),而 DAG 是具有因果性的。这意味着图中的边是有向的,并且没有(反馈)循环(无环)。

2.1. 概率论

概率论,或者更具体地说贝叶斯定理或贝叶斯规则,构成了贝叶斯网络的基础。

贝叶斯规则用于更新模型信息,数学上表示如下方程式:

图片

方程式由四个部分组成:

  • 后验概率(posterior probability)是给定 发生的概率。
  • 条件概率(conditional probability)或似然是在假设成立的情况下,证据发生的概率。这可以从数据
python写的一段贝叶斯网络的程序 This file describes a Bayes Net Toolkit that we will refer to now as BNT. This version is 0.1. Let's consider this code an "alpha" version that contains some useful functionality, but is not complete, and is not a ready-to-use "application". The purpose of the toolkit is to facilitate creating experimental Bayes nets that analyze sequences of events. The toolkit provides code to help with the following: (a) creating Bayes nets. There are three classes of nodes defined, and to construct a Bayes net, you can write code that calls the constructors of these classes, and then you can create links among them. (b) displaying Bayes nets. There is code to create new windows and to draw Bayes nets in them. This includes drawing the nodes, the arcs, the labels, and various properties of nodes. (c) propagating a-posteriori probabilities. When one node's probability changes, the posterior probabilities of nodes downstream from it may need to change, too, depending on firing thresholds, etc. There is code in the toolkit to support that. (d) simulating events ("playing" event sequences) and having the Bayes net respond to them. This functionality is split over several files. Here are the files and the functionality that they represent. BayesNetNode.py: class definition for the basic node in a Bayes net. BayesUpdating.py: computing the a-posteriori probability of a node given the probabilities of its parents. InputNode.py: class definition for "input nodes". InputNode is a subclass of BayesNetNode. Input nodes have special features that allow them to recognize evidence items (using regular-expression pattern matching of the string descriptions of events). OutputNode.py: class definition for "output nodes". OutputBode is a subclass of BayesNetNode. An output node can have a list of actions to be performed when the node's posterior probability exceeds a threshold ReadWriteSigmaFiles.py: Functionality for loading and saving Bayes nets in an XML format. SampleNets.py: Some code that constructs a sample Bayes net. This is called when SIGMAEditor.py is started up. SIGMAEditor.py: A main program that can be turned into an experimental application by adding menus, more code, etc. It has some facilities already for loading event sequence files and playing them. sample-event-file.txt: A sequence of events that exemplifies the format for these events. gma-mona.igm: A sample Bayes net in the form of an XML file. The SIGMAEditor program can read this type of file. Here are some limitations of the toolkit as of 23 February 2009: 1. Users cannot yet edit Bayes nets directly in the SIGMAEditor. Code has to be written to create new Bayes nets, at this time. 2. If you select the File menu's option to load a new Bayes net file, you get a fixed example: gma-mona.igm. This should be changed in the future to bring up a file dialog box so that the user can select the file. 3. When you "run" an event sequence in the SIGMAEditor, the program will present each event to each input node and find out if the input node's filter matches the evidence. If it does match, that fact is printed to standard output, but nothing else is done. What should then happen is that the node's probability is updated according to its response method, and if the new probability exceeds the node's threshold, then its successor ("children") get their probabilities updated, too. 4. No animation of the Bayes net is performed when an event sequence is run. Ideally, the diagram would be updated dynamically to show the activity, especially when posterior probabilities of nodes change and thresholds are exceeded. To use the BNT, do three kinds of development: A. create your own Bayes net whose input nodes correspond to pieces of evidence that might be presented and that might be relevant to drawing inferences about what's going on in the situation or process that you are analyzing. You do this by writing Python code that calls constructors etc. See the example in SampleNets.py. B. create a sample event stream that represents a plausible sequence of events that your system should be able to analyze. Put this in a file in the same format as used in sample-event-sequence.txt. C. modify the code of BNT or add new modules as necessary to obtain the functionality you want in your system. This could include code to perform actions whenever an output node's threshold is exceeded. It could include code to generate events (rather than read them from a file). And it could include code to describe more clearly what is going on whenever a node's probability is updated (e.g., what the significance of the update is -- more certainty about something, an indication that the weight of evidence is becoming strong, etc.)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值