众所周知,K2算法是贝叶斯网络结构学习的经典算法,其本质是一种结合了爬山算法和贝叶斯评分算法的综合算法。本文就将基于贝叶斯工具箱,详细阐述其算法的原理,以及结合了论文Yasin A, Leray P. iMMPC: a local search approach for incremental Bayesian network structure learning[C]// International Symposium on Intelligent Data Analysis. Springer Berlin Heidelberg, 2011:401-412.
中的增量的思想,对K2算法的一种改进。实现在大量数据下显著提高算法的效率。
其实该思想是很简单的:我们可以先利用K2算法学习出一个基本的结构,在学习的过程中,可以保存下来我学习的路径,即算法每一次的决策,那么我改进的地方在哪里呢,就是我不仅保存了最优的路径,而且我保存住几条次优的路径(算法中加上最优一共是4个路径),我将次优的路径作为我下一次搜索的空间,注意:这里有一个假设,同时也是这个算法的缺陷,假定此次决策不是最优的,那么也会是在评分较高的几个选择里面,所以算法剔除掉了低分的模型,缩小了搜索空间,提升了算法的效率。
如下图,左边是算法第一次的执行过程,此时每一步保存了4个候选步骤,在新的数据到来之后,将采用增量算法,即右边的算法,每一次搜索的空间大大减小(只有4个选择,你说快不快)。
废话不多说,贴代码:
function dag = learn_struct_K2(data, ns, order, varargin)
% LEARN_STRUCT_K2 Greedily learn the best structure compatible with a fixed node ordering
% best_dag = learn_struct_K2(data, node_sizes, order, ...)
%
% data(i,m) = value of node i in case m (can be a cell array).
% node_sizes(i) is the size of node i.
% order(i) is the i'th node in the topological ordering.
%
% The following optional arguments can be specified in the form of name/value pairs:
% [default value in brackets]
%
% max_fan_in - this the largest number of parents we allow per node [N]
% scoring_fn - 'bayesian' or 'bic' [ 'bayesian' ]
% Currently, only networks with all tabular nodes support Bayesian scoring.
% type - type{i} is the type of CPD to use for node i, where the type is a string
% of the form 'tabular', 'noisy_or', 'gaussian', etc. [ all cells contain 'tabular' ]
% params - params{i} contains optional arguments passed to the CPD constructor for node i,
% or [] if none. [ all cells contain {'prior', 1}, meaning use uniform Dirichlet priors ]
% discrete - the list of discrete nodes [ 1:N ]
% clamped - clamped(i,m) = 1 if node i is clamped in case m [ zeros(N, ncases) ]
% verbose - 'yes' means display output while running [ 'no' ]
%
% e.g., dag = learn_struct_K2(data, ns, order, 'scoring_fn', 'bic', 'params', [])
%
% To be backwards compatible with BNT2, you can also specify arguments as follows
% dag = learn_struct_K2(data, node_sizes, order, max_fan_in)
%
% This algorithm is described in
% - Cooper and Herskovits, "A Bayesian method for the induction of probabilistic
% networks from data", Machine Learning Journal 9:308--347, 1992
[n ncases] = size(data);
% set default params
type = cell