机器学习文本分类Improved Iterative Scaling算法以及JAVA实现

最新推荐文章于 2023-07-14 08:57:23 发布

原创最新推荐文章于 2023-07-14 08:57:23 发布 · 3.3k 阅读

8 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习 #自然语言处理 #nlp #文本分类

机器学习同时被 2 个专栏收录

4 篇文章

订阅专栏

自然语言处理

4 篇文章

订阅专栏

本文介绍了机器学习中用于文本分类的Improved Iterative Scaling(IIS)算法，详细阐述了算法的数学理论，包括最大似然原则和算法流程。在实践部分，讨论了算法的实现细节以及模型的训练与测试过程，提供了代码和数据的下载链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

IIS算法数学理论

背景

IIS算法主要用来计算参数估计的maximum-likelihood。这篇文章主要是解读Adam Berger的算法( IIS Algorithm)。首先这里采用的是概率模型。

$p_{\Lambda }(y|x)\equiv \frac{1}{Z_{\Lambda}(x)}exp(\sum_{i=1}^{n}\lambda _{i}f_{i}(x,y))$

其中 $Z_{\Lambda}(x) = \sum_{y}^{ }exp(\sum_{i=1}^{n}\lambda_{i}f_{i}(x,y))$

参数解释：

$p_{\Lambda }(y|x)$ ：表示再输入文档是x的情况下，输出label为y的概率。(在Adam的文章中这个是表示language modeling的一个句子概率问题，但是这里用于文本分类)。y就是你的模型中会包含有多少个的label，然后去判断你输入的文档属于每个label的概率，当然概率最大的就会判断哪一个。

$\lambda_{i}$ ：这个其实就是我们用IIS算法训练出来的权重，i的范围就是1到n，n就是你的所有training dataset 里面的feature总数。

$f_{i}(x,y)$ ：就是feature function了。我这里feature function的定义就是，如果在一篇文章里 word[i] 属于 document x并且 word[i] 也属于label y就为1，否则为0.

$f_{i}(x,y)=\left\{\begin{matrix}1, & word[i] \in \; document\;x\;\&\&\;word[i] also \in label[y]\\ 0, & otherwise \end{matrix}\right.$

$Z_{\Lambda}(x)$ ：是用在标准化的，使得概率在0到1之间。

$\Lambda$ ：这个是表示 $\Lambda \equiv\;\left \{ \lambda_{1},\lambda_{2},\cdots \lambda_{n} \right \}$ .

Improved Iterative Scaling算法

Maximum-likelihood

首先我们构造基于对数的log-likelihood.

$L_{\widetilde{p}(x,y)}(\Lambda)\equiv\sum_{x,y}^{ }\widetilde{p}(x,y) \log p_{\Lambda}(y|x)$

这里的 $\widetilde{p}(x,y)$ 表示<x,y> pair出现在dataset中的概率，一般都是 $\frac{1}{N}$ .。然后具体的推导不在这里详细详述，推导可以看adam上的论文，我这里直接把结果给出来。因为要求最大的可能性，所以我们需要求这个公式的最大值，就是说对这个公式就导，求导的话，因为 $\lambda$ 有n个值，所以就要求n次偏导数。从而求到每一个的 $\lambda_{i}$ .

因为这个式子优化还是比较麻烦，所以Adam文章中采用办法就是一点一点的迭代。原理就是如果我们能够找到一个 $\Delta$ 使得下面的式子成立就可以了。

$L_{\widetilde{p}}(\Lambda+\Delta)-L_{\widetilde{p}}(\Delta)\geq 0$

而且又有 $L_{\widetilde{p}}(\Lambda+\Delta)-L_{\widetilde{p}}(\Delta)\geq \sum_{x,y}^{ }\widetilde{p}(x,y)\sum_{i}^{ }\delta_{i}f_{i}(x,y)+1-\sum_{x}^{ }\widetilde{p}(x)\sum_{y}^{ }p_{\Lambda}(y|x)exp\sum_{i}^{ }\delta_{i}f_{i}(x,y)$

所以只需要上式右边的式子大于0即可，可以看到现在要求的变量变成了 $\delta_{i}$ 。然后论文最后，再继续优化上式，得到一个式子可以单独的求取 $\delta_{i}$ ，如下：

$\frac{\partial \beta (\Delta) }{\partial \delta_{i}}=\sum_{x,y}^{ }\widetilde{p}(x,y)f_{i}(x,y)-\sum_{x}^{ }\widetilde{p}(x)\sum_{y}^{ }p_{\Lambda}(y|x)f_{i}(x,y)e^{\delta_{i}f^{\#}(x,y)}$

$f^{\#}(x,y)$ 其实就是表示在<x,y>这个pair中的所有的feature，在文本分类模型中，就是总共有多少种feature(注意不是多少个.)

$f^{\#}(x,y) \equiv \sum_{i}^{ }f_{i}(x,y)$

因为这里可能每一个的 $f^{\#}(x,y)$ 都会不一样，所以可能需要用牛顿迭代求解的方法去求解 $\delta_{i}$ 。但是，其实有两篇文章(忘了什么名字了)都证明只需要取最大的 $f^{\#}(x,y)$ 来代替就可以了，就是说我们希望计算的方便。

所以我们这里选择：

$M=max(f^{\#}(x,y))$ 针对所有的x 和y.

最后可以得到 $\delta_{i}$ 的表达式：

$\delta_{i}=\frac{1}{M}ln\frac{\sum_{x,y}^{ }f_{i}(x,y)\widetilde{p}(x,y)}{\sum_{x}^{ }\widetilde{p}(x)\sum_{y}^{ }p_{\Lambda}(y|x)f_{i}(x,y)}$

IIS算法流程

第一步，随便选择 $\lambda_{i}$ ，一般初始化每一个都是0.

第二步，不断执行一下的操作收敛为止，一般很快。

-----求解 $\delta_{i}$ ，用上面最后推导出来的式子

-----Set $\lambda_{i} \leftarrow \lambda_{i} + \delta_{i}$

算法的实现和模型训练测试

这里因为代码太长不可能完全放出来，所以打算讲解一部分的核心代码然后把完整代码和数据都放到出来大家下载好了。

首先是feature function的定义。
/**
 * feature function
 * wordInLabel[c] is a map, to denote that if this word is also in this label.
 * Note: a word can belongs to many labels.
 * allFeature List has contained all the words in training dataset
 * @param i: denote the word[i] in document d
 * @param d: doc[d], the number of this document
 * @param c: the label of this document
 * @return
*/
private int featureFun(int i,int d,int c){
        
        if(doc[d].getDoc().containsKey(allFeature.get(i)) && wordInLabel[c].containsKey(allFeature.get(i)))
            return 1;
        else return 0;
}
doc 是我定义的一个document的结构，这个结构里面包含一个 map<word,value>容器的document, key是单词，value是这个单词在这个document中出现的次数。这个结构另外一个成员就是label. 就是这篇document的label.主要是用在训练过程中，测试过程不需要。

接下来直接上训练阶段的核心代码，就是不停迭代求解 $\delta_{i}$ 然后更新 $\lambda_{i}$ 的过程。
    /**
     * Train the data.
     * getExponentialProb(x,y) function is used to calculate p(y|x)
     * lamda[] is the final weight coefficients we can get.
     * @param times: the iteration time, depends on your choice
     */
    private void trainData(int times){
        lambda = new double[allFeature.size()];
        double[] delta = new double[allFeature.size()];
        for(int t=0;t
     ，显然分子和
    是没有关系的，所以我已经预处理在另外一个方法中计算存储起来，这个也是空间换时间的一个概念。
   
在测试阶段，我通过预测每个label的概率然后进行比较，得到概率最大的那个label. 先给出测试的代码，然后最后给出结果。 /** * Get the error rate, remember run the startUp function first * Get the error rate of our model. */ public void testModel(){ int ein = 0; for(int d=0;d 测试数据采用的是 Reuter21578的数据，用了其中的4种类型，acq, coffee, fuel and housing. Best:：就是这个算法中预测出来的最好的，就是概率最大的。True：就是真正测试中的结果。

 
  
   
   
    coefficient vector size:1728
acq[0.3559] coffee[0.3012] fuel[0.2457] housing[0.0972]     Best:acq   True:acq
acq[0.2893] coffee[0.2815] fuel[0.2438] housing[0.1853]     Best:acq   True:acq
acq[0.2704] coffee[0.2562] fuel[0.2565] housing[0.2170]     Best:acq   True:acq
acq[0.3340] coffee[0.2252] fuel[0.2659] housing[0.1749]     Best:acq   True:acq
acq[0.2990] coffee[0.2590] fuel[0.2466] housing[0.1955]     Best:acq   True:acq
acq[0.2904] coffee[0.2416] fuel[0.2544] housing[0.2136]     Best:acq   True:acq
acq[0.2978] coffee[0.2767] fuel[0.2406] housing[0.1848]     Best:acq   True:acq
acq[0.3844] coffee[0.2638] fuel[0.2282] housing[0.1236]     Best:acq   True:acq
acq[0.2855] coffee[0.3156] fuel[0.2547] housing[0.1442]     Best:coffee   True:coffee
acq[0.2486] coffee[0.2780] fuel[0.2744] housing[0.1989]     Best:coffee   True:coffee
acq[0.2407] coffee[0.2871] fuel[0.2596] housing[0.2126]     Best:coffee   True:coffee
acq[0.2788] coffee[0.2859] fuel[0.2580] housing[0.1773]     Best:coffee   True:coffee
acq[0.2518] coffee[0.2501] fuel[0.2518] housing[0.2464]     Best:acq   True:coffee
acq[0.2850] coffee[0.2646] fuel[0.2532] housing[0.1972]     Best:acq   True:coffee
acq[0.2348] coffee[0.2692] fuel[0.2735] housing[0.2224]     Best:fuel   True:coffee
acq[0.2382] coffee[0.2678] fuel[0.2727] housing[0.2214]     Best:fuel   True:coffee
acq[0.2548] coffee[0.2632] fuel[0.2859] housing[0.1962]     Best:fuel   True:fuel
acq[0.2733] coffee[0.2613] fuel[0.2541] housing[0.2113]     Best:acq   True:fuel
acq[0.2844] coffee[0.2530] fuel[0.2727] housing[0.1899]     Best:acq   True:fuel
acq[0.2421] coffee[0.2345] fuel[0.3366] housing[0.1867]     Best:fuel   True:fuel
acq[0.2579] coffee[0.2519] fuel[0.3093] housing[0.1809]     Best:fuel   True:fuel
acq[0.2435] coffee[0.2466] fuel[0.3049] housing[0.2050]     Best:fuel   True:fuel
acq[0.2871] coffee[0.3343] fuel[0.2452] housing[0.1334]     Best:coffee   True:fuel
acq[0.2738] coffee[0.2191] fuel[0.3157] housing[0.1914]     Best:fuel   True:fuel
acq[0.2450] coffee[0.2397] fuel[0.2450] housing[0.2702]     Best:housing   True:housing
acq[0.2202] coffee[0.2054] fuel[0.2335] housing[0.3409]     Best:housing   True:housing
acq[0.2350] coffee[0.2372] fuel[0.2393] housing[0.2885]     Best:housing   True:housing
acq[0.2340] coffee[0.2361] fuel[0.2382] housing[0.2917]     Best:housing   True:housing
acq[0.3156] coffee[0.2897] fuel[0.2219] housing[0.1728]     Best:acq   True:housing
acq[0.2340] coffee[0.2238] fuel[0.2475] housing[0.2947]     Best:housing   True:housing
acq[0.2431] coffee[0.2383] fuel[0.2481] housing[0.2706]     Best:housing   True:housing
acq[0.2525] coffee[0.2460] fuel[0.2737] housing[0.2278]     Best:fuel   True:housing
error rate is:0.28125
accuracy: 71.75%

    
   
    其实做了另外一个实验发现用binary classification 的结果更好，就是首先预测
    
   
    acq- not acq
    
   
    coffee - not coffee
    
   
    fuel - not fuel
    
   
    housing -not housing.
    
   
    然后最后在combine 这四个结果，得到的准确率会更高。
    
   
 
   
  
   最后其实我也把IIS模型和 maxent里面已经实现了的GIS模型对比了，发现GIS的收敛的确比IIS慢一点，但是GIS的迭代速度更快。我觉得可能是我的算法写的不好，比较慢。有兴趣的可以尝试用用OPENNLP 的maxent 包。
   
  
 

 
 
 
代码和数据下载 

 因为我貌似没有找到上传附件的地方，所以我放到百度云上让大家下载吧，下面是连接。
 

 http://pan.baidu.com/s/1bnf7BZL
 
 
 

 
 
 

 谢谢您的阅读，希望有问题可以一起讨论。