4-23

本文介绍了如何使用LDA(Latent Dirichlet Allocation)主题模型对文本数据进行主题分析,并详细解释了一个核心函数——采样函数的具体实现过程。通过该函数,我们可以了解如何通过迭代采样来估计文档的主题分布。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

晚上去了趟图书大厦,回来好大雨。买了本《JQuery实战》,因为听说这个js库不错,而且这本书的评价好像也不错。然后发现了一本《可视化数据》。因为老板让我做一些可视化的事情,烦得很。不过可视化确实是件很重要的事情,用户第一眼看到的就是界面。另外,可视化其实也是一种艺术,因为数据都是高维的,用二维或者三维的方式竟可能多的展示信息其实也可以看成降维。

 

今天想把LDA在ACL的数据上跑一跑,很久没看Topic Models的东西了,以前的代码都快看不懂了,趁现在还记得,记下来吧。

 

其实主要就是一个函数 Sampling

ALPHA 文档-Topic分布的超参数 一般取值 50/T

BETA Topic-词分布的超参数 一般.01

W Corpus中所有不同的token数量 预处理时可以忽略大小写,去掉停用词,去掉低频词

T  topic数量

D 文档数量

NN 采样次数

OUTPUT 观察用

n 所有token的数量

z 每个词的主题,长度n,采样时使用

d d[0...n-1] d[i]表示第i个词是哪个文档的

w w[0...n-1] w[i]表示第i个词是哪个词(0  --> W-1)

wp [W*T]  

                     topic0 topic1 ... topic T-1

               w0

               w1

                ...

               w W-1

 wp[i,j]表示采样时被赋予topic j的wi有多少个

 dp[D*T]

                   topic0 topic1 .... topic T-1

               d0

               d1

                ...

               d D-1

 dp[i,j] 表示文档i中topicj有多少个

 ztot[T] ztot[i]表示有多少个词的topic 是topic i

 order Gibbs Sampling的随机顺序

 probs[T] 为了计算采样时选择那个主题而设置的一个数组

 

 采样的公式 在

  Parameter estimation for text analysis的公式79 不过不是等于而是成正比

  或者Probabilistic Topic Models里

  p(zi| z-i,w) =(其实是正比于) (C(w,t)+beta)/(sigma(C(w,t))+beta*W)  * (C(d,t)+alpha)/(sigma(C(d,t))+alpha*T) 因为第二个因子的分母与i没有关系,所以根本不用计算

(优快云blog太烂了,没法编辑公式)

下面的代码就是根据p(zi|z-i,w)进行采样的。

什么意思呢?

比如用3个topic 计算出的p(zi|z-i,w)分别是 .3 .8 .9 注意这不是概率,因为没有归一化

那么totprob就是 .3+.8+.9=2.0

r是0-2.0间的随机数,如果r在0-.3在topic 取第一个

r在.3-1.1取第二个

r在1.1-2.0取第三个

实现了采样

                 for (j = 0; j < T; j++) {
                    probs[j] = ((double) wp[wioffset + j] + (double) BETA) /
                               ((double) ztot[j] + (double) WBETA) *
                               ((double) dp[dioffset + j] + (double) ALPHA);
                    totprob += probs[j];
                }

                r = (double) totprob * rnd.nextDouble();
                max = probs[0];
                topic = 0;
                while (r > max) {
                    topic++;
                    max += probs[topic];
                }

                z[i] = topic; // assign current word token i to topic j

          

    /**
     * Gibbs Sampling for LDA model
     * @param ALPHA double ALPHA and BETA are the hyperparameters on the Dirichlet priors for the topic distributions (theta) and the topic-word distributions (phi) respectively
     * @param BETA double Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W
     * @param W int vocabulary size
     * @param T int the number of topics
     * @param D int number of documents, but in fact it's not used.
     * @param NN int determines the number of iterations to run the Gibbs sampler
     * @param OUTPUT int determines the screen output by the sampler 0 = no output provided 1 = show the iteration number only 2 = show all output
     * @param n int the number of words in the whole corpus
     * @param z int[] z,d,w out variable
     * @param d int[] see also z input 
     * @param w int[] see also z input
     * @param wp int[] is the word-topic count matrix, where C[w,t] is the number of words from the wth entry in the vocabulary assigned to topic t
     * @param dp int[] C[D*T] represents the topic-document count matrix, where C[d,t] is the number of words assigned to topic t in document d
     * @param ztot int[] ztot[t]=the number of words assigned to topic t
     * @param order int[] Determining random order update sequence
     * @param probs double[] p(xi,zi) p(0,0) ... p(0,T-1);p(1,0) ...p(1,T-1)
     * @param startcond int see z d w for details if startcond==1,they are required for initial states of variables
     */
    public static void Sampling(double ALPHA, double BETA, int W, int T, int D,
                                int NN, int OUTPUT, int n, int[] z, int[] d,
                                int[] w, int[] wp, int[] dp, int[] ztot,
                                int[] order, double[] probs, int startcond,int saveInterval,String outDir) {
        int wi, di, i, ii, j, topic, rp, temp, iter, wioffset, dioffset;
        double totprob, WBETA, r, max;
        Random rnd = new Random(System.currentTimeMillis());
        if (startcond == 1) {
            /* start from previously saved state */
            for (i = 0; i < n; i++) {
                wi = w[i];
                di = d[i];
                topic = z[i];
                wp[wi * T + topic]++; // increment wp count matrix
                dp[di * T + topic]++; // increment dp count matrix
                ztot[topic]++; // increment ztot matrix
            }
        }

        if (startcond == 0) {
            /* random initialization */
            if (OUTPUT==2) System.out.print( "Starting Random initialization/n" );
            for (i = 0; i < n; i++) {
                wi = w[i];
                di = d[i];
                // pick a random topic 0..T-1
                //topic = (int) ( (double) randomMT() * (double) T / (double) (4294967296.0 + 1.0) );
                topic = rnd.nextInt(T);
                z[i] = topic; // assign this word token to this topic
                wp[wi * T + topic]++; // increment wp count matrix
                dp[di * T + topic]++; // increment dp count matrix
                ztot[topic]++; // increment ztot matrix
            }
        }

        if (OUTPUT==2) System.out.print( "Determining random order update sequence/n" );

        for (i = 0; i < n; i++)
            order[i] = i; // fill with increasing series
        for (i = 0; i < (n - 1); i++) {
            // pick a random integer between i and nw
            //rp = i + (int) ((double) (n-i) * (double) randomMT() / (double) (4294967296.0 + 1.0));
            rp = i + rnd.nextInt(n - i);
            // switch contents on position i and position rp
            temp = order[rp];
            order[rp] = order[i];
            order[i] = temp;
        }

        //for (i=0; i<n; i++) mexPrintf( "i=%3d order[i]=%3d/n" , i , order[ i ] );
        WBETA = (double) (W * BETA);
        for (iter = 0; iter < NN; iter++) {
            if (OUTPUT >= 1) {
                if ((iter % 10)==0) System.out.print( new java.util.Date()+"/tIteration "+iter+" of "+NN+"/n");
                //if ((iter % 10)==0) mexEvalString("drawnow;");
            }
            for (ii = 0; ii < n; ii++) {
                i = order[ii]; // current word token to assess

                wi = w[i]; // current word index
                di = d[i]; // current document index
                topic = z[i]; // current topic assignment to word token
                ztot[topic]--; // substract this from counts

                wioffset = wi * T;
                dioffset = di * T;

                wp[wioffset + topic]--;
                dp[dioffset + topic]--;

                //mexPrintf( "(1) Working on ii=%d i=%d wi=%d di=%d topic=%d wp=%d dp=%d/n" , ii , i , wi , di , topic , wp[wi+topic*W] , dp[wi+topic*D] );

                totprob = (double) 0;
                for (j = 0; j < T; j++) {
                    probs[j] = ((double) wp[wioffset + j] + (double) BETA) /
                               ((double) ztot[j] + (double) WBETA) *
                               ((double) dp[dioffset + j] + (double) ALPHA);
                    totprob += probs[j];
                }

                // sample a topic from the distribution
                //r = (double) totprob * (double) randomMT() / (double) 4294967296.0;
                r = (double) totprob * rnd.nextDouble();
                max = probs[0];
                topic = 0;
                while (r > max) {
                    topic++;
                    max += probs[topic];
                }

                z[i] = topic; // assign current word token i to topic j
                wp[wioffset + topic]++; // and update counts
                dp[dioffset + topic]++;
                ztot[topic]++;

                //mexPrintf( "(2) Working on ii=%d i=%d wi=%d di=%d topic=%d wp=%d dp=%d/n" , ii , i , wi , di , topic , wp[wi+topic*W] , dp[wi+topic*D] );
            }
            if((iter+1)%saveInterval==0){
                //save
                if(OUTPUT>=1){
                    System.out.println("Saving result of the "+(iter+1)+"th iteration.");
                    SaveParameters(NN,z,wp,dp,outDir);
                }
            }           
        }
        if(NN%saveInterval!=0){
            SaveParameters(NN,z,wp,dp,outDir);
        }     
       
       
    }

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值