晚上去了趟图书大厦,回来好大雨。买了本《JQuery实战》,因为听说这个js库不错,而且这本书的评价好像也不错。然后发现了一本《可视化数据》。因为老板让我做一些可视化的事情,烦得很。不过可视化确实是件很重要的事情,用户第一眼看到的就是界面。另外,可视化其实也是一种艺术,因为数据都是高维的,用二维或者三维的方式竟可能多的展示信息其实也可以看成降维。
今天想把LDA在ACL的数据上跑一跑,很久没看Topic Models的东西了,以前的代码都快看不懂了,趁现在还记得,记下来吧。
其实主要就是一个函数 Sampling
ALPHA 文档-Topic分布的超参数 一般取值 50/T
BETA Topic-词分布的超参数 一般.01
W Corpus中所有不同的token数量 预处理时可以忽略大小写,去掉停用词,去掉低频词
T topic数量
D 文档数量
NN 采样次数
OUTPUT 观察用
n 所有token的数量
z 每个词的主题,长度n,采样时使用
d d[0...n-1] d[i]表示第i个词是哪个文档的
w w[0...n-1] w[i]表示第i个词是哪个词(0 --> W-1)
wp [W*T]
topic0 topic1 ... topic T-1
w0
w1
...
w W-1
wp[i,j]表示采样时被赋予topic j的wi有多少个
dp[D*T]
topic0 topic1 .... topic T-1
d0
d1
...
d D-1
dp[i,j] 表示文档i中topicj有多少个
ztot[T] ztot[i]表示有多少个词的topic 是topic i
order Gibbs Sampling的随机顺序
probs[T] 为了计算采样时选择那个主题而设置的一个数组
采样的公式 在
Parameter estimation for text analysis的公式79 不过不是等于而是成正比
或者Probabilistic Topic Models里
p(zi| z-i,w) =(其实是正比于) (C(w,t)+beta)/(sigma(C(w,t))+beta*W) * (C(d,t)+alpha)/(sigma(C(d,t))+alpha*T) 因为第二个因子的分母与i没有关系,所以根本不用计算
(优快云blog太烂了,没法编辑公式)
下面的代码就是根据p(zi|z-i,w)进行采样的。
什么意思呢?
比如用3个topic 计算出的p(zi|z-i,w)分别是 .3 .8 .9 注意这不是概率,因为没有归一化
那么totprob就是 .3+.8+.9=2.0
r是0-2.0间的随机数,如果r在0-.3在topic 取第一个
r在.3-1.1取第二个
r在1.1-2.0取第三个
实现了采样
for (j = 0; j < T; j++) {
probs[j] = ((double) wp[wioffset + j] + (double) BETA) /
((double) ztot[j] + (double) WBETA) *
((double) dp[dioffset + j] + (double) ALPHA);
totprob += probs[j];
}
r = (double) totprob * rnd.nextDouble();
max = probs[0];
topic = 0;
while (r > max) {
topic++;
max += probs[topic];
}
z[i] = topic; // assign current word token i to topic j
/**
* Gibbs Sampling for LDA model
* @param ALPHA double ALPHA and BETA are the hyperparameters on the Dirichlet priors for the topic distributions (theta) and the topic-word distributions (phi) respectively
* @param BETA double Appropriate values for ALPHA and BETA depend on the number of topics and the number of words in vocabulary. For most applications, good results can be obtained by setting ALPHA = 50 / T and BETA = 200 / W
* @param W int vocabulary size
* @param T int the number of topics
* @param D int number of documents, but in fact it's not used.
* @param NN int determines the number of iterations to run the Gibbs sampler
* @param OUTPUT int determines the screen output by the sampler 0 = no output provided 1 = show the iteration number only 2 = show all output
* @param n int the number of words in the whole corpus
* @param z int[] z,d,w out variable
* @param d int[] see also z input
* @param w int[] see also z input
* @param wp int[] is the word-topic count matrix, where C[w,t] is the number of words from the wth entry in the vocabulary assigned to topic t
* @param dp int[] C[D*T] represents the topic-document count matrix, where C[d,t] is the number of words assigned to topic t in document d
* @param ztot int[] ztot[t]=the number of words assigned to topic t
* @param order int[] Determining random order update sequence
* @param probs double[] p(xi,zi) p(0,0) ... p(0,T-1);p(1,0) ...p(1,T-1)
* @param startcond int see z d w for details if startcond==1,they are required for initial states of variables
*/
public static void Sampling(double ALPHA, double BETA, int W, int T, int D,
int NN, int OUTPUT, int n, int[] z, int[] d,
int[] w, int[] wp, int[] dp, int[] ztot,
int[] order, double[] probs, int startcond,int saveInterval,String outDir) {
int wi, di, i, ii, j, topic, rp, temp, iter, wioffset, dioffset;
double totprob, WBETA, r, max;
Random rnd = new Random(System.currentTimeMillis());
if (startcond == 1) {
/* start from previously saved state */
for (i = 0; i < n; i++) {
wi = w[i];
di = d[i];
topic = z[i];
wp[wi * T + topic]++; // increment wp count matrix
dp[di * T + topic]++; // increment dp count matrix
ztot[topic]++; // increment ztot matrix
}
}
if (startcond == 0) {
/* random initialization */
if (OUTPUT==2) System.out.print( "Starting Random initialization/n" );
for (i = 0; i < n; i++) {
wi = w[i];
di = d[i];
// pick a random topic 0..T-1
//topic = (int) ( (double) randomMT() * (double) T / (double) (4294967296.0 + 1.0) );
topic = rnd.nextInt(T);
z[i] = topic; // assign this word token to this topic
wp[wi * T + topic]++; // increment wp count matrix
dp[di * T + topic]++; // increment dp count matrix
ztot[topic]++; // increment ztot matrix
}
}
if (OUTPUT==2) System.out.print( "Determining random order update sequence/n" );
for (i = 0; i < n; i++)
order[i] = i; // fill with increasing series
for (i = 0; i < (n - 1); i++) {
// pick a random integer between i and nw
//rp = i + (int) ((double) (n-i) * (double) randomMT() / (double) (4294967296.0 + 1.0));
rp = i + rnd.nextInt(n - i);
// switch contents on position i and position rp
temp = order[rp];
order[rp] = order[i];
order[i] = temp;
}
//for (i=0; i<n; i++) mexPrintf( "i=%3d order[i]=%3d/n" , i , order[ i ] );
WBETA = (double) (W * BETA);
for (iter = 0; iter < NN; iter++) {
if (OUTPUT >= 1) {
if ((iter % 10)==0) System.out.print( new java.util.Date()+"/tIteration "+iter+" of "+NN+"/n");
//if ((iter % 10)==0) mexEvalString("drawnow;");
}
for (ii = 0; ii < n; ii++) {
i = order[ii]; // current word token to assess
wi = w[i]; // current word index
di = d[i]; // current document index
topic = z[i]; // current topic assignment to word token
ztot[topic]--; // substract this from counts
wioffset = wi * T;
dioffset = di * T;
wp[wioffset + topic]--;
dp[dioffset + topic]--;
//mexPrintf( "(1) Working on ii=%d i=%d wi=%d di=%d topic=%d wp=%d dp=%d/n" , ii , i , wi , di , topic , wp[wi+topic*W] , dp[wi+topic*D] );
totprob = (double) 0;
for (j = 0; j < T; j++) {
probs[j] = ((double) wp[wioffset + j] + (double) BETA) /
((double) ztot[j] + (double) WBETA) *
((double) dp[dioffset + j] + (double) ALPHA);
totprob += probs[j];
}
// sample a topic from the distribution
//r = (double) totprob * (double) randomMT() / (double) 4294967296.0;
r = (double) totprob * rnd.nextDouble();
max = probs[0];
topic = 0;
while (r > max) {
topic++;
max += probs[topic];
}
z[i] = topic; // assign current word token i to topic j
wp[wioffset + topic]++; // and update counts
dp[dioffset + topic]++;
ztot[topic]++;
//mexPrintf( "(2) Working on ii=%d i=%d wi=%d di=%d topic=%d wp=%d dp=%d/n" , ii , i , wi , di , topic , wp[wi+topic*W] , dp[wi+topic*D] );
}
if((iter+1)%saveInterval==0){
//save
if(OUTPUT>=1){
System.out.println("Saving result of the "+(iter+1)+"th iteration.");
SaveParameters(NN,z,wp,dp,outDir);
}
}
}
if(NN%saveInterval!=0){
SaveParameters(NN,z,wp,dp,outDir);
}
}