有两个向量V1和V2
V1={1:3,2:2,3:1,5:0},V2={1:3,3:1,4:2,5:0}
以表格的形式展现:
将向量V1和V2带入相关系数公式并展开展开,结果为:
n值(n = 4):从表格可以看出,向量V1和V2 的第五位元素上都是0,因此该位置可忽略。向量V1第四位元素上值为0,但向量V2第四位元素有值,因此位置4上的元素不能忽略。同理V2元素上的第二位元素也是。因此 n = 4.
V1的平均值:(3+2+1)/ 4 (将V1向量非零的值累加,然后除以N值)
V2的平均值:(3+2+1)/ 4
V1*V2:3*3+2*0+1*1+0*2 (将V1和V2对应位置上的值相乘,然后将结果累加)
实现代码:
package fuse.hang;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.Vector.Element;
public class Correlation {
/**
* @param args
*/
public static void main(String[] args) {
/***
* 创建向量V1和V2
*/
Vector v1 = new RandomAccessSparseVector(1000);
v1.set(1, 3);
v1.set(2, 2);
v1.set(3, 1);
v1.set(4, 0);
v1.set(5, 0);
Vector v2 = new RandomAccessSparseVector(1000);
v2.set(1, 3);
v2.set(2, 0);
v2.set(3, 1);
v2.set(4, 2);
v2.set(5, 0);
correlation(v1, v2);
}
public static void correlation(Vector v1,Vector v2){
if(v1 == null || v2 == null) return;
double dot = v1.dot(v2);
System.out.println("dot : "+dot);
double averageV1 = 0;
double averageV2 = 0;
double commonCount = 0;
double sumCount = v1.getNumNonZeroElements() + v2.getNumNonZeroElements();
double v1SquareSum = 0;
double v2SquareSum = 0;
for(Element e : v1.nonZeroes()){
v1SquareSum += e.get() * e.get();
double d = v2.get(e.index());
if(d > 0){
commonCount ++;
}
}
for(Element e : v2.nonZeroes()){
v2SquareSum += e.get() * e.get();
}
sumCount = sumCount - commonCount;
System.out.println("sumCount: "+sumCount);
averageV1 = v1.zSum()/sumCount;
System.out.println("averageV1: "+averageV1);
averageV2 = v2.zSum()/sumCount;
System.out.println("averageV2: "+averageV2);
System.out.println("v1SquareSum: "+v1SquareSum);
System.out.println("v2SquareSum: "+v2SquareSum);
System.out.println("相关系数值:"+(sumCount*dot - sumCount*sumCount * averageV1 * averageV2)/((Math.sqrt(sumCount*v1SquareSum - sumCount*sumCount * averageV1 * averageV1))*(Math.sqrt(sumCount*v2SquareSum - sumCount*sumCount * averageV2 * averageV2))));
}
}