在之前的实验中我使用卡方统计法,提取语料库中单词信息量最高的1000 2000 3000词不等。
之后发现大部分算法还是需要一定的features量,也就是 在一定范围内(如 1000~3000)提取的词汇越多,classifier性能越好。达到一定峰值后 性能会下降。
只有Knn算法出现了与众不同的现象,features越大,性能有呈现下降趋势。其本身性能也不是很好,所以没给予更多关注。
后来想提高算法性能,于是打算先从已有的features下手。导师建议我去review一下提取的unigram,看里面是个什么情况。简单浏览一下发现,很多词汇是没什么情感色彩的名词,这些词汇似乎对情感分析没有帮助,而且有可能起到负面作用。于是想试下提纯features。 导师推荐了用情感词典做为参考。
于是做了一个筛选:如果选出来的features在情感词典中 那么保留,如果不在,丢弃。
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
import itertools
line = []
share_words = []
inpfile = open("my_features.txt", "r") # 提取的features,可多提取一些,如8000个 best_words
line1 = inpfile.readline()
my_features = []
while line1:
word1 = line1.strip()
my_features.append(word1)
line1 = inpfile.readline()
inpfile.close()
inpfile = open("lexicon.txt", "r") # 情感词典
line2 = inpfile.readline()
lexicon = []
while line2:
word2 = line2.strip()
lexicon.append(word2)
line2 = inpfile.readline()
inpfile.close()
for w in my_features:
if(w in lexicon):
line.append(w)
share_words.append(line)
f1=open('shared_words.txt','w') # 写入到文件(此例子中为1662个shared features)
words = share_words
words = list(itertools.chain(*words))
for item in words:
f1.write(item+"\n") #k
f1.close()
////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
这样处理后再用weka进行处理,结果很有意思。
Classifiers | Cross-validation (10) |
Top 1662 features | |
NaiveBayesMultinomial: | 81.21% |
NaiveBayes : | 67.80% |
IBK(KNN) | 65.92% |
J48(Decision Tree) | 64.34% |
Logistic | 79.71% |
SMO(SVM) | 79.39% |
MultilayerPerceptron(Netural Network) | |
Voting(NB+NBM+J48) | |
Voting(NBM+Logistic+SMO) |
Classifiers | Cross-validation (10) |
1662 features shared with 6800 sentment dict | |
NaiveBayesMultinomial: | 71.82% |
NaiveBayes : | 70.06% |
IBK(KNN) | 68.31% |
J48(Decision Tree) | 59.35% |
Logistic | 72.64% |
SMO(SVM) | 71.74% |
MultilayerPerceptron(Netural Network) | |
Voting(NB+NBM+J48) | 71.58% |
从整体的accuracy中可以看出 大部分分类器性能都下降了,NB和knn是有一些提高 但也不是很明显。
不过更细致去看:
Top 1662 features | |||||||||||
NBM | |||||||||||
=== Detailed Accuracy By Class === | |||||||||||
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class | |||||||||||
0.862 0.238 0.784 0.862 0.821 0.627 0.900 0.899 pos | |||||||||||
0.762 0.138 0.847 0.762 0.802 0.627 0.900 0.898 neg | |||||||||||
Weighted Avg. 0.812 0.188 0.815 0.812 0.812 0.627 0.900 0.898 | |||||||||||
=== Confusion Matrix === | |||||||||||
a b <-- classified as | |||||||||||
4309 691 | a = pos | |||||||||||
1188 3812 | b = neg |
1662 features shared with sentiment dict | |||||||||||
NBM | |||||||||||
=== Detailed Accuracy By Class === | |||||||||||
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class | |||||||||||
0.895 0.459 0.661 0.895 0.761 0.467 0.818 0.800 pos | |||||||||||
0.541 0.105 0.838 0.541 0.658 0.467 0.818 0.798 neg | |||||||||||
Weighted Avg. 0.718 0.282 0.750 0.718 0.709 0.467 0.818 0.799 | |||||||||||
=== Confusion Matrix === | |||||||||||
a b <-- classified as | |||||||||||
4477 523 | a = pos | |||||||||||
2295 2705 | b = neg |
用提纯后的features NBM pos class 分对的比例比单纯 top1662 unigrams 提高了3个百分点 性能, 但neg class错误率很高,所以导致了整体性能下降。
再来看看SMO(SVM):
Top 1662 features | |||||||||||
SMO | |||||||||||
=== Detailed Accuracy By Class === | |||||||||||
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class | |||||||||||
0.777 0.189 0.804 0.777 0.790 0.588 0.794 0.736 pos | |||||||||||
0.811 0.223 0.784 0.811 0.797 0.588 0.794 0.731 neg | |||||||||||
Weighted Avg. 0.794 0.206 0.794 0.794 0.794 0.588 0.794 0.733 | |||||||||||
=== Confusion Matrix === | |||||||||||
a b <-- classified as | |||||||||||
3885 1115 | a = pos | |||||||||||
946 4054 | b = neg |
1662 features shared with sentiment dict | |||||||||||
SMO | |||||||||||
=== Detailed Accuracy By Class === | |||||||||||
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class | |||||||||||
0.594 0.159 0.789 0.594 0.678 0.449 0.717 0.671 pos | |||||||||||
0.841 0.406 0.674 0.841 0.748 0.449 0.717 0.647 neg | |||||||||||
Weighted Avg. 0.717 0.283 0.732 0.717 0.713 0.449 0.717 0.659 | |||||||||||
=== Confusion Matrix === | |||||||||||
a b <-- classified as | |||||||||||
2970 2030 | a = pos | |||||||||||
796 4204 | b = neg |
SMO 中用提纯过后的features neg class 分对的比例达到了 84% 比单纯top 1662词汇中的81%也提高了3个百分点,由于 pos正确率很低 所以整体性能下降。
其他算法中也有类似现象 偏向一极的比例有所增加。
如何选取features可能还是要深思熟虑。
继续....