Feature Selection(特征提取) 单纯高信息量unigram与参考情感词典词汇对比

最新推荐文章于 2020-10-29 12:53:43 发布

贝叶斯优化

最新推荐文章于 2020-10-29 12:53:43 发布

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

分类专栏：分类算法 NLP NLP 算法

本文链接：https://blog.youkuaiyun.com/yangang908/article/details/39211251

NLP 同时被 3 个专栏收录

6 篇文章

订阅专栏

分类算法

5 篇文章

订阅专栏

NLP 算法

2 篇文章

订阅专栏

本文探讨了在情感分析中，使用卡方统计法提取高信息量的unigram作为特征，发现特征数量与算法性能的关系。通过对比，发现在一定范围内增加特征数量可提升性能，但过多可能导致性能下降，特别是KNN算法。为提高性能，作者尝试使用情感词典对特征进行筛选，只保留与情感词典匹配的词汇。然而，实验结果显示，这种提纯后的特征在大多数分类器中导致性能下降，尽管某些分类器如NBM和SMO的部分类别对比例有所提高，但总体效果并不明显。这表明在特征选择上需要更加谨慎。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在之前的实验中我使用卡方统计法，提取语料库中单词信息量最高的1000 2000 3000词不等。

之后发现大部分算法还是需要一定的features量，也就是在一定范围内（如 1000~3000）提取的词汇越多，classifier性能越好。达到一定峰值后性能会下降。

只有Knn算法出现了与众不同的现象，features越大，性能有呈现下降趋势。其本身性能也不是很好，所以没给予更多关注。

后来想提高算法性能，于是打算先从已有的features下手。导师建议我去review一下提取的unigram，看里面是个什么情况。简单浏览一下发现，很多词汇是没什么情感色彩的名词，这些词汇似乎对情感分析没有帮助，而且有可能起到负面作用。于是想试下提纯features。导师推荐了用情感词典做为参考。

于是做了一个筛选：如果选出来的features在情感词典中那么保留，如果不在，丢弃。

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

import itertools


line = []
share_words = []
inpfile = open("my_features.txt", "r") # 提取的features，可多提取一些，如8000个 best_words
line1 = inpfile.readline()
my_features = []
while line1:
word1 = line1.strip()
my_features.append(word1)
line1 = inpfile.readline()
inpfile.close()

inpfile = open("lexicon.txt", "r") # 情感词典
line2 = inpfile.readline()
lexicon = []
while line2:
word2 = line2.strip()
lexicon.append(word2)
line2 = inpfile.readline()
inpfile.close()

for w in my_features:
if(w in lexicon):

line.append(w)
share_words.append(line)

f1=open('shared_words.txt','w') # 写入到文件（此例子中为1662个shared features）
words = share_words
words = list(itertools.chain(*words))

for item in words:

f1.write(item+"\n") #k
f1.close()

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

这样处理后再用weka进行处理，结果很有意思。

Classifiers	Cross-validation (10)
Top 1662 features
NaiveBayesMultinomial:	81.21%
NaiveBayes :	67.80%
IBK(KNN)	65.92%
J48(Decision Tree)	64.34%
Logistic	79.71%
SMO(SVM)	79.39%
MultilayerPerceptron(Netural Network)
Voting(NB+NBM+J48)
Voting(NBM+Logistic+SMO)

Classifiers	Cross-validation (10)
1662 features shared with 6800 sentment dict
NaiveBayesMultinomial:	71.82%
NaiveBayes :	70.06%
IBK(KNN)	68.31%
J48(Decision Tree)	59.35%
Logistic	72.64%
SMO(SVM)	71.74%
MultilayerPerceptron(Netural Network)
Voting(NB+NBM+J48)	71.58%

从整体的accuracy中可以看出大部分分类器性能都下降了，NB和knn是有一些提高但也不是很明显。

不过更细致去看：

Top 1662 features
NBM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.862 0.238 0.784 0.862 0.821 0.627 0.900 0.899 pos
0.762 0.138 0.847 0.762 0.802 0.627 0.900 0.898 neg
Weighted Avg. 0.812 0.188 0.815 0.812 0.812 0.627 0.900 0.898
=== Confusion Matrix ===
a b <-- classified as
4309 691 \| a = pos
1188 3812 \| b = neg

1662 features shared with sentiment dict
NBM
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.895 0.459 0.661 0.895 0.761 0.467 0.818 0.800 pos
0.541 0.105 0.838 0.541 0.658 0.467 0.818 0.798 neg
Weighted Avg. 0.718 0.282 0.750 0.718 0.709 0.467 0.818 0.799
=== Confusion Matrix ===
a b <-- classified as
4477 523 \| a = pos
2295 2705 \| b = neg

用提纯后的features NBM pos class 分对的比例比单纯 top1662 unigrams 提高了3个百分点性能, 但neg class错误率很高，所以导致了整体性能下降。

再来看看SMO(SVM):

Top 1662 features
SMO
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.777 0.189 0.804 0.777 0.790 0.588 0.794 0.736 pos
0.811 0.223 0.784 0.811 0.797 0.588 0.794 0.731 neg
Weighted Avg. 0.794 0.206 0.794 0.794 0.794 0.588 0.794 0.733
=== Confusion Matrix ===
a b <-- classified as
3885 1115 \| a = pos
946 4054 \| b = neg

1662 features shared with sentiment dict
SMO
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.594 0.159 0.789 0.594 0.678 0.449 0.717 0.671 pos
0.841 0.406 0.674 0.841 0.748 0.449 0.717 0.647 neg
Weighted Avg. 0.717 0.283 0.732 0.717 0.713 0.449 0.717 0.659
=== Confusion Matrix ===
a b <-- classified as
2970 2030 \| a = pos
796 4204 \| b = neg