对文本处理一般是将句子分成词级别来进行处理,如果原始文本由几十万个中文词组成,那么将产生非常高的维度,因此为了剔除一些不必要的文本信息,需要从中提取与之相关的特征词,常用的方法有IG信息增益、df特征、CHI(卡方校验)、MI互信息方法等等。
一、IG信息增益
一个系统中,某个变量的X,其取值有n种,分别为x1,x2,…,xn,因此X的信息熵为
信息增益是针对一个个特征而言,某个特征t,分别计算有它和没它的时候信息量各是多少,两者的差值就是这个特征所带来的信息增益。
因此其信息增益定义为:
其中条件熵H(C|T)为
信息增益提取特征词的步骤为:
1、统计正负分类的文档数目,N1,N2;
2、统计每个词在正文档出现的频率(A),负文档出现的频率(B),正文档不出现的概率(C),负文档不出现的概率(D);
3、计算信息熵;
4、计算每个词w的信息增益;
5、按照信息增益的大小排序,取前top k;
代码如下:
# 加载数据 data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt" def load_data(path): dataset = [] cate_dict = {} for line in open(path,encoding='utf-8'): line = line.strip().split(',') cate = line[0] if cate not in cate_dict: cate_dict[cate] = 1 else: cate_dict[cate] += 1 dataset.append([line[0],[wd for wd in line[1].split(' ') if 'nbsp' not in wd]]) return cate_dict,dataset cate_dict,dataset = load_data(data_file) # 统计文本的类别df特征 def collect_df(): worddf_dict = {} for data in dataset: category = data[0] for wd in set(data[1]): if wd not in worddf_dict: worddf_dict[wd] = category else: worddf_dict[wd] += '@'+ category for word,word_category in worddf_dict.items(): cate_dict = {} for cate in word_category.split("@"): if cate not in word_category: cate_dict[cate] = 1 else: cate_dict[cate] += 1 worddf_dict[word] = cate_dict return worddf_dict '''基于得到的词-类别相关性字典,统计每一类对应的top词,组合成为特征词''' def select_best(feature_num, word_dict,cate_nums): cate_worddict = {} features = [] for word, scores in word_dict.items(): for cate, word_score in scores.items(): if cate not in cate_worddict: cate_worddict[cate] = {} else: cate_worddict[cate][word] = word_score #为了防止类别之间的特征词有重复,要加大分配的topnum top_num = int(feature_num/cate_nums) + 100 for cate, words in cate_worddict.items(): words = sorted(words.items(), key=lambda asd:asd[1], reverse=True)[:top_num] top_words = [item[0] for item in words] features += top_words return list(set(features[:feature_num]))
# 基于信息增益的方法提取文本特征,计算对应的ABCD,得到的是一个全局特征 from textFeatureExtraction.loadData import load_data,collect_df import math data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt" cate_dict,dataset = load_data(data_file) worddf_dict = collect_df() cate_nums = len(cate_dict) def IG(feature_num): N = sum(cate_dict.values()) ig_dict = {} # 主要是根据公式来,对每个单词的理解可以用矩阵的方式来理解 for word, word_cate in worddf_dict.items(): HC = 0.0 #原先分类的系统的信息熵 HTC = 0.0 #分类系统包含该词的熵 HT_C =0.0#分类系统不包含该词的熵 for cate in range(cate_nums): cate = str(cate) N1 = cate_dict[cate] hc = -(N1/N)*math.log(N1/N) A = word_cate.get(cate,0) B = sum([word_cate[key] for key in word_cate.keys() if key !=cate]) C = cate_dict[str(cate)] - A D = N - cate_dict[str(cate)] - B # 这里会出现缺失值的情况,需要做一个平滑,用add——one平滑处理 p_t = (A+B)/N p_not_t =(C+D)/N p_t_c = (A + 1) / (A + B + cate_nums) p_t_not_c = (C + 1) / (C + D + cate_nums) h_t_ci = p_t * p_t_c * math.log(p_t_c) h_t_not_ci = p_not_t * p_t_not_c * math.log(p_t_not_c) # 对所有类别做累加操作 HC += hc HTC += h_t_ci HT_C += h_t_not_ci ig_score = HC + HTC + HT_C# 得到对应的分数 ig_dict[word] = ig_score ig_dict = sorted(ig_dict.items(),key=lambda asd:asd[1],reverse=True)[:feature_num] features = [item[0] for item in ig_dict] return features
二、卡方校验
其最基本的思想是通过观察实际值与理论值的偏差来确定。具体常常先假设两个变量确实是独立的(“原假设”),计算其观察值与理论值的偏差程度,如果偏差足够小,认为误差是自然样本误差,原假设成立;如果偏差到一定程度,则否定原假设。则其计算公式为:
卡方特征提取步骤:
1、统计正负分类的文档数,N1,N2;
2、统计每个词在正文档出现的频率(A),负文档出现的频率(B),正文档不出现的概率(C),负文档不出现的概率(D);
3、计算卡方
4、按照卡方值的大小,取前top就行
代码如下:
# 卡方检验,得到的是一个类别的局部特征 from textFeatureExtraction.loadData import load_data,collect_df,select_best import math data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt" cate_dict,dataset = load_data(data_file) worddf_dict = collect_df() cate_nums = len(cate_dict) def CHI(feature_num): N = sum(cate_dict.values()) chi_dict = {} for word, word_cate in worddf_dict.items(): data = {} for cate in range(cate_nums): cate = str(cate) A = word_cate.get(cate, 0) B = sum([word_cate[key] for key in word_cate.keys() if key != cate]) C = cate_dict[str(cate)] - A D = N - cate_dict[str(cate)] - B chi_score = (N*(A*D - B*C)**2)/((A+C)*(A+B)*(B+D)*(B+C)) data[cate] = chi_score chi_dict[word] = data feature = select_best(feature_num, chi_dict,cate_nums) return feature
三、互信息
衡量两个变量之间的关联程度,即词与分类变量的分类程度,公式如下:
代码如下:
# 基于互信息的抽取,的到的文本特征是偏局部的 from textFeatureExtraction.loadData import load_data,collect_df,select_best import math data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt" cate_dict,dataset = load_data(data_file) worddf_dict = collect_df() cate_nums = len(cate_dict) def IM(feature_num): N = sum(cate_dict.values()) mi_dict ={} for word,word_cate in worddf_dict.items(): data ={} for cate in range(cate_nums): A = word_cate.get(cate, 0) B = sum([word_cate[key] for key in word_cate.keys() if key != cate]) C = cate_dict[str(cate)] - A D = N - cate_dict[str(cate)] - B # 进行平滑处理,添加简单的add-one平滑处理 p_t_c = (A+1)/(A+B+cate_nums) p_c = (A + C) / N p_t = (A + B) / N mi_score = p_t_c * math.log(p_t_c / (p_c * p_t)) data[cate] = mi_score mi_dict[word] = data feature = select_best(feature_num, mi_dict, cate_nums) return feature
四、基于频率df
文档中包含该词的频率,作为特征提取。
代码如下:
# 基于词的频率统计文本的特征 from textFeatureExtraction.loadData import load_data data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt" cate_dict,dataset = load_data(data_file) # 统计文本出现的文档数量,得到的是一个与类别无关的全局特征 def DF(feature_num): df_dict = {} for data in dataset: for word in set(data[1]): if word not in df_dict: df_dict[word] = 1 else: df_dict[word] += 1 df_dict = sorted(df_dict.items(),key=lambda asd:asd[1],reverse=True)[:feature_num] return df_dict
五、参考文献
https://www.jianshu.com/p/167283ab011f
https://blog.youkuaiyun.com/snowdroptulip/article/details/78770088
https://www.cnblogs.com/chenying99/p/5018196.html
https://github.com/liuhuanyong/TextFeatureExtraction/blob/master/feature_extract.py
https://blog.youkuaiyun.com/zhixiongzhao/article/details/72852841