文本特征提取

本文详细介绍了几种常用的文本特征提取方法,包括信息增益(IG)、卡方检验(CHI)、互信息(MI)和基于频率(df)的方法,探讨了它们在文本处理中的应用和计算过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  对文本处理一般是将句子分成词级别来进行处理,如果原始文本由几十万个中文词组成,那么将产生非常高的维度,因此为了剔除一些不必要的文本信息,需要从中提取与之相关的特征词,常用的方法有IG信息增益、df特征、CHI(卡方校验)、MI互信息方法等等。

一、IG信息增益

  一个系统中,某个变量的X,其取值有n种,分别为x1,x2,…,xn,因此X的信息熵为

  信息增益是针对一个个特征而言,某个特征t,分别计算有它和没它的时候信息量各是多少,两者的差值就是这个特征所带来的信息增益。

  因此其信息增益定义为:

  其中条件熵H(C|T)为

信息增益提取特征词的步骤为:

  1、统计正负分类的文档数目,N1,N2;

  2、统计每个词在正文档出现的频率(A),负文档出现的频率(B),正文档不出现的概率(C),负文档不出现的概率(D);

  3、计算信息熵;

  4、计算每个词w的信息增益;

  5、按照信息增益的大小排序,取前top k;

  代码如下:

# 加载数据
data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt"
def load_data(path):
    dataset = []
    cate_dict = {}
    for line in open(path,encoding='utf-8'):
        line = line.strip().split(',')
        cate = line[0]
        if cate not in cate_dict:
            cate_dict[cate] = 1
        else:
            cate_dict[cate] += 1
        dataset.append([line[0],[wd for wd in line[1].split(' ') if 'nbsp' not in wd]])
    return cate_dict,dataset
cate_dict,dataset = load_data(data_file)
# 统计文本的类别df特征
def collect_df():
    worddf_dict = {}
    for data in dataset:
        category = data[0]
        for wd in set(data[1]):
            if wd not in worddf_dict:
                worddf_dict[wd] = category
            else:
                worddf_dict[wd] += '@'+ category
    for word,word_category in worddf_dict.items():
        cate_dict = {}
        for cate in word_category.split("@"):
            if cate not in word_category:
                cate_dict[cate] = 1
            else:
                cate_dict[cate] += 1
        worddf_dict[word] = cate_dict
    return worddf_dict
'''基于得到的词-类别相关性字典,统计每一类对应的top词,组合成为特征词'''
def select_best(feature_num, word_dict,cate_nums):
    cate_worddict = {}
    features = []
    for word, scores in word_dict.items():
        for cate, word_score in scores.items():
            if cate not in cate_worddict:
                cate_worddict[cate] = {}
            else:
                cate_worddict[cate][word] = word_score
    #为了防止类别之间的特征词有重复,要加大分配的topnum
    top_num = int(feature_num/cate_nums) + 100

    for cate, words in cate_worddict.items():
        words = sorted(words.items(), key=lambda asd:asd[1], reverse=True)[:top_num]
        top_words = [item[0] for item in words]
        features += top_words

    return list(set(features[:feature_num]))
# 基于信息增益的方法提取文本特征,计算对应的ABCD,得到的是一个全局特征
from textFeatureExtraction.loadData import load_data,collect_df
import math
data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt"
cate_dict,dataset = load_data(data_file)
worddf_dict = collect_df()
cate_nums = len(cate_dict)
def IG(feature_num):
    N = sum(cate_dict.values())
    ig_dict = {}
    # 主要是根据公式来,对每个单词的理解可以用矩阵的方式来理解
    for word, word_cate in worddf_dict.items():
        HC = 0.0 #原先分类的系统的信息熵
        HTC = 0.0 #分类系统包含该词的熵
        HT_C =0.0#分类系统不包含该词的熵
        for cate in range(cate_nums):
            cate = str(cate)
            N1 = cate_dict[cate]
            hc = -(N1/N)*math.log(N1/N)
            A = word_cate.get(cate,0)
            B = sum([word_cate[key] for key in word_cate.keys() if key !=cate])
            C = cate_dict[str(cate)] - A
            D = N - cate_dict[str(cate)] - B
            # 这里会出现缺失值的情况,需要做一个平滑,用add——one平滑处理
            p_t = (A+B)/N
            p_not_t =(C+D)/N
            p_t_c = (A + 1) / (A + B + cate_nums)
            p_t_not_c = (C + 1) / (C + D + cate_nums)
            h_t_ci = p_t * p_t_c * math.log(p_t_c)
            h_t_not_ci = p_not_t * p_t_not_c * math.log(p_t_not_c)
            # 对所有类别做累加操作
            HC += hc
            HTC += h_t_ci
            HT_C += h_t_not_ci
        ig_score = HC + HTC + HT_C# 得到对应的分数
        ig_dict[word] = ig_score
    ig_dict = sorted(ig_dict.items(),key=lambda asd:asd[1],reverse=True)[:feature_num]
    features = [item[0] for item in ig_dict]
    return features

二、卡方校验

  其最基本的思想是通过观察实际值与理论值的偏差来确定。具体常常先假设两个变量确实是独立的(“原假设”),计算其观察值与理论值的偏差程度,如果偏差足够小,认为误差是自然样本误差,原假设成立;如果偏差到一定程度,则否定原假设。则其计算公式为:

卡方特征提取步骤:

  1、统计正负分类的文档数,N1,N2;

  2、统计每个词在正文档出现的频率(A),负文档出现的频率(B),正文档不出现的概率(C),负文档不出现的概率(D);

  3、计算卡方

  4、按照卡方值的大小,取前top就行

  代码如下:

# 卡方检验,得到的是一个类别的局部特征
from textFeatureExtraction.loadData import load_data,collect_df,select_best
import math
data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt"
cate_dict,dataset = load_data(data_file)
worddf_dict = collect_df()
cate_nums = len(cate_dict)
def CHI(feature_num):
    N = sum(cate_dict.values())
    chi_dict = {}
    for word, word_cate in worddf_dict.items():
        data = {}
        for cate in range(cate_nums):
            cate = str(cate)
            A = word_cate.get(cate, 0)
            B = sum([word_cate[key] for key in word_cate.keys() if key != cate])
            C = cate_dict[str(cate)] - A
            D = N - cate_dict[str(cate)] - B
            chi_score =  (N*(A*D - B*C)**2)/((A+C)*(A+B)*(B+D)*(B+C))
            data[cate] = chi_score
        chi_dict[word] = data
    feature = select_best(feature_num, chi_dict,cate_nums)
    return feature

三、互信息

  衡量两个变量之间的关联程度,即词与分类变量的分类程度,公式如下:

  代码如下:

# 基于互信息的抽取,的到的文本特征是偏局部的
from textFeatureExtraction.loadData import load_data,collect_df,select_best
import math
data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt"
cate_dict,dataset = load_data(data_file)
worddf_dict = collect_df()
cate_nums = len(cate_dict)
def IM(feature_num):
    N = sum(cate_dict.values())
    mi_dict ={}
    for word,word_cate in worddf_dict.items():
        data ={}
        for cate in range(cate_nums):
            A = word_cate.get(cate, 0)
            B = sum([word_cate[key] for key in word_cate.keys() if key != cate])
            C = cate_dict[str(cate)] - A
            D = N - cate_dict[str(cate)] - B
            # 进行平滑处理,添加简单的add-one平滑处理
            p_t_c = (A+1)/(A+B+cate_nums)
            p_c = (A + C) / N
            p_t = (A + B) / N
            mi_score = p_t_c * math.log(p_t_c / (p_c * p_t))
            data[cate] = mi_score
        mi_dict[word] = data
    feature = select_best(feature_num, mi_dict, cate_nums)
    return feature

四、基于频率df

  文档中包含该词的频率,作为特征提取。

  代码如下:

# 基于词的频率统计文本的特征
from textFeatureExtraction.loadData import load_data
data_file = "D:\workspace\project\\NLPcase\\textFeatureExtraction\\data\\data.txt"
cate_dict,dataset = load_data(data_file)
# 统计文本出现的文档数量,得到的是一个与类别无关的全局特征
def DF(feature_num):
    df_dict = {}
    for data in dataset:
        for word in set(data[1]):
            if word not in df_dict:
                df_dict[word] = 1
            else:
                df_dict[word] += 1
    df_dict = sorted(df_dict.items(),key=lambda asd:asd[1],reverse=True)[:feature_num]
    return df_dict

五、参考文献

https://www.jianshu.com/p/167283ab011f

https://blog.youkuaiyun.com/snowdroptulip/article/details/78770088

https://www.cnblogs.com/chenying99/p/5018196.html

https://github.com/liuhuanyong/TextFeatureExtraction/blob/master/feature_extract.py

https://blog.youkuaiyun.com/zhixiongzhao/article/details/72852841

 

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值