第3关:文本数据特征提取

本关任务

在前一关卡,我们已经学会了数据集标准化处理,标准化一般主要针对数值型数据。对于文本数据,我们无法直接将原始文本作为训练数据,需通过特征提取将其转化为特征向量。本关卡,将学习提取文本数据特征的基本操作。
相关知识

文本分析是机器学习算法的一个主要应用领域。文本分析的原始数据无法直接输入算法,因为大部分学习算法期望的输入是固定长度的数值特征向量,而不是可变长的文本数据。
为了解决这个问题,sklearn提供了一些实用工具用最常见的方式从文本内容中抽取数值特征。比如说:

    分词(tokenizing),对句子分词后,为每一个词(token)分配的一个整型id,通常用空格和标点符号作为分词的分割符。
    计数(counting),计算在某个词在文本中的出现频率。
    归一化权重(nomalizating and weighting), 降低在大多数样本/文档中都出现的词的权重。

在文本特征提取中,特征和样本的定义如下:

    将每个词出现的频率作为特征。
    将给定文档中所有词的出现频率所构成的向量看做一个样本。

因此,整个语料库可以看做一个矩阵,矩阵的每行代表一个文档,每列代表一个分词。我们将文档集合转化为数值特征向量的过程称为向量化。这一整套策略被称为词袋模型,用词频描述文档,完全忽略词在文档中出现的相对位置信息。  

sklearn中文本数据特征提取的方法

CountVectorizer模块实现了分词和计数,该方法包含许多参数,如下图所示,打印了其默认的参数值。

常见参数:

input : 指定输入格式
tokenizer : 指定分词器
stop_words : 指定停止词,比如当设置 stop_words="english"时,将会使用内建的英语停用词列表
max_df : 设置最大词频,若为浮点数且范围在[0,1]之间,则表示频率,若为整数则表示频数
min_df : 设置最小词频
max_features: 设置最大的特征数量
vocabulary: 指定语料库,即指定词和特征索引间的映射关系表  

属性:

vocabulary_ : 字典类型,返回词和特征索引间的映射关系表  

    #得到词汇映射表
    vocab = vectorizer.vocabulary_
    #字典结构,返回特征值‘document’对应的下标索引
    vectorizer.vocabulary_.get('document')

stop_words_ : set集合,停止词集合

方法:

fit(raw_documents[, y]) : 从原文本数据得到词汇-特征索引间的映射关系
transform(raw_documents) :    将原文本集合转换为特征矩阵
fit_transform(raw_documents[, y]): 结合fit和transform方法,返回特征矩阵,如下图所示

build_analyzer()    :返回预处理和分词的引用
下图表示对“This is a text document to analyze.”进行分词并验证结果

get_feature_names()    :返回特征索引id对应的特征名字(下标对应的某个词)

将向量化转换器应用到新的测试数据:

注意:transform函数的返回结果是一个矩阵(sparse matrix),为了更好的表示数据,采用toarray()将数据转化为numpy数组,注意接下来的编程任务中也要转化为一个数组。

在文本语料库中,一些词非常常见(例如,英文中的“the”,“a”,“is”),但是有用信息含量不高。如果我们将词频直接输入分类器,那些频繁出现的词会掩盖那些很少出现但是更有意义的词,将达不到比较好的预期效果。为了重新计算特征的计数权重,通常都会进行TFIDF转换。
TFIDF的主要思想是:如果某个词或短语在一篇文章中出现的频率高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TFIDF实际上是:TF * IDF,TF词频(Term Frequency),IDF反文档频率(Inverse Document Frequency)。TF词频(Term Frequency)指的是某一个给定的词语在该文件中出现的次数。IDF反文档频率(Inverse Document Frequency)是指如果包含词条的文档越少,IDF越大,则说明词条具有很好的类别区分能力。

sklearn中的TfidfVectorizer模块,实现了TFIDF思想。该模块的参数、属性及方法与CountVectorizer类似,可参考CountVectorizer的调用方式。
编程要求

本关任务,希望分别使用CountVectorizer和TfidfVectorizer对新闻文本数据进行特征提取,并验证提取是否正确。

数据介绍:

采用fetch_20newsgroups("./",subset='train', categories=categories)函数加载对应目录的新闻数据,subset='train'指定采用的是该数据的训练集,categories 指定新闻目录,具体的函数调用可参考官方文档fetch_20newsgroups。X是一个string数组,包含857个新闻文档。

本关具体分成为几个子任务:

1.使用CountVectorizer方法提取新闻数据的特征向量,返回词汇表大小和前五条特征向量

要补充的代码块如下:

    def transfer2CountVector():
        '''
        使用CountVectorizer方法提取特征向量,返回词汇表大小和前五条特征向量
        返回值:
        vocab_len - 标量,词汇表大小
        tokenizer_list - 数组,对测试字符串test_str进行分词后的结果
        '''
        vocab_len = 0
        
        test_str = "what's your favorite programming language?"
        tokenizer_list = []
        
        #   请在此添加实现代码   #
        # ********** Begin *********#
        
        
        # ********** End **********#
        return vocab_len,tokenizer_list

2.使用TfidfVectorizer方法得到新闻数据的词汇-特征向量映射关系,指定使用内建的英文停止词列表作为停止词,并且词出现的最小频数等于2。然后将向量化转换器应用到新的测试数据。

    def transfer2TfidfVector():
        '''
            使用TfidfVectorizer方法提取特征向量,并将向量化转换器应用到新的测试数据
            TfidfVectorizer()方法的参数设置:
            min_df = 2,stop_words="english"
            test_data - 需要转换的原数据
            返回值:
            transfer_test_data - 二维数组ndarray
            '''
        test_data = ['Once again, to not believe in God is different than saying....... where is your evidence for that "god is" is meaningful at some level?\n   Benedikt\n']
        transfer_test_data = None
        #   请在此添加实现代码   #
        # ********** Begin *********#
        # ********** End **********#
        return transfer_test_data

测试说明

本关卡的测试数据来自./step5/testTextFeatureExt\fraction.py文件,平台将比对您所编写函数的返回值与正确的数值,只有所有数据全部计算正确才能进入下一关。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


categories = [
    'alt.atheism',
    'talk.religion.misc',
]

# 加载对应目录的新闻数据,包含857 个文档
data = fetch_20newsgroups("./step5/",subset='train', categories=categories)
X = data.data

def transfer2CountVector():
    '''
    使用CountVectorizer方法提取特征向量,返回词汇表大小和前五条特征向量

    返回值:
    vocab_len - 标量,词汇表大小
    tokenizer_list - 数组,对测试字符串test_str进行分词后的结果
    '''

    vocab_len = 0

    test_str = "what's your favorite programming language?"
    tokenizer_list = []

    #   请在此添加实现代码   #
    # ********** Begin *********#
    vectorizer = CountVectorizer()
    vectorizer.fit(X)
    vocab_len = len(vectorizer.vocabulary_)

    analyze = vectorizer.build_analyzer()
    tokenizer_list = analyze(test_str)
    
    # ********** End **********#

    return vocab_len,tokenizer_list

def transfer2TfidfVector():
    '''
        使用TfidfVectorizer方法提取特征向量,并将向量化转换器应用到新的测试数据

        TfidfVectorizer()方法的参数设置:
        min_df = 2,stop_words="english"

        test_data - 需要转换的原数据

        返回值:
        transfer_test_data - 二维数组ndarray
        '''

    test_data = ['Once again, to not believe in God is different than saying\n>I BELIEVE that God does not exist. I still maintain the position, even\n>after reading the FAQs, that strong atheism requires faith.\n>\n \nNo it in the way it is usually used. In my view, you are saying here that\ndriving a car requires faith that the car drives.\n \nFor me it is a conclusion, and I have no more faith in it than I have in the\npremises and the argument used.\n \n \n>But first let me say the following.\n>We might have a language problem here - in regards to "faith" and\n>"existence". I, as a Christian, maintain that God does not exist.\n>To exist means to have being in space and time. God does not HAVE\n>being - God IS Being. Kierkegaard once said that God does not\n>exist, He is eternal. With this said, I feel it\'s rather pointless\n>to debate the so called "existence" of God - and that is not what\n>I\'m doing here. I believe that God is the source and ground of\n>being. When you say that "god does not exist", I also accept this\n>statement - but we obviously mean two different things by it. However,\n>in what follows I will use the phrase "the existence of God" in it\'s\n>\'usual sense\' - and this is the sense that I think you are using it.\n>I would like a clarification upon what you mean by "the existence of\n>God".\n>\n \nNo, that\'s a word game. The term god is used in a different way usually.\nWhen you use a different definition it is your thing, but until it is\ncommonly accepted you would have to say the way I define god is ... and\nthat does not exist, it is existence itself, so I say it does not exist.\n \nInterestingly, there are those who say that "existence exists" is one of\nthe indubitable statements possible.\n \nFurther, saying god is existence is either a waste of time, existence is\nalready used and there is no need to replace it by god, or you are implying\nmore with it, in which case your definition and your argument so far\nare incomplete, making it a fallacy.\n \n \n(Deletion)\n>One can never prove that God does or does not exist. When you say\n>that you believe God does not exist, and that this is an opinion\n>"based upon observation", I will have to ask "what observtions are\n>you refering to?" There are NO observations - pro or con - that\n>are valid here in establishing a POSITIVE belief.\n(Deletion)\n \nWhere does that follow? Aren\'t observations based on the assumption\nthat something exists?\n \nAnd wouldn\'t you say there is a level of definition that the assumption\n"god is" is meaningful. If not, I would reject that concept anyway.\n \nSo, where is your evidence for that "god is" is meaningful at some level?\n   Benedikt\n']
    transfer_test_data = None

    #   请在此添加实现代码   #
    # ********** Begin *********#
    tfidf_vertor = TfidfVectorizer(min_df=2, stop_words="english")
    tfidf_vertor.fit(X)
    transfer_test_data = tfidf_vertor.transform(test_data).toarray()
    # ********** End **********#

    return transfer_test_data

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值