chapter 3：feature engineering for text_feature engining text-优快云博客

本文链接：https://blog.youkuaiyun.com/u014765410/article/details/84069536

Bag of X : turning natural text into flat vectors

Bag of words
Bag of words是一种简单的document表示方法，他将corpas中所有word组成一个dictionary。document可表示为：dictionary中各个word的one-hot vector set。
Bag of words将sentence简单的切分为single word，没有考虑word之间的order，因此，只能用于coarse-grained text task，如：document classifying，或者，information retrieval。
note that：bag of words matrix的转置矩阵为 bag of documents，即：dictionary中的一个word可以用各个document中word出现频次构成的vector表示。
Bag of n-grams
bag of n-grams可以缓解“bag of words无法进行semantic analysis的缺陷”，因为，他考虑了sentence中word的order。但是，治标不治本。
bag of n-grams是一个sequence of words。n-gram代表words的个数，举例说明：
例句：emma knock the door；对其做bag of n-grams：
如果要做bag of bigram(2)，则为：{emma knock, knock the, the door}；
如果要做bag of trigram(3)，则为：{emma knock the, knock the door}；
在实际中，bag of n-grams，n最多等于3。
因为，n越大，虽然，获得的feature set 越大，information越多，但是，model fitting计算量越大，storage cost越大。

Filtering for cleaner features

用bag of X获得的feature中，有很多word，或word pair对于text task并无实际意义，属于noise，应该清除，下面介绍几种cleaning feature的方法：

Stopwords
根据一些stopwords list，将获得的feature set中的stopwords去掉。
如，在document classification和information retrieval中，“介词”，“代词”，“冠词”等对于text task是毫无意义的，因此，可以利用一些stopwords list，将这些词，从原本的feature set中去掉。
stopwords list可以通过python 的NLTK 获得：

nltk.download()

也可以从web上找到。

Frequency based Filtering
一些typical stopwords list虽然能够将feature set中的general stopwords去掉，但是，对于corpus specific common words并不适用。
对于corpus-specific common words我们可以通过计算各个word在“多少个document中出现过”，来判断他是否为corpus-specific common words。这种方法即为 “frequency based filtering”。
在实际中，我们常将frequency based filtering和stopwords list一起联用，来去除corpus-specific common words。
Rare words（maybe misspellings or obscure words）
当一个word仅在少数document(1 or 2)中出现时，他可能只是noise而不是有用的information。
这种rare words，不仅对于prediction无意义，而且，他们也会造成不必要的计算量。
如果在一个text task中，dictionary中的60%或更多都是rare words，则会形成“heavy tailed distribution”。如chapter 2中所讲，heavy tailed distribution会影响prediction。
在text task中，rare words可以通过“word count statisitic(word在？个文档中出现过)”来识别和去除。可以将识别出的rare words数量作为text task 中feature set的一个额外feature，但是，这种additional feature要在遍历完所有document以后才能得到（因为，我们首先要识别那些word为rare words）。
Stemming
在一个text task中，我们可能将一个word的多种变形都计入dictionary中，如：flower,flowers，为了解决这一问题，我们在NLP task中，要进行stemming。
stemming：is an NLP task that tries to chop each word down to its basic linguistic word stem form。
虽然stemming能够避免将同一word的不同变形视为"不同的word"，但是，stemming也可能带来一些弊端，如：stemming会将new, news统一归为new，这显然是不对的。因此，在实际中，stemming is not always used。
关于stemming的工具有很多，其中porter stemmer为用于处理English的tool:

>>> import nltk
>>> stemmer = nltk.stem.porter.PorterStemmer()
>>> stemmer.stem('flowers')
u'flower'
>>> stemmer.stem('zeroes')
u'zero'

Atoms of meaning：From words to n-grams to phrases

Parsing and Tokenization
当text为web page，email，log等格式时，首先要先解析这些structure，提取正文，然后，才能将正文转化为single words。前一个步骤称为parse，后一个步骤称为tokenization。
note that：在实际中，tokenization往往对sentence进行，而非document。因此，首先要将document切割为sentence或者paragraph，然后在对其进行tokenization。如在word2vec，n-gram中，均是如此。
note that：如果English document中包含有non-ASCII character时，要确保tokenizer能够处理这个encoding。
Collocation extraction for phrase detection
1)collocation：In NLP，the concept of a useful phrase is called a “collocation”。collocation不一定是一个consecutive sequence，如“emma knocked on the door”，knock door是一个collocation。collocation is more meaningful than the sum of their parts，如：strong tea的意义 > strong + tea。在bag of n-gram中，并不是每个n-gram都被看作有意义的collocation。
2)collocation extraction的方法有以下两大类：
way1：predefined collocation；
给定"a fixed list of phrases and idiomatic sayings"，then，we look through the text for any match。这种collocation extraction方法computationally expensive，但是it works。 If the corpus is very domain specific and contains esoteric lingo, then this might be the preferred method. But the list would require a lot of manual curation, and it would need to be constantly updated for evolving corpora.
way2：statisitcal collocation extraction methods
2.1 frequency based methods
A simple hack is to look at the most frequently occurring n-grams(document count)in corpus. The problem with this approach is that the most frequently occurring ones may not be the most useful ones.
2.2 Hypothesis testing for collocation extraction
The key idea is to ask whether two words appear together more often than they would by chance.

2.3 There is another statistical approach that’s based on pointwise mutual
information, but it is very sensitive to rare words, which are always present in real-world text corpora. Hence, it is not commonly used。
上述方法都需要进行以下操作：1）filtering a list of candidate phrases；candidate phrases主要通过n-gram得到，一般n <= 3；
对于longer phrase，主要通过chunking 得到。

Chunking and part-of-speech tagging

chunking：即将sentence中同一词性的word组装成longer non-consecutive candidate phrase。可用于collocation extraction。

The models that map words to parts of speech （词性）are generally language specific. Several open source Python libraries, such as NLTK, spaCy, and TextBlob, have multiple language models available.

在用model标出各个word的词性后，我们可以利用相关function来得到chunking(part-of-speech groupings)。

code示例：

>>> import pandas as pd
>>> import json
>>> f = open('data/yelp/v6/yelp_academic_dataset_review.json')
>>> js = []
>>> for i in range(10):
... js.append(json.loads(f.readline()))
>>> f.close()
>>> review_df = pd.DataFrame(js)
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc_df = review_df['text'].apply(nlp)
>>> for doc in doc_df[4]:
... print([doc.text, doc.pos_, doc.tag_])
Got VERB VBP
a DET DT
letter NOUN NN
in ADP IN
the DET DT
mail NOUN NN
>>> print([chunk for chunk in doc_df[4].noun_chunks])
[a letter, the mail, Dr. Goldberg, Arizona, a new position, June, He, I,
a new doctor, NYC, you, a date]

n-grams，collocation extraction都加进了一些原document的structure 给flat vectors，在下一节中，我们将介绍一些其它的“adding structure back into flat vectors”的方法，以及 tf-idf。