week1-text preprocessing->feature extraction

最新推荐文章于 2024-06-04 20:51:16 发布

原创最新推荐文章于 2024-06-04 20:51:16 发布 · 387 阅读

0 ·

CC 4.0 BY-SA版权

nlp-coursera 专栏收录该内容

4 篇文章

订阅专栏

本文探讨了文本分类中的预处理技术，如分词、标准化，并介绍了如何将文本转换为特征向量，包括词袋模型及其存在的问题，以及如何通过n-gram改进词序考虑。

This week, we focus on text classification

text preprocessing
- Tokenization
- Normalization
transforming tokens into features / text to text vector

tips: text classification can be used to sentiment analysis.

text preprocessing

Tokenization

How to process text depends on what you think of text as.
a sequence of

characters
words
phrases and named entities
sentences
paragraphs

Here, we think of text as a sequence of words because we reckon that a word is a meaningful sequence of characters.
Therefore, we should extract all words from a sentence.This process is called tokenization. So what’s the boundary of words?
Here, we mainly talk about English.
In English we can split a sentence by spaces or punctuation.
Three methods of tokenization are built in Python ntlk liberary.

whitespace tokenizer
puctuation tokenizer
treebankword tokenizer

examples

Normalization

stemming
lemmatization

examples

transforming tokens into features / text to text vector

== bag of words==

count occurrences of a particular token in our text

problems:

loose word order
counters are not normalized

so, for word order, we count token pairs, triplets,etc. n-gram
therefore, there are too many features
then, we remove some n-grams based on their occurrence frequency in documents of our corpus(df).(remove too high or too low)
and then, all features we have are moderately appearing among documents of our corpus. Next, we should focus on the value of feature columns.–or term frequency.
and then more accurately, we can get df in detail, not just medium df.