Leture7 Text Mining 文本挖掘笔记_from krovetzstemmer import stemmer krovetz = stemm-优快云博客

Basics of text mining

Process of text mining:

Text Pre-processing
Text Transformation
Feature Selection
Data Mining
Evaluate
Applications

Text representation
- Set of Words
- Bag of Words
- Vector Space Model
- Topic Models
- Word Embedding

Text mining tasks

• Classification –Document categorization –Sentiment analysis
• Clustering Analysis –Text clustering
• Natural Language Processing Tasks

Applications of text mining

Sentiment Analysis
Financial Market Prediction
Recommendation

Challenges in text mining

–Data is not well-organized
–ambiguities
–annotated training examples -expensive to acquire

Text preprocessing

在这里插入图片描述

Tokenization:
Break a stream of text into meaningful units
Normalization:
Convert all text to the same case (upper or lower) –Remove numbers –Remove punctuation

Stemming/Lemmatization
- Inflected or derived words =>The root form
- Plurals, adverbs, inflected word forms – ladies => lady, referring => refer, forgotten => forget.
- Solutions (for English) – Porter Stemmer:Patterns of vowel-consonant sequence Krovetz Stemmer:Morphological rules
- Risk: – Lose precise meaning of the word – ground => grind
Normalization - Stopwords
- remove useless words
- risk: break the original meaning and structure of text

Text representation

Set of Words & Bag of Words
Vector Space Model –Term Frequency – Inverse Document Frequency
Topic Models –Latent Dirichlet Allocation –……
Word Embedding –Word2vec –……

VECTOR SPACE MODEL

Represent texts by vectors

Each dimension corresponds to a meaningful unit
- Orthogonal: –Linearly independent basis vectors –No ambiguity
Element of each vector is the weight (importance) of the unit
- Two basic heuristics to assign weights:
  - TF (Term Frequency) = Within-doc-frequency.
  - IDF (Inverse Document Frequency)

TF (Term Frequence)
Idea: a term is more important if it occurs more frequently in a document

raw TF: $t f (t, d) = c (t, d)$ , frequence count of term t in doc d --not accurate,can be affected by the document length
Normalize by the number by the number of words in this document
$tf(t,d)=c(t,d)∑tc(t,d)tf(t,d)=\frac{c(t,d)}{\sum_tc(t,d)}$
Normalize by teh most frequent word in this document
$c(t,d)>0tf(t,d)=\alpha+(1-\alpha)\frac{c(t,d)}{\max_tc(t,d)}, if \ c(t,d)>0$

IDF (Inverse Document Frequency)
Idea:a term is more discriminative if it occurs only in fewer documents
$IDF(t)=1+log(Ndf(t))IDF(t)=1+log(\frac{N}{df(t)})$
TF-IDF
$w(t,d)=TF(t,d)×IDF(t)w(t,d)=TF(t,d)\times IDF(t)$