Leture7 Text Mining 文本挖掘笔记

Basics of text mining

Process of text mining:

  1. Text Pre-processing
  2. Text Transformation
  3. Feature Selection
  4. Data Mining
  5. Evaluate
  6. Applications
  • Text representation
    • Set of Words
    • Bag of Words
    • Vector Space Model
    • Topic Models
    • Word Embedding
Text mining tasks

• Classification –Document categorization –Sentiment analysis
• Clustering Analysis –Text clustering
• Natural Language Processing Tasks

Applications of text mining
  • Sentiment Analysis
  • Financial Market Prediction
  • Recommendation
Challenges in text mining

–Data is not well-organized
–ambiguities
–annotated training examples -expensive to acquire

Text preprocessing

在这里插入图片描述

  1. Tokenization:
    Break a stream of text into meaningful units
  2. Normalization:
    Convert all text to the same case (upper or lower) –Remove numbers –Remove punctuation
  • Stemming/Lemmatization
    • Inflected or derived words =>The root form
    • Plurals, adverbs, inflected word forms – ladies => lady, referring => refer, forgotten => forget.
    • Solutions (for English) – Porter Stemmer:Patterns of vowel-consonant sequence Krovetz Stemmer:Morphological rules
    • Risk: – Lose precise meaning of the word – ground => grind
  • Normalization - Stopwords
    • remove useless words
    • risk: break the original meaning and structure of text

Text representation

  • Set of Words & Bag of Words
  • Vector Space Model –Term Frequency – Inverse Document Frequency
  • Topic Models –Latent Dirichlet Allocation –……
  • Word Embedding –Word2vec –……
VECTOR SPACE MODEL

Represent texts by vectors

  • Each dimension corresponds to a meaningful unit
    • Orthogonal: –Linearly independent basis vectors –No ambiguity
  • Element of each vector is the weight (importance) of the unit
    • Two basic heuristics to assign weights:
      • TF (Term Frequency) = Within-doc-frequency.
      • IDF (Inverse Document Frequency)
        在这里插入图片描述

TF (Term Frequence)
Idea: a term is more important if it occurs more frequently in a document

  • raw TF: tf(t,d)=c(t,d)tf(t,d)=c(t,d)tf(t,d)=c(t,d), frequence count of term t in doc d --not accurate,can be affected by the document length
  • Normalize by the number by the number of words in this document
    tf(t,d)=c(t,d)∑tc(t,d)tf(t,d)=\frac{c(t,d)}{\sum_tc(t,d)}tf(t,d)=tc(t,d)c(t,d)
  • Normalize by teh most frequent word in this document
    tf(t,d)=α+(1−α)c(t,d)max⁡tc(t,d),if c(t,d)>0tf(t,d)=\alpha+(1-\alpha)\frac{c(t,d)}{\max_tc(t,d)}, if \ c(t,d)>0tf(t,d)=α+(1α)maxtc(t,d)c(t,d),if c(t,d)>0

IDF (Inverse Document Frequency)
Idea:a term is more discriminative if it occurs only in fewer documents
IDF(t)=1+log(Ndf(t))IDF(t)=1+log(\frac{N}{df(t)})IDF(t)=1+log(df(t)N)
TF-IDF
w(t,d)=TF(t,d)×IDF(t)w(t,d)=TF(t,d)\times IDF(t)w(t,d)=TF(t,d)×IDF(t)

向量空间:生成稀疏高维矩阵
后面两个模型主要生成低维稠密模型

TOPIC MODELS
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值