文本表示(Text Representation)之词集模型(SOW)&词袋模型(BOW)&TF-IDF模型

最新推荐文章于 2025-09-29 16:33:36 发布

原创最新推荐文章于 2025-09-29 16:33:36 发布 · 1.5w 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#nlp #文本表示 #tf-idf #bow #sow

nlp 专栏收录该内容

2 篇文章

订阅专栏

本文介绍了文本表示中的三种关键方法：词集模型、词袋模型及词频-逆文档频率（TF-IDF）。从数学角度详细解释了每种方法如何将文本转化为数值特征向量，便于计算机处理。

转载请注明来源 http://blog.youkuaiyun.com/Recall_Tomorrow/article/details/79488639
欢迎大家查看这些模型简单实现的代码……
$\ \ \ \$ 对于一个包含若干个文档的语料库(Corpus) $\mathcal C=\{doc_1, doc_2,\cdots,doc_m\}$ ,将其所有词条（Tokens）整合为一个大的词库(Lexicons) $\mathcal L_{\mathcal C}$ ，对于任意文档 $doc_i,i\in\mathbf R^+$ 的分词结果(当然这里已经包括了NER、stopwords、lemmatization等预处理)为 $\mathcal W_i$ ，那么文本表示为 $V_i，|V_i|=len(\mathcal L_{\mathcal C})$

词集模型（Set of Words）

$\ \ \ \$ 对于文档 $doc_i$ 的 $\mathcal W_i$ ，如果词库中第j个token $\mathcal L_{\mathcal C}^{(j)}$ 出现在 $\mathcal W_i$ 中，那么该文档此处的向量分量 $\mathbf V_{ij}$ 就为1，否则就为0，即，

V i j = {1, 0, L (j) C \in W i e l s e, i \in R +, j \in [1, | l e n (L C) |]

$\mathbf V_{ij}=\left\{\begin{array}{lr}1,&\mathcal L_{\mathcal C}^{(j)}\in \mathcal W_i\\ 0, &else\end{array}\right.,\ \ \ \ i\in\mathbf R^+,j\in[1, |len(\mathcal L_{\mathcal C})|]$

词袋模型（Bag of Words）

$\ \ \ \$ 对于文档 $doc_i$ 的 $\mathcal W_i$ ，如果词库中第j个token $\mathcal L_{\mathcal C}^{(j)}$ 出现在 $\mathcal W_i$ 中，那么该文档此处的向量分量 $\mathbf V_{ij}$ 就为它的词频freq( $\mathcal L_{\mathcal C}^{(j)}$ )，否则就为0，即，

V i j = {f r e q i (L (j) C), 0, L (j) C \in W i e l s e, i \in R +, j \in [1, | l e n (L C) |]

$\mathbf V_{ij}=\left\{\begin{array}{lr}freq_i(\mathcal L_{\mathcal C}^{(j)}),&\mathcal L_{\mathcal C}^{(j)}\in \mathcal W_i\\ 0, &else\end{array}\right.,\ \ \ \ i\in\mathbf R^+,j\in[1, |len(\mathcal L_{\mathcal C})|]$