NLP-文本表示

最新推荐文章于 2025-07-02 23:13:49 发布

Carrie_Lei

最新推荐文章于 2025-07-02 23:13:49 发布

阅读量577

点赞数 4

分类专栏： NLP 文章标签：自然语言处理人工智能

NLP 专栏收录该内容

16 篇文章

订阅专栏

文本表示（Text Representation）是自然语言处理（NLP）中的一个关键步骤，它将文本数据转换为机器学习模型可以理解的格式。不同的文本表示方法有助于不同的任务，如文本分类、情感分析、机器翻译等。以下是常见的文本表示方法及其简介：

1. 词袋模型（Bag of Words, BoW）

定义：将文本表示为词汇表中所有词的出现频次。忽略词的顺序和语法结构。
优点：简单易懂，适用于基础文本分类任务。

缺点：高维稀疏矩阵，无法捕捉词的顺序和上下文信息。

from sklearn.feature_extraction.text import CountVectorizer

documents = ["This is a sample.", "This is another example."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())

2. TF-IDF（Term Frequency-Inverse Document Frequency）

定义：在词袋模型的基础上，通过词频（TF）和逆文档频率（IDF）来加权词的出现频率，反映词在文档中的重要性。
优点：考虑了词的重要性，较好地处理了常见词和稀有词的问题。

缺点：仍然忽略了词序和上下文。

from sklearn.feature_extraction.text import TfidfVectorizer

documents = ["This is a sample.", "This is another example."]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names_out())
print(X.toarray())

3. 词嵌入（Word Embeddings）

定义：将词表示为密集的向量，这些向量捕捉了词的语义和上下文信息。常用的方法包括Word2Vec、GloVe和FastText。
优点：可以捕捉词的语义关系，词向量可以用于进一步的深度学习模型。

缺点：需要预训练模型或大规模语料库进行训练。

from gensim.models import Word2Vec

sentences = [["this", "is", "a", "sample"], ["this", "is", "another", "example"]]
model = Word2Vec(sentences, vector_size=50, window=5, min_count=1, sg=0)
vector = model.wv['sample']
print(vector)

4. 上下文词嵌入（Contextual Word Embeddings）

定义：词向量根据上下文进行动态生成。常用的模型包括BERT、GPT、ELMo。
优点：能够捕捉词在不同上下文中的含义，更好地处理多义词和上下文依赖。

缺点：计算资源需求较高，模型复杂。

from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "Text representation using BERT."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
print(last_hidden_state)

5. 句子嵌入（Sentence Embeddings）

定义：将整个句子表示为一个向量，能够捕捉句子的整体语义。常用的方法包括InferSent、Sentence-BERT。
优点：适合于句子级别的任务，如句子相似度计算和文本匹配。

缺点：需要对句子进行专门的训练或使用预训练模型。

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
sentences = ["This is a sample sentence.", "This is another example sentence."]
embeddings = model.encode(sentences)
print(embeddings)

6. 主题模型（Topic Modeling）

定义：通过建模文本的主题来表示文本。常用的方法包括LDA（Latent Dirichlet Allocation）。
优点：能够发现文本中的潜在主题，适用于文本的主题分析。

缺点：模型参数选择较为复杂，可能需要大量数据进行训练。

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

documents = ["This is a sample document.", "This document is another example."]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
lda = LatentDirichletAllocation(n_components=2)
lda.fit(X)
topics = lda.components_
print(topics)