TensorFlow 词向量：Word2Vec、GloVe、FastText 深入解析

一碗黄焖鸡三碗米饭

于 2025-03-21 09:54:29 发布

阅读量957

点赞数 33

分类专栏：人工智能前沿与实践文章标签： tensorflow word2vec 人工智能 python 深度学习 nlp

本文链接：https://blog.youkuaiyun.com/sjdgehi/article/details/146413300

版权

TensorFlow 词向量：Word2Vec、GloVe、FastText 深入解析

1. 词向量简介

2. Word2Vec

2.1 Word2Vec 工作原理

2.2 TensorFlow 实现 Word2Vec

2.3 Word2Vec 总结

3. GloVe（Global Vectors for Word Representation）

3.1 GloVe 工作原理

3.2 TensorFlow 实现 GloVe

3.3 GloVe 总结

4. FastText

4.1 FastText 工作原理

4.2 TensorFlow 实现 FastText

4.3 FastText 总结

5. Word2Vec、GloVe 和 FastText 对比

6. 总结

在自然语言处理（NLP）领域，词向量（Word Embeddings）是将词语转换为计算机可以理解的向量表示的技术。通过词向量，计算机能够捕捉词语之间的语义关系。TensorFlow是一个非常流行的深度学习框架，它提供了丰富的工具来实现词向量模型。今天，我们将深入探讨三种常见的词向量技术：Word2Vec、GloVe和FastText，并通过TensorFlow代码实现这些模型。

1. 词向量简介

词向量是将每个词语表示为一个固定维度的向量，使得相似的词在向量空间中更靠近。常见的词向量技术有：

Word2Vec：通过预测上下文词来学习词向量。
GloVe：通过全局词频统计来学习词向量。
FastText：考虑词内部的子词（subword），更加灵活和高效。

每种技术背后都有不同的训练方法、优缺点和使用场景。

2. Word2Vec

2.1 Word2Vec 工作原理

Word2Vec 是一种基于神经网络的算法，它通过训练神经网络来学习词语之间的关系，常见的有两种训练方法：

Skip-Gram：给定一个词，预测它周围的上下文。
CBOW（Continuous Bag of Words）：给定一个上下文词，预测中心词。

Skip-Gram适用于小型语料库，尤其是在稀有词的情况下效果更好，而CBOW则适用于大规模语料库。

2.2 TensorFlow 实现 Word2Vec

在 TensorFlow 中实现 Word2Vec 可以通过以下步骤：

import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 输入语料
sentences = ["I love machine learning", "Deep learning is amazing", "I love coding with TensorFlow"]

# 使用Tokenizer进行分词
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# 将句子转换为整数序列
sequences = tokenizer.texts_to_sequences(sentences)
sequences = pad_sequences(sequences, padding='post')

# 创建训练数据
window_size = 2  # 上下文窗口大小
X_train, y_train = [], []

for sequence in sequences:
    for i in range(len(sequence)):
        for j in range(i - window_size, i + window_size + 1):
            if j >= 0 and j < len(sequence) and j != i:
                X_train.append(sequence[i])
                y_train.append(sequence[j])

X_train = np.array(X_train)
y_train = np.array(y_train)

# 构建Word2Vec模型
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=len(word_index)