Word2Vec Sent2Vec

本文详细探讨了Word2Vec和Sent2Vec的原理与实现,包括CBOW和Skip Gram模型,以及如何使用它们进行词和句子向量表示。通过分析源代码和公式推导,解释了训练过程中的权重更新和损失函数。Sent2Vec是在Word2Vec基础上构建句子表示的方法。

好好研究了下Word2Vec和Sent2Vec的代码,推导了下公式,花费了不少的时间,不过清晰了很多。

源代码参考:https://github.com/klb3713/sentence2vec

理论上是分两部分,首先是进行Word2Vec的,获得词向量,以及权重等。然后再进行Sent2Vec的处理,基于已有的Word Vector以及网络权重。

Word2Vec

  • 预测目标

总体的目标是,词向量作为输入(CBOW是多个词的sum,sg是单个词输入),加上权重之后,得到预测的词向量。

首先,会建立haffman tree,每个词就用一组数字来表示[1,0,1,1]。上面的“预测的词向量”,就变成预测这组编码了。变成一个softmax。同时,越高频的词,编码越少,对应要处理的权重也越少,加快效率。

然后,假设最长一共有N位的编码,那么就是Softmax有N个输出。假设词向量维度为M,那么就有权重的空间就是M*N的矩阵,对应代码里的 l2a = model.syn1[word.point]。

  • CBOW

CBOW就是,对于当前处理的word,获得它前后窗口范围内的所有词,将他们的词向量各维度求平均,作为输入(对应代码里的 l1)。

然后,计算输出 fa = 1. / (1. + exp(-dot(l1, l2a.T)))。以及梯度 ga = (1. - word.code - fa) * alpha。

由于我们需要学习W和词向量X,所以,两次计算。
学习W时,model.syn1[word.point] += outer(ga, l1)(就是说,syn1代表的就是W这一层)。
学习词向量时,neu1e += dot(ga, l2a);model.syn0[word2.index] += neu1e。(就是说,syn0代表的就是X这一层,也即我们的词向量)。

为什么是上

### Word2Vec Implementation in Python Word2Vec is a technique used for natural language processing tasks that involves mapping words into numerical vectors based on their context. In Python, one of the most popular libraries for implementing Word2Vec models is Gensim. The following code demonstrates how to implement Word2Vec using Gensim: ```python from gensim.models import Word2Vec import nltk nltk.download('punkt') def preprocess(text): sentences = nltk.sent_tokenize(text.lower()) tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences] return tokenized_sentences data = ["Natural language processing enables computers to understand human languages.", "It has applications ranging from speech recognition to text translation."] tokenized_data = preprocess(" ".join(data)) model = Word2Vec(sentences=tokenized_data, vector_size=100, window=5, min_count=1, workers=4) # Finding similar words similar_words = model.wv.most_similar("language", topn=5) print(similar_words) ``` Gensim provides an efficient interface for training Word2Vec models and offers functionalities such as finding word similarities and analogies[^1]. Additionally, TensorFlow also supports Word2Vec through its Keras API which can be beneficial when integrating with other deep learning components within TensorFlow's ecosystem. For comprehensive tutorials specifically focused on Word2Vec implementations in Python, resources like those provided by organizations offering massive open online courses could serve as valuable references[^2]. These platforms often include detailed guides alongside practical examples covering various aspects of working with textual data including embedding techniques like Word2Vec.
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值