我一直有这方面的疑问,看网上大部分的例子,大家都是直接创建了model即可,例如这里:https://www.jianshu.com/p/5f04e97d1b27
给出的例子:
from gensim.models import word2vec
import time
start = time.clock()
model=word2vec.Word2Vec(train_content, size=200)
end = time.clock()
print('Running time: %s Seconds'%(end-start))
我就是觉得很疑惑,因为我之前使用doc2ec的时候,是要单独train的。但是网上的例子基本都没有train那一步,例如下面这些:
https://zhuanlan.zhihu.com/p/141136987
https://blog.youkuaiyun.com/baimafujinji/article/details/77836142
https://blog.youkuaiyun.com/ljz2016/article/details/103767689
https://www.cnblogs.com/hziwei/p/13533888.html
这个嘛,感觉大家写博客都不认真啊!按照官方的文档:https://radimrehurek.com/gensim/models/word2vec.html#usage-examples
应该是需要train的!这里我也给出一个例子:
import gensim
import pickle
import datetime
print('Start reading the corpus')
sentences=gensim.models.word2vec.LineSentence('SymTxt_Doc2Vec.txt')#对应的语料库文件
print('Start building the model')
time_1=datetime.datetime.now()
model=gensim.models.word2vec.Word2Vec(sentences, hs=1, vector_size=256)
time_2=datetime.datetime.now()
print("Total elapse time for building the model (s): "+str((time_2-time_1).total_seconds()))
test=model.wv['invoke-direct']#这是我的语料库中的一个word
print(test)
print('Start Training')
time_1=datetime.datetime.now()
model.train(corpus_iterable=sentences,total_examples=model.corpus_count,epochs=100)
time_2=datetime.datetime.now()
print("Total elapse time for training (s): "+str((time_2-time_1).total_seconds()))
f_model=open('Word2Vec_Model.pkl','wb')
pickle.dump(model, f_model, protocol = 4)
test=model.wv['invoke-direct']#可以看到,train的时间是要比model构建的时间长的。训练之后这个word的embedding也发生了变化
print(test)
不得不吐槽说,网上的很多例子都太不靠谱了!!