如果词的文件太大,考虑用迭代器来进行一部分一部分地输入。
如果词的文件存在数据库中,可以用迭代器来一行一行地进行输入。
如果词存在一个dataframe中,可以用迭代器来把dataframe每一行输入。
例子如下,只是为了示范使用方法,语料随便挑的,太少了,一般不会这样:
import gensim
import pandas as pd
text = [["双方", "要", "持续", "深化", "政治", "互信","和","利益","融合"],
["在","涉及","彼此","核心","利益","和","重大","关切","问题","上","相互","尊重"],
["在","共同","发展","的","道路","上","相互","支持"]]
class sentences:
def __iter__(self):
for i in range(len(text)):
yield text[i]
word2vec = gensim.models.word2vec.Word2Vec(sentences(), size=10, window=10, min_count=1, sg=1, hs=1, iter=10,workers=25)
print("pd.Series(word2vec.most_similar('政治')):\n",pd.Series(word2vec.most_similar('政治')))
print("word2vec.wv['政治']:\n",word2vec.wv['政治'])
pd.Series(word2vec.most_similar('政治')):
0 (彼此, 0.5328612923622131)
1 (的, 0.5152028799057007)
2 (支持, 0.4853699505329132)
3 (核心, 0.3838141858577728)
4 (互信, 0.3804806172847748)
5 (双方, 0.2851719260215759)
6 (相互, 0.2486085295677185)
7 (深化, 0.2200365960597992)
8 (持续, 0.1906088888645172)
9 (要, 0.13010087609291077)
dtype: object
word2vec.wv['政治']:
[-0.04362014 -0.04979561 -0.04996823 -0.02305548 0.01126006 -0.04688745
0.02839105 0.01941208 0.02390135 -0.04904057]
上面的例子也可以直接这样写,语料少就可以这样写:
import gensim
import pandas as pd
text = [["双方", "要", "持续", "深化", "政治", "互信","和","利益","融合"],
["在","涉及","彼此","核心","利益","和","重大","关切","问题","上","相互","尊重"],
["在","共同","发展","的","道路","上","相互","支持"]]
word2vec = gensim.models.word2vec.Word2Vec(text, size=10, window=10, min_count=1, sg=1, hs=1, iter=10,workers=25)
print("pd.Series(word2vec.most_similar('政治')):\n",pd.Series(word2vec.most_similar('政治'))) # 和“政治”词向量最相似的词
print("word2vec.wv['政治']:\n",word2vec.wv['政治']) # “政治”的词向量
一般不会自己定义list这些,可能会从一个txt文件中读一个语料,写法如下,其中的test.txt是已经分好词的语料,是最初的语料分词后每两个词用空格隔开的(可以参考最下面的参考网址的博客):
import gensim
sentences = gensim.models.word2vec.LineSentence('./test.txt') # test.txt是已经分好词的语料,每两个词用空格隔开的
word2vec = gensim.models.word2vec.Word2Vec(sentences, size=10, window=10, min_count=1, sg=1, hs=1, iter=10,workers=25)
关于gensim.models.word2vec.Word2Vec的参数,参考网址里写了很多,几个比较需要注意的参数有:1、语料 2、词向量长度 3、语料中最小词频(小于这个词频的词不会有词向量) 4、移动窗口大小 5、选择skip-gram或cbow 6、选择负采样或者层次softmax 7、随机梯度下降法中迭代的最大次数 8、workers,训练的并行数。
关于模型保存和导入,以上面的例子为例:
import gensim
sentences = gensim.models.word2vec.LineSentence('./test.txt') # test.txt是已经分好词的语料,每两个词用空格隔开的
word2vec = gensim.models.word2vec.Word2Vec(sentences, size=10, window=10, min_count=1, sg=1, hs=1, iter=10,workers=25)
# 保存
word2vec.save('word2vec_model')
# 导入
model = gensim.models.word2vec.Word2Vec.load('word2vec_model')
参考网址:https://www.cnblogs.com/pinard/p/7278324.html
https://spaces.ac.cn/archives/4304/comment-page-1
https://www.cnblogs.com/gylph/p/9178444.html
https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2VecVocab