定义
word2vec是一种把词转到某种向量空间的方法,在新的向量空间,词之间的相互关系,上下文关系都以某种程度被表征出来。
方法
词向量的转换方法有两种: CBOW(Continouns bags of words)和Skip-gram。
以下图示为CBOW的网络结构图
上图中的x1,x2,….Xc代表的是源码中的context向量中的每个单词,这个上下文的窗口大小对每个词都是随机取值的。
源码解读
这里选取一个开源实现代码:Word2vec GitHub code
训练流程:
- 加载文件,初始化词汇表
- 初始化神经网络和霍夫曼树
- 多进程训练
- 遍历文档每一行,为每行生成词索引向量
- 根据window大小配置该词的上下文context
- 输入NN训练
- 遍历文档每一行,为每行生成词索引向量
训练核心算法:
def train(fi, fo, cbow, neg, dim, alpha, win, min_count, num_processes, binary):
# Read train file to init vocab
vocab = Vocab(fi, min_count)
# Init net
syn0, syn1 = init_net(dim, len(vocab))
global_word_count = Value('i', 0)
table = None
if neg > 0:#默认参数是5
print 'Initializing unigram table'
table = UnigramTable(vocab)
else: #没有负样本,使用hierarchical softmax
print 'Initializing Huffman tree'
vocab.encode_huffman()
# Begin training using num_processes workers
t0 = time.time()
pool = Pool(processes=num_processes, initializer=__init_process,
initargs=(vocab, syn0, syn1, table, cbow, neg, dim, alpha,
win, num_processes, global_word_count, fi))
pool.map(train_process, range(num_processes))
t1 = time.time()
print
print 'Completed training. Training took', (t1 - t0) / 60, 'minutes'
# Save model to file
save(vocab, syn0, fo, binary)
def train_process(pid):
# Set fi to point to the right chunk of training file
#因为是多进程处理数据,所以根据进程号做好数据块划分
start = vocab.bytes / num_processes * pid
end = vocab.bytes if pid == num_processes - 1 else vocab.bytes / num_processes * (pid + 1)
fi.seek(start)
#print 'Worker %d beginning training at %d, ending at %d' % (pid, start, end)
alpha = starting_alpha
word_count = 0
last_word_count = 0
#遍历数据块
while fi.tell() < end:
line = fi.readline().strip()
# Skip blank lines
if not line:
continue
# 为一行句子初始化索引向量
sent = vocab.indices(['<bol>'] + line.split() + ['<eol>'])
#遍历一句话中的每个词
for