glove 学习

最新推荐文章于 2025-06-11 13:49:03 发布

原创最新推荐文章于 2025-06-11 13:49:03 发布 · 738 阅读

0 ·

CC 4.0 BY-SA版权

随笔同时被 2 个专栏收录

43 篇文章

订阅专栏

工作总结职业规划职业zongj

17 篇文章

订阅专栏

本文介绍GloVe词向量模型的训练流程，包括语料预处理、构建共现矩阵及向量训练等步骤，并提供了一个具体的demo.sh执行示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

github地址

https://github.com/stanfordnlp/GloVe/tree/master/src

gloVe是和word2vector功能相似的模型，把句子的信息和全局的信息结合，目的是在语义和语句上都获得更好的表达效果，下面我们仅从使用的角度上看gloVe模型

模型目标：进行词的向量化表示，使得向量之间尽可能多地蕴含语义和语法的信息。

输入：语料库
输出：词向量
方法概述：首先基于语料库构建词的共现矩阵，然后基于共现矩阵和GloVe模型学习词向量。
To train your own GloVe vectors, first you'll need to prepare your corpus as a single text file with all words separated by a single space. If your corpus has multiple documents, simply concatenate documents together with a single space. If your documents are particularly short, it's possible that padding the gap between documents with e.g. 5 "dummy" words will produce better vectors. Once you create your corpus, you can train GloVe vectors using the following 4 tools. An example is included in demo.sh, which you can modify as necessary.

This four main tools in this package are:

1) vocab_count

This tool requires an input corpus that should already consist of whitespace-separated tokens. Use something like the Stanford Tokenizer first on raw text. From the corpus, it constructs unigram counts from a corpus, and optionally thresholds the resulting vocabulary based on total vocabulary size or minimum frequency count.

2) cooccur

Constructs word-word cooccurrence statistics from a corpus. The user should supply a vocabulary file, as produced by vocab_count, and may specify a variety of parameters, as described by running ./build/cooccur.

3) shuffle

Shuffles the binary file of cooccurrence statistics produced by cooccur. For large files, the file is automatically split into chunks, each of which is shuffled and stored on disk before being merged and shuffled together. The user may specify a number of parameters, as described by running ./build/shuffle.

4) glove

Train the GloVe model on the specified cooccurrence data, which typically will be the output of the shuffle tool. The user should supply a vocabulary file, as given by vocab_count, and may specify a number of other parameters, which are described by running ./build/glove.
下载源代码需要修改
CFLAGS = -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result为CFLAGS = -lm -pthread -O2 -march=native -funroll-loops -Wno-unused-result
具体demo.sh执行例子

./vocab_count -min-count $VOCAB_MIN_COUNT -verbose $VERBOSE < $CORPUS > $VOCAB_FILE
./cooccur -memory $MEMORY -vocab-file $VOCAB_FILE -verbose $VERBOSE -window-size $WINDOW_SIZE < $CORPUS > $COOCCURRENCE_FILE
./shuffle -memory $MEMORY -verbose $VERBOSE < $COOCCURRENCE_FILE > $COOCCURRENCE_SHUF_FILE
./glove -save-file $SAVE_FILE -threads $NUM_THREADS -input-file $COOCCURRENCE_SHUF_FILE -x-max $X_MAX -iter $MAX_ITER -vector-size $VECTOR_SIZE -binary $BINARY -vocab-file $VOCAB_FILE -verbose $VERBOSE
octave -nodisplay -nodesktop -nojvm -nosplash < ./eval/read_and_evaluate.m 1>&2