Word2Vec 相关

最新推荐文章于 2025-07-22 15:24:20 发布

原创最新推荐文章于 2025-07-22 15:24:20 发布 · 231 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

NLP 专栏收录该内容

1 篇文章

订阅专栏

找Word2Vec的工具，实现看效果
- Word2Vec(Google)：
  - Capture many linguistic regularities
    For example vector operations vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’)
  - From words to phrases and beyond
    Example vector for representing ‘san francisco’
  - Word Consine distance
  - Word clustering
    Deriving word classes from huge data sets. This is achieved by performing K-means clustering on top of the word vectors. The output is a vocabulary file with words and their corresponding class IDs
- Performance
  - Architecture:
    - Skip-Gram: slower, better for infrequent words
    - CBOW: fast
  - The training algorithm:
    - hierarchical softmax: better for infrequent words
    - negative sampling: better for frequent words, better with low dimensional vectors
  - Sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 1e-3 to 1e-5)
  - Dimensionality of the word vectors: usually more is better, but not always
  - Context(window) size:
    - skip-gram: around 10
    - CBOW: around 5
- 获取训练数据（黑体的训练数据在参考网站都有网址）
  - First billion characters from wikipedia (use the pre-processing perl script from the bottom of Matt Mahoney’s page)
  - Latest Wikipedia dump Use the same script as above to obtain clean text. Should be more than 3 billion words.
  - WMT11 site: text data for several languages (duplicate sentences should be removed before training the models)
  - Dataset from "One Billion Word Language Modeling Benchmark" Almost 1B words, already pre-processed text.
  - UMBC webbase corpus Around 3 billion words, more info here. Needs further processing (mainly tokenization).
  - Text data from more languages can be obtained at statmt.org and in the Polyglot project(亲测好评).
- 总之Google的word2vec网站有很多可探索的东西
- 影响词向量质量的因素
  - 训练数据的数量和质量
  - 词向量的大小
  - 训练算法

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。