bert和word2vec/glove的区别

最新推荐文章于 2025-04-24 15:59:55 发布

大龙2020

最新推荐文章于 2025-04-24 15:59:55 发布

阅读量1.3k

点赞数 3

原文链接：https://www.quora.com/Are-encoder-representations-BERT-considered-embeddings/answer/Wenxiang-Jiao

版权

https://www.quora.com/Are-encoder-representations-BERT-considered-embeddings/answer/Wenxiang-Jiao

Of course, BERT can be considerd as an embeddings generator.

From Word2Vec, GloVe to Context2Vec, to ELMo, then to BERT, the approaches for learning embeddings evolve from order-free to contextualized and deeply contextualized.

Word2Vec and GloVe ultilize the co-occurrence of target word and context word, which is defined by the context window. But the order of words in a sentence is not taken into account.

Context2Vec takes the sequential relationships between words into account. Each sentence is modeled by a bidirectional RNN, and each target word obtains a contextual embedding from the hidden states of RNN that captures the information from words before and after it.

ELMo is quite similar to Context2Vec. The main difference is that ELMo uses language modelling to training the word embeddings but Context2Vec adopts the Word2Vec fashion, which builds the mapping between target word and context word. Also, ELMo is a little deeper than Context2Vec.

BERT benefits from the invention of Transformer. Though ELMo proves to be effective in extracting context-dependent embeddings, BERT argues that ELMo captiures the context from only two directions (i.e., bidirectional RNN). BERT adopts the encoder of Transformer, which is composed by attention networks. Therefore, BERT can capture context from all possible directions (fully connected).

So, let’s get back to the question. The conclusion is:

Encoder representations can be considered embeddings, for example ELMo, and BERT;
The way to incorporate the encoder representations into other NLP tasks should be different, in compare to Word2Vec, GloVe, and Context2Vec. Unlike Word2Vec that exports the trained word embeddings for other models as better initialization, ELMo or BERT should be a part of the other models so as to produce context-depdendent embeddings.