bert-as-service & BERT FAQ

该博客探讨了BERT服务的使用,包括句子编码、池化策略的选择,以及如何处理分词问题。文章指出,池化层的选择会影响句子表示的质量,而BERT的子词分词可能导致OOV问题。对于句子相似度计算,作者建议关注排名而非绝对的余弦相似度。此外,还讨论了如何使用自定义的微调BERT模型。

bert-as-service github:
https://github.com/hanxiao/bert-as-service#q-the-cosine-similarity-of-two-sentence-vectors-is-unreasonably-high-eg-always–08-whats-wrong

BERT interview questions:
https://mp.weixin.qq.com/s/E60wUHkHo-Gj3wb9Denuag

bert-as-service uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations in just two lines of code.

1. Pooling

1)pooling是干什么用的?
pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
在这里插入图片描述
在这里插入图片描述
Q: So which layer and which pooling strategy is the best?
A: It depends. Keep in mind that different BERT layers capture different information. To see that more clearly, here is a visualization on UCI-News Aggregator Dataset, where I randomly sample 20K news titles; get sentence encodes from different layers and with different pooling strategies, finally reduce it to 2D via PCA (one can of course do t-SNE as well, but that’s not my point). There are only four classes of the data, illustrated in red, blue, yellow and green. To reproduce the result, please run example7.py.

在这里插入图片描述

在这里插入图片描述

2)如何选择在第几层pool出sentence embedding?

Intuitively, pooling_layer=-1 is close to the training output, so it may be biased to the training targets. If you don’t fine tune the model, then this could lead to a bad representation. pooling_layer=-12 is close to the word embedding, may preserve the very original word information (with no fancy self-attention etc.). On the other hand, you may achieve the very same performance by simply using a word-embedding only. That said, anything in-between [-1, -12] is then a trade-off.

2. Tokenizer

1)subword分词:Why my (English) word is tokenized to ##something?
Because your word is out-of-vocabulary (OOV). The tokenizer from Google uses a greedy longest-match-first algorithm to perform tokenization using the given vocabulary.

For example:

input = “unaffable”
tokenizer_output = [“un”, “##aff”, “##able”]

2)Do I need to do segmentation for Chinese?
No, if you are using the pretrained Chinese BERT released by Google you don’t need word segmentation. As this Chinese BERT is character-based model. It won’t recognize word/phrase even if you intentionally add space in-between. The word embedding is actually the character embedding for Chinese-BERT.

3. sentence similarity

The cosine similarity of two sentence vectors is unreasonably high (e.g. always > 0.8), what’s wrong?
A: A decent representation for a downstream task doesn’t mean that it will be meaningful in terms of cosine distance. Since cosine distance is a linear space where all dimensions are weighted equally. if you want to use cosine distance anyway, then please focus on the rank not the absolute value. Namely, do not use:

if cosine(A, B) > 0.9, then A and B are similar
Please consider the following instead:

if cosine(A, B) > cosine(A, C), then A is more similar to B than C.

The graph below illustrates the pairwise similarity of 3000 Chinese sentences randomly sampled from web (char. length < 25). We compute cosine similarity based on the sentence vectors and Rouge-L based on the raw text. The diagonal (self-correlation) is removed for the sake of clarity. As one can see, there is some positive correlation between these two metrics.

4. finetuned BERT used in bert-as-service

Q: Can I use my own fine-tuned BERT model?
A: Yes. In fact, this is suggested. Make sure you have the following three items in model_dir:

A TensorFlow checkpoint (bert_model.ckpt) containing the pre-trained weights (which is actually 3 files).
A vocab file (vocab.txt) to map WordPiece to word id.
A config file (bert_config.json) which specifies the hyperparameters of the model.

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值