[paper]Transformer-XL: Attentive Language Models
(venv2.7) mi@mi-OptiPlex-7060:~/shenhao/study/transformer-xl/tf$ bash scripts/enwik8_base_gpu.sh train_data
Producing dataset...
building vocab with min_freq=0, max_size=None
final vocab size 204 from 204 unique tokens
Saving dataset...
Converting train set...
processing batch 0
processing batch 500
processing batch 1000
processing batch 1500
processing batch 2000
processing batch 2500
processing batch 3000
processing batch 3500
processing batch 4000
processing batch 4500
processing batch 5000
processing batch 5500
processing batch 6000
processing batch 6500
processing batch 7000
Done writing train.bsz-24.tlen-512.tfrecords. batches: 7242
Converting valid set...
processing batch 0
Done writing valid.bsz-24.tlen-512.tfrecords. batches: 403
论文笔记 —— Transformer-XL - IndexFziQ的文章 - 知乎 https://zhuanlan.zhihu.com/p/70745925
和
是需要学习的参数,这是这部分的关键。在计算self-attention时,由于query所有位置对应的query向量是一样的,因此不管的query位置如何,对不同单词的attention偏差应保持相同 ???



https://zhuanlan.zhihu.com/p/56027916
https://ai.googleblog.com/2019/01/transformer-xl-unleashing-potential-of.html
[Paper]Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context

predict next sentence是不是没有mask
1. 语料都多长 有很短的怎么办 因为同学的语料都很短吧
2. next sentence prediction是不是没做

本文深入探讨Transformer-XL模型,一种超越固定长度上下文限制的注意力语言模型。文章记录了模型训练过程,包括数据集构建、参数学习关键点,并讨论了预测下一句的实现细节及对短语料的适用性。
3547

被折叠的 条评论
为什么被折叠?



