- paper
- CNN beats LSTM in language model?
Model
* Word embedding
* Hidden layers:
hl(X)=(X∗W+b)⨂σ(X∗V+c)
*
⨂
: element-wise product
*
W,V∈ℝk×m×n
* Note that the begin of the sequence is zero-padded by k/2
* The linear gate can alleviate vanishing gradient problem (can be seen as a multiplicative skip connection)
* Adaptive softmax
Experiment
- Datasets: Google Billion Word (GBW), WikiText-103
- Optimization: Nesterov’s momentum, gradient clipping (to 0.1), weight normalization
- Can gain stable and fast convergence with large learning rate such as 1
- Hyper-parameter search
- Result
- Speed: GCNN-22 compared with LSTM-2048 (units), better throughput and responsiveness
- Complexity: GCNN-22 compared with LSTM-2048 (units), less parameter and FLOPs/token
- Gating mechanism: GLU > (GTU (LSTM unit) <=> ReLU) > Tanh
- Non-linear modeling: GLU > Linear > Bilinear
- Network depth: the deep the better
- Context: large context size brings lower test perplexity but returns diminish.
本文介绍了一种使用CNN构建的语言模型,并通过实验证明其在多项指标上优于传统的LSTM模型。实验中对比了两种模型在Google Billion Word及WikiText-103数据集上的表现,结果显示CNN模型不仅收敛速度快,而且参数量更少,运行效率更高。
5908

被折叠的 条评论
为什么被折叠?



