- paper
- CNN beats LSTM in language model?
Model
* Word embedding
* Hidden layers:
hl(X)=(X∗W+b)⨂σ(X∗V+c)
*
⨂
: element-wise product
*
W,V∈ℝk×m×n
* Note that the begin of the sequence is zero-padded by k/2
* The linear gate can alleviate vanishing gradient problem (can be seen as a multiplicative skip connection)
* Adaptive softmax
Experiment
- Datasets: Google Billion Word (GBW), WikiText-103
- Optimization: Nesterov’s momentum, gradient clipping (to 0.1), weight normalization
- Can gain stable and fast convergence with large learning rate such as 1
- Hyper-parameter search
- Result
- Speed: GCNN-22 compared with LSTM-2048 (units), better throughput and responsiveness
- Complexity: GCNN-22 compared with LSTM-2048 (units), less parameter and FLOPs/token
- Gating mechanism: GLU > (GTU (LSTM unit) <=> ReLU) > Tanh
- Non-linear modeling: GLU > Linear > Bilinear
- Network depth: the deep the better
- Context: large context size brings lower test perplexity but returns diminish.