Chinese Word Segmentation
Task
Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.
Example:
'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']
Systems
♠ marks the system that uses character unigram as input.
♣ marks the system that uses character bigram as input.
- Huang et al. (2019): BERT + model compression + multi-criterial learing ♠
- Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣
- Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣
- Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣
- Zhou et al. (2017): Greedy Search + word context♠
- Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣
- Cai et al. (2017): Greedy Search+Span representation♠
- Kurita et al. (2017): Transition-based + Joint model♠
- Liu et al. (2016): neural semi-CRF♠
- Cai and Zhao (2016): Greedy Search♠
- Chen et al. (2015a): Gated Recursive NN♠♣
- Chen et al. (2015b): BiLSTM-CRF♠♣
Evaluation
Metrics
F1-score
Dataset
Chinese Treebank 6
| Model | F1 | Paper / Source | Code |
|---|---|---|---|
| Huang et al. (2019) | 97.6 | Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning | |
| Ma et al. (2018) | 96.7 | State-of-the-art Chinese Word Segmentation with Bi-LSTMs | |
| Yang et al. (2018) | 96.3 | Subword Encoding in Lattice LSTM for Chinese Word Segmentation | Github |
| Yang et al. (2017) | 96.2 | Neural Word Segmentation with |

本文介绍了中文分词任务,列举了多种系统,包括基于BERT、LSTM-CRF等模型的方法,并讨论了评价指标F1-score。此外,还提到了一系列中文分词数据集,如Chinese Treebank 6、7,AS,CityU,PKU和MSR。
最低0.47元/天 解锁文章
884





