最新中文分词方面的论文和数据集

最新推荐文章于 2025-04-21 14:15:22 发布

原创

最新推荐文章于 2025-04-21 14:15:22 发布 · 3.9k 阅读

CC 4.0 BY-SA版权

本文介绍了中文分词任务，列举了多种系统，包括基于BERT、LSTM-CRF等模型的方法，并讨论了评价指标F1-score。此外，还提到了一系列中文分词数据集，如Chinese Treebank 6、7，AS，CityU，PKU和MSR。

Chinese Word Segmentation

Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.

Example:

'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']

♠ marks the system that uses character unigram as input.
♣ marks the system that uses character bigram as input.

F1-score

Model	F1	Paper / Source	Code
Huang et al. (2019)	97.6	Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Ma et al. (2018)	96.7	State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2018)	96.3	Subword Encoding in Lattice LSTM for Chinese Word Segmentation	Github
Yang et al. (2017)	96.2	Neural Word Segmentation with