最新中文分词方面的论文和数据集

本文介绍了中文分词任务,列举了多种系统,包括基于BERT、LSTM-CRF等模型的方法,并讨论了评价指标F1-score。此外,还提到了一系列中文分词数据集,如Chinese Treebank 6、7,AS,CityU,PKU和MSR。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Chinese Word Segmentation

Task

Chinese word segmentation is the task of splitting Chinese text (a sequence of Chinese characters) into words.

Example:

'上海浦东开发与建设同步' → ['上海', '浦东', '开发', ‘与', ’建设', '同步']

Systems

♠ marks the system that uses character unigram as input.
♣ marks the system that uses character bigram as input.

  • Huang et al. (2019): BERT + model compression + multi-criterial learing ♠
  • Yang et al. (2018): Lattice LSTM-CRF + BPE subword embeddings ♠♣
  • Ma et al. (2018): BiLSTM-CRF + hyper-params search♠♣
  • Yang et al. (2017): Transition-based + Beam-search + Rich pretrain♠♣
  • Zhou et al. (2017): Greedy Search + word context♠
  • Chen et al. (2017): BiLSTM-CRF + adv. loss♠♣
  • Cai et al. (2017): Greedy Search+Span representation♠
  • Kurita et al. (2017): Transition-based + Joint model♠
  • Liu et al. (2016): neural semi-CRF♠
  • Cai and Zhao (2016): Greedy Search♠
  • Chen et al. (2015a): Gated Recursive NN♠♣
  • Chen et al. (2015b): BiLSTM-CRF♠♣

Evaluation

Metrics

F1-score

Dataset

Chinese Treebank 6
Model F1 Paper / Source Code
Huang et al. (2019) 97.6 Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning
Ma et al. (2018) 96.7 State-of-the-art Chinese Word Segmentation with Bi-LSTMs
Yang et al. (2018) 96.3 Subword Encoding in Lattice LSTM for Chinese Word Segmentation Github
Yang et al. (2017) 96.2 Neural Word Segmentation with
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值