#spaCy V3.0.0 专业领域中文分词问题
spaCy 3.0.0版本今年已经正式发布。非常幸运的是,其提供的5个最新transformer-based pipelines 模型中就包括中文预训练模型(zh_core_web_trf)。该模型在中文上的表现也相当突出:
PACKAGE | LANGUAGE | TRANSFORMER | TAGGER | PARSER | NER |
---|---|---|---|---|---|
en_core_web_trf | English | roberta-base | 97.8 | 95.2 | 89.9 |
de_dep_news_trf | German | bert-base-german-cased | 99.0 | 95.8 | - |
es_dep_news_trf | Spanish | bert-base-spanish-wwm-cased | 98.2 | 94.6 | - |
fr_dep_news_trf | French | camembert-base | 95.7 | 94.4 | - |
zh_core_web_trf | Chinese | bert-base-chinese | 92.5 | 76.6 | 75.4 |
对于中文还有3个预训练模型:
- zh_core_web_sm-3.0.0: 小型
- zh_core