[科研] | 101级别解释 | Corpora vs. Datasets

转载已于 2023-03-06 18:07:42 修改 · 233 阅读

CC 4.0 BY-SA版权

文章标签：

于 2023-03-01 20:27:48 首次发布

本文探讨了语料库和数据集在语言学研究中的差异。两者都包含语言生成并提供注解信息，但语料库关注实际语言生产的代表性样本，具有广泛上下文和一般性目的，而数据集则针对特定语言现象，在受限环境中，注解直接关联特定研究问题。

Both contain linguistic production.
Both usually provide further information about the production in the form of annotations.
These annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.

A corpus is a representative sample of actual language production within a meaningful context and with a general purpose.
A dataset is a representative sample of a specific linguistic phenomenon in a restricted context and with annotations that relate to a specific research question.