They are similar.
- Both contain linguistic production.
- Both usually provide further information about the production in the form of annotations.
- These annotations can be linguistic in nature, but may also reveal meta-information about the language producer, or the context in which the production found place.
Two minimal definitions.
- A corpus is a representative sample of actual language production within a meaningful context and with a general purpose.
- A dataset is a representative sample of a specific linguistic phenomenon in a restricted context and with annotations that relate to a specific research question.
Summary
| Prototypical corpus | Prototypical dataset | |
|---|---|---|
| Language | unrestricted production | specific phenomenon |
| Context | wide | restricted |
| Purpose | general | research question |
本文探讨了语料库和数据集在语言学研究中的差异。两者都包含语言生成并提供注解信息,但语料库关注实际语言生产的代表性样本,具有广泛上下文和一般性目的,而数据集则针对特定语言现象,在受限环境中,注解直接关联特定研究问题。
1373

被折叠的 条评论
为什么被折叠?



