EVA-中文开放域对话预训练模型

最新推荐文章于 2025-04-11 23:08:20 发布

原创

最新推荐文章于 2025-04-11 23:08:20 发布 · 3.6k 阅读

2 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #人工智能 #nlp

本文介绍了EVA，一个由中国社交媒体构建的28亿参数的对话模型，专为WDC-Dialogue训练，超越了CDial-GPT和CPM。模型细节和实验结果展示了其在多领域对话生成的优越性能。

EVA 是目前最大的中文开放域对话预训练模型，拥有28亿参数，在 WDC-Dialogue 上预训练而成。该数据包含14亿个多领域的上文-回复对。实验表明 EVA 在自动指标和人工指标上都超越了现在其他的中文预训练对话模型。

官网：

智源开源开放平台 (wudaoai.cn)

github：GitHub - BAAI-WuDao/EVA

Paper link: https://arxiv.org/abs/2108.01547.

2 Dataset

We construct a dataset named WDC-Dialogue from Chinese social media to train EVA. Specifically, conversations from various sources are gathered and a rigorous data cleaning pipeline is designed to enforce the quality of WDC-Dialogue. We mainly focus on three categories of textual interaction data, i.e., repost on social media, comment / reply on various online forums and online question and answer (Q&A) exchanges. Each round of these textual interactions yields a dialogue session via well-designed parsing rules. The following table shows statistics of the filtered WDC-Dialogue dataset and other Chinese dialogue datasets.

2 数据集
我们从中国社交媒体构建了一个名为 WDC-Dialogue 的数据集来训练 EVA。具体而言，收集了来自各种来源的对话，并设计了严格的数据清理管道，以确保 WDC-Dialogue 的质量。我们主要关注三类文本交互数据，即社交媒体上的转发、各种在线论坛的评论/回复和在线问答（Q&A）交流。这些文本交互的每一轮都通过精心设计的解析规则产生对话会话。下表显示了过滤后的 WDC-Dialogue 数据集和其他中文对话数据集的统计数据。

最低0.47元/天解锁文章