参考框架系统基准_大约一百万个数据集和一个用于自然语言的基准框架

最新推荐文章于 2025-03-05 11:06:19 发布

weixin_26729375

最新推荐文章于 2025-03-05 11:06:19 发布

阅读量195

点赞数

文章标签： java 深度学习 linux spring

原文链接：https://medium.com/swlh/approximately-one-million-datasets-and-one-benchmark-framework-for-your-natural-language-e4541fa217b2

版权

参考框架系统基准

In this article, I will show how to retrieve close to one million public text or PDF documents. Some of these documents are raw text, some are clean text, and some include categorical labelling. I will also introduce KILT, a benchmark framework for natural language models.

在本文中，我将展示如何检索接近一百万个公共文本或PDF文档。这些文档中有些是原始文本，有些是纯文本，有些则包含分类标签。我还将介绍KILT， 自然语言模型的基准框架。

公共NLP数据集列表列表。 (List of Lists of Public NLP Datasets.)

The following are non-inclusive lists of lists of NLP datasets:

以下是NLP数据集列表的非包含列表：

原始文字 (Raw text)

Awesome-Public-Datasets;
真棒公共数据集
Project Gutenberg: File Repository;
古腾堡计划：文件存储库；
Project Gutenberg: Top 100 EBooks as of 8/15/2020;
古腾堡计划：截至2020年8月15日的前100名电子书；
Google Books API for Python;
适用于Python的Google图书API ;
Google Books Ngram Viewer;
Google图书Ngram Viewer ；
Google datasets;
Google数据集；
textacy datasets;
文本数据集;
Kaggle datasets;
Kaggle数据集；
fast.ai datasets;
fast.ai数据集；
USC Machine Learning Repository datasets;
USC机器学习存储库数据集；
pyquora: A Python module to fetch and parse data from Quora;
pyquora：一个Python模块，用于从Quora中获取和解析数据；
Zillow: Real Estate and Mortgage Data;
Zillow：房地产和抵押数据；
readthedocs.org;
readthedocs.org ;

联邦 (Federal)

偏压 (Bias)

StereoSet is a dataset of 17,000 sentences that measures model preferences across gender, race, religion, and profession. StereoSett is used to measure bias in NLP models.
StereoSet是一个包含17,000个句子的数据集，用于测量跨性别，种族，宗教和职业的模型偏好。 StereoSett用于测量NLP模型中的偏差。

COVID-19，医学和NIH原始文本 (COVID-19, Medical and NIH Raw Text)

What to Do If You Are Sick;
如果你生病怎么办;
https://www.cdc.gov/coronavirus/2019-ncov/downloads/10Things.pdf;
https://www.cdc.gov/coronavirus/2019-ncov/downloads/10Things.pdf ;
COVID-19 Open Research Database. COVID-19 is a resource of over 200,000 scholarly articles, including over 97,000 with full text, about COVID-19, SARS-CoV-2, and related coronaviruses.
COVID-19开放研究数据库。 COVID-19的资源超过200,000篇学术文章，其中包括97,000篇以上的全文，涉及COVID-19，SARS-CoV-2和相关冠状病毒。
CDC Guidance Documents;
CDC指导文件;
Biosafety in Microbiological and Biomedical Laboratories;
微生物和生物医学实验室的生物安全;
National Institutes of Health (NIH) Funding: FY1995-FY2021 (search NIH PDF);
美国国立卫生研究院(NIH)资助：1995-FY2021 (研究NIH PDF)；
WebMD text.
WebMD文本吨。

非英语自然语言 (Non-English Natural Language)

专门的NLP数据集 (Specialized NLP datasets)

1。

2..

2 ..

最终元数据集 (The Ultimate Meta-Dataset)

Goggle Dataset Search: Finding Millions of Datasets on the Web

Goggle数据集搜索：在Web上查找数百万个数据集

Goggle Dataset Search was released into public publication in January, 2020 [1].

Goggle数据集搜索已于2020年1月公开发布[1] 。

Instead of grepping or web scraping a dataset of interest, you can filter many candidate PDFs, Word text, image, sound, structured data, and somebody-already-created-it-for-you datasets from Goggle Dataset Search.

相反grepping或网页抓取感兴趣的数据集，您可以过滤许多候选人的PDF，Word中文字，图像，声音，结构化数据，而有人-已经创建的，它适合你的护目镜数据集搜索数据集。

标杆管理 (Benchmarking)

短裙(KILT)

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. [2].

诸如开放域问题解答，事实检查，空位填充和实体链接之类的具有挑战性的问题需要访问大量外部知识资源。尽管某些模型可以很好地完成单个任务，但是开发通用模型却很困难，因为除了专用的基础结构之外，每个任务都可能需要计算昂贵的自定义知识源索引。 [2]。

KILT (knowledge-intensive language tasks) is a benchmark for an Artificial natural language models. The KILT benchmark is for a wide range of knowledge-intensive tasks.

KILT (知识密集型语言任务)是人工自然语言模型的基准。 KILT基准测试适用于各种知识密集型任务。

Admittedly it is “specialized” to Natural Language Processing (NLP) models.

诚然，它是“专用于”自然语言处理( NLP )模型的。

However, the Turing test, the widely accepted AGI (Artificial General Intelligence) test, is a natural language-based [3].

但是，图灵测验是一种被广泛接受的AGI(人工通用情报)测验，是基于自然语言的[3] 。

… solving knowledge-intensive tasks requires (even for humans) access to a large body of information [].

…解决知识密集型任务需要(甚至对于人类)访问大量信息[]。

KILT uses 5.9 million Wiki pages for its knowledge base [2,4].

KILT使用590万个Wiki页面作为其知识库[2,4]。

Using a large corpus to start and then keep feeding more text to KILT, the researchers at Facebook hope that KILT is a benchmark for any NLP model for any domain.

Facebook的研究人员希望使用大型语料库来开始然后继续向KILT提供更多文本，因此希望KILT是任何域的任何NLP模型的基准。

Being able to benchmark any domain is a lofty goal. Below are domain-specific NLP tasks:

能够对任何领域进行基准测试是一个崇高的目标。以下是特定于域的NLP任务：

Business-specific entities, like artifacts, events, and actors;
特定于业务的实体，例如工件，事件和参与者；
Relationships between entities;
实体之间的关系；
Business processes.
业务流程。
Meta-knowledge. Knowledge about what knowledge you know.
元知识。有关您所知道的知识的知识。

概要 (Summary)

You are presented with 33 lists of datasets.

您会看到33个数据集列表。

Fast.ai probably has datasets most common to researchers.

Fast.ai可能具有研究人员最常用的数据集。

Kaggle has datasets of text, Q&A, structured data audio, and 2D- and 3-D images.

Kaggle具有文本，问题与解答，结构化数据音频以及2D和3D图像的数据集。

You were presented with a formal NLP benchmark framework: Kate.

您将看到一个正式的NLP基准框架： Kate 。

Finally, you were introduced to an awesome dataset search engine: Goggle Dataset Search.

最后，向您介绍了一个很棒的数据集搜索引擎： Goggle数据集搜索。

Compiling lists of datasets has helped me. I hope it helps you.

编译数据集列表对我有所帮助。希望对您有帮助。

翻译自: https://medium.com/swlh/approximately-one-million-datasets-and-one-benchmark-framework-for-your-natural-language-e4541fa217b2