LegalAI公开数据集的整理、总结及介绍(持续更新ing…)

诸神缄默不语-个人优快云博文目录

1. 司法判决预测

中文:

  1. CAIL2018
    刑法
    1. 原始论文:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction
      Overview of CAIL2018: Legal Judgment Prediction Competition
    2. 数据下载地址:https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip(对数据的具体介绍除上面的论文外,还可以参考:thunlp/CAIL: Chinese AI & Law Challenge
    3. 任务:(分类)预测法条、罪名、刑期

2. 通用语料

多语言:

  1. MultiLegalPile在这里插入图片描述
    1. 原始论文:(2023) MultiLegalPile: A 689GB Multilingual Legal Corpus
    2. 数据下载地址:https://huggingface.co/datasets/joelito/Multi_Legal_Pile
    3. 项目包含的数据:
      1. https://huggingface.co/datasets/joelito/eurlex_resources
      2. https://huggingface.co/datasets/joelito/legal-mc4
      3. Pile of Law
  2. LexFiles
    1. 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development

西班牙语:

  1. Spanish Legal Domain Corpora
    1. 原始论文:(2021) Spanish Legalese Language Model and Corpora
    2. 数据下载地址:Spanish Legal Domain Corpora | Zenodo

英语:

  1. CaseHOLD
    English Harvard Law case corpus (1965-2021)
    1. 原始论文:(2021 ICAIL) When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
  2. Pile of Law
    1. 原始论文:(2022 NeurIPS) Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
    2. 数据下载地址:https://huggingface.co/datasets/pile-of-law/pile-of-law
  3. (跨国)LeXFiles and LegalLAMA
    1. 原始论文:(2023 ACL) LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development
    2. LeXFiles是一组语料,LegalLAMA则是用以评估模型效果的benchmark(参考的是LAMA)
    3. 已放到transformers上:
      from datasets import load_dataset
      dataset = load_dataset('lexlms/lex_files', name='eu-legislation')
      
      from datasets import load_dataset
      dataset = load_dataset('lexlms/legal_lama', name='contract_sections')
      

中文:

  1. 华律网法律咨询数据及论文所需语料库;同时发表的论文:法律咨询文本分类系统设计与研究
    The legal consultation data and corpus of the thesis from China law network.Replication Data for: Design and research of legal consultation text classification system. - Data Driven Innovation Research Competition for University of China

葡萄牙语:

  1. https://github.com/alfaneo-ai/brazilian-legal-text-dataset(巴西)

3. 其他集成项目

多语言:

  1. LexGLUE
    coastalcph/lex-glue: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
    1. 原始论文:(2021) LexGLUE: A Benchmark Dataset for Legal Language Understanding in English
  2. LEXTREME
    1. 原始论文:(2023) LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain
    2. 数据下载地址:https://huggingface.co/datasets/joelito/lextreme

还没整理完的:

  1. https://github.com/neelguha/legal-ml-datasets

4. 推理

  1. legalbench
    1. 原始论文:(2022) LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning
    2. 数据下载地址:https://github.com/HazyResearch/legalbench

英语:

  1. SARA:大概来说就是推理某种情况是否适用于某某法律(美国税法中的9个Section)
    1. 原始论文:(2020) A Dataset for Statutory Reasoning in Tax Law Entailment and Question Answering

5. NLU

  1. SemEval 2023 Task 6: LegalEval - Understanding Legal Texts
    1. 任务:Rhetorical Roles Labeling,命名实体识别,可解释的司法判决预测
  2. MAUD
    1. 原始论文:(2023) MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding
    2. 数据下载地址:https://drive.google.com/drive/folders/1RujOK2FZKdFSCJ15tqdyd42g8WLsYagj

6. NLG

1 QA

中文:

  1. JEC-QA
    法考数据集
    https://jecqa.thunlp.org/
    1. 原始论文:(2020 AAAI) JEC-QA: A Legal-Domain Question Answering Dataset

越南语

  1. (交通法)(2017 KSE) Question analysis for Vietnamese legal question answering

2 文本摘要

英文:

  1. BillSum
    1. 原始论文:(2019 WS) BillSum: A Corpus for Automatic Summarization of US Legislation
    2. 数据下载地址:billsum · Datasets at Hugging Face
  2. VebCL(基于案例引用图实现一句话摘要/抽取重点信息)
    1. 原始论文:(2021 CIKM) VerbCL: A Dataset of Verbatim Quotes for Highlight Extraction in Case Law
    2. 数据下载地址:https://uvaauas.figshare.com/articles/dataset/VerbCL_Dataset/14798878/1

多语言:

  1. EUR-Lex-Sum(24种欧洲官方语言)
    原始论文:(2022 EMNLP) EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain
    数据下载地址:dennlinger/eur-lex-sum · Datasets at Hugging Face
  2. Multi-LexSum
    原始论文:(2022) Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities
    数据集官网:https://multilexsum.github.io/

7. 信息抽取

1 命名实体识别

葡萄牙语(巴西):

  1. CDJUR-BR
    1. 原始论文:(2023) CDJUR-BR – A Golden Collection of Legal Document from Brazilian Justice with Fine-Grained Named Entities

2 句子边界检测(分句)

多语言:

  1. MultiLegalSBD(英语、西班牙语、德语、意大利语、葡萄牙语、法语)
    1. 原始论文:(2023 ICAIL) MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset
    2. 数据下载地址:https://huggingface.co/datasets/rcds/MultiLegalSBD

3 论据挖掘

  1. 英语
    1. mining-legal-arguments
      1. 原始论文:(2023) Mining Legal Arguments in Court Decisions
      2. 下载地址:trusthlt/mining-legal-arguments: Mining Legal Arguments in Court Decisions - Data and software

4. 事件抽取

  1. 中文
    1. DLEE
      1. 原始论文:(2024 Neural Computing and Applications) DLEE: a dataset for Chinese document-level legal event extraction

8. 智能合同审查

  1. 英语
    1. (2021 NeurIPS) CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review
      https://github.com/TheAtticusProject/cuad
      https://huggingface.co/datasets/theatticusproject/cuad-qa

9. 其他任务

结构化:

  1. DiscoveringTheRationaleOfDecisions(用于抽取判决结果中的rationale。具体干啥的其实我还没看)
    1. 原始论文:(2021 ICAIL) Discovering the Rationale of Decisions: Experiments on Aligning Learning and Reasoning
    2. 数据下载地址见官方GitHub项目:CorSteging/DiscoveringTheRationaleOfDecisions: Discovering the Rationale of Decisions

  1. GENTLE(英语域外评估,包括了法律文书)
    1. 原始论文:(2023 ACL) GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
    2. 下载地址:gucorpling/gentle: Repository for the GENTLE corpus

10. 公平性

多语言:

  1. FairLex
    1. 原始论文:(2022 ACL) FairLex: A Multilingual Benchmark for Evaluating Fairness in Legal Text Processing
    2. 数据下载地址:coastalcph/fairlex · Datasets at Hugging Face
评论 4
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

诸神缄默不语

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值