LlaSMol LLM for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset

828 篇文章

已下架不支持订阅

本文介绍了LlaSMol,一个通过大规模、全面、高质量的化学指令数据集SMolInstruct进行调优的LLM。LlaSMol在14个化学任务上超越GPT-4,接近任务特定SoTA模型。研究还揭示了参数训练的影响。

本文是LLM系列文章,针对《LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset》的翻译。

LlaSMol:用大规模、全面、高质量的指令调优数据集推进大型化学语言模型

文章主要内容总结

  1. 研究背景与问题:化学在药物发现、材料科学等领域至关重要,但大型语言模型(LLMs)如GPT-4在化学任务上表现不佳,现有研究表明其性能远低于任务特定模型,尤其是对分子表示SMILES的理解不足。
  2. 数据集构建:提出SMolInstruct数据集,包含14个化学任务(如名称转换、性质预测、化学反应等)和超过300万样本。数据来自PubChem、MoleculeNet等多个来源,并经过严格质量控制(如过滤无效SMILES、纠正错误信息)。
  3. 模型开发:基于SMolInstruct微调开源LLMs,形成LlaSMol系列模型。实验表明,Mistral作为基础模型效果最佳,LlaSMol在多项任务上性能超过GPT-4和Claude 3 Opus。
  4. 关键发现
    • 规范SMILES可提升模型性能,使用SMILES比SELFIES更有效。
    • 多任务训练有助于知识共享,但各任务相对独立。
    • LlaSMol仅微调0.58%参数即可接近任务特定模型性能,具有巨大潜力。
  5. 局限性与未来方向:分子描述任务的评估不够准确,模型泛化能力未深入研究,未来将优化训练过程并扩展应用场景。

创新点

  1. 大规模高质量数据集:SMolInstr
### Few-Shot Learning Examples in Large Language Models (LLM) In the context of large language models, few-shot learning allows these models to perform tasks effectively even when provided with a limited number of training examples. This capability is particularly valuable as it reduces the need for extensive labeled data and demonstrates robust generalization abilities. #### Example 1: Tabular Data Reasoning For reasoning over tabular data, LLMs like GPT-3 can be prompted using only a handful of examples within the input prompt itself[^2]. For instance: ```plaintext Table: | Name | Age | Occupation | |------|-----|------------| | John | 30 | Engineer | | Jane | 25 | Doctor | Question: Who is older between John and Jane? Answer: John. ``` The model learns from this small set of structured inputs and applies similar logic to new questions about different tables without requiring additional specific training on each table format or content type. #### Example 2: Guided Active Learning for Debiasing Another application involves guided active learning strategies aimed at reducing biases present in pre-trained LLMs by identifying biased samples through counterfactual in-context learning methods[^3]: ```python def identify_biased_samples(model_output): # Analyze output patterns that indicate potential bias pass biased_examples = [] for sample in dataset: prediction = model(sample) if identify_biased_samples(prediction): biased_examples.append((sample, "Biased")) # Use identified biased examples to refine model behavior further via targeted retraining ``` This approach leverages few-shot capabilities not just for task execution but also for improving model fairness across various applications. #### Example 3: One-Shot Selection of High-Quality SFT Data When selecting high-quality instruction-following instances for fine-tuning purposes, one-shot learning techniques help evaluate individual command demonstrations based on their effectiveness in enhancing overall performance across diverse downstream tasks[^4]. ```json { "instruction": "Summarize the following article...", "input_text": "...", "output_summary": "..." } ``` By assessing how well single examples contribute towards better outcomes during evaluation phases, researchers can strategically choose optimal datasets for specialized tuning processes.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值