A Survey of Backdoor Attacks and Defenses on Large Language Models

本文是LLM系列文章,针对《A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures》的翻译。

摘要

大型语言模型 (LLM) 弥合了人类语言理解和复杂问题解决之间的差距,在多项 NLP 任务上实现了最先进的性能,特别是在少样本和零样本设置中。尽管 LMM 的功效显而易见,但由于计算资源的限制,用户必须使用开源语言模型或将整个训练过程外包给第三方平台。然而,研究表明,语言模型容易受到潜在安全漏洞的影响,特别是在后门攻击中。后门攻击旨在通过毒害训练样本或模型权重,将目标漏洞引入到语言模型中,从而使攻击者能够通过恶意触发器操纵模型响应。虽然现有的后门攻击调查提供了全面的概述,但缺乏对专门针对LLM的后门攻击的深入研究。为了弥补这一差距并掌握该领域的最新趋势,本文通过重点关注微调方法,提出了一种关于 LLM 后门攻击的新颖视角。具体来说,我们系统地将后门攻击分为三类:全参数微调、参数高效微调和无微调攻击。基于大量评论的见解,我们还讨论了未来后门攻击研究的关键问题,例如进一步探索不需要微调的攻击算法,或开发更隐蔽的攻击算法。

1 引言

2 大型语言模型后门攻击的背景

3 大型语言模型的后门攻击

4 后门攻击的应用

5 关于防御后门攻击的简要讨论

<

这一段讲的是什么:Abstract—A recent trojan attack on deep neural network (DNN) models is one insidious variant of data poisoning attacks. Trojan attacks exploit an effective backdoor created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker’s chosen trojan trigger. Since the trojan trigger is a secret guarded and exploited by the attacker, detecting such trojan inputs is a challenge, especially at run-time when models are in active operation. This work builds STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system. We intentionally perturb the incoming input, for instance by superimposing various image patterns, and observe the randomness of predicted classes for perturbed inputs from a given deployed model—malicious or benign. A low entropy in predicted classes violates the input-dependence property of a benign model and implies the presence of a malicious input—a characteristic of a trojaned input. The high efficacy of our method is validated through case studies on three popular and contrasting datasets: MNIST, CIFAR10 and GTSRB. We achieve an overall false acceptance rate (FAR) of less than 1%, given a preset false rejection rate (FRR) of 1%, for different types of triggers. Using CIFAR10 and GTSRB, we have empirically achieved result of 0% for both FRR and FAR. We have also evaluated STRIP robustness against a number of trojan attack variants and adaptive attacks. Index Terms—Trojan attack, Backdoor attack
07-24
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

UnknownBody

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值