这次介绍一下论文《LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions》
原文链接:https://arxiv.org/pdf/2408.07321
开源代码:https://anonymous.4open.science/r/Vercation_A
这篇论文中提出一种方法Vercation:(Vulnerable version identiffcation)将静态分析技术与LLM技术相结合,并应用于检测开源软件(主要是C/C++语言编写程序)哪些版本是有漏洞风险的,其中对于一些关于如何结合LLM以及静态检测时选取那些信息作为依据等等都为之后的研究提供了一些有益的思路。
论文贡献:
本文的主要贡献如下:
•我们介绍了Vercation,这是一种由LLM支持的新颖、自动化的易受攻击版本识别方法。之前的工作严重依赖于预先定义的静态分析工具模式,而Vercation通过使用多策略通用提示工程,利用LLM的能力进行漏洞理解任务。
•Vercation提出了一种基于扩展AST的解决方案,以解决现有技术中未考虑的语义级克隆检测挑战。
•我们精心组装了一个全面的数据集,其中包括10476个独特的漏洞版本对,这些漏洞版本对来自1013个版本,包括74个CVE。这个广泛的数据集是在11个OSS项目中策划的,并通过PoC输入验证和严格的手动验证相结合进行了细致的标记。
•我们已经实现了我们方法的原型,并使用我们的数据集评估了其性能,F1得分为92.4%,比SOTA高出9%至48%。更重要的是,通过应用我们的方法,我们在NVD报告中检测到134个不正确的易受攻击的OSS版本。
(The main contributions of this paper are as follows:
• We present Vercation, a novel, automated vulnerable version identiffcation approach powered by LLM. Previous efforts heavily relied on pre-deffned patterns of static analysis tools, while Vercation leverages the capability of LLM for vulnerability comprehension task, through the use of a multi-strategy universal prompt engineering.
• Vercation presents a solution based on expanded AST to address the semantic-level clone detection challenge, which was not considered in existing techniques.
• We meticulously assembled a comprehensive dataset that included 10,476 unique vulnerability-version pairs derived from 1,013 versions encompassing 74 CVEs. This extensive dataset was curated across 11 OSS projects and underwent meticulous labeling achieved through a combination of PoC input validation and rigorous manual veriffcation.
• We have implemented a prototype of our approach and assessed its performance using our dataset, achieving the F1 score of 92.4%, outperforming SOTAs by 9% to 48%. More importantly, by applying our approach, we have detected 134 incorrect vulnerable OSS versions in NVD reports.)
方法的一个前提就是关于OSS版本时间线的一个流程:
Commits是OSS开发的全面记录,是跟踪代码更改按时间顺序演变的重要检查点。它们使开发人员能够在特定日期或时间重新访问特定点。
本文主要研究OSS漏洞中的漏洞版本识别。在OSS的开发过程中,某些提交会引入漏洞,称为漏洞引入提交(vic),这些漏洞稍后会在补丁提交(pc)中被修复。漏洞版本分析旨在查明受害者,使我们能够评估哪些版本容易受到漏洞的影响。
我们假设初始函数被vic修改并转换为易受攻击的函数Fv。可能存在一些后续提交,如添加功能提交和重构提交,以优化函数代码,表示Fur 最终,软件分析师发现了漏洞并修补了pc中的代码,产生了最终函数Fp。如图给出的说明性时间线:
方法的工作流程如图:
版本识别包括三个阶段:易受攻击代码提取(P1)、代码更改检测(P2)和易受攻击版本范围描绘(P3)。
在P1中,Vercation结合了程序切片和LLM,以精确的方式识别和提取与漏洞相关的程序语句。具体来说,我们利用补丁代码作为切片标准来提取危险的ffows。为了降低包含与漏洞无关的陈述(误报)的风险,我们采用了快速构建、少镜头和思维链(CoT)策略,使LLM能够根据提取的危险线索对漏洞进行推理,并准确提取与漏洞相关的陈述。
在P2中,Vercation回溯历史提交,以收集易受攻击语句的先前修改。对于修改前后的每条语句,Vercation都会展开语句中的函数,生成AST并对其进行规范化。然后,我们利用按顺序遍历算法将修改前后的AST作为语义相似性进行比较,从而确定提交是否是易受攻击语句的初始引入。
在P3中,Vercation根据CVE的pc和相应的vic识别受漏洞影响的版本。
我不再对其方法细节进行描绘,原文中关于细节的描述比较简洁明了,在数据集设计方面(开源代码中提供了数据集),论文作者是这样提供的:
验证和标签。我们应用以下过程来系统地标记地面实况数据集:
1) 固定版本可以通过补丁提交的正式发布来确定。我们通过检查ffxed版本之后的版本是否包含补丁代码来确认它们是否存在漏洞。
2) 对于剩下的版本,我们从Git问题或参考OSS站点获得了每个漏洞的公开PoC。
3) 为了在每个相应的OSS版本上验证PoC,我们为每个OSS版本建立依赖环境并编译库。
4) 在每个漏洞的PoC输入测试过程中,我们分析漏洞的触发条件和危险行为,并将触发漏洞的语句标记为 对于PoC无法触发漏洞的版本(因为某些PoC仅适用于某些特定版本),我们检查该版本是否包含
为版本添加标签。
接着用四个问题及其解答对于方法进行一个评估:
RQ1. Is Vercation more effective than the state-of-the-art vulnerable version identiffcation methods?
Vercation demonstrates superior performance in vulnerable version identiffcation on our ground-truth dataset, achieving an F1 score of 92.4%. This signiffcantly outperforms existing methods, including NVD reports, SZZ algorithms, and V0Finder. Compared to the best-performing baseline V-SZZ, Vercation shows a 9.2% improvement. These results highlight Vercation’s effectiveness in addressing limitations of current approaches, providing a more accurate tool for identifying vulnerable versions in open-source software.
RQ2. How effective is Vercation powered by LLMs in vulnerability comprehension?
Harnessed by the power of LLMs, Vercation can generate the vulnerability logic and extract vulnerable codes from dangerous ffows more precisely than traditional static analysis tools. The prompt strategy combining Few-shot and CoT techniques performs the best.
RQ3. How accurate is our proposed semantic-level code clone detection method?
Refactoring commits occupy a significant proportion of the code changes history. The AST-based similarity comparison in Vercation performs better in the task of code changes detection than the singular syntactic and singular semantic analysis, significantly improves the precision of vulnerability-introducing commit identification, enhancing the overall effectiveness of Vercation in vulnerable version detection.
RQ4. What are the effects of applying Vercation in the real world?
Applying Vercation to real-world CVEs revealed significant practical impact. It identified 134 CVEs (38.1%) with incomplete or inaccurate CPE information in NVD reports. Analysis of the top 10 OSS projects showed substantial delays in vulnerability resolution, with some cases lasting up to ten years. This demonstrates Vercation’s potential to enhance vulnerability management in real-world scenarios by providing more accurate version information and insights into vulnerability lifecycles.
当然方法有一些问题:
Limitations of Joern Parser. Vercation utilize Joern parser to generate the code property graph, containing abstract syntax tree, control flow graph and data dependency graph. However, we encountered some challenges with certain C/C++ constructs that Joern did not handle well, impeding its ability to track data flow and missing essential constructors correctly. Additionally, Joern may face difficulties in outputting data dependencies related to pointers, particularly in languages like C/C++ where pointers are heavily used. This limitation could potentially affect the efficacy of our approach.
Generalizability of the Dataset. Our study primarily focused on C/C++ open-source software projects. While these are widely used languages, the effectiveness of Vercation on other programming languages or closed-source software remains untested. The vulnerability patterns and code structures in other languages might differ, potentially affecting the performance of our approach. Future work could involve extending Vercation to support a broader range of programming languages and software types.
LLM Variability and Reproducibility. As LLMs continue to evolve rapidly, the performance of Vercation might change with newer versions of these models. This could affect the long-term consistency of our results. Additionally, LLMs may have biases or limitations in their training data that could influence their ability to understand certain types of vulnerabilities or code structures. To mitigate this, we have provided detailed information about the LLM versions and prompts used in our study, but future research may need to account for potential variations in LLM capabilities and performance as these models continue to develop.
Limitation and Future Work. Vercation operates under certain assumptions that may limit its applicability. Firstly, it is tailored for C/C++ projects, yet its methodology can be applied to other programming languages. With an accurate exploration of common vulnerability types in diverse languages, Vercation could identify vulnerable versions in those languages. Secondly, Vercation exclusively addresses vulnerabilities within functions, thus omitting interfunctional vulnerabilities. However, we are contemplating the integration inter-functional data-flow information into the vulnerability signatures to tackle such scenarios. In the evaluation results presented in section 5.2, Vercation demonstrated notable precision (93%) and recall (98%). There exists room for further enhancement in both precision and recall. In semantic similarity comparisons, the utilization of heavyweight program analysis techniques, such as symbolic execution, could enhance the understanding of code semantics, thereby reducing false negatives.
最后,做个小总结,本文对于应用LLM的思路还是有一定参考价值的,而且在实验的设计以及数据集选取等方面工作较为全面严谨,其应用的静态检测技术虽然是集中于源代码检测尤其是C/C++源代码检测,但是对于方法可能的问题以及为什么会有更好的效果等等都从基础上进行了较为详尽且合理地分析。总的来说,作为LLM结合静态技术的方法是值得一看和学习的,对于LLM应用于其他场景(比如二进制代码)等也具备一定参考价值。