[OpenAI Codex] Evaluating Large Language Models Trained on Code

Link of paper: https://arxiv.org/abs/2107.03374.

Evaluating Large Language Models Trained on Code

Introduction

Codex is a GPT-based model fine-tuned on public code from GitHub, to write Python code. (But copilot, a distinct version of Codex, support multiple programming languages.) Because GPT-3 is not trained on datasets containing too much code, the coding performance of GPT-3 is not so good. To gain a LLM that is “more proficient” on coding. The authors tried to fine-tune GPT solely on code. So, codex might be considered as a procedure to produce a set of parameters of GPT for coding, instead of a “brand new” model.

The main function of Codex is, given docstring (can be seen as the notation or natural language description) of the function, Codex will generate the function in Python. OpenAI collected a dataset of 164 programming problems with unit tests, containing language comprehension, algorithm and simple mathematics. The difficulty is “some comparable to simple software interview questions”. The dataset called HumanEval can be found at https://github.com/openai/human-eval.

The performance (pass rate) along with model scale is shown below:
在这里插入图片描述
where Codex-S is a further fine-tuned model on from-docstring-to-code generation task. “mean logp reranking” means generate 100 code candidates and pick the one with the highest mean log-probability, and “oracle reranking” means optimal pick up, which is equivalent to pass@100.

Evaluation Framework

Metric

Usually, we will use metrics based on substring match to evaluate generated content, like BLUE and RONGE. However, apparently, they won’t work well on evaluating generated code because code has very low tolerance of mistakes and “similar” is not acceptable. In fact, we have a more straightforward way to evaluate generated code compared with other natural language generations. By simply running the generated function on unit tests, we will know whether it is correct. The quantitative metric is pass@k, the formula is
pass @ k : = E Problems [ 1 − ( n − c k ) ( n k ) ] \text{pass}@k:=\mathbb{E}_{\text{Problems}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right] pass@k:=EProblems[1(kn)(knc)]
where for each problem Codex will generate n , n ≥ k n, n\ge k n,nk generations, in which c c c of generations are correct. The fraction is calculating the probability that among all n n n generations, pick k k k and none of the k k k picked generations is correct. This is more stable compared with just generate k k k generations. Notice that directly computing the combination may lead to overflow, a trick is to compute the product of fractions step by step.

Hand-Written Evaluation Set

All problems in HumanEval, including function signature, docstring, body, and several unit tests and all hand-written to avoid data leakage.

Sandbox for Execution

Execute generated function in sandbox to avoid harmful snippet.

Code Fine-Tuning

Data Collection

159GB Python code from GitHub. Notice that the authors did not clarify the license of these code, which implies that they may collect some code regardless of requiring for authorization.

Methods

Compared with training from scratch, fine-tuning from GPT-3 did not gain more performance improvement, but converge more quickly.

By adding special tokens representing whitespaces of different length, it reduces 30% tokens to represent code.

Supervised Fine-Tuning

Methods

Python code dataset collected from GitHub is really massive and contains a lot of different tasks, which places a gap between the training dataset and HumanEval evaluation set (generate function from docstring). To collect training dataset that aligns better with HumanEval, the authors collected problems from Competitive Programming and Continuous Integration and filtered out ambiguous/too difficult problems if Codex cannot generate a correct answer in 100 times. Utilizing this dataset, they applied further fine-tuneing and gained Codex-S.

Results

To gain best performance, the larger the number of samples ( k k k), the higher the temperature.
在这里插入图片描述
Codex-S shows significantly better performance than Codex.
在这里插入图片描述在这里插入图片描述

Docstring Generation

By reorganize the order of dataset from “function signature - docstring - implementation” to “function signature - implementation - docstring” to let the model learn how to generate docstring according to function implementation. The model fine-tuned on this dataset is called Codex-D. Due to the lack of ground-truth, the generated docstring is evaluated by hand-grading.

Limitations

  • Not sample efficient. With 159GB python code in training dataset, the performance of Codex is not satisfactory.
  • It is hard for Codex to handle long docstring or implement function with many operations/variables.

Broader Impacts and Hazard Analysis

Over-reliance, Misalignment (model is able to do, but don’t), Bias, Labor market, Security, Environment, Law (sometimes model generate the same code as its training data, despite the use license).

### OpenAI Codex 模型概述 Codex 是由 OpenAI 推出的一个基于 GPT-3 的派生模型,经过针对代码数据的微调训练而成[^1]。该模型能够完成诸如根据函数名和注释自动生成代码、生成测试用例以及支持多种编程语言的任务。它的参数规模范围广泛,从小至 12M 到大至 12B 不等,使其成为当前最强大的编程语言预训练模型之一。 Codex 还具备丰富的上下文理解能力,它不仅知道大量现有的库、API 和模块,还能够在开发者的提示下灵活调整建议方向。例如,在指定版本号的情况下,Codex 可以优先推荐特定版本的相关内容[^2]。 ```html <!-- Use A-Frame version 1.2.0 to create a 3D website --> <script src="https://aframe.io/releases/1.2.0/aframe.min.js"></script> ``` 此外,Codex 提供的技术文档强调了其对最佳实践的支持。当生成代码时,Codex 倾向于附加详细的注释或者提供外部参考资料链接,从而帮助开发者更好地理解和应用所生成的内容[^3]。 --- ### 使用方法简介 为了充分利用 Codex 的功能,用户可以通过以下方式操作: #### 方法一:通过 API 访问 OpenAI 提供了一个 RESTful 风格的接口来访问 Codex 功能。以下是基本请求流程: 1. 注册并获取个人密钥; 2. 构建 HTTP 请求头携带认证信息; 3. 发送包含输入文本的数据包给服务器端处理; 下面是一个简单的 Python 实现例子展示如何调用 Codex 来生成一段 JavaScript 函数定义: ```python import os import openai openai.api_key = 'your_api_key_here' response = openai.Completion.create( engine="code-davinci-002", prompt="Write a function that reverses an array.\n\nfunction reverseArray(input) {\n", temperature=0, max_tokens=64 ) print(response.choices[0].text.strip()) ``` 注意这里使用的引擎名称 `code-davinci-002` 属于最新一代改进版 Codex 系列的一部分[^4]。 #### 方法二:集成到 IDE 或编辑器插件 许多现代集成开发环境 (IDEs),像 Visual Studio Code, JetBrains IntelliJ IDEA Ultimate Edition 等都提供了官方或社区维护好的扩展程序允许无缝接入此类 AI 辅助工具。安装完成后只需按照界面指引配置好相应的账户设置即可享受即时编码辅助体验。 --- ### 技术文档资源 对于更深入的学习和技术细节探索,可参考如下资料地址: - [Official Documentation](https://platform.openai.com/docs/models/codex): 此处包含了关于不同版本特性对比说明、定价策略以及其他实用技巧等内容。 - GitHub Repository: 如果希望查看具体实现案例,则可以查阅一些开源项目仓库中的实际运用场景实例。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值