[OpenAI Codex] Evaluating Large Language Models Trained on Code

Link of paper: https://arxiv.org/abs/2107.03374.

Evaluating Large Language Models Trained on Code

Introduction

Codex is a GPT-based model fine-tuned on public code from GitHub, to write Python code. (But copilot, a distinct version of Codex, support multiple programming languages.) Because GPT-3 is not trained on datasets containing too much code, the coding performance of GPT-3 is not so good. To gain a LLM that is “more proficient” on coding. The authors tried to fine-tune GPT solely on code. So, codex might be considered as a procedure to produce a set of parameters of GPT for coding, instead of a “brand new” model.

The main function of Codex is, given docstring (can be seen as the notation or natural language description) of the function, Codex will generate the function in Python. OpenAI collected a dataset of 164 programming problems with unit tests, containing language comprehension, algorithm and simple mathematics. The difficulty is “some comparable to simple software interview questions”. The dataset called HumanEval can be found at https://github.com/openai/human-eval.

The performance (pass rate) along with model scale is shown below:
在这里插入图片描述
where Codex-S is a further fine-tuned model on from-docstring-to-code generation task. “mean logp reranking” means generate 100 code candidates and pick the one with the highest mean log-probability, and “oracle reranking” means optimal pick up, which is equivalent to pass@100.

Evaluation Framework

Metric

Usually, we will use metrics based on substring match to evaluate generated content, like BLUE and RONGE. However, apparently, they won’t work well on evaluating generated code because code has very low tolerance of mistakes and “similar” is not acceptable. In fact, we have a more straightforward way to evaluate generated code compared with other natural language generations. By simply running the generated function on unit tests, we will know whether it is correct. The quantitative metric is pass@k, the formula is
pass @ k : = E Problems [ 1 − ( n − c k ) ( n k ) ] \text{pass}@k:=\mathbb{E}_{\text{Problems}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right] pass@k:=EProblems[1(kn)(knc)]
where for each problem Codex will generate n , n ≥ k n, n\ge k n,nk generations, in which c c c of generations are correct. The fraction is calculating the probability that among all n n n generations, pick k k k and none of the k k k picked generations is correct. This is more stable compared with just generate k k k generations. Notice that directly computing the combination may lead to overflow, a trick is to compute the product of fractions step by step.

Hand-Written Evaluation Set

All problems in HumanEval, including function signature, docstring, body, and several unit tests and all hand-written to avoid data leakage.

Sandbox for Execution

Execute generated function in sandbox to avoid harmful snippet.

Code Fine-Tuning

Data Collection

159GB Python code from GitHub. Notice that the authors did not clarify the license of these code, which implies that they may collect some code regardless of requiring for authorization.

Methods

Compared with training from scratch, fine-tuning from GPT-3 did not gain more performance improvement, but converge more quickly.

By adding special tokens representing whitespaces of different length, it reduces 30% tokens to represent code.

Supervised Fine-Tuning

Methods

Python code dataset collected from GitHub is really massive and contains a lot of different tasks, which places a gap between the training dataset and HumanEval evaluation set (generate function from docstring). To collect training dataset that aligns better with HumanEval, the authors collected problems from Competitive Programming and Continuous Integration and filtered out ambiguous/too difficult problems if Codex cannot generate a correct answer in 100 times. Utilizing this dataset, they applied further fine-tuneing and gained Codex-S.

Results

To gain best performance, the larger the number of samples ( k k k), the higher the temperature.
在这里插入图片描述
Codex-S shows significantly better performance than Codex.
在这里插入图片描述在这里插入图片描述

Docstring Generation

By reorganize the order of dataset from “function signature - docstring - implementation” to “function signature - implementation - docstring” to let the model learn how to generate docstring according to function implementation. The model fine-tuned on this dataset is called Codex-D. Due to the lack of ground-truth, the generated docstring is evaluated by hand-grading.

Limitations

  • Not sample efficient. With 159GB python code in training dataset, the performance of Codex is not satisfactory.
  • It is hard for Codex to handle long docstring or implement function with many operations/variables.

Broader Impacts and Hazard Analysis

Over-reliance, Misalignment (model is able to do, but don’t), Bias, Labor market, Security, Environment, Law (sometimes model generate the same code as its training data, despite the use license).

<think>嗯,用户问的是OpenAI Codex和机械臂的结合。首先,我需要确定用户对这两个概念是否熟悉。可能他们知道Codex是用于代码生成的AI,而机械臂是机器人领域的硬件。但用户可能想知道如何将这两者结合起来,或者有什么应用场景。 接下来,我需要回忆OpenAI Codex的功能。Codex是基于GPT的模型,擅长将自然语言转换为代码,比如生成Python代码。机械臂通常需要编程来控制,传统方法可能需要手动编写控制指令或者使用预定义的脚本。 用户可能想知道如何用Codex来简化机械臂的控制,比如通过自然语言指令让Codex生成相应的控制代码,然后机械臂执行这些代码。这听起来像是自动化编程的一部分,可能涉及机器人操作系统(如ROS)的集成。 然后,要考虑应用场景。例如,工业自动化中的任务,比如分拣、装配,或者教育领域让学生用自然语言控制机械臂。另外,医疗或服务机器人领域也可能有应用,比如通过语音指令让机械臂执行任务。 技术实现方面,可能需要将Codex生成的代码集成到机械臂的控制系统中。这里需要考虑安全性,因为自动生成的代码可能有错误,导致机械臂动作异常,甚至危险。所以需要验证生成的代码,或者在仿真环境中测试后再实际执行。 还要提到可能的挑战,比如代码生成的准确性,对复杂任务的处理能力,实时性要求,以及硬件兼容性。此外,数据隐私和伦理问题也是需要考虑的,特别是在医疗或敏感环境中使用。 可能用户是开发者、研究人员,或者对机器人技术感兴趣的人。他们可能想了解这种结合的前景,或者寻找实现的方法。需要给出一些例子,比如用户说“将那个红色方块放到左边盒子”,Codex生成对应的运动轨迹代码,机械臂执行。 另外,要比较传统方法和使用Codex的优势,比如降低编程门槛,提高开发效率,灵活应对不同任务。同时也要指出目前的局限性,比如依赖训练数据中的代码示例,对复杂逻辑的处理可能不够完善。 最后,总结一下这种结合的意义,比如推动机器人编程的民主化,加速自动化应用的开发,促进人机交互的自然化。可能还需要提到相关项目或案例,如果有的话,比如某些研究团队已经尝试用Codex控制机械臂完成特定任务。</think>--- **OpenAI Codex + 机械臂** 的结合是一种通过自然语言指令直接控制机器人执行物理任务的技术方案。它利用OpenAI Codex的**代码生成能力**将人类语言转化为机械臂的控制指令,从而实现更直观、高效的机器人编程。以下是详细介绍: --- ### **核心技术原理** 1. **OpenAI Codex** - 基于GPT-3的衍生模型,擅长理解自然语言并生成代码(如Python、C++)。 - 通过海量开源代码训练,能根据任务描述生成控制机械臂的代码逻辑(如运动规划、传感器交互)。 2. **机械臂控制框架** - 机械臂通常通过编程接口(如ROS机器人操作系统、Arduino/Python SDK)接收指令。 - 传统方法需手动编写代码,而Codex可自动生成适配代码,例如: ```python # 用户指令:“将物体从A点移动到B点” arm.move_to(x=0.5, y=0.3, z=0.2) # Codex生成的代码 ``` 3. **工作流程** - **输入**:自然语言指令(如“抓取红色方块并放入右侧盒子”)。 - **处理**:Codex解析语义→生成控制代码(含运动轨迹、夹爪开合逻辑等)。 - **执行**:代码通过机械臂API驱动硬件完成动作。 --- ### **应用场景** 1. **工业自动化** - 快速编程分拣、装配任务,适应小批量定制化生产。 - 示例:通过指令“按大小排列零件”,自动生成分拣程序。 2. **教育与科研** - 降低机器人编程门槛,学生用自然语言即可实验机械臂控制。 - 研究人机协作、零样本任务泛化能力。 3. **服务与医疗** - 辅助残障人士:语音指令控制机械臂递送物品、操作设备。 - 手术机器人:医生用自然语言描述操作步骤,生成辅助代码。 --- ### **优势与创新** - **无需专业编程知识**:普通用户可通过语言指令操控机器人。 - **快速迭代**:修改任务只需调整指令,无需重写底层代码。 - **灵活适应新任务**:利用Codex的泛化能力处理未预编程的场景(如“避开障碍物抓取物体”)。 --- ### **挑战与限制** 1. **安全性风险** - 自动生成代码可能存在逻辑错误(如碰撞路径),需结合仿真验证或人工审核。 2. **语义理解局限** - 复杂指令(如“以最省力的方式整理桌面”)可能超出当前模型能力。 3. **硬件兼容性** - 不同机械臂的API差异需定制化适配,增加部署成本。 --- ### **典型案例** - **MIT研究项目**:用Codex控制机械臂玩叠积木游戏,通过指令“将红色积木放在蓝色积木上方”生成代码并执行。 - **仓储机器人**:根据“按订单分拣3件商品”自动规划抓取顺序和路径。 --- ### **未来方向** - **多模态扩展**:结合视觉模型(如CLIP),实现“看到物体→语言描述→生成动作”的闭环。 - **实时交互**:通过语音或文本动态调整机械臂行为(如“暂停,改放到左边”)。 - **伦理与标准化**:建立安全协议,确保生成代码符合物理约束和伦理规范。 --- 这一技术将机器人编程从“专家专属”推向“大众可用”,为人机协作开辟了新的可能性,但需在安全性与灵活性之间找到平衡。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值