[Microsoft] A Comparative Study of DSL Code Generation

A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation

What is DSL

Domain Specific Languages (or DSLs) are custom Computer Languages designed and optimized for specific applications. Examples of DSLs include SQL and industry-specific languages for formalizing API calls, often using formats like JSON or YAML to represent API sequences.

Therefore, general pre-trained models are hard to directly adopted into this task, which can be greatly affected by hallucination. The syntax, names and APIs in DSL are all custom, partly private and ever-changing. So, fine-tuning is also not an ideal method to handle new APIs, but RAG techniques seem fit well with this topic.

Relative work

Code Generation

Exsiting popular code generation LLM, like Copilot, Code Llama, etc. are all trained on public programming languages including C++, Python, etc. but hard or impossible to include DSL. To utilize them, techniques like fine-tuning or few-shot prompt engineering is needed.

Tool Integration

Some LLMs have learnt to call external tools like web search or calculators. However, these pre-trained models are limited to a small set of well-documented tools and hard to adopt to new, private APIs.

Task Orchestration

It is an unfamiliar task that I’ve never heard of. After searching on the Internet, it is like “automated arranging, coordinating and managing tasks”. Under the context of this paper, I think LLM in orchestration is used to divide the user instruction into several ordered sub-tasks. That’s why the authors put orchestration on par with reasoning.

Summary

In summary, DSL itself can be regarded as a kind of programming language, also a set of external tools, so DSL generation can refer from Code Generation and Tool Integration. Additionally, the process of understanding the instruction and dividing into sub-tasks to decide what APIs to call is like Task Orchestration.

Methodology

在这里插入图片描述

Fine-Tuning Base Model

The backbone in this system is a fine-tuned NL2DSL generation model. It is fine-tuned by LoRA-based approach with 67k NL-DSL pairs from Codex base from OpenAI. To improve the performance, the authors try to do data augmentation and add synthesized data into training set. They found it is challenging to predict the parameter keys due to limitation of data generation, because the synthetized samples do not contain various different parameters.

TST-based Example Retrieval

The authors fine-tuned a pre-trained BERT model with a modified loss function to produce embeddings for retrieval. TST (Target Similarity Tuning) is a kind of loss computation methods, aiming at minimizing the difference between Cosine Similarity of Natural Language Embeddings and another Pre-Defined Metric of Program Similarity.
LTST(θ)=E(i,j)∼D[fθ(ui,uj)−S(pi,pj)]2 L_{TST}(\theta) = \mathbb{E}_{(i,j)\sim\mathcal{D}}\left[f_\theta(u_i, u_j)-S(p_i, p_j)\right]^2 LTST(θ)=E(i,j)D[fθ(ui,uj)S(pi,pj)]2
Here, SSS is Jaccard score of lists of API function names.

The authors mentioned that they collect positive and negative pairs based on similarity of embedding produced by a Pre-Trained Transformer model, but they did not mention how they are used in the fine-tuning.

Prompt Grounding

“Grounding” is a concept proposed by Microsoft (I think). Basically, it means providing additional ground truth information in the prompt. In this system, they provide metadata of API in the prompt, including the description of function and all parameter keys. There are 2 methods proposed.

  • Providing metadata of all APIs appeared in retrieved few-shot examples (Regular FD).
  • Based on the NL query, retrieve APIs with similar API description in metadata. They named it Semantic Function Definition (SFD). It is useful for prompts where no few-shot examples are retrieved.

Experiments

The backbone agent model used in experiments is GPT-4 (16k token limit).

Dataset Generation

The train and test set consists of a total of 67k and 1k samples, respectively. These samples are (prompt, flow) pairs with the workflows being created by users across a large set of APIs. After removing Personally Identifiable Information, PII data, there are 700 publicly available APIS left. They generate corresponding NL prompts using GPT-4.

Metrics

  • Average Similarity
    The authors reduce the flow to a sequence of APIs, despite the parameters (actually I don’t think it is reasonable), and then compute the Longest Common Subsequence match between ground truth sequence and predicted sequence. The similarity is LCSS(A,B)/max⁡(∣A∣,∣B∣)LCSS(A,B)/\max(|A|, |B|)LCSS(A,B)/max(A,B), where A,BA, BA,B is the reduced API sequence. Hallucination and Parser failures lead to the sample being discarded and is assigned a similarity score of 0.
  • Unparsed rate
    The rate of syntactic errors, ∣Flowunparsed∣/∣Flowtotal∣|\text{Flow}_\text{unparsed}|/|\text{Flow}_\text{total}|Flowunparsed∣/∣Flowtotal.
  • Hallucination rate
    Hallucinated API name rate: ∣Flowh∣/∣Flowparsed∣|\text{Flow}_\text{h}|/|\text{Flow}_\text{parsed}|Flowh∣/∣Flowparsed.
    Hallucinated API parameter key name rate: ∣Flowh∣/∣Flowparsed∣|\text{Flow}_\text{h}|/|\text{Flow}_\text{parsed}|Flowh∣/∣Flowparsed.

Results

The test set contains 1000 NL-DSL pairs, where 864 are in-domain samples and 136 are out-of-domain samples. The authors investigate the impact of each ablation with in-domain samples and evaluate the generalizability with out-of-domain samples.

An unusual thing of the results is that they only display the difference with baseline but keep the actual value of baseline secret. Is this confidential requirement or just because the results do not meet the expectation?

Number of few-shot examples

在这里插入图片描述
More few-shots improves the performance by reducing the number of made-up API names as well as reducing the number of made-up API parameter keys.

TST vs Pre-trained Model

在这里插入图片描述
在这里插入图片描述

The addition predominantly helps reducing the hallucination rate for API names and parameters, which indicates that adding tool descriptions (like it is done in planning tasks) along with few-shot code samples helps improve reliability of plan generation.

Regular Function Definition vs Semantic Function Definitions

在这里插入图片描述
Simply add semantically similar API metadata for a query is not useful for DSL generation.

Out of Domain APIs

在这里插入图片描述
When samples are not present in the train set, grounding with RAG context can provide the LLM support for improving code quality. The role of few-shots in informing the syntax of the output code cannot be substituted with just adding function definitions. Since, it is hard to obtain the examples for unseen APIs, alternate ways to improve syntactic errors need to be found.

My thoughts

The metrics of average similarity lacks practical significance, because it does not consider the correctness of parameters. Even a 100% similarity cannot guarantee the generated flow is correct. To be more practical, I think it is necessary to evaluate whether the generated flow works, just like other code generation tasks.

When discussing with an expert in industry, he said his team would consider the NL2DSL task as Named Entity Recognition, NER task, where they will try to recognize the API names, parameter names and key values from prompt. It can greatly decrease the probability of hallucinations but requires user to provide more information in prompt. In fact, we can guide users to clarify their intent, asking users to select and re-rank retrieved key words, instead of generating flow end to end.

高动态范围 (High Dynamic Range,简称HDR) 视频是指能够显示更广泛亮度区域和更丰富细节的视频格式。为了在标准的显示设备上播放HDR视频,需要进行色调映射 (Tone Mapping) 处理,将HDR视频转换为标准动态范围 (Standard Dynamic Range,简称SDR) 视频。 《高动态范围视频的色调映射算法比较评价》是一篇综述性文章,对目前的色调映射算法进行了对比和评估。 首先,文章介绍了需要解决的问题,即如何保留HDR视频的丰富细节和对比度,同时适应不同的SDR显示设备,使得观众在任何显示设备上都能够获得良好的观看体验。 接下来,文章列举了几种主要的色调映射算法,并对它们进行了详细分析和比较。比如,全局映射算法主要通过压缩整个亮度范围来适应SDR设备,但可能会损失细节;局部映射算法则更加注重保留细节,但可能导致亮度不连续性;基于图像分割的算法可以在图像不同区域中应用不同的映射策略,但需要更多的计算资源。 在比较过程中,文章对每种算法的映射质量、计算复杂度和实时性等指标进行了评估。并举例说明了不同算法在真实HDR视频上的应用效果。 最后,文章总结了各种算法的优缺点,并提出了未来研究的方向。例如,如何在保留细节的同时提高计算效率,以适应高分辨率和高帧率的HDR视频。同时,如何结合人眼感知和动态映射策略,以提供更好的观看体验。 综上所述,《高动态范围视频的色调映射算法比较评价》通过详细分析和比较不同的色调映射算法,为高动态范围视频的后续研究和开发提供了重要参考和指导。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

ShadyPi

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值