Dataherald数据集构建：微调训练数据的高效生成-优快云博客

Dataherald数据集构建：微调训练数据的高效生成

【免费下载链接】dataherald 项目地址: https://gitcode.com/GitHub_Trending/da/dataherald

痛点：为什么需要高质量的微调数据集？

在企业级自然语言转SQL（NL-to-SQL）场景中，通用大语言模型往往缺乏特定业务领域的上下文理解能力。你可能会遇到这样的困境：

模型生成的SQL语法正确但语义错误
缺乏对业务表结构和关系的深度理解
无法正确处理企业特有的数据模式和业务逻辑
响应时间过长影响用户体验

Dataherald通过智能化的数据集构建流程，解决了这些痛点，让微调训练数据的生成变得高效且精准。

Dataherald微调数据生成架构

mermaid

核心组件：Golden SQL系统

Golden SQL是Dataherald微调数据的核心构建块，每个Golden SQL包含：

字段	描述	示例
`prompt_text`	自然语言问题	"查询2023年销售额最高的产品"
`sql`	对应的SQL语句	"SELECT product_name, SUM(sales) FROM sales WHERE year=2023 GROUP BY product_name ORDER BY SUM(sales) DESC LIMIT 1"
`db_connection_id`	数据库连接标识	"656e52cb4d1fda50cae7b939"
`metadata`	附加元数据	{"confidence": 0.95, "verified_by": "admin"}

智能数据格式化流程

Dataherald采用多层次的上下文增强策略：

1. 表结构信息提取

def format_table(self, table: TableDescription) -> str:
    table_representation = ""
    table_representation += table.table_schema + "\n"
    
    # 添加表描述
    if table.description is not None:
        table_representation += f"Table `{table.table_name}`: {table.description}\n"
    
    # 添加列描述和分类信息
    for column in table.columns:
        if column.description is not None:
            table_representation += f"Column `{column.name}`: {column.description}\n"
    
    # 添加分类列信息
    columns_information = self.format_columns(table)
    if columns_information:
        table_representation += "/* Categorical Columns:\n"
        table_representation += columns_information
        table_representation += "*/\n"
    
    # 添加样本数据
    sample_rows = table.examples
    table_representation += "/* Sample rows:\n"
    for item in sample_rows:
        for key, value in item.items():
            table_representation += f"{key}: {value}, "
        table_representation += "*/\n"
    
    return table_representation

2. 语义相似度排序

基于嵌入向量的表排序算法：

def sort_tables(self, tables, table_embeddings, prompt):
    tables_with_similarity = []
    prompt_embedding = self.embedding.embed_query(prompt)
    
    similarities = np.dot(table_embeddings, prompt_embedding) / (
        np.linalg.norm(table_embeddings) * np.linalg.norm(prompt_embedding)
    )
    
    for i in range(len(tables)):
        tables_with_similarity.append((tables[i], similarities[i]))
    
    tables_with_similarity.sort(key=lambda x: x[1], reverse=True)
    return [table[0] for table in tables_with_similarity]

微调数据集构建实战

步骤1：数据库连接与扫描

首先建立数据库连接并扫描表结构：

curl -X 'POST' \
  'http://localhost/api/v1/database-connections' \
  -H 'Content-Type: application/json' \
  -d '{
    "alias": "production_db",
    "use_ssh": false,
    "connection_uri": "postgresql://user:password@localhost:5432/mydb"
  }'

步骤2：收集Golden SQL

通过API添加已验证的问答对：

curl -X 'POST' \
  'http://localhost/api/v1/golden-sqls' \
  -H 'Content-Type: application/json' \
  -d '{
    "db_connection_id": "656e52cb4d1fda50cae7b939",
    "prompt_text": "查询每个部门的员工数量",
    "sql": "SELECT department, COUNT(*) as employee_count FROM employees GROUP BY department"
  }'

步骤3：启动微调任务

curl -X 'POST' \
  'http://localhost/api/v1/finetunings' \
  -H 'Content-Type: application/json' \
  -d '{
    "db_connection_id": "656e52cb4d1fda50cae7b939",
    "alias": "sales_model_v1",
    "golden_sqls": ["gsql_1", "gsql_2", "gsql_3"],
    "base_llm": {
      "model_name": "gpt-3.5-turbo"
    }
  }'

数据质量保障机制

令牌数验证

确保每个训练样本不超过模型上下文窗口：

def count_tokens(self, messages: dict) -> int:
    prompt = ""
    for message in messages["messages"]:
        prompt += message["content"]
    return len(self.encoding.encode(prompt))

# 检查令牌数限制
if number_of_tokens > OPENAI_FINETUNING_MODELS_WINDOW_SIZES[model_name]:
    raise ValueError("令牌数超出限制")

自动化验证流程

mermaid

性能优化策略

1. 批量处理优化

采用并行处理Golden SQL，显著提升数据集生成速度：

for index, golden_sql_id in enumerate(self.fine_tuning_model.golden_sqls):
    logger.info(f"处理Golden SQL {index + 1}/{总数}")
    # 并行处理逻辑

2. 内存管理

临时文件处理和自动清理机制：

finetuning_dataset_path = f"tmp/{str(uuid.uuid4())}.jsonl"
# 数据处理...
with open(finetuning_dataset_path, "a") as outfile:
    for messages in results:
        json.dump(messages, outfile)
        outfile.write("\n")
# 文件上传后自动清理
os.remove(finetuning_dataset_path)

企业级应用场景

场景1：电商数据分析

数据特征：

多表关联查询（订单、用户、商品）
复杂的业务逻辑（促销、折扣、会员等级）
实时性要求高

微调效果：

查询准确率提升40%
响应时间减少60%
支持复杂业务问答

场景2：金融风控系统

特殊要求：

严格的合规性检查
敏感数据过滤
审计日志记录

实现方案：

# 添加风控特定指令
instructions = {
    "never_include": ["ssn", "credit_card"],
    "always_filter": ["active_status = true"],
    "audit_logging": True
}

最佳实践指南

1. Golden SQL质量标准

质量等级	特征	数量建议
优秀	问题清晰，SQL优化，覆盖主要业务场景	100-200条
良好	问题明确，SQL正确，覆盖常见场景	50-100条
基础	简单问答，基础SQL操作	20-50条

2. 数据分布策略

mermaid

3. 持续优化循环

收集业务问题 → 生成Golden SQL → 模型微调 → 
部署验证 → 监控性能 → 收集反馈 → 优化数据集

技术挑战与解决方案

挑战1：上下文长度限制

解决方案：

智能表选择算法
动态上下文裁剪
分层注意力机制

挑战2：数据一致性

验证机制：

自动化SQL语法检查
业务逻辑验证
执行结果对比

挑战3：模型泛化能力

增强策略：

多样化问题表述
多数据库类型支持
增量学习机制

未来发展方向

1. 自动化数据增强

基于LLM的问题重述
SQL等价变换
负样本生成

2. 多模态数据集

结合图表描述
自然语言解释
可视化问答对

3. 实时学习系统

用户反馈收集
自动质量评估
动态模型更新

总结

Dataherald的微调数据集构建系统通过智能化的Golden SQL管理、多层次的上下文增强和严格的质量控制，为企业级NL-to-SQL应用提供了高效、可靠的数据 foundation。无论是电商、金融还是制造业，都能通过这套系统快速构建领域专用的智能查询能力。

关键收获：

✅ Golden SQL是微调成功的核心
✅ 智能上下文选择大幅提升效果
✅ 自动化流程确保数据质量
✅ 持续优化实现业务价值最大化

通过Dataherald，企业可以快速将自然语言查询能力集成到现有系统中，真正实现"用自然语言对话数据"的愿景。

【免费下载链接】dataherald 项目地址: https://gitcode.com/GitHub_Trending/da/dataherald

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考