基于通用LLM合成数据生成

最新推荐文章于 2025-07-15 09:23:09 发布

原创

最新推荐文章于 2025-07-15 09:23:09 发布 · 1.2k 阅读

14 ·

CC 4.0 BY-SA版权

本文为博主原创文章，未经博主允许不得转载。

文章标签：

#python #机器学习 #开发语言

使用大型语言模型 (LLM) 生成合成数据为一个常见问题提供了强大的解决方案：提供高质量、多样化且符合隐私要求的数据。这可以用于许多场景，例如训练数据科学机器学习模型（SVM、决策树、KNN）、在数据上微调不同的 GPT 模型、作为冷启动问题的解决方案、帮助使用真实数据构建引人注目的演示/应用程序、场景测试等。

有许多关键驱动因素可能会促使您想要利用合成数据。

人类数据可能包含我们不希望使用的隐私限制和/或可识别数据。
合成数据比真实数据更加结构化，因此更容易操作。
在数据稀疏或某些类别的数据稀疏的领域，我们可能希望增强数据。
当处理不平衡的数据集或缺乏多样性的数据集时，我们可能希望创建数据来提高数据集的丰富度。

与传统的数据增强或手动数据创建方法不同，使用 LLM 可以生成丰富、细致入微且与上下文相关的数据集，从而显著增强其对企业和开发人员的实用性。

我们将本教程分为两部分。在本指南中，我们将制定以下议程：

带有结构化提示的 CSV
使用 Python 程序生成 CSV
使用 Python 程序实现多表 CSV
简单创建文本数据
在第 2部分中，我们将处理不平衡或非多样化的文本数据，并研究获取更好文本数据的提示策略。

最后两种方法尤其适用于创建合成数据来微调另一个 GPT 模型。例如，使用生成的更高质量的数据来gpt-4o更便宜、更快速地微调模型gpt-3.5-turbo，以提高性能并降低成本。

开始设置
%pip install openai
%pip install pandas
%pip install scikit-learn
%pip install matplotlib

from openai import OpenAI
import os
import re
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import json
import matplotlib

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

1. 带有结构提示的 CSV

这里我们以最简单的方式创建数据。您可以通过解决 3 个关键点来快速生成数据：告诉它数据的格式 (CSV)、架构以及有关列如何关联的有用信息（LLM 将能够从列名称中推断出这一点，但帮助会提高性能）。

datagen_model = "gpt-4o-mini"
question = """
Create a CSV file with 10 rows of housing data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense). Also only respond with the CSV.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {
   
   "role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {
   
   "role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

id、house_size_m2、house_price、location、number_of_bedrooms
1,50,150000、郊区、2
2,75,250000、市中心、3 3,100,350000
、郊区、4
4,120,450000、郊区、4
5,80,300000、市中心、3
6,90,400000、市中心、3
7,150,600000、优质区域、5
8,200,750000、优质区域、5
9,55,180000、郊区、2
10,300,950000、优质区域、6

2. 使用 Python 程序生成 CSV

直接生成数据的问题是，由于上下文的原因，我们能够生成的数据量有限。相反，我们可以做的是让 LLM 生成一个 Python 程序来生成合成数据。这使我们能够扩展到更多数据，同时还通过检查 Python 程序让我们了解数据是如何生成的。

这将使我们能够根据需要编辑 Python 程序，同时也为我们提供了一个良好的起点。

question = """
Create a Python program to generate 100 rows of housing data.
I want you to at the end of it output a pandas dataframe with 100 rows of data.
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {
   
   "role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {
   
   "role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)

当然可以！下面是一个 Python 程序，它根据您的要求生成合成住房数据。我们将创建一个具有定义字段和特征的 pandas DataFrame。
 
 import pandas as pd
 import random
 
def generate_housing_data(num_rows):
     data = []
     
    locations = [
         ('市中心', 10000, 150), # (位置名称、每平方米基本价格、基本大小)
         ('郊区', 8000, 100),
         ('乡村', 5000, 80),
         ('沿海地区', 12000, 110),
         ('城市社区', 9000, 130)
     ]
     
    for i in range(1, num_rows + 1):
         # 随机选择一个位置
        location, base_price_per_m2, base_size = random.choice(locations)
         
        # 生成卧室数量（1 到 5）
         number_of_bedrooms = random.randint(1, 5)
         
        # 根据卧室数量计算房屋大小
        house_size = base_size + (10 * number_of_bedrooms) + random.randint(-5, 15) # 添加一些噪音
        
        # 根据房屋大小和位置计算房价
        house_price = base_price_per_m2 * house_size + random.randint(-5000, 10000) # 添加一些噪音

        # 将生成的数据附加到列表中
        data.append({
             'id': i,
             'house_size_m2': house_size,
             'house_price': house_price,
             'location': location,
             'number_of_bedrooms': number_of_bedrooms
         })
 
    # 创建一个 pandas DataFrame
     df = pd.DataFrame(data)
     return df
 
#生成 100 行房屋数据
housing_data_df = generate_housing_data(100)
 
#显示结果
print(housing_data_df)

###解释：

generate_housing_data 函数为指定行数（num_rows）。
我们定义不同的位置，并对应每平方米的基本价格和平均房屋面积。
对于每栋房子，我们随机选择一个位置、卧室数量，并计算房屋面积和价格，以确保值之间有合理的相关性。
最后，我们从生成的数据中创建一个 pandas DataFrame 并返回它。

您可以在 Python 环境中运行此程序，它将输出一个包含 100 行合成住房数据的 DataFrame。

我们需要确保正确解析此输出，因为 Python 代码周围可能经常有文本。我们还可以明确要求它陈述它对生成的数据做出的所有假设，但在这种情况下，它会自动告诉我们这一点。

3. 使用 Python 程序处理多表 CSV

然而，对于更复杂的关系，我们需要确保指定更多特征。

要创建多个相互关联的不同数据集（例如住房、位置、房屋类型），我们需要像以前一样指定格式、架构和有用信息。但是，现在获得良好性能所需的有用信息更多。这是针对具体情况的，但需要描述的大量内容包括数据集如何相互关联、解决数据集相对于彼此的大小、确保外键和主键正确生成以及理想情况下使用先前生成的数据集来填充新数据集，以便实际数据值在必要时匹配。

question = """
Create a Python program to generate 3 different pandas dataframes.

1. Housing data
I want 100 rows. Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - house size (m^2)
 - house price
 - location
 - number of bedrooms
 - house type
 + any relevant foreign keys

2. Location
Each row should include the following fields:
 - id (incrementing integer starting at 1)
 - country
 - city
 - population
 - area (m^2)
 + any relevant foreign keys

 3. House types
 - id (incrementing integer starting at 1)
 - house type
 - average house type price
 - number of houses
 + any relevant foreign keys

Make sure that the numbers make sense (i.e. more rooms is usually bigger size, more expensive locations increase price. more size is usually higher price etc. make sure all the numbers make sense).
Make sure that the dataframe generally follow common sense checks, e.g. the size of the dataframes make sense in comparison with one another.
Make sure the foreign keys match up and you can use previously generated dataframes when creating each consecutive dataframes.
You can use the previously generated dataframe to generate the next dataframe.
"""

response = client.chat.completions.create(
  model=datagen_model,
  messages=[
    {
   
   "role": "system", "content": "You are a helpful assistant designed to generate synthetic data."},
    {
   
   "role": "user", "content": question}
  ]
)
res = response.choices[0].message.content
print(res)