(原数据集链接📚)
文章目录
Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants
(基于LLM模型训练虚拟助手用的’用户服务标记训练数据集’)
简介
这个混合合成数据集旨在用于微调GPT、Mistral和OpenELM等大型语言模型,并使用我们的NLP/NLG技术和自动化数据标记(DAL)工具生成。我们的目标是演示如何使用我们的两步LLM微调方法轻松实现客户支持部门的垂直化/领域自适应。例如,如果你是[ACME公司],你可以创建自己的定制LLM,首先使用这个数据集训练一个微调模型,然后用少量自己的数据进一步微调它。这种方法的概述可以在以下网址找到:从通用LLM到垂直企业模型
数据集具有以下规格:
- 用途: 意图检测
- 垂直领域: 客户服务
- 27个意图分配到10个类别
- 26872个QA对,每个意图近1000个用例
- 30个实体属性类型
- 12个不同的语言生成标签
这些类别和意图是从Bitext的20个垂直特定数据集中选择的,涵盖了所有20个垂直领域中常见的意图。垂直方向为:
- 汽车、零售银行、教育、活动和票务、现场服务、医疗保健、酒店、保险、法律服务、制造业、流媒体、抵押贷款和贷款、移动和存储、房地产/建筑、餐饮连锁店、零售/电子商务、电信、旅游、公用事业、财富管理
有关垂直行业及其意图的完整列表,请参阅 https://www.bitext.com/chatbot-verticals/.
QA对是使用混合方法生成的,该方法使用自然文本作为源文本,NLP技术从这些文本中提取种子,NLG技术扩展种子文本。该过程中的所有步骤均由计算语言学家精心策划。
数据集Token数量
数据集在其“instruction”和“response”列中包含大量文本数据。在处理和标记数据集后,我们总共识别出357万个Token。这套丰富的Token对于训练人工智能会话、人工智能生成和问答(Q&A)模型的高级LLM至关重要。
数据集中的字段
数据集中的每个条目都包含以下字段:
- flags: 标签(在下面的“语言生成标签”部分中解释)
- instruction: 来自客户服务领域的用户请求
- category: 意图的高级语义类
- intent: 与用户指令对应的意图
- response: 虚拟助手的预期响应示例
类别和意图
数据集涵盖的类别和意图是:
(大写全拼代表了意图的类别,小写下划线组合代表意图的子类,绿色为新增意图或类别,删除线代表删除意图或类别)
ACCOUNT: create_account, delete_account, edit_account, switch_account- CANCELLATION_FEE: check_cancellation_fee
- DELIVERY: delivery_options
- FEEDBACK: complaint, review
- INVOICE: check_invoice, get_invoice
NEWSLETTER: newsletter_subscription- ORDER: cancel_order, change_order, place_order
PAYMENT: check_payment_methods, payment_issue- REFUND: check_refund_policy,
track_refund - SHIPPING_ADDRESS: change_shipping_address,
set_up_shipping_address
实体属性
数据集中的实体属性有:
(下述属性将出现于回答中,并通过数据库查询对应的回复值)
- {{Order Number}}, typically present in:
- Intents: cancel_order, change_order, change_shipping_address, check_invoice, check_refund_policy, complaint, delivery_options, delivery_period, get_invoice, get_refund, place_order, track_order, track_refund
- {{Invoice Number}}, typically present in:
- Intents: check_invoice, get_invoice
- {{Online Order Interaction}}, typically present in:
- Intents: cancel_order, change_order, check_refund_policy, delivery_period, get_refund, review, track_order, track_refund
- {{Online Payment Interaction}}, typically present in:
- Intents: cancel_order, check_payment_methods
- {{Online Navigation Step}}, typically present in:
- Intents: complaint, delivery_options
- {{Online Customer Support Channel}}, typically present in:
- Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, registration_problems, switch_account
- {{Profile}}, typically present in:
- Intent: switch_account
- {{Profile Type}}, typically present in:
- Intent: switch_account
- {{Settings}}, typically present in:
- Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, contact_human_agent, delete_account, delivery_options, edit_account, get_invoice, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, set_up_shipping_address, switch_account, track_order, track_refund
- {{Online Company Portal Info}}, typically present in:
- Intents: cancel_order, edit_account
- {{Date}}, typically present in:
- Intents: check_invoice, check_refund_policy, get_refund, track_order, track_refund
- {{Date Range}}, typically present in:
- Intents: check_cancellation_fee, check_invoice, get_invoice
- {{Shipping Cut-off Time}}, typically present in:
- Intent: delivery_options
- {{Delivery City}}, typically present in:
- Intent: delivery_options
- {{Delivery Country}}, typically present in:
- Intents: check_payment_methods, check_refund_policy, delivery_options, review, switch_account
- {{Salutation}}, typically present in:
- Intents: cancel_order, check_payment_methods, check_refund_policy, create_account, delete_account, delivery_options, get_refund, recover_password, review, set_up_shipping_address, switch_account, track_refund
- {{Client First Name}}, typically present in:
- Intents: check_invoice, get_invoice
- {{Client Last Name}}, typically present in:
- Intents: check_invoice, create_account, get_invoice
- {{Customer Support Phone Number}}, typically present in:
- Intents: change_shipping_address, contact_customer_service, contact_human_agent, payment_issue
- {{Customer Support Email}}, typically present in:
- Intents: cancel_order, change_shipping_address, check_invoice, check_refund_policy, complaint, contact_customer_service, contact_human_agent, get_invoice, get_refund, newsletter_subscription, payment_issue, recover_password, registration_problems, review, set_up_shipping_address, switch_account
- {{Live Chat Support}}, typically present in:
- Intents: check_refund_policy, complaint, contact_human_agent, delete_account, delivery_options, edit_account, get_refund, payment_issue, recover_password, registration_problems, review, set_up_shipping_address, switch_account, track_order
- {{Website URL}}, typically present in:
- Intents: check_payment_methods, check_refund_policy, complaint, contact_customer_service, contact_human_agent, create_account, delete_account, delivery_options, get_refund, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, review, switch_account
- {{Upgrade Account}}, typically present in:
- Intents: create_account, edit_account, switch_account
- {{Account Type}}, typically present in:
- Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, check_refund_policy, complaint, contact_customer_service, contact_human_agent, create_account, delete_account, delivery_options, delivery_period, edit_account, get_invoice, get_refund, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, review, set_up_shipping_address, switch_account, track_order, track_refund
- {{Account Category}}, typically present in:
- Intents: cancel_order, change_order, change_shipping_address, check_cancellation_fee, check_invoice, check_payment_methods, check_refund_policy, complaint, contact_customer_service, contact_human_agent, create_account, delete_account, delivery_options, delivery_period, edit_account, get_invoice, get_refund, newsletter_subscription, payment_issue, place_order, recover_password, registration_problems, review, set_up_shipping_address, switch_account, track_order, track_refund
- {{Account Change}}, typically present in:
- Intent: switch_account
- {{Program}}, typically present in:
- Intent: place_order
- {{Refund Amount}}, typically present in:
- Intent: track_refund
- {{Money Amount}}, typically present in:
- Intents: check_refund_policy, complaint, get_refund, track_refund
- {{Store Location}}, typically present in:
- Intents: complaint, delivery_options, place_order
语言生成标签
该数据集包含反映语言在不同语言现象(如口语或冒犯性语言)中如何变化/改变的标签。因此,如果一个用于意图“cancel_order”的话语包含“COLLOQUIAL”(口语)标签,那么该话语将表达一种非正式的语言变体,例如:“can u cancel my order”(此处为体现为can首字母未大写,用u替代you,结尾无标点符号)。
这些标签表示条目表达的语言变体类型。当与每个条目相关联时,它们允许会话设计师根据不同语言使用的不同用户配置文件定制训练数据集。通过这些标签,可以创建许多不同的数据集,使生成的助手更加准确和稳健。销售运动鞋的机器人应该主要针对使用更口语的年轻人;而传统的零售银行机器人应该能够处理更正式或礼貌的语言。该数据集还反映了现实生活中虚拟助手常见的语言现象,如拼写错误、连读单词、标点错误…
该数据集包含所有相关语言现象的标记,可用于为不同的用户配置文件定制数据集。
词汇变化标签
(此处附上翻译原文,预防可能存在的错译)
M - Morphological variation: inflectional and derivational “is my SIM card active”, “is my SIM card activated”
(语态变化,active变为activated)
L - Semantic variations: synonyms, use of hyphens, compounding… “what’s my billing date", “what’s my anniversary date”
(语义变化,使用连字符和组合词 what’s)
Tags for Syntactic structure variation 句法结构变体标签
B - Basic syntactic structure: “activate my SIM card”, “I need to activate my SIM card”
(陈述句)
I - Interrogative structure “can you activate my SIM card?”, “how do I activate my SIM card?”
(疑问句)
C - Coordinated syntactic structure “I have a new SIM card, what do I need to do to activate it?”
(并列句,此处未陈述句和疑问句的并列)
N - Negation “I do not want this item, where to cancel my order?”
(否定句)
Tags for language register variations 语气变体的标签
P - Politeness variation “could you help me activate my SIM card, please?”
(请求句)
Q - Colloquial variation “can u activ8 my SIM?”
(口语化句型,此处指的u 替代了you,activ8替代了activate(谐音))
W - Offensive language “I want to talk to a fxxxing agent”
(冒犯性语句,说藏话了)
Tags for stylistic variations 文体变体标签
K - Keyword mode “activate SIM”, “new SIM”
(关键词,没有完整的句式只提供了相关关键词)
E - Use of abbreviations: “I’m / I am interested in getting a new SIM”
(缩写,使用了缩写I’m)
Z - Errors and Typos: spelling issues, wrong punctuation… “how can i activaet my card”
(错误和拼写错误)
Other tags not in use in this Dataset 未启用的标签
D - Indirect speech “ask my agent to activate my SIM card”
G - Regional variations US English vs UK English: “truck” vs “lorry” France French vs Canadian French: “tchatter” vs “clavarder”
R - Respect structures - Language-dependent variations English: “may” vs “can…” French: “tu” vs “vous…” Spanish: “tú” vs “usted…”
Y - Code switching “activer ma SIM card”