explain extend

本文详细解析了MySQL查询优化的过程及其使用Explain Extended与Show Warnings获取优化信息的方法。通过具体实例,展示了如何从Explain Extended输出中洞察SQL执行方式,以及它在查询优化中的作用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

explain 的extended 扩展能够在原本explain的基础

上额外的提供一些查询优化的信息,这些信息可以通过MySQL的show warnings命令得到。下面是一个最简单的例子。

首先执行对想要分析的语句进行MySQL explain,并带上extended选项

MySQL> explain extended select * from account\G;

  1. row

id: 1
select_type: SIMPLE
table: account
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1
filtered: 100.00
Extra:
1 row in set, 1 warning (0.00 sec)

接下来再执行Show Warnings

MySQL> show warnings\G;

  1. row Level: Note

Code: 1003
Message: select dbunit.account.id AS id,dbunit.account.name AS name from dbunit.account
1 row in set (0.00 sec)
从 show warnings的输出结果中我们可以看到原本的select * 被MySQL优化成了

select dbunit.account.id AS id,dbunit.account.name AS name

explain extended 除了能够告诉我们MySQL的查询优化能做什么,同时也能告诉我们MySQL的

查询优化做不了什么。MySQL performance的Extended EXPLAIN这篇文中中作者就利用explain

extended +show warnings 找到了MySQL查询优化器中不能查询优化的地方。

从 EXPLAIN extended SELECT * FROM sbtest WHERE id>5 AND id>6 AND c=”a” AND pad=c

语句的输出我们得知MySQL的查询优化器不能将id>5 和 id>6 这两个查询条件优化合并成一个 id>6。

在MySQL performance的explain extended文章中第三个例子和静室的MySQL explain的extended选项文章中,

两位作者也对explain extended做了进一步的实验,从这个两篇文中中我们可以得出结论是从

explain extend的输出中,我们可以看到sql的执行方式,对于分析sql还是很有帮助的。

下面特别摘抄了静室的explain的extended选项这篇文章中的内容

以下代码和分析摘抄至静室的explain的extended选项

MySQL>explain extended select * from t where a in (select b from i);
+—-+——————–+——-+——+
| id | select_type | table | type |
+—-+——————–+——-+——+
| 1 | PRIMARY | t | ALL |
| 2 | DEPENDENT SUBQUERY | i | ALL |
+—-+——————–+——-+——+
2 rows in set, 1 warning (0.01 sec)
子查询看起来和外部的查询没有任何关系,为什么MySQL显示的是DEPENDENT SUBQUERY,

和外部相关的查询呢?从explain extended的结果我们就可以看出原因了。

MySQL>show warnings\G

  1. row

Level: Note
Code: 1003
Message: select test.t.a AS a,test.t.b AS b,test.t.c AS c
from test.t where
(test.t.a,
(select 1 AS Not_used from test.i
where ((test.t.a) = test.i.b)))
1 row in set (0.00 sec)
在这里MySQL改写了SQL,做了in的优化。

以上代码和分析摘抄至静室的explain的extended选项

不过需要注意的一点是从EXPLAIN extended +show warnings得到“优化以后”的查询语句

可能还不是最终优化执行的sql,或者说MySQL explain extended看到的信息还不足以说明MySQL最

终对查询语句优化的结果。同样还是MySQL formance的explain Extended这篇文章的第二个

例子就说明了这种情况

MySQL> EXPLAIN extended SELECT t1.id,t2.pad FROM sbtest t1, sbtest t2 WHERE t1.id=5
AND t2.k=t1.k;
+—-+————-+——-+——-+—————+———+———+——-+——-+——-+
| id | select_type | TABLE | type | possible_keys | KEY | key_len | ref | rows | Extra |
+—-+————-+——-+——-+—————+———+———+——-+——-+——-+
| 1 | SIMPLE | t1 | const | PRIMARY,k | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | t2 | ref | k | k | 4 | const | 55561 | |
+—-+————-+——-+——-+—————+———+———+——-+——-+——-+
2 rows IN SET, 1 warning (0.00 sec)
MySQL> SHOW warnings \G
1. row Level: Note

Code: 1003
Message: SELECT test.t1.id AS id,test.t2.pad AS pad FROM test.sbtest t1
JOIN test.sbtest t2 WHERE ((test.t2.k = test.t1.k) AND (test.t1.id = 5))
1 row IN SET (0.00 sec)
从Explain的结果中我们可以得到t1表的查询使用的是”const”类型,也就是说MySQL查询的时候

会先由t1.id=5 找到t1.k 再利用t1.k的值去t2表中查询数据,很显然这样的查询优化结果没有在

接下来的Show Warings输出中找到。

总结

还是引用静室 在explain的 extended选项这篇文章中的几句话”从MySQL explain extend的输出中,我们可以

看到sql的执行方式,对于分析sql还是很有帮助的”。

import torch import json import os import argparse import numpy as np import re from torch.utils.data import Dataset, DataLoader from tqdm import tqdm from PIL import Image from peft import LoraConfig, get_peft_model from transformers import ( AutoModelForCausalLM, AutoProcessor, TrainingArguments, BitsAndBytesConfig, GenerationConfig, AutoTokenizer, AutoImageProcessor, get_cosine_schedule_with_warmup ) import torch.optim as optim from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, classification_report import warnings warnings.filterwarnings("ignore", message="Could not find a config file") # 标签映射定义 TASK1_LABEL_MAP = {"无害": 0, "有害": 1} TASK2_LABEL_MAP = { 0: "无害", 1: "针对性有害", 2: "一般性有害", 3: "性暗示", 4: "沮丧文化" } class HarmfulMemeDataset(Dataset): def __init__(self, annotation_path, tokenizer, image_processor, task_name, image_root=None): print(f"Loading annotations from: {annotation_path}") self.image_root = image_root with open(annotation_path, 'r', encoding='utf-8') as f: self.data = json.load(f) self.tokenizer = tokenizer self.task_name = task_name self.image_processor = image_processor self.label_token_ids = self._get_label_tokens() def _get_label_tokens(self): """获取标签对应的token ID""" if self.task_name == "task_1": labels = ["无害", "有害"] else: labels = [str(i) for i in range(5)] # 0-4 token_ids = {} tokenizer = self.processor.tokenizer for label in labels: tokens = tokenizer.encode(label) token_ids[label] = tokens # 保存完整的token序列 if len(tokens) > 1: print(f"警告: 标签 '{label}' 被分词为多个token: {tokens}") return token_ids def __len__(self): return len(self.data) def __getitem__(self, idx): sample = self.data[idx] if self.image_root: image_path = os.path.join(self.image_root, sample["path"]) else: image_path = sample["path"] if not os.path.exists(image_path): raise FileNotFoundError(f"图片不存在:{image_path}") image = Image.open(image_path).convert("RGB") if self.task_name == "task_1": user_content = f"任务:有害模因检测(是否有害)。\n输入文本:'{sample['text']}'\n请判断此图文组合是否有害。" raw_label = sample['label']# 原始标签是字符串("无害"或"有害") # 根据TASK1_LABEL_MAP进行映射 if isinstance(raw_label, int): # 如果是整数,转换为对应的字符串标签 label_map = {v: k for k, v in TASK1_LABEL_MAP.items()} # 反转映射 label = label_map.get(raw_label, "无害") # 默认值为"无害" else: # 如果已经是字符串,直接使用 label = raw_label label_token = self.label_token_ids[label] assistant_content = f"结论:{label}。\n理由:{sample['explanation']}" else: user_content = f"任务:有害模因类型分类。\n输入文本:'{sample['text']}'\n请判断此图文组合的有害类型(0-4)。" raw_label = str(sample['type'])# 将整数标签转换为字符串 label = str(raw_label) label_token = self.label_token_ids[label] assistant_content = f"结论:{label}。\n理由:{sample['explanation']}" messages = [ {"role": "user", "content": [{"type": "image"}, {"type": "text", "text": user_content}]}, {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]} ] prompt = self.processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, chat_format="chatml" ) # 单独处理图像 image = self.image_processor( images=image, return_tensors="pt" )["pixel_values"].squeeze(0) # 单独处理文本 encoding = self.tokenizer( text=prompt, return_tensors="pt", padding=False, truncation=False ) prompt_tokens = encoding["input_ids"][0].tolist() # 找到结论标签的位置 conclusion_start = self.processor.tokenizer.encode("结论:") # 在prompt中查找"结论:"的位置 start_idx = -1 for i in range(len(prompt_tokens) - len(conclusion_start) + 1): if prompt_tokens[i:i+len(conclusion_start)] == conclusion_start: start_idx = i + len(conclusion_start) break inputs = self.processor( text=prompt, images=image, return_tensors="pt", padding="max_length", truncation=True, max_length=512 ) inputs = {k: v.squeeze(0) for k, v in inputs.items()} # 创建标签张量,只标记结论位置 labels = torch.full_like(inputs["input_ids"], fill_value=-100, dtype=torch.long) if start_idx != -1 and start_idx < len(labels): # 标记整个标签token序列 label_tokens = self.label_token_ids[label] for i, token_id in enumerate(label_tokens): if start_idx + i < len(labels): labels[start_idx + i] = token_id inputs["labels"] = labels return inputs def parse_generated_text(self,text): """解析生成的文本,提取结论标签""" conclusion_match = re.search(r"结论[::]\s*(\S+)", text) if not conclusion_match: return None conclusion = conclusion_match.group(1).strip().rstrip('。.') # 处理多token标签 if conclusion in ["无害", "有害"]: # 任务1标签 return conclusion elif conclusion.isdigit() and 0 <= int(conclusion) <= 4: # 任务2标签 return conclusion # 尝试分词匹配 tokenizer = AutoProcessor.from_pretrained(args.model_id).tokenizer conclusion_tokens = tokenizer.encode(conclusion, add_special_tokens=False) # 与已知标签的token序列匹配 for label, tokens in self.label_token_ids.items(): if conclusion_tokens == tokens: return label return None def compute_metrics(task_name, preds, labels): """计算评估指标""" mask = labels != -100 preds = preds[mask] labels = labels[mask] if task_name == "task_1": # 二分类任务 return { "accuracy": accuracy_score(labels, preds), "f1": f1_score(labels, preds, average="binary"), "precision": precision_score(labels, preds, average="binary"), "recall": recall_score(labels, preds, average="binary") } else: # 多分类任务 report = classification_report(labels, preds, output_dict=True, zero_division=0) return { "accuracy": accuracy_score(labels, preds), "f1_macro": f1_score(labels, preds, average="macro"), "precision_macro": precision_score(labels, preds, average="macro"), "recall_macro": recall_score(labels, preds, average="macro"), "class_report": report } def main(args): os.environ["TOKENIZERS_PARALLELISM"] = "false" # 1. 加载模型和预处理器 print("Loading model and processor...") quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( args.model_id, quantization_config=quantization_config, trust_remote_code=True, device_map="auto", bf16=True ) model.generation_config = GenerationConfig.from_pretrained( args.model_id, trust_remote_code=True, chat_format="chatml", max_new_tokens=100, pad_token_id=model.generation_config.eos_token_id ) # 分别初始化文本和图像处理器 tokenizer = AutoTokenizer.from_pretrained( args.model_id, trust_remote_code=True, pad_token='<|endoftext|>' # 显式设置pad_token ) image_processor = AutoImageProcessor.from_pretrained( args.model_id, trust_remote_code=True ) tokenizer.chat_template = """{% for message in messages %} <|im_start|>{{ message['role'] }} {{ message['content'] }} <|im_end|> {% endfor %} {% if add_generation_prompt %} <|im_start|>assistant {% endif %}""" # 设置pad token # 确保pad_token正确设置 if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id # 2. LoRA配置 print("Configuring LoRA...") lora_config = LoraConfig( r=args.lora_rank, lora_alpha=args.lora_alpha, lora_dropout=args.lora_dropout, bias="none", task_type="CAUSAL_LM", target_modules=[ "c_attn", "c_proj", "w1", "w2", "w3", "visual.proj", "visual.image_encoder" ] ) peft_model = get_peft_model(model, lora_config) peft_model.print_trainable_parameters() # 3. 初始化优化器和调度器 optimizer = optim.AdamW( peft_model.parameters(), lr=args.learning_rate, weight_decay=args.weight_decay ) # 4. 训练参数配置 training_args = TrainingArguments( output_dir=os.path.join(args.output_dir, args.task), num_train_epochs=args.epochs, per_device_train_batch_size=args.batch_size, per_device_eval_batch_size=args.eval_batch_size, gradient_accumulation_steps=args.grad_accum_steps, learning_rate=args.learning_rate, weight_decay=args.weight_decay, lr_scheduler_type="cosine", logging_strategy="steps", logging_steps=10, save_strategy="epoch", eval_strategy="epoch", eval_accumulation_steps=1, metric_for_best_model="f1" if args.task == "task_1" else "f1_macro", greater_is_better=True, load_best_model_at_end=True, bf16=True, report_to="none", remove_unused_columns=False, disable_tqdm=False, skip_memory_metrics=True, dataloader_pin_memory=False, ) # 5. 加载数据集 print(f"Loading datasets for {args.task}...") train_dataset = HarmfulMemeDataset( annotation_path=args.train_annotation_path, tokenizer=tokenizer, image_processor=image_processor, task_name=args.task, image_root=args.image_root ) test_dataset = HarmfulMemeDataset( annotation_path=args.test_annotation_path, tokenizer=tokenizer, image_processor=image_processor, task_name=args.task, image_root=args.image_root ) # 创建数据加载器 train_loader = DataLoader( train_dataset, batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers, pin_memory=True ) eval_loader = DataLoader( test_dataset, batch_size=args.eval_batch_size, shuffle=False, num_workers=args.num_workers, pin_memory=True ) # 计算总步数,初始化学习率调度器 total_train_steps = len(train_loader) // args.grad_accum_steps * args.epochs scheduler = get_cosine_schedule_with_warmup( optimizer, num_warmup_steps=args.warmup_steps, num_training_steps=total_train_steps ) # 6. 训练循环 print(f"Starting {args.task} training...") best_metric = -1 for epoch in range(args.epochs): print(f"\n===== Epoch {epoch + 1}/{args.epochs} =====") # 训练阶段 peft_model.train() total_train_loss = 0.0 train_pbar = tqdm(train_loader, desc=f"Training Epoch {epoch + 1}", unit="batch") for step, batch in enumerate(train_pbar): batch = {k: v.to(peft_model.device) for k, v in batch.items()} # 前向传播 outputs = peft_model(**batch) loss = outputs.loss total_train_loss += loss.item() # 梯度累积 loss = loss / args.grad_accum_steps loss.backward() # 参数更新 if (step + 1) % args.grad_accum_steps == 0: optimizer.step() scheduler.step() optimizer.zero_grad() # 更新进度条 train_pbar.set_postfix({"loss": f"{loss.item() * args.grad_accum_steps:.4f}"}) avg_train_loss = total_train_loss / len(train_loader) print(f"Epoch {epoch + 1} 平均训练损失: {avg_train_loss:.4f}") # 评估阶段 peft_model.eval() all_preds = [] all_labels = [] all_generated_texts = [] eval_pbar = tqdm(eval_loader, desc=f"Evaluating Epoch {epoch + 1}", unit="batch") with torch.no_grad(): for batch in eval_pbar: # 获取真实标签 labels = batch["labels"].cpu().numpy() mask = labels != -100 valid_labels = labels[mask].reshape(-1) # 生成文本 inputs = {k: v.to(peft_model.device) for k, v in batch.items() if k != "labels"} pad_token_id = tokenizer.pad_token_id or tokenizer.eos_token_id generated_ids = peft_model.generate( **inputs, generation_config=model.generation_config, pad_token_id=pad_token_id # 使用修正后的值 ) # 解码生成的文本 generated_texts = tokenizer.batch_decode( generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True ) # 解析生成的文本获取预测标签 batch_preds = [] for text in generated_texts: # 提取assistant的响应部分 if "<|im_start|>assistant" in text: response = text.split("<|im_start|>assistant")[-1].strip() else: response = text # 解析结论 conclusion = parse_generated_text(response) if conclusion is None: # 无法解析结论,使用默认值 pred_label = 0 if args.task == "task_1" else "0" else: pred_label = conclusion # 转换为数字标签 if args.task == "task_1": # 二分类任务 if "无害" in pred_label: pred_value = 0 elif "有害" in pred_label: pred_value = 1 else: # 无法解析,使用默认值 pred_value = 0 else: # 多分类任务 if pred_label in ["0", "1", "2", "3", "4"]: pred_value = int(pred_label) else: # 无法解析,使用默认值 pred_value = 0 batch_preds.append(pred_value) all_preds.extend(batch_preds) all_labels.extend(valid_labels.tolist()) all_generated_texts.extend(generated_texts) # 计算评估指标 metrics = compute_metrics(args.task, np.array(all_preds), np.array(all_labels)) # 打印评估结果 print("\n评估指标:") print("=" * 50) if args.task == "task_1": print(f"Accuracy: {metrics['accuracy']:.4f}") print(f"F1 Score: {metrics['f1']:.4f}") print(f"Precision: {metrics['precision']:.4f}") print(f"Recall: {metrics['recall']:.4f}") else: print(f"Accuracy: {metrics['accuracy']:.4f}") print(f"Macro F1: {metrics['f1_macro']:.4f}") print(f"Macro Precision: {metrics['precision_macro']:.4f}") print(f"Macro Recall: {metrics['recall_macro']:.4f}") print("\n分类报告:") print(classification_report(all_labels, all_preds, target_names=list(TASK2_LABEL_MAP.values()), zero_division=0)) print("=" * 50) # 保存最佳模型 current_metric = metrics["f1"] if args.task == "task_1" else metrics["f1_macro"] if current_metric > best_metric: best_metric = current_metric save_path = os.path.join(training_args.output_dir, f"best_model_epoch{epoch+1}") print(f"保存最佳模型(指标 {current_metric:.4f})到 {save_path}") peft_model.save_pretrained(save_path) # 保存生成的文本示例 sample_output_path = os.path.join(save_path, "sample_outputs.txt") with open(sample_output_path, "w", encoding="utf-8") as f: for i, text in enumerate(all_generated_texts[:10]): f.write(f"样本 {i+1}:\n") f.write(text) f.write("\n" + "-"*80 + "\n") print(f"训练完成!最佳指标: {best_metric:.4f}") if __name__ == "__main__": parser = argparse.ArgumentParser(description="训练有害模因检测模型") parser.add_argument("--model_id", default="/xzwu/Qwen-VL-Chat", help="预训练模型路径") parser.add_argument("--output_dir", default="/xzwu/explain-m3-adapter", help="输出目录") parser.add_argument("--epochs", type=int, default=5, help="训练轮数") parser.add_argument("--batch_size", type=int, default=4, help="训练批次大小") parser.add_argument("--eval_batch_size", type=int, default=4, help="评估批次大小") parser.add_argument("--grad_accum_steps", type=int, default=2, help="梯度累积步数") parser.add_argument("--learning_rate", type=float, default=1e-5, help="学习率") parser.add_argument("--weight_decay", type=float, default=0.01, help="权重衰减") parser.add_argument("--warmup_steps", type=int, default=100, help="预热步数") parser.add_argument("--lora_rank", type=int, default=8, help="LoRA秩") parser.add_argument("--lora_alpha", type=int, default=16, help="LoRA alpha") parser.add_argument("--lora_dropout", type=float, default=0.1, help="LoRA dropout") parser.add_argument("--num_workers", type=int, default=4, help="数据加载工作线程数") parser.add_argument("--task", choices=["task_1", "task_2"], default="task_1", help="任务类型") parser.add_argument("--train_annotation_path", default="/xzwu/data/data/train_data_explanation.json", help="训练标注路径") parser.add_argument("--test_annotation_path", default="/xzwu/data/data/test_data_explanation.json", help="测试标注路径") parser.add_argument("--image_root", default="/xzwu/data/meme", help="图片根目录") args = parser.parse_args() # 打印配置 print("=" * 50) print("训练配置:") for arg in vars(args): print(f"{arg}: {getattr(args, arg)}") print("=" * 50) main(args)运行以上代码报错:Traceback (most recent call last): File "/xzwu/explain-m3/explain-m3-project/train2.py", line 530, in <module> main(args) File "/xzwu/explain-m3/explain-m3-project/train2.py", line 314, in main train_dataset = HarmfulMemeDataset( File "/xzwu/explain-m3/explain-m3-project/train2.py", line 46, in __init__ self.label_token_ids = self._get_label_tokens() File "/xzwu/explain-m3/explain-m3-project/train2.py", line 56, in _get_label_tokens tokenizer = self.processor.tokenizer AttributeError: 'HarmfulMemeDataset' object has no attribute 'processor'
最新发布
07-29
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值