Log File Sync/Log File Parallel Write (单实例)

本文深入探讨了Oracle数据库中Log File Sync和LogFile Parallel Write两个等待事件,分析了它们产生的原因,如频繁提交、日志缓冲区过大或日志文件过少等,并提供了相应的诊断方法和解决方案,包括批量提交、调整日志文件配置和优化系统I/O及CPU使用情况。

1 等待事件说明

当用户执行完事务(insert插入数据)执行commit/rollbackup命令后,Oracle后台LGWR进程需要redo log buffer -> online redo log files,写入后返回Commit complete/Rollback complete.。这个过程就会形成Log File Sync等待事件。
在这里插入图片描述
在这里插入图片描述

Log File Parallel Write是LGWR进程开始执行到结束这个过程的等待事件。
在这里插入图片描述

2 模拟该等待事件

2.1 单次

在这里插入图片描述
查不到等待事件

2.2 多次

drop table sync;
create table sync (x int);
begin
    for i in 1 .. 100000
    loop
        insert into sync values (i);
        commit work write wait immediate;
    end loop;
end;
/

在这里插入图片描述

在这里插入图片描述

drop table sync;
create table sync (x int);
begin
    for i in 1 .. 100000
    loop
        insert into sync values (i);
        commit;
    end loop;
end;
/

commit_write、commit_logging、commit_wait

drop table baipx;
create table baipx as select * from dba_objects;
begin
    for i in 1 .. 100
    loop
        insert into baipx select * from baipx;
        commit;
    end loop;
end;
/

在这里插入图片描述

2.3 别的等待事件

log file sequential read
在这里插入图片描述
在这里插入图片描述
log file parallel write
在这里插入图片描述

在这里插入图片描述

3 原因

数据库层面

  1. 提交或者回滚次数过于频繁 -> 查看统计信息(统计user commits,user rollback次数)
  2. 日志缓冲区过大
  3. redo logfile过少

系统层面
结合log file parallel write等待事件来判断

  1. I/O缓慢导致LGWR进程写日志慢 (log file sync ≈ log file parallel write)
  2. CPU使用率过高 (log file sync > log file parallel write)

4 解决方案

找到相应原因进行解决。
数据库层面

  1. 提交或者回滚次数过于频繁 -> 批量提交
  2. 日志缓冲区过大
  3. redo logfile过少

系统层面

  1. I/O缓慢导致LGWR进程写日志慢 -> 使用性能好的盘
  2. CPU使用率过高

5 学习链接

log file switch (日志切换)
log file sync( 日志文件同步)
Oracle log file sync 原理详解
频繁commit导致的log file sync的诊断

给下面的.train.py的代码添加计算参数量和flops的代码# @Time : 2022/9/14 22:11 # @Author : PEIWEN PAN # @Email : 121106022690@njust.edu.cn # @File : tools.py # @Software: PyCharm import random import torch.distributed import torch.nn from torchstat import stat from utils.metric import * from torch.utils.tensorboard import SummaryWriter from utils.logs import * import shutil from utils.save_model import * from utils.drawing import * import logging def random_seed(n): random.seed(n) np.random.seed(n) torch.manual_seed(n) torch.cuda.manual_seed_all(n) def empty_function(): pass def model_wrapper(model_dict): new_dict = {} for k, v in model_dict.items(): new_dict['decode_head.' + k] = v return new_dict def init_metrics(args, optimizer, checkpoint=None): best_mIoU, best_nIoU, best_f1 = 0.0, 0.0, 0.0 train_loss, test_loss, mIoU, nIoU, f1, num_epoch = [], [], [], [], [], [] if args.resume_from: best_mIoU = checkpoint['best_mIoU'] best_nIoU = checkpoint['best_nIoU'] best_f1 = checkpoint['best_f1'] train_loss = checkpoint['train_loss'] test_loss = checkpoint['test_loss'] mIoU = checkpoint['mIoU'] nIoU = checkpoint['nIoU'] f1 = checkpoint['f1'] num_epoch = checkpoint['num_epoch'] optimizer.load_state_dict(checkpoint['optimizer']) iou_metric = SigmoidMetric() nIoU_metric = SamplewiseSigmoidMetric(1, score_thresh=0.5) ROC = ROCMetric(1, 10) return optimizer, {'best_mIoU': best_mIoU, 'best_nIoU': best_nIoU, 'best_f1': best_f1, 'train_loss': train_loss, 'test_loss': test_loss, 'mIoU': mIoU, 'nIoU': nIoU, 'f1': f1, 'num_epoch': num_epoch, 'iou_metric': iou_metric, 'nIoU_metric': nIoU_metric, 'ROC': ROC} def init_data(args, data): train_sample = None if args.local_rank != -1: train_sample, train_data, test_data, train_data_len, test_data_len = data else: train_data, test_data, train_data_len, test_data_len = data return {'train_sample': train_sample, 'train_data': train_data, 'test_data': test_data, 'train_data_len': train_data_len, 'test_data_len': test_data_len} def init_model(args, cfg, model, device): checkpoint = None if args.load_from: cfg.load_from = args.load_from checkpoint = torch.load(args.load_from) model.load_state_dict(checkpoint) # FIXME Loss Accuracy Decreases When Use resume_from if args.resume_from: cfg.resume_from = args.resume_from checkpoint = torch.load(args.resume_from) model.load_state_dict(checkpoint['state_dict']) print("Model Initializing") if args.local_rank != -1: model.to(device) model = torch.nn.parallel.DistributedDataParallel( model, device_ids=[args.local_rank], output_device=args.local_rank, find_unused_parameters=cfg.find_unused_parameters) model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) else: model = model.to(device) return model, checkpoint def init_devices(args, cfg): if args.local_rank != -1: device = torch.device('cuda', args.local_rank) torch.cuda.set_device(args.local_rank) torch.distributed.init_process_group(backend=cfg.dist_params['backend']) else: device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu') random_seed(cfg.random_seed) return device def save_log(args, cfg, model): save_dir, train_log_file_name, write = None, None, None if args.local_rank <= 0: save_dir = args.config.split('/')[-1][:-3] train_log_file_name = train_log_file() make_log_dir(save_dir, train_log_file_name) save_config_log(cfg, save_dir, train_log_file_name) save_model_struct(save_dir, train_log_file_name, model) if 'develop' in cfg: shutil.copy(cfg.develop['source_file_root'], os.path.join('work_dirs', save_dir, train_log_file_name, 'model.py')) write = SummaryWriter(log_dir='work_dirs/' + save_dir + '/' + train_log_file_name + '/tf_logs') return save_dir, train_log_file_name, write def data2device(args, data, device): img, mask = data if args.local_rank != -1: img = img.cuda() mask = mask.cuda() else: img = img.to(device) mask = mask.to(device) return img, mask def compute_loss(preds, mask, deep_supervision, cfg, criterion): # TODO when use deep supervision, should log pred loss, not all loss sum if deep_supervision and cfg.model['decode_head']['deep_supervision']: loss = [] for pre in preds: loss.append(criterion(pre, mask)) loss = sum(loss) preds = preds[-1] else: loss = criterion(preds, mask) return loss, preds def show_log(mode, args, cfg, epoch, losses, save_dir, train_log_file, **kwargs): if mode not in ['train', 'test']: raise ValueError('The parameter "mode" input should be "train" or "test"') logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s', datefmt='%F %T') if args.local_rank <= 0: if mode == 'train': msg = 'Epoch %d/%d, Iter %d/%d, train loss %.4f, lr %.5f, time %.5f' % ( epoch, cfg.runner['max_epochs'], kwargs['i'] + 1, kwargs['data']['train_data_len'] / cfg.data['train_batch'] / cfg.gpus, np.mean(losses), kwargs['optimizer'].state_dict()['param_groups'][0]['lr'], kwargs['time_elapsed']) logging.info(msg) if (kwargs['i'] + 1) % cfg.log_config['interval'] == 0: save_train_log(save_dir, train_log_file, epoch, cfg.runner['max_epochs'], kwargs['i'] + 1, kwargs['data']['train_data_len'] / cfg.data['train_batch'] / cfg.gpus, np.mean(losses), kwargs['optimizer'].state_dict()['param_groups'][0]['lr'], kwargs['time_elapsed']) else: msg = 'Epoch %d/%d, test loss %.4f, mIoU %.4f, nIoU %.4f, F1-score %.4f, best_mIoU %.4f, ' \ 'best_nIoU %.4f, best_F1-score %.4f' % ( epoch, cfg.runner['max_epochs'], np.mean(losses), kwargs['IoU'], kwargs['nIoU'], kwargs['F1_score'], kwargs['metrics']['best_mIoU'], kwargs['metrics']['best_nIoU'], kwargs['metrics']['best_f1']) logging.info(msg) save_test_log(save_dir, train_log_file, epoch, cfg.runner['max_epochs'], np.mean(losses), kwargs['IoU'], kwargs['nIoU'], kwargs['F1_score'], kwargs['metrics']['best_mIoU'], kwargs['metrics']['best_nIoU'], kwargs['metrics']['best_f1']) def save_model(mode, args, cfg, epoch, model, losses, optimizer, metrics, save_dir, train_log_file, **kwargs): if mode not in ['train', 'test']: raise ValueError('The parameter "mode" input should be "train" or "test"') if args.local_rank <= 0: ckpt_info = { 'epoch': epoch, 'state_dict': model.module.state_dict() if args.local_rank != -1 else model.state_dict(), 'loss': np.mean(losses), 'optimizer': optimizer.state_dict(), 'train_loss': metrics['train_loss'], 'test_loss': metrics['test_loss'], 'num_epoch': metrics['num_epoch'], 'best_mIoU': metrics['best_mIoU'], 'best_nIoU': metrics['best_nIoU'], 'best_f1': metrics['best_f1'], 'mIoU': metrics['mIoU'], 'nIoU': metrics['nIoU'], 'f1': metrics['f1'] } if mode == 'train': save_ckpt(ckpt_info, save_path='work_dirs/' + save_dir + '/' + train_log_file, filename='last.pth.tar') if cfg.checkpoint_config['by_epoch'] and epoch % cfg.checkpoint_config['interval'] == 0: save_ckpt(ckpt_info, save_path='work_dirs/' + save_dir + '/' + train_log_file, filename='epoch_%d' % epoch + '.pth.tar') else: if kwargs['IoU'] > metrics['best_mIoU'] or kwargs['nIoU'] > metrics['best_nIoU']: save_ckpt(ckpt_info, save_path='work_dirs/' + save_dir + '/' + train_log_file, filename='best.pth.tar') if kwargs['IoU'] > metrics['best_mIoU']: save_ckpt(ckpt_info, save_path='work_dirs/' + save_dir + '/' + train_log_file, filename='best_mIoU.pth.tar') if kwargs['nIoU'] > metrics['best_nIoU']: save_ckpt(ckpt_info, save_path='work_dirs/' + save_dir + '/' + train_log_file, filename='best_nIoU.pth.tar') def update_log(mode, args, metrics, write, losses, epoch, **kwargs): if mode not in ['train', 'test']: raise ValueError('The parameter "mode" input should be "train" or "test"') if args.local_rank <= 0: if mode == 'train': metrics['train_loss'].append(np.mean(losses)) metrics['num_epoch'].append(epoch) write.add_scalar('train/train_loss', np.mean(losses), epoch) write.add_scalar('train/lr', kwargs['optimizer'].state_dict()['param_groups'][0]['lr'], epoch) else: metrics['best_mIoU'] = max(kwargs['IoU'], metrics['best_mIoU']) metrics['best_nIoU'] = max(kwargs['nIoU'], metrics['best_nIoU']) metrics['best_f1'] = max(kwargs['F1_score'], metrics['best_f1']) write.add_scalar('train/test_loss', np.mean(losses), epoch) write.add_scalar('test/mIoU', kwargs['IoU'], epoch) write.add_scalar('test/nIoU', kwargs['nIoU'], epoch) write.add_scalar('test/F1-score', kwargs['F1_score'], epoch) def reset_metrics(metrics): metrics['iou_metric'].reset() metrics['nIoU_metric'].reset() metrics['ROC'].reset() def update_metrics(preds, mask, metrics): metrics['iou_metric'].update(preds, mask) metrics['nIoU_metric'].update(preds, mask) metrics['ROC'].update(preds, mask) _, IoU = metrics['iou_metric'].get() _, nIoU = metrics['nIoU_metric'].get() _, _, _, _, F1_score = metrics['ROC'].get() return IoU, nIoU, F1_score def append_metrics(args, metrics, losses, IoU, nIoU, F1_score): if args.local_rank <= 0: metrics['test_loss'].append(np.mean(losses)) metrics['mIoU'].append(IoU) metrics['nIoU'].append(nIoU) metrics['f1'].append(F1_score) def draw(args, metrics, save_dir, train_log_file): if args.local_rank <= 0: drawing_loss(metrics['num_epoch'], metrics['train_loss'], metrics['test_loss'], save_dir, train_log_file) drawing_iou(metrics['num_epoch'], metrics['mIoU'], metrics['nIoU'], save_dir, train_log_file) drawing_f1(metrics['num_epoch'], metrics['f1'], save_dir, train_log_file)
最新发布
06-06
### 如何在 `train.py` 中添加计算模型参数量和 FLOPs 的功能 在 PyTorch 项目中,可以使用 `torchstat` 或其他类似工具来计算模型的参数量和 FLOPs。以下是具体实现方法: #### 安装依赖 首先需要安装 `torchstat` 工具,可以通过以下命令完成: ```bash pip install torchstat ``` #### 修改 `train.py` 脚本 在 `train.py` 中引入 `torchstat` 并调用其功能来计算模型的参数量和 FLOPs。以下是代码示例: ```python import torch from torchstat import stat from models.yolov5s_voc import YOLOv5s # 假设这是你的模型定义文件 # 加载模型 model = YOLOv5s() # 根据实际模型定义替换 input_size = (3, 640, 640) # 输入张量的尺寸,根据模型需求调整 # 使用 torchstat 计算参数量和 FLOPs try: stat(model, input_size) except Exception as e: print(f"计算参数量和 FLOPs 时发生错误: {e}") ``` #### 代码解释 1. **模型加载**:确保正确加载模型实例,并将其传递给 `torchstat.stat` 函数。 2. **输入尺寸**:指定模型输入张量的尺寸,例如 `(3, 640, 640)` 表示三通道图像,大小为 640x640。 3. **异常处理**:由于某些模型可能不支持直接计算 FLOPs,因此建议添加异常捕获机制[^1]。 #### 替代方案 如果 `torchstat` 不满足需求,还可以使用 `thop`(即 Torch Profile)工具: ```bash pip install thop ``` 以下是使用 `thop` 的代码示例: ```python import torch from thop import profile from models.yolov5s_voc import YOLOv5s # 假设这是你的模型定义文件 # 加载模型 model = YOLOv5s() # 根据实际模型定义替换 input_tensor = torch.randn(1, 3, 640, 640) # 创建一个随机输入张量 # 使用 thop 计算参数量和 FLOPs flops, params = profile(model, inputs=(input_tensor,)) print(f"FLOPs: {flops / 1e9:.2f} G") print(f"Parameters: {params / 1e6:.2f} M") ``` #### 动态选择运行环境 为了确保代码能够在不同设备上正常运行,可以参考动态选择运行环境的代码[^3]: ```python device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') model.to(device) if device.type == 'cuda': input_tensor = input_tensor.cuda() ``` #### 注意事项 - 确保模型的前向传播逻辑能够正确执行,否则可能导致 FLOPs 计算失败。 - 如果模型包含动态计算图(如条件分支),可能需要手动调整计算逻辑以适应工具要求[^4]。 ### 示例输出 假设模型为 YOLOv5s,输入尺寸为 `(3, 640, 640)`,可能的输出如下: ``` FLOPs: 7.84 G Parameters: 7.23 M ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值