PyTorch学习记录六——logging

Felix_M.

已于 2025-03-06 22:24:56 修改

阅读量901

点赞数 28

分类专栏： PyTorch学习记录文章标签： pytorch 学习人工智能

于 2025-02-20 22:48:02 首次发布

本文链接：https://blog.youkuaiyun.com/mmmmfh/article/details/145747308

版权

PyTorch学习记录专栏收录该内容

13 篇文章

订阅专栏

logging

[官方文档]https://docs.python.org/zh-cn/3/library/logging.html
[简洁]https://blog.youkuaiyun.com/lwgkzl/article/details/110853262
[模型]https://zhuanlan.zhihu.com/p/24355532201
[格式化字符串format string]https://blog.youkuaiyun.com/yizhuanlu9607/article/details/89530982

一、概述

logging是python的一个标准库，是内置模块，直接import不需要安装，提供了通用的日志系统（包含不同级别），用于记录训练过程、监控模型性能、调试代码，方便实验复现和问题分析。
主要用来记录&输出以下内容：
1.超参数信息
2.训练损失
3.验证&测试结果
4.模型保存信息
5.时间&资源消耗
6.异常&预警
在训练模型时，通常使用logging模块，或者wandb、tensorboard来记录训练过程。

二、日志基础教程

1.记录日志到文件

通过执行 logger = getLogger(name) 创建一个日志记录器然后调用日志记录方法来使用日志记录功能。对于何时使用日志记录以及使用哪个日志记录器参考下表。
日志记录器方法以及他们所追踪事件的级别（严重程度从低到高），如下表。
一般来说，日志记录器只会追踪所设定的严重程度及以上的事件，默认级别是WARNING。
输出有两种方式，一种是输出到控制台，一种是写入指定文件。

具体例子如下

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(filename='example.log', encoding='utf-8', level=logging.DEBUG)  # encoding在py3.9版本才有
logger.debug('This message should go to the log file')
logger.info('so should this')
logger.warning('and this, too')
logger.error('And non-ASCII stuff')

打开日志文件，可以看到日志信息：
在这里插入图片描述
因为我们设置的level是DEBUG，所以所有信息都将被打印。

2.记录变量数据

使用格式字符串，传入参数为变量数据

import logging
a = 'nini'
b = 'mimi'
logging.warning('%s before the %s', a, b)

输出如下图
在这里插入图片描述

3.更改打印的格式

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
logger.debug('This message should go to the log file')
logger.info('so should this')
logger.warning('and this, too')

输出如下图
在这里插入图片描述

4.在打印中显示日期和时间

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(asctime)s %(message)s')
logging.warning('this is a haha')

输出如下图
在这里插入图片描述
可以看到，使用logger.info或者logger.warning记录后的日志信息会直接输出在控制台，不需要再使用print进行输出。

三、在模型训练中的用法

1.logger.info() —— 记录一般日志

import logging
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--batch_size', type=int, default=200, help='111')
parser.add_argument('--epochs', type=int, default=10, help='222')
args = parser.parse_args()
# 配置logger
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(message)s', level=logging.DEBUG)
# 记录超参数(因为默认输出层级为warning及以上，而下方为info，所以上方要设置level为debug)
logger.info(f"Train with batch_size:{args.batch_size}, epochs:{args.epochs}")

输出如下图
在这里插入图片描述
作用：记录普通日志信息，方便跟踪训练过程。

2.logger.warning() —— 记录警告

if loss.item() > 10:
	logger.warning(f"Loss is too high: {loss.item()}")

输出示例：
2025-02-16 12:35:20 - WARNING - Loss is too high: 12.340
作用：记录异常情况，发现训练中的异常损失值，便于调试。

3.logger.error() ——记录错误

try:
	output = model(input)
except Exception as e:
	logger.error(f"Model forward pass failed: {e}")

作用：记录代码运行中的错误。

4.记录训练损失

for epoch in range(args.epochs):
	train_loss = train_one_epoch(model, train_loader, optimizer)
	loggger.info(f"Epoch {epoch}: Train Loss = {train_loss:.4f}")

作用：每个epoch的损失记录，查看训练是否收敛

5.记录验证损失

val_loss = validate(model, val_loader)
logger.info(f"Validation Loss: {val_loss:.4f}")

作用：记录模型在验证集上的表现，便于选择最优模型

6.记录GPU内存使用情况

gpu_memory = torch.cuda.memory_allocated() / (1024 ** 3)  # 转换为GB
logger.info(f"GPU Memory Used:{gpu_memory:.2f}GB")

作用：监控显存占用，预防CUDA OOM(out of memory)
[memory_allocated()]https://pytorch.org/docs/2.6/generated/torch.cuda.memory_allocated.html#torch.cuda.memory_allocated

7.记录模型保存

if val_loss < best_val_loss:
	best_val_loss = val_loss
	torch.save(model.state_dict(), "best_model.pt")
	logger.info("Best model saved at best_model.pt")

作用：记录最佳模型保存时机，确保保存的模型是最优的。

8.记录训练时间

import time
start_time = time.time()
train_one_epoch(model, train_loader, optimizer)
end_time = time.time()

epoch_time = end_time - start_time
logger.info(f"Epoch completed in {epoch_time ;.2f} seconds")

作用：记录训练时间，评估计算资源消耗

9.记录梯度信息

for name, param in model.named_parameters():
	if param.grad is not None:
		logger.info(f"{name} gradient norm: {param.grad.norm():.4f}")

作用：监控梯度是否正常，发现梯度消失和梯度爆炸
[model.named_parameters()]https://blog.youkuaiyun.com/u013250861/article/details/124567826

保存日志到文件

logging.basicConfig(
	filename="training.log",
	level=logging.INFO,
	format="%(asctime)s - %(levelname)s - %(message)s"
)

作用：日志自动保存到 training.log 文件，便于长期监控和复现实验。
在这里插入图片描述

四、在ViT项目中的使用

import logging
logger = logging.getLogger(__name__)
logging.basicConfig(format='%(asctime)s - %(levelname)s - %(name)s - %(message)s',
                    datefmt='%m/%d/%Y %H:%M:%S',
                    level=logging.INFO if args.local_rank in [-1, 0] else logging.WARN)
logger.warning("Process rank: %s, device: %s, n_gpu: %s, distributed training: %s, 16-bits training: %s" %
               (args.local_rank, args.device, args.n_gpu, bool(args.local_rank != -1), args.fp16))