大模型底层transformers源码解析(二)之 TrainingAugumentes实例,/src/transformers/training_args.py

# TODO: `TrainingArguments` users rely on it being fully mutable. In the future see if we can narrow this to a few keys: https://github.com/huggingface/transformers/pull/25903
@dataclass
class TrainingArguments:
    """
    TrainingArguments is the subset of the arguments we use in our example scripts **which relate to the training loop
    itself**.

    Using [`HfArgumentParser`] we can turn this class into
    [argparse](https://docs.python.org/3/library/argparse#module-argparse) arguments that can be specified on the
    command line.

    Parameters:
        output_dir (`str`):
            The output directory where the model predictions and checkpoints will be written.
        overwrite_output_dir (`bool`, *optional*
/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( The following values were not passed to `accelerate launch` and had defaults used instead: More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. /home/ustc/anaconda3/lib/python3.12/site-packages/torch/nn/utils/weight_norm.py:134: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. WeightNorm.apply(module, name, dim) /home/ustc/anaconda3/lib/python3.12/site-packages/transformers/utils/hub.py:105: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( [rank0]: Traceback (most recent call last): [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 92, in _call_target [rank0]: return _target_(*args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/modules/astral_quantization/default_model.py", line 22, in __init__ [rank0]: self.tokenizer = WhisperProcessor.from_pretrained( [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/processing_utils.py", line 1079, in from_pretrained [rank0]: args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/processing_utils.py", line 1143, in _get_arguments_from_pretrained [rank0]: args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs)) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/feature_extraction_utils.py", line 384, in from_pretrained [rank0]: feature_extractor_dict, kwargs = cls.get_feature_extractor_dict(pretrained_model_name_or_path, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/feature_extraction_utils.py", line 510, in get_feature_extractor_dict [rank0]: resolved_feature_extractor_file = cached_file( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/utils/hub.py", line 266, in cached_file [rank0]: file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/transformers/utils/hub.py", line 381, in cached_files [rank0]: raise OSError( [rank0]: OSError: ./checkpoints/hf_cache does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./checkpoints/hf_cache/tree/main' for available files. [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/train_v2.py", line 346, in <module> [rank0]: main(args) [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/train_v2.py", line 315, in main [rank0]: trainer = Trainer( [rank0]: ^^^^^^^^ [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/train_v2.py", line 76, in __init__ [rank0]: self._init_models(train_cfm=train_cfm, train_ar=train_ar) [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/train_v2.py", line 107, in _init_models [rank0]: self._init_main_model(train_cfm=train_cfm, train_ar=train_ar) [rank0]: File "/home/ustc/桌面/seed-vc-main-v2/train_v2.py", line 117, in _init_main_model [rank0]: self.model = hydra.utils.instantiate(cfg).to(self.device) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 226, in instantiate [rank0]: return instantiate_node( [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 342, in instantiate_node [rank0]: value = instantiate_node( [rank0]: ^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 347, in instantiate_node [rank0]: return _call_target(_target_, partial, args, kwargs, full_key) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/ustc/anaconda3/lib/python3.12/site-packages/hydra/_internal/instantiate/_instantiate2.py", line 97, in _call_target [rank0]: raise InstantiationException(msg) from e [rank0]: hydra.errors.InstantiationException: Error in call to target 'modules.astral_quantization.default_model.AstralQuantizer': [rank0]: OSError("./checkpoints/hf_cache does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/./checkpoints/hf_cache/tree/main' for available files.") [rank0]: full_key: content_extractor_narrow [rank0]:[W514 21:30:32.208971596 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) W0514 21:30:32.529000 130300934612480 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3146516 closing signal SIGTERM W0514 21:30:32.529000 130300934612480 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3146517 closing signal SIGTERM W0514 21:30:32.530000 130300934612480 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 3146518 closing signal SIGTERM E0514 21:30:32.758000 130300934612480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 3146515) of binary: /home/ustc/anaconda3/bin/python Traceback (most recent call last): File "/home/ustc/anaconda3/bin/accelerate", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/ustc/anaconda3/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/home/ustc/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1204, in launch_command multi_gpu_launcher(args) File "/home/ustc/anaconda3/lib/python3.12/site-packages/accelerate/commands/launch.py", line 825, in multi_gpu_launcher distrib_run.run(args) File "/home/ustc/anaconda3/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/ustc/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 133, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ustc/anaconda3/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_v2.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-14_21:30:32 host : ustc-SYS-740GP-TNRT rank : 0 (local_rank: 0) exitcode : 1 (pid: 3146515) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 分析问题
最新发布
05-15
### 解决方案 #### 关于 `TRANSFORMERS_CACHE` 被弃用的问题 当遇到 `TRANSFORMERS_CACHE` 被弃用的情况时,可以改用环境变量 `HF_HOME` 来定义缓存路径。通过设置此变量,能够重新指定 Hugging Face 的库所使用的缓存目录位置[^1]。 ```bash export HF_HOME=/path/to/new/cache/directory ``` 这样做的好处在于它不仅涵盖了原本由 `TRANSFORMERS_CACHE` 提供的功能,还统一管理了其他依赖项的存储位置,从而减少冗余并提高效率[^2]。 #### 多 GPU 训练配置问题 对于多 GPU 配置下的训练过程,如果出现了因缺少某些必要参数而产生的警告,则需确保所有相关选项都被正确定义。例如,在分布式环境中运行脚本时,应该显式声明诸如学习率调度器、梯度累积步数以及其他影响性能的关键超参。 以下是实现这一目标的一个简单例子: ```python from transformers import TrainingArguments, Trainer training_args = TrainingArguments( output_dir="./results", per_device_train_batch_size=8, gradient_accumulation_steps=4, fp16=True, # 如果硬件支持的话启用混合精度计算 dataloader_num_workers=4, local_rank=int(os.environ.get('LOCAL_RANK', -1)) # 自动检测当前设备编号 ) trainer = Trainer(model=model, args=training_args, ...) ``` 这里特别注意的是 `local_rank` 参数的应用;它是用来指示每台机器在其节点内的相对ID号,这对于正确初始化进程组至关重要[^2]。 另外关于检查点恢复方面提到过可以通过命令行标志 `--resume_from_checkpoint` 实现从中断处接着往下学的过程,并且还可以借助另一个开关控制仅加载模型权重而不涉及优化状态等额外数据结构。 #### 缺失 `'preprocessor_config.json'` 文件处理办法 针对找不到特定资源如 `'./checkpoints/hf_cache/preprocessor_config.json'` 这类情形,通常是因为原始下载过程中发生了中断或者版本不匹配引起的数据丢失现象所致。建议先清除现有损坏副本再尝试重新拉取最新可用实例即可解决问题[^2]。 执行下面两条指令完成清理与重载操作: ```bash rm -rf ./checkpoints/hf_cache/* pip install --upgrade huggingface_hub && python -m transformers.onnx --model=<your-model-id> . ``` 以上方法综合考虑到了兼容性和稳定性因素,适用于大多数常规场景需求。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

强化学习曾小健

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值