- 参考github:https://github.com/liucongg/ChatGLM-Finetuning
- 服务器配置:8块A100 80G
- 从huggingface上下载chatglm2-6模型,此处提供一个好用的镜像网址:
https://hf-mirror.com/THUDM/chatglm2-6b
- 先下载依赖,然后运行命令行
CUDA_VISIBLE_DEVICES=4,5,6,7 deepspeed --master_port 21400 train.py --train_path data/spo_0.json --model_name_or_path chatglm2-6b/ --per_device_train_batch_size 1 --max_len 1560 --max_src_len 1024 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_steps 4 --warmup_ratio 0.1 --mode glm2 --lora_dim 16 --lora_alpha 64 --lora_dropout 0.1 --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" --seed 1234 --ds_file ds_zero2_no_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm2
- 问题及解决
- 问题1:端口冲突,是因为在非特权端口(<1024)上绑定服务器套接字而没有管理员权限
- 解决1:将原指定端口520换成一个21400解决
- 问题2:ImportError: cannot import name ‘log’ from ‘torch.distributed.elastic.agent.server.api’
- 解决2:运行
pip install deepspeed --upgrade
- 问题3:报错AttributeError: ‘ChatGLMTokenizer’ object has no attribute ‘tokenizer’
- 解决3:降低 transformers 版本就可以跑起来了
pip uninstall transformers pip install -i https://pypi.tuna.tsinghua.edu.cn/simple transformers==4.33.2
或者去huggingface更新一下
tokenization_chatglm.py
这个文件就可以了(两种方法可以都试一下)
- 命令行运行结果记录
[2024-11-08 17:06:02,050] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:03.949823: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-08 17:06:04.070586: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-08 17:06:04.621454: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:04.621553: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:04.621563: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
[2024-11-08 17:06:05,017] [WARNING] [runner.py:215:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=4,5,6,7: setting --include=localhost:4,5,6,7
[2024-11-08 17:06:05,017] [INFO] [runner.py:607:main] cmd = /data/user23262833/.conda/envs/chatglm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=21400 --enable_each_rank_log=None train.py --train_path data/spo_0.json --model_name_or_path chatglm2-6b/ --per_device_train_batch_size 1 --max_len 1560 --max_src_len 1024 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_steps 4 --warmup_ratio 0.1 --mode glm2 --lora_dim 16 --lora_alpha 64 --lora_dropout 0.1 --lora_module_name query_key_value,dense_h_to_4h,dense_4h_to_h,dense --seed 1234 --ds_file ds_zero2_no_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm2
[2024-11-08 17:06:06,382] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:08.125986: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-08 17:06:08.243387: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-08 17:06:08.784327: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:08.784423: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:08.784431: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
[2024-11-08 17:06:09,168] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [4, 5, 6, 7]}
[2024-11-08 17:06:09,168] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-11-08 17:06:09,168] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-11-08 17:06:09,168] [INFO] [launch.py:164:main] dist_world_size=4
[2024-11-08 17:06:09,168] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=4,5,6,7
[2024-11-08 17:06:09,187] [INFO] [launch.py:256:main] process 3443023 spawned with command: ['/data/user23262833/.conda/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=0', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'chatglm2-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm2', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm2']
[2024-11-08 17:06:09,206] [INFO] [launch.py:256:main] process 3443024 spawned with command: ['/data/user23262833/.conda/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=1', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'chatglm2-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm2', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm2']
[2024-11-08 17:06:09,227] [INFO] [launch.py:256:main] process 3443025 spawned with command: ['/data/user23262833/.conda/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=2', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'chatglm2-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm2', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm2']
[2024-11-08 17:06:09,238] [INFO] [launch.py:256:main] process 3443026 spawned with command: ['/data/user23262833/.conda/envs/chatglm/bin/python', '-u', 'train.py', '--local_rank=3', '--train_path', 'data/spo_0.json', '--model_name_or_path', 'chatglm2-6b/', '--per_device_train_batch_size', '1', '--max_len', '1560', '--max_src_len', '1024', '--learning_rate', '1e-4', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '4', '--warmup_ratio', '0.1', '--mode', 'glm2', '--lora_dim', '16', '--lora_alpha', '64', '--lora_dropout', '0.1', '--lora_module_name', 'query_key_value,dense_h_to_4h,dense_4h_to_h,dense', '--seed', '1234', '--ds_file', 'ds_zero2_no_offload.json', '--gradient_checkpointing', '--show_loss_step', '10', '--output_dir', './output-glm2']
[2024-11-08 17:06:10,688] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-08 17:06:10,852] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-08 17:06:10,894] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-11-08 17:06:10,926] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:12.439215: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:12.522203: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:12.572146: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
2024-11-08 17:06:12.601861: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-08 17:06:12.629245: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-08 17:06:12.641090: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-11-08 17:06:12.716181: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-08 17:06:12.747069: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-08 17:06:13.120827: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.120924: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.120932: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-11-08 17:06:13.185514: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.185607: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.185615: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-11-08 17:06:13.292935: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.293026: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.293034: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2024-11-08 17:06:13.343458: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.343553: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.8/lib64:/usr/local/cuda-11.8/lib64:
2024-11-08 17:06:13.343562: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
[2024-11-08 17:06:13,547] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-08 17:06:13,547] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/utils/generic.py:311: FutureWarning: `torch.utils._pytree._register_pytree_node` is deprecated. Please use `torch.utils._pytree.register_pytree_node` instead.
torch.utils._pytree._register_pytree_node(
[2024-11-08 17:06:13,960] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-08 17:06:14,088] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-08 17:06:14,089] [INFO] [comm.py:652:init_distributed] cdb=None
tokenizer.pad_token: <unk>
tokenizer.eos_token: </s>
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py:488: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md
return torch.load(checkpoint_file, map_location=map_location)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py:488: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md
return torch.load(checkpoint_file, map_location=map_location)
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py:488: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md
return torch.load(checkpoint_file, map_location=map_location)
Loading checkpoint shards: 0%| | 0/7 [00:00<?, ?it/s]/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/transformers/modeling_utils.py:488: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md
return torch.load(checkpoint_file, map_location=map_location)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.34s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.43s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.44s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████| 7/7 [00:10<00:00, 1.44s/it]
the number of skipping data is 0
len(train_dataloader) = 361
len(train_dataset) = 1441
num_training_steps = 182
num_warmup_steps = 18
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_B.default.weight
trainable params: 29646848 || all params: 6273230848 || trainable%: 0.47259297032647635
[2024-11-08 17:07:29,871] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed info: version=0.15.3, git-hash=unknown, git-branch=unknown
[2024-11-08 17:07:29,872] [INFO] [comm.py:677:init_distributed] Distributed backend already initialized
[2024-11-08 17:07:29,872] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
the number of skipping data is 0
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_B.default.weight
trainable params: 29646848 || all params: 6273230848 || trainable%: 0.47259297032647635
[2024-11-08 17:07:29,969] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
the number of skipping data is 0
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_B.default.weight
trainable params: 29646848 || all params: 6273230848 || trainable%: 0.47259297032647635
[2024-11-08 17:07:29,990] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
the number of skipping data is 0
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.0.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.1.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.2.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.3.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.4.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.5.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.6.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.7.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.8.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.9.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.10.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.11.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.12.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.13.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.14.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.15.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.16.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.17.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.18.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.19.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.20.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.21.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.22.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.23.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.24.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.25.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.26.mlp.dense_4h_to_h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.query_key_value.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.self_attention.dense.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_h_to_4h.lora_B.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_A.default.weight
base_model.model.transformer.encoder.layers.27.mlp.dense_4h_to_h.lora_B.default.weight
trainable params: 29646848 || all params: 6273230848 || trainable%: 0.47259297032647635
[2024-11-08 17:07:31,295] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 4
[2024-11-08 17:07:34,800] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
Using /data/user23262833/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Creating extension directory /data/user23262833/.cache/torch_extensions/py38_cu121/fused_adam...
Using /data/user23262833/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Using /data/user23262833/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Using /data/user23262833/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /data/user23262833/.cache/torch_extensions/py38_cu121/fused_adam/build.ninja...
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/utils/cpp_extension.py:1965: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda-11.8/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/TH -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.8/include -isystem /data/user23262833/.conda/envs/chatglm/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -std=c++17 -c /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/adam -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/TH -isystem /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.8/include -isystem /data/user23262833/.conda/envs/chatglm/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-11.8/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 22.472938776016235 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 22.532176971435547 seconds
Time to load fused_adam op: 22.533612728118896 seconds
[2024-11-08 17:07:57,341] [INFO] [logging.py:129:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2024-11-08 17:07:57,341] [INFO] [logging.py:129:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
Loading extension module fused_adam...
Time to load fused_adam op: 22.533676862716675 seconds
[2024-11-08 17:07:57,402] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2024-11-08 17:07:57,402] [INFO] [utils.py:59:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedAdam type=<class 'deepspeed.ops.adam.fused_adam.FusedAdam'>
[2024-11-08 17:07:57,402] [INFO] [logging.py:129:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer
[2024-11-08 17:07:57,402] [INFO] [stage_1_and_2.py:149:__init__] Reduce bucket size 500000000
[2024-11-08 17:07:57,402] [INFO] [stage_1_and_2.py:150:__init__] Allgather bucket size 500000000
[2024-11-08 17:07:57,402] [INFO] [stage_1_and_2.py:151:__init__] CPU Offload: False
[2024-11-08 17:07:57,402] [INFO] [stage_1_and_2.py:152:__init__] Round robin gradient partitioning: False
[2024-11-08 17:07:57,429] [WARNING] [lr_schedules.py:671:__init__] Using unknown warmup_type: cosine. The increasing function is set to default (log)
[2024-11-08 17:07:57,429] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
0%| | 0/361 [00:00<?, ?batch/s][2024-11-08 17:07:57,469] [WARNING] [lr_schedules.py:671:__init__] Using unknown warmup_type: cosine. The increasing function is set to default (log)
[2024-11-08 17:07:57,469] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
0%| | 0/361 [00:00<?, ?batch/s][2024-11-08 17:07:57,520] [WARNING] [lr_schedules.py:671:__init__] Using unknown warmup_type: cosine. The increasing function is set to default (log)
[2024-11-08 17:07:57,520] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
0%| | 0/361 [00:00<?, ?batch/s][2024-11-08 17:07:57,647] [INFO] [utils.py:781:see_memory_usage] Before initializing optimizer states
[2024-11-08 17:07:57,648] [INFO] [utils.py:782:see_memory_usage] MA 11.74 GB Max_MA 11.75 GB CA 11.79 GB Max_CA 12 GB
[2024-11-08 17:07:57,648] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 71.46 GB, percent = 7.1%
[2024-11-08 17:07:57,774] [INFO] [utils.py:781:see_memory_usage] After initializing optimizer states
[2024-11-08 17:07:57,775] [INFO] [utils.py:782:see_memory_usage] MA 11.74 GB Max_MA 11.77 GB CA 11.81 GB Max_CA 12 GB
[2024-11-08 17:07:57,775] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 71.47 GB, percent = 7.1%
[2024-11-08 17:07:57,775] [INFO] [stage_1_and_2.py:544:__init__] optimizer state initialized
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
[2024-11-08 17:07:57,905] [INFO] [utils.py:781:see_memory_usage] After initializing ZeRO optimizer
[2024-11-08 17:07:57,905] [INFO] [utils.py:782:see_memory_usage] MA 11.74 GB Max_MA 11.74 GB CA 11.81 GB Max_CA 12 GB
[2024-11-08 17:07:57,905] [INFO] [utils.py:789:see_memory_usage] CPU Virtual Memory: used = 71.73 GB, percent = 7.1%
[2024-11-08 17:07:57,907] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed Final Optimizer = DeepSpeedZeroOptimizer
[2024-11-08 17:07:57,907] [WARNING] [lr_schedules.py:671:__init__] Using unknown warmup_type: cosine. The increasing function is set to default (log)
[2024-11-08 17:07:57,907] [WARNING] [lr_schedules.py:683:get_lr] Attempting to get learning rate from scheduler before it has started
[2024-11-08 17:07:57,907] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2024-11-08 17:07:57,907] [INFO] [logging.py:129:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7f3f7f36bfd0>
[2024-11-08 17:07:57,907] [INFO] [logging.py:129:log_dist] [Rank 0] step=0, skipped=0, lr=[1e-05], mom=[(0.9, 0.95)]
[2024-11-08 17:07:57,910] [INFO] [config.py:999:print] DeepSpeedEngine configuration:
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True, 'use_gds': False}
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] amp_enabled .................. False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] amp_params ................... False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] bfloat16_enabled ............. False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] bfloat16_immediate_grad_update False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] checkpoint_parallel_write_pipeline False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] checkpoint_tag_validation_enabled True
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] checkpoint_tag_validation_fail False
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f3f7f36b970>
[2024-11-08 17:07:57,910] [INFO] [config.py:1003:print] communication_data_type ...... None
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] curriculum_enabled_legacy .... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] curriculum_params_legacy ..... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] data_efficiency_enabled ...... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] dataloader_drop_last ......... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] disable_allgather ............ False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] dump_state ................... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] dynamic_loss_scale_args ...... {'init_scale': 65536, 'scale_window': 100, 'delayed_shift': 2, 'consecutive_hysteresis': False, 'min_scale': 1}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_enabled ........... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_gas_boundary_resolution 1
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_layer_num ......... 0
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_max_iter .......... 100
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_stability ......... 1e-06
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_tol ............... 0.01
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] eigenvalue_verbose ........... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] elasticity_enabled ........... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] fp16_auto_cast ............... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] fp16_enabled ................. True
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] fp16_master_weights_and_gradients False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] global_rank .................. 0
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] grad_accum_dtype ............. None
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] gradient_accumulation_steps .. 4
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] gradient_clipping ............ 1.0
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] gradient_predivide_factor .... 1.0
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] graph_harvesting ............. False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] initial_dynamic_scale ........ 65536
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] load_universal_checkpoint .... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] loss_scale ................... 0
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] memory_breakdown ............. False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] mics_hierarchial_params_gather False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] mics_shard_size .............. -1
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') comet=CometConfig(enabled=False, samples_log_interval=100, project=None, workspace=None, api_key=None, experiment_name=None, experiment_key=None, online=None, mode=None) wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName')
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] optimizer_legacy_fusion ...... False
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] optimizer_name ............... adamw
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] optimizer_params ............. {'lr': 0.0001, 'betas': (0.9, 0.95), 'eps': 1e-08, 'weight_decay': 0.1}
[2024-11-08 17:07:57,911] [INFO] [config.py:1003:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] pld_enabled .................. False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] pld_params ................... False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] prescale_gradients ........... False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] scheduler_name ............... WarmupDecayLR
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] scheduler_params ............. {'last_batch_iteration': -1, 'total_num_steps': 182, 'warmup_min_lr': 1e-05, 'warmup_max_lr': 0.0001, 'warmup_num_steps': 18, 'warmup_type': 'cosine'}
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] seq_parallel_communication_data_type torch.float32
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] sparse_attention ............. None
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] sparse_gradients_enabled ..... False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] steps_per_print .............. 1
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] timers_config ................ enabled=True synchronized=True
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] train_batch_size ............. 16
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] train_micro_batch_size_per_gpu 1
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] use_data_before_expert_parallel_ False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] use_node_local_storage ....... False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] wall_clock_breakdown ......... False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] weight_quantization_config ... None
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] world_size ................... 4
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] zero_allow_untested_optimizer False
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] zero_config .................. stage=2 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=500000000 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500000000 overlap_comm=False load_from_fp32_weights=True elastic_checkpoint=False offload_param=DeepSpeedZeroOffloadParamConfig(device='none', nvme_path=None, buffer_count=5, buffer_size=100000000, max_in_cpu=1000000000, pin_memory=False) offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='none', nvme_path=None, buffer_count=4, pin_memory=False, pipeline_read=False, pipeline_write=False, fast_init=False, ratio=1.0) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=50000000 param_persistence_threshold=100000 model_persistence_threshold=9223372036854775807 max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=False use_all_reduce_for_fetch_params=False stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] zero_enabled ................. True
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] zero_force_ds_cpu_optimizer .. True
[2024-11-08 17:07:57,912] [INFO] [config.py:1003:print] zero_optimization_stage ...... 2
[2024-11-08 17:07:57,912] [INFO] [config.py:989:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"steps_per_print": 1,
"zero_optimization": {
"stage": 2,
"offload_param": {
"device": "auto"
},
"offload_optimizer": {
"device": "auto"
}
},
"bf16": {
"enabled": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 100
},
"gradient_clipping": 1.0,
"prescale_gradients": false,
"wall_clock_breakdown": false,
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": 182,
"warmup_min_lr": 1e-05,
"warmup_max_lr": 0.0001,
"warmup_num_steps": 18,
"warmup_type": "cosine"
}
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": 0.0001,
"betas": [0.9, 0.95],
"eps": 1e-08,
"weight_decay": 0.1
}
},
"gradient_accumulation_steps": 4
}
Beginning of Epoch 1/2, Total Micro Batches 361
0%| | 0/361 [00:00<?, ?batch/s]/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py:600: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
return fn(*args, **kwargs)
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
0%|▎ | 1/361 [00:01<07:24, 1.23s/batch]/data/user23262833/.conda/envs/chatglm/lib/python3.8/site-packages/torch/utils/checkpoint.py:295: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with torch.enable_grad(), device_autocast_ctx, torch.cpu.amp.autocast(**ctx.cpu_autocast_kwargs):
1%|█ | 4/361 [00:02<02:29, 2.39batch/s][2024-11-08 17:07:59,606] [INFO] [logging.py:129:log_dist] [Rank 0] step=1, skipped=0, lr=[1e-05], mom=[(0.9, 0.95)]
2%|█▊ | 7/361 [00:02<01:23, 4.23batch/s][2024-11-08 17:08:00,245] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
2%|██ | 8/361 [00:02<01:14, 4.74batch/s][2024-11-08 17:08:00,245] [INFO] [logging.py:129:log_dist] [Rank 0] step=2, skipped=1, lr=[1e-05], mom=[(0.9, 0.95)]
3%|██▋ | 11/361 [00:03<01:01, 5.73batch/s][2024-11-08 17:08:00,888] [INFO] [logging.py:129:log_dist] [Rank 0] step=3, skipped=1, lr=[3.1583121991131835e-05], mom=[(0.9, 0.95)]
3%|██▉ | 12/361 [00:03<01:00, 5.80batch/s][2024-11-08 17:08:00,889] [INFO] [timer.py:264:stop] epoch=0/micro_step=12/global_step=3, RunningAvgSamplesPerSec=24.94289685931983, CurrSamplesPerSec=24.942857975123964, MemAllocated=11.85GB, MaxMemAllocated=13.15GB
4%|███▋ | 15/361 [00:03<00:57, 6.06batch/s][2024-11-08 17:08:01,537] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2024-11-08 17:08:01,537] [INFO] [logging.py:129:log_dist] [Rank 0] step=4, skipped=2, lr=[3.1583121991131835e-05], mom=[(0.9, 0.95)]
4%|███▉ | 16/361 [00:04<00:56, 6.07batch/s][2024-11-08 17:08:01,537] [INFO] [timer.py:264:stop] epoch=0/micro_step=16/global_step=4, RunningAvgSamplesPerSec=24.835147120484145, CurrSamplesPerSec=24.72828608625941, MemAllocated=11.84GB, MaxMemAllocated=13.15GB
5%|████▋ | 19/361 [00:04<00:54, 6.26batch/s][2024-11-08 17:08:02,172] [INFO] [logging.py:129:log_dist] [Rank 0] step=5, skipped=2, lr=[4.42084390044341e-05], mom=[(0.9, 0.95)]
6%|████▉ | 20/361 [00:04<00:55, 6.19batch/s][2024-11-08 17:08:02,172] [INFO] [timer.py:264:stop] epoch=0/micro_step=20/global_step=5, RunningAvgSamplesPerSec=24.9752069765532, CurrSamplesPerSec=25.260080148523734, MemAllocated=11.85GB, MaxMemAllocated=13.23GB
6%|█████▋ | 23/361 [00:05<00:53, 6.33batch/s][2024-11-08 17:08:02,806] [INFO] [logging.py:129:log_dist] [Rank 0] step=6, skipped=2, lr=[5.3166243982263665e-05], mom=[(0.9, 0.95)]
7%|█████▉ | 24/361 [00:05<00:54, 6.20batch/s][2024-11-08 17:08:02,807] [INFO] [timer.py:264:stop] epoch=0/micro_step=24/global_step=6, RunningAvgSamplesPerSec=25.054683771665495, CurrSamplesPerSec=25.296138375681792, MemAllocated=11.87GB, MaxMemAllocated=13.23GB
7%|██████▋ | 27/361 [00:05<00:54, 6.12batch/s][2024-11-08 17:08:03,479] [INFO] [logging.py:129:log_dist] [Rank 0] step=7, skipped=2, lr=[6.011445732659027e-05], mom=[(0.9, 0.95)]
8%|██████▉ | 28/361 [00:06<00:54, 6.14batch/s][2024-11-08 17:08:03,479] [INFO] [timer.py:264:stop] epoch=0/micro_step=28/global_step=7, RunningAvgSamplesPerSec=24.80362778508688, CurrSamplesPerSec=23.84774262275905, MemAllocated=11.86GB, MaxMemAllocated=13.23GB
9%|███████▋ | 31/361 [00:06<00:59, 5.54batch/s][2024-11-08 17:08:04,209] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32768, reducing to 16384
[2024-11-08 17:08:04,210] [INFO] [logging.py:129:log_dist] [Rank 0] step=8, skipped=3, lr=[6.011445732659027e-05], mom=[(0.9, 0.95)]
9%|███████▉ | 32/361 [00:06<00:56, 5.78batch/s][2024-11-08 17:08:04,210] [INFO] [timer.py:264:stop] epoch=0/micro_step=32/global_step=8, RunningAvgSamplesPerSec=24.27660825454828, CurrSamplesPerSec=21.945156691266938, MemAllocated=11.86GB, MaxMemAllocated=13.23GB
10%|████████▋ | 35/361 [00:07<00:55, 5.90batch/s][2024-11-08 17:08:04,875] [INFO] [logging.py:129:log_dist] [Rank 0] step=9, skipped=3, lr=[6.579156099556593e-05], mom=[(0.9, 0.95)]
10%|████████▉ | 36/361 [00:07<00:54, 6.01batch/s][2024-11-08 17:08:04,876] [INFO] [timer.py:264:stop] epoch=0/micro_step=36/global_step=9, RunningAvgSamplesPerSec=24.250528090425334, CurrSamplesPerSec=24.095180026910562, MemAllocated=11.86GB, MaxMemAllocated=13.51GB
11%|█████████▋ | 39/361 [00:07<00:50, 6.33batch/s][2024-11-08 17:08:05,539] [INFO] [logging.py:129:log_dist] [Rank 0] step=10, skipped=3, lr=[7.059148375517367e-05], mom=[(0.9, 0.95)]
11%|█████████▉ | 40/361 [00:08<00:55, 5.83batch/s][2024-11-08 17:08:05,539] [INFO] [timer.py:264:stop] epoch=0/micro_step=40/global_step=10, RunningAvgSamplesPerSec=24.24051542399338, CurrSamplesPerSec=24.17062108622753, MemAllocated=11.92GB, MaxMemAllocated=13.54GB
Epoch: 0, step: 40, global_step:10, loss: 2.80465087890625
step: 40-10-10
12%|██████████▉ | 44/361 [00:08<00:50, 6.27batch/s][2024-11-08 17:08:06,154] [INFO] [logging.py:129:log_dist] [Rank 0] step=11, skipped=3, lr=[7.47493659733955e-05], mom=[(0.9, 0.95)]
[2024-11-08 17:08:06,154] [INFO] [timer.py:264:stop] epoch=0/micro_step=44/global_step=11, RunningAvgSamplesPerSec=24.434034837465752, CurrSamplesPerSec=26.100970947922384, MemAllocated=11.85GB, MaxMemAllocated=13.54GB
13%|███████████▋ | 47/361 [00:09<00:49, 6.28batch/s][2024-11-08 17:08:06,804] [INFO] [logging.py:129:log_dist] [Rank 0] step=12, skipped=3, lr=[7.84168780088682e-05], mom=[(0.9, 0.95)]
13%|███████████▉ | 48/361 [00:09<00:51, 6.09batch/s][2024-11-08 17:08:06,804] [INFO] [timer.py:264:stop] epoch=0/micro_step=48/global_step=12, RunningAvgSamplesPerSec=24.460144252051272, CurrSamplesPerSec=24.697626213107018, MemAllocated=11.86GB, MaxMemAllocated=13.54GB
14%|████████████▋ | 51/361 [00:09<00:50, 6.20batch/s][2024-11-08 17:08:07,458] [INFO] [logging.py:129:log_dist] [Rank 0] step=13, skipped=3, lr=[8.169757931772212e-05], mom=[(0.9, 0.95)]
14%|████████████▉ | 52/361 [00:09<00:50, 6.13batch/s][2024-11-08 17:08:07,458] [INFO] [timer.py:264:stop] epoch=0/micro_step=52/global_step=13, RunningAvgSamplesPerSec=24.464291806234066, CurrSamplesPerSec=24.50580730623016, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
15%|█████████████▋ | 55/361 [00:10<00:48, 6.25batch/s][2024-11-08 17:08:08,102] [INFO] [logging.py:129:log_dist] [Rank 0] step=14, skipped=3, lr=[8.466533464502744e-05], mom=[(0.9, 0.95)]
16%|█████████████▉ | 56/361 [00:10<00:49, 6.11batch/s][2024-11-08 17:08:08,103] [INFO] [timer.py:264:stop] epoch=0/micro_step=56/global_step=14, RunningAvgSamplesPerSec=24.49999975661437, CurrSamplesPerSec=24.899740446823003, MemAllocated=11.87GB, MaxMemAllocated=13.54GB
16%|██████████████▋ | 59/361 [00:11<00:48, 6.25batch/s][2024-11-08 17:08:08,748] [INFO] [logging.py:129:log_dist] [Rank 0] step=15, skipped=3, lr=[8.737468298669776e-05], mom=[(0.9, 0.95)]
17%|██████████████▉ | 60/361 [00:11<00:49, 6.10batch/s][2024-11-08 17:08:08,748] [INFO] [timer.py:264:stop] epoch=0/micro_step=60/global_step=15, RunningAvgSamplesPerSec=24.5267633792578, CurrSamplesPerSec=24.85250970584881, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
17%|███████████████▋ | 63/361 [00:11<00:47, 6.21batch/s][2024-11-08 17:08:09,395] [INFO] [logging.py:129:log_dist] [Rank 0] step=16, skipped=3, lr=[8.986704185746869e-05], mom=[(0.9, 0.95)]
18%|███████████████▉ | 64/361 [00:11<00:48, 6.11batch/s][2024-11-08 17:08:09,395] [INFO] [timer.py:264:stop] epoch=0/micro_step=64/global_step=16, RunningAvgSamplesPerSec=24.54535939232108, CurrSamplesPerSec=24.78966078230782, MemAllocated=11.87GB, MaxMemAllocated=13.54GB
19%|████████████████▋ | 67/361 [00:12<00:47, 6.25batch/s][2024-11-08 17:08:10,038] [INFO] [logging.py:129:log_dist] [Rank 0] step=17, skipped=3, lr=[9.21746057463055e-05], mom=[(0.9, 0.95)]
19%|████████████████▉ | 68/361 [00:12<00:47, 6.14batch/s][2024-11-08 17:08:10,038] [INFO] [timer.py:264:stop] epoch=0/micro_step=68/global_step=17, RunningAvgSamplesPerSec=24.572142005731475, CurrSamplesPerSec=24.95329187573621, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
20%|█████████████████▋ | 71/361 [00:12<00:46, 6.27batch/s][2024-11-08 17:08:10,680] [INFO] [logging.py:129:log_dist] [Rank 0] step=18, skipped=3, lr=[9.432289633102436e-05], mom=[(0.9, 0.95)]
20%|█████████████████▉ | 72/361 [00:13<00:46, 6.15batch/s][2024-11-08 17:08:10,680] [INFO] [timer.py:264:stop] epoch=0/micro_step=72/global_step=18, RunningAvgSamplesPerSec=24.59836095205259, CurrSamplesPerSec=24.9984295693146, MemAllocated=11.85GB, MaxMemAllocated=13.54GB
21%|██████████████████▋ | 75/361 [00:13<00:45, 6.25batch/s][2024-11-08 17:08:11,324] [INFO] [logging.py:129:log_dist] [Rank 0] step=19, skipped=3, lr=[9.633248796452733e-05], mom=[(0.9, 0.95)]
21%|██████████████████▉ | 76/361 [00:13<00:46, 6.15batch/s][2024-11-08 17:08:11,324] [INFO] [timer.py:264:stop] epoch=0/micro_step=76/global_step=19, RunningAvgSamplesPerSec=24.61640502613241, CurrSamplesPerSec=24.9087144346569, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
22%|███████████████████▋ | 79/361 [00:14<00:44, 6.27batch/s][2024-11-08 17:08:11,966] [INFO] [logging.py:129:log_dist] [Rank 0] step=20, skipped=3, lr=[9.822020913692442e-05], mom=[(0.9, 0.95)]
22%|███████████████████▉ | 80/361 [00:14<00:45, 6.15batch/s][2024-11-08 17:08:11,967] [INFO] [timer.py:264:stop] epoch=0/micro_step=80/global_step=20, RunningAvgSamplesPerSec=24.635272610247558, CurrSamplesPerSec=24.960466185853033, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
Epoch: 0, step: 80, global_step:20, loss: 1.0513336181640625
step: 80-20-20
23%|████████████████████▋ | 83/361 [00:14<00:45, 6.15batch/s][2024-11-08 17:08:12,619] [INFO] [logging.py:129:log_dist] [Rank 0] step=21, skipped=3, lr=[0.0001], mom=[(0.9, 0.95)]
23%|████████████████████▉ | 84/361 [00:15<00:45, 6.07batch/s][2024-11-08 17:08:12,620] [INFO] [timer.py:264:stop] epoch=0/micro_step=84/global_step=21, RunningAvgSamplesPerSec=24.632095273455004, CurrSamplesPerSec=24.575005291526246, MemAllocated=11.87GB, MaxMemAllocated=13.54GB
24%|█████████████████████▋ | 87/361 [00:15<00:43, 6.26batch/s][2024-11-08 17:08:13,258] [INFO] [logging.py:129:log_dist] [Rank 0] step=22, skipped=3, lr=[0.0001], mom=[(0.9, 0.95)]
24%|█████████████████████▉ | 88/361 [00:15<00:44, 6.14batch/s][2024-11-08 17:08:13,259] [INFO] [timer.py:264:stop] epoch=0/micro_step=88/global_step=22, RunningAvgSamplesPerSec=24.65548972625532, CurrSamplesPerSec=25.108543306074118, MemAllocated=11.89GB, MaxMemAllocated=13.54GB
25%|██████████████████████▋ | 91/361 [00:16<00:43, 6.25batch/s][2024-11-08 17:08:13,903] [INFO] [logging.py:129:log_dist] [Rank 0] step=23, skipped=3, lr=[9.945121951219513e-05], mom=[(0.9, 0.95)]
25%|██████████████████████▉ | 92/361 [00:16<00:43, 6.13batch/s][2024-11-08 17:08:13,904] [INFO] [timer.py:264:stop] epoch=0/micro_step=92/global_step=23, RunningAvgSamplesPerSec=24.66540866536722, CurrSamplesPerSec=24.865438368177724, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
26%|███████████████████████▋ | 95/361 [00:16<00:42, 6.19batch/s][2024-11-08 17:08:14,553] [INFO] [logging.py:129:log_dist] [Rank 0] step=24, skipped=3, lr=[9.890243902439024e-05], mom=[(0.9, 0.95)]
27%|███████████████████████▉ | 96/361 [00:17<00:43, 6.10batch/s][2024-11-08 17:08:14,553] [INFO] [timer.py:264:stop] epoch=0/micro_step=96/global_step=24, RunningAvgSamplesPerSec=24.666616667201023, CurrSamplesPerSec=24.691973961041388, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
27%|████████████████████████▋ | 99/361 [00:17<00:42, 6.23batch/s][2024-11-08 17:08:15,196] [INFO] [logging.py:129:log_dist] [Rank 0] step=25, skipped=3, lr=[9.835365853658537e-05], mom=[(0.9, 0.95)]
28%|████████████████████████▋ | 100/361 [00:17<00:42, 6.17batch/s][2024-11-08 17:08:15,196] [INFO] [timer.py:264:stop] epoch=0/micro_step=100/global_step=25, RunningAvgSamplesPerSec=24.67917941894309, CurrSamplesPerSec=24.958795214794428, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
29%|█████████████████████████▍ | 103/361 [00:18<00:41, 6.24batch/s][2024-11-08 17:08:15,843] [INFO] [logging.py:129:log_dist] [Rank 0] step=26, skipped=3, lr=[9.78048780487805e-05], mom=[(0.9, 0.95)]
29%|█████████████████████████▋ | 104/361 [00:18<00:42, 6.12batch/s][2024-11-08 17:08:15,843] [INFO] [timer.py:264:stop] epoch=0/micro_step=104/global_step=26, RunningAvgSamplesPerSec=24.683684910610012, CurrSamplesPerSec=24.787728770005838, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
30%|██████████████████████████▍ | 107/361 [00:18<00:41, 6.13batch/s][2024-11-08 17:08:16,498] [INFO] [logging.py:129:log_dist] [Rank 0] step=27, skipped=3, lr=[9.72560975609756e-05], mom=[(0.9, 0.95)]
30%|██████████████████████████▋ | 108/361 [00:18<00:41, 6.13batch/s][2024-11-08 17:08:16,498] [INFO] [timer.py:264:stop] epoch=0/micro_step=108/global_step=27, RunningAvgSamplesPerSec=24.676375494969918, CurrSamplesPerSec=24.502201525772378, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
31%|███████████████████████████▌ | 112/361 [00:19<00:40, 6.16batch/s][2024-11-08 17:08:17,161] [INFO] [logging.py:129:log_dist] [Rank 0] step=28, skipped=3, lr=[9.670731707317073e-05], mom=[(0.9, 0.95)]
31%|███████████████████████████▌ | 112/361 [00:19<00:40, 6.18batch/s][2024-11-08 17:08:17,161] [INFO] [timer.py:264:stop] epoch=0/micro_step=112/global_step=28, RunningAvgSamplesPerSec=24.657827236779745, CurrSamplesPerSec=24.20297931342735, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
32%|████████████████████████████▎ | 115/361 [00:20<00:38, 6.34batch/s][2024-11-08 17:08:17,791] [INFO] [logging.py:129:log_dist] [Rank 0] step=29, skipped=3, lr=[9.615853658536586e-05], mom=[(0.9, 0.95)]
32%|████████████████████████████▌ | 116/361 [00:20<00:39, 6.23batch/s][2024-11-08 17:08:17,791] [INFO] [timer.py:264:stop] epoch=0/micro_step=116/global_step=29, RunningAvgSamplesPerSec=24.686717221270342, CurrSamplesPerSec=25.46232454849932, MemAllocated=11.84GB, MaxMemAllocated=13.54GB
33%|█████████████████████████████▎ | 119/361 [00:20<00:38, 6.28batch/s][2024-11-08 17:08:18,475] [INFO] [logging.py:129:log_dist] [Rank 0] step=30, skipped=3, lr=[9.560975609756097e-05], mom=[(0.9, 0.95)]
33%|█████████████████████████████▌ | 120/361 [00:20<00:41, 5.74batch/s][2024-11-08 17:08:18,475] [INFO] [timer.py:264:stop] epoch=0/micro_step=120/global_step=30, RunningAvgSamplesPerSec=24.63974983126007, CurrSamplesPerSec=23.435849408164277, MemAllocated=11.93GB, MaxMemAllocated=13.59GB
Epoch: 0, step: 120, global_step:30, loss: 0.63695068359375
step: 120-30-30
34%|██████████████████████████████▎ | 123/361 [00:21<00:39, 6.02batch/s][2024-11-08 17:08:19,114] [INFO] [logging.py:129:log_dist] [Rank 0] step=31, skipped=3, lr=[9.50609756097561e-05], mom=[(0.9, 0.95)]
34%|██████████████████████████████▌ | 124/361 [00:21<00:38, 6.13batch/s][2024-11-08 17:08:19,115] [INFO] [timer.py:264:stop] epoch=0/micro_step=124/global_step=31, RunningAvgSamplesPerSec=24.655700739928054, CurrSamplesPerSec=25.11082632182666, MemAllocated=11.84GB, MaxMemAllocated=13.59GB
35%|███████████████████████████████▎ | 127/361 [00:22<00:37, 6.31batch/s][2024-11-08 17:08:19,748] [INFO] [logging.py:129:log_dist] [Rank 0] step=32, skipped=3, lr=[9.451219512195122e-05], mom=[(0.9, 0.95)]
35%|███████████████████████████████▌ | 128/361 [00:22<00:37, 6.23batch/s][2024-11-08 17:08:19,748] [INFO] [timer.py:264:stop] epoch=0/micro_step=128/global_step=32, RunningAvgSamplesPerSec=24.677341273427523, CurrSamplesPerSec=25.321833134118396, MemAllocated=11.85GB, MaxMemAllocated=13.59GB
36%|████████████████████████████████▎ | 131/361 [00:22<00:36, 6.26batch/s][2024-11-08 17:08:20,402] [INFO] [logging.py:129:log_dist] [Rank 0] step=33, skipped=3, lr=[9.396341463414635e-05], mom=[(0.9, 0.95)]
37%|████████████████████████████████▌ | 132/361 [00:22<00:37, 6.04batch/s][2024-11-08 17:08:20,403] [INFO] [timer.py:264:stop] epoch=0/micro_step=132/global_step=33, RunningAvgSamplesPerSec=24.67190802034226, CurrSamplesPerSec=24.509978092803205, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
37%|█████████████████████████████████▎ | 135/361 [00:23<00:36, 6.20batch/s][2024-11-08 17:08:21,046] [INFO] [logging.py:129:log_dist] [Rank 0] step=34, skipped=3, lr=[9.341463414634147e-05], mom=[(0.9, 0.95)]
38%|█████████████████████████████████▌ | 136/361 [00:23<00:36, 6.11batch/s][2024-11-08 17:08:21,046] [INFO] [timer.py:264:stop] epoch=0/micro_step=136/global_step=34, RunningAvgSamplesPerSec=24.679909740831736, CurrSamplesPerSec=24.93052477432655, MemAllocated=11.84GB, MaxMemAllocated=13.59GB
39%|██████████████████████████████████▎ | 139/361 [00:23<00:35, 6.23batch/s][2024-11-08 17:08:21,691] [INFO] [logging.py:129:log_dist] [Rank 0] step=35, skipped=3, lr=[9.28658536585366e-05], mom=[(0.9, 0.95)]
39%|██████████████████████████████████▌ | 140/361 [00:24<00:36, 6.12batch/s][2024-11-08 17:08:21,692] [INFO] [timer.py:264:stop] epoch=0/micro_step=140/global_step=35, RunningAvgSamplesPerSec=24.685173654028997, CurrSamplesPerSec=24.854774012752404, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
40%|███████████████████████████████████▎ | 143/361 [00:24<00:35, 6.22batch/s][2024-11-08 17:08:22,339] [INFO] [logging.py:129:log_dist] [Rank 0] step=36, skipped=3, lr=[9.231707317073171e-05], mom=[(0.9, 0.95)]
40%|███████████████████████████████████▌ | 144/361 [00:24<00:35, 6.10batch/s][2024-11-08 17:08:22,340] [INFO] [timer.py:264:stop] epoch=0/micro_step=144/global_step=36, RunningAvgSamplesPerSec=24.687244609245255, CurrSamplesPerSec=24.755743308295603, MemAllocated=11.87GB, MaxMemAllocated=13.59GB
41%|████████████████████████████████████▏ | 147/361 [00:25<00:34, 6.22batch/s][2024-11-08 17:08:22,987] [INFO] [logging.py:129:log_dist] [Rank 0] step=37, skipped=3, lr=[9.176829268292684e-05], mom=[(0.9, 0.95)]
41%|████████████████████████████████████▍ | 148/361 [00:25<00:34, 6.10batch/s][2024-11-08 17:08:22,987] [INFO] [timer.py:264:stop] epoch=0/micro_step=148/global_step=37, RunningAvgSamplesPerSec=24.689844795127897, CurrSamplesPerSec=24.778539811911255, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
42%|█████████████████████████████████████▏ | 151/361 [00:25<00:33, 6.23batch/s][2024-11-08 17:08:23,631] [INFO] [logging.py:129:log_dist] [Rank 0] step=38, skipped=3, lr=[9.121951219512196e-05], mom=[(0.9, 0.95)]
42%|█████████████████████████████████████▍ | 152/361 [00:26<00:34, 6.12batch/s][2024-11-08 17:08:23,632] [INFO] [timer.py:264:stop] epoch=0/micro_step=152/global_step=38, RunningAvgSamplesPerSec=24.695252617229894, CurrSamplesPerSec=24.885991653558662, MemAllocated=11.85GB, MaxMemAllocated=13.59GB
43%|██████████████████████████████████████▏ | 155/361 [00:26<00:34, 5.94batch/s][2024-11-08 17:08:24,304] [INFO] [logging.py:129:log_dist] [Rank 0] step=39, skipped=3, lr=[9.067073170731708e-05], mom=[(0.9, 0.95)]
43%|██████████████████████████████████████▍ | 156/361 [00:26<00:34, 6.02batch/s][2024-11-08 17:08:24,305] [INFO] [timer.py:264:stop] epoch=0/micro_step=156/global_step=39, RunningAvgSamplesPerSec=24.671247584507885, CurrSamplesPerSec=23.837061050469533, MemAllocated=11.87GB, MaxMemAllocated=13.59GB
44%|███████████████████████████████████████▏ | 159/361 [00:27<00:32, 6.22batch/s][2024-11-08 17:08:24,942] [INFO] [logging.py:129:log_dist] [Rank 0] step=40, skipped=3, lr=[9.01219512195122e-05], mom=[(0.9, 0.95)]
44%|███████████████████████████████████████▍ | 160/361 [00:27<00:32, 6.16batch/s][2024-11-08 17:08:24,942] [INFO] [timer.py:264:stop] epoch=0/micro_step=160/global_step=40, RunningAvgSamplesPerSec=24.683715313497338, CurrSamplesPerSec=25.154009215600396, MemAllocated=11.89GB, MaxMemAllocated=13.59GB
44%|███████████████████████████████████████▍ | 160/361 [00:27<00:32, 6.18batch/s]Epoch: 0, step: 160, global_step:40, loss: 0.5588729858398438
step: 160-40-40
45%|████████████████████████████████████████▏ | 163/361 [00:28<00:36, 5.37batch/s][2024-11-08 17:08:25,703] [INFO] [logging.py:129:log_dist] [Rank 0] step=41, skipped=3, lr=[8.957317073170733e-05], mom=[(0.9, 0.95)]
45%|████████████████████████████████████████▍ | 164/361 [00:28<00:35, 5.62batch/s][2024-11-08 17:08:25,704] [INFO] [timer.py:264:stop] epoch=0/micro_step=164/global_step=41, RunningAvgSamplesPerSec=24.57558946839633, CurrSamplesPerSec=21.068545958778678, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
46%|█████████████████████████████████████████▏ | 167/361 [00:28<00:31, 6.13batch/s][2024-11-08 17:08:26,334] [INFO] [logging.py:129:log_dist] [Rank 0] step=42, skipped=3, lr=[8.902439024390244e-05], mom=[(0.9, 0.95)]
47%|█████████████████████████████████████████▍ | 168/361 [00:28<00:31, 6.15batch/s][2024-11-08 17:08:26,334] [INFO] [timer.py:264:stop] epoch=0/micro_step=168/global_step=42, RunningAvgSamplesPerSec=24.596622265090193, CurrSamplesPerSec=25.445911656995555, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
47%|██████████████████████████████████████████▏ | 171/361 [00:29<00:29, 6.36batch/s][2024-11-08 17:08:26,964] [INFO] [logging.py:129:log_dist] [Rank 0] step=43, skipped=3, lr=[8.847560975609757e-05], mom=[(0.9, 0.95)]
48%|██████████████████████████████████████████▍ | 172/361 [00:29<00:30, 6.25batch/s][2024-11-08 17:08:26,964] [INFO] [timer.py:264:stop] epoch=0/micro_step=172/global_step=43, RunningAvgSamplesPerSec=24.61707810732763, CurrSamplesPerSec=25.46413125926745, MemAllocated=11.87GB, MaxMemAllocated=13.59GB
48%|███████████████████████████████████████████▏ | 175/361 [00:30<00:35, 5.23batch/s][2024-11-08 17:08:27,705] [INFO] [logging.py:129:log_dist] [Rank 0] step=44, skipped=3, lr=[8.792682926829269e-05], mom=[(0.9, 0.95)]
49%|███████████████████████████████████████████▍ | 176/361 [00:30<00:33, 5.49batch/s][2024-11-08 17:08:27,706] [INFO] [timer.py:264:stop] epoch=0/micro_step=176/global_step=44, RunningAvgSamplesPerSec=24.53609952878578, CurrSamplesPerSec=21.620143253720826, MemAllocated=11.89GB, MaxMemAllocated=13.59GB
50%|████████████████████████████████████████████▏ | 179/361 [00:30<00:29, 6.10batch/s][2024-11-08 17:08:28,337] [INFO] [logging.py:129:log_dist] [Rank 0] step=45, skipped=3, lr=[8.73780487804878e-05], mom=[(0.9, 0.95)]
50%|████████████████████████████████████████████▍ | 180/361 [00:30<00:29, 6.07batch/s][2024-11-08 17:08:28,337] [INFO] [timer.py:264:stop] epoch=0/micro_step=180/global_step=45, RunningAvgSamplesPerSec=24.55543687106703, CurrSamplesPerSec=25.396030819918465, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
51%|█████████████████████████████████████████████ | 183/361 [00:31<00:28, 6.27batch/s][2024-11-08 17:08:29,047] [INFO] [logging.py:129:log_dist] [Rank 0] step=46, skipped=3, lr=[8.682926829268293e-05], mom=[(0.9, 0.95)]
51%|█████████████████████████████████████████████▎ | 184/361 [00:31<00:32, 5.41batch/s][2024-11-08 17:08:29,048] [INFO] [timer.py:264:stop] epoch=0/micro_step=184/global_step=46, RunningAvgSamplesPerSec=24.506817641612027, CurrSamplesPerSec=22.584002276220073, MemAllocated=11.86GB, MaxMemAllocated=13.59GB
52%|██████████████████████████████████████████████ | 187/361 [00:32<00:40, 4.31batch/s][2024-11-08 17:08:29,922] [INFO] [logging.py:129:log_dist] [Rank 0] step=47, skipped=3, lr=[8.628048780487805e-05], mom=[(0.9, 0.95)]
52%|██████████████████████████████████████████████▎ | 188/361 [00:32<00:37, 4.60batch/s][2024-11-08 17:08:29,922] [INFO] [timer.py:264:stop] epoch=0/micro_step=188/global_step=47, RunningAvgSamplesPerSec=24.324470367465565, CurrSamplesPerSec=18.325019562140483, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
53%|███████████████████████████████████████████████▎ | 192/361 [00:33<00:28, 5.85batch/s][2024-11-08 17:08:30,539] [INFO] [logging.py:129:log_dist] [Rank 0] step=48, skipped=3, lr=[8.573170731707317e-05], mom=[(0.9, 0.95)]
53%|███████████████████████████████████████████████▎ | 192/361 [00:33<00:28, 5.86batch/s][2024-11-08 17:08:30,540] [INFO] [timer.py:264:stop] epoch=0/micro_step=192/global_step=48, RunningAvgSamplesPerSec=24.35841899375584, CurrSamplesPerSec=25.990715432612774, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
54%|████████████████████████████████████████████████ | 195/361 [00:33<00:26, 6.23batch/s][2024-11-08 17:08:31,166] [INFO] [logging.py:129:log_dist] [Rank 0] step=49, skipped=3, lr=[8.518292682926829e-05], mom=[(0.9, 0.95)]
54%|████████████████████████████████████████████████▎ | 196/361 [00:33<00:26, 6.19batch/s][2024-11-08 17:08:31,167] [INFO] [timer.py:264:stop] epoch=0/micro_step=196/global_step=49, RunningAvgSamplesPerSec=24.38357435651873, CurrSamplesPerSec=25.599646544880166, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
55%|█████████████████████████████████████████████████ | 199/361 [00:34<00:25, 6.27batch/s][2024-11-08 17:08:31,978] [INFO] [logging.py:129:log_dist] [Rank 0] step=50, skipped=3, lr=[8.463414634146342e-05], mom=[(0.9, 0.95)]
55%|█████████████████████████████████████████████████▎ | 200/361 [00:34<00:34, 4.68batch/s][2024-11-08 17:08:31,978] [INFO] [timer.py:264:stop] epoch=0/micro_step=200/global_step=50, RunningAvgSamplesPerSec=24.265309913210697, CurrSamplesPerSec=19.76067713356112, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
Epoch: 0, step: 200, global_step:50, loss: 0.4884662628173828
step: 200-50-50
56%|██████████████████████████████████████████████████ | 203/361 [00:35<00:29, 5.32batch/s][2024-11-08 17:08:32,639] [INFO] [logging.py:129:log_dist] [Rank 0] step=51, skipped=3, lr=[8.408536585365853e-05], mom=[(0.9, 0.95)]
57%|██████████████████████████████████████████████████▎ | 204/361 [00:35<00:28, 5.60batch/s][2024-11-08 17:08:32,639] [INFO] [timer.py:264:stop] epoch=0/micro_step=204/global_step=51, RunningAvgSamplesPerSec=24.265671759177305, CurrSamplesPerSec=24.28301621164498, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
57%|███████████████████████████████████████████████████ | 207/361 [00:35<00:24, 6.22batch/s][2024-11-08 17:08:33,259] [INFO] [logging.py:129:log_dist] [Rank 0] step=52, skipped=3, lr=[8.353658536585366e-05], mom=[(0.9, 0.95)]
58%|███████████████████████████████████████████████████▎ | 208/361 [00:35<00:24, 6.19batch/s][2024-11-08 17:08:33,259] [INFO] [timer.py:264:stop] epoch=0/micro_step=208/global_step=52, RunningAvgSamplesPerSec=24.29584166028783, CurrSamplesPerSec=25.871988334456812, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
58%|████████████████████████████████████████████████████ | 211/361 [00:36<00:25, 5.88batch/s][2024-11-08 17:08:33,935] [INFO] [logging.py:129:log_dist] [Rank 0] step=53, skipped=3, lr=[8.298780487804878e-05], mom=[(0.9, 0.95)]
59%|████████████████████████████████████████████████████▎ | 212/361 [00:36<00:25, 5.90batch/s][2024-11-08 17:08:33,936] [INFO] [timer.py:264:stop] epoch=0/micro_step=212/global_step=53, RunningAvgSamplesPerSec=24.284062201957138, CurrSamplesPerSec=23.70927273835748, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
60%|█████████████████████████████████████████████████████ | 215/361 [00:36<00:26, 5.48batch/s][2024-11-08 17:08:34,678] [INFO] [logging.py:129:log_dist] [Rank 0] step=54, skipped=3, lr=[8.243902439024391e-05], mom=[(0.9, 0.95)]
60%|█████████████████████████████████████████████████████▎ | 216/361 [00:37<00:25, 5.63batch/s][2024-11-08 17:08:34,679] [INFO] [timer.py:264:stop] epoch=0/micro_step=216/global_step=54, RunningAvgSamplesPerSec=24.22564018204901, CurrSamplesPerSec=21.57809235210358, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
61%|█████████████████████████████████████████████████████▉ | 219/361 [00:37<00:25, 5.65batch/s][2024-11-08 17:08:35,388] [INFO] [logging.py:129:log_dist] [Rank 0] step=55, skipped=3, lr=[8.189024390243903e-05], mom=[(0.9, 0.95)]
61%|██████████████████████████████████████████████████████▏ | 220/361 [00:37<00:24, 5.67batch/s][2024-11-08 17:08:35,388] [INFO] [timer.py:264:stop] epoch=0/micro_step=220/global_step=55, RunningAvgSamplesPerSec=24.192839683025944, CurrSamplesPerSec=22.601526600078866, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
62%|██████████████████████████████████████████████████████▉ | 223/361 [00:38<00:22, 6.18batch/s][2024-11-08 17:08:36,015] [INFO] [logging.py:129:log_dist] [Rank 0] step=56, skipped=3, lr=[8.134146341463416e-05], mom=[(0.9, 0.95)]
62%|███████████████████████████████████████████████████████▏ | 224/361 [00:38<00:22, 6.18batch/s][2024-11-08 17:08:36,015] [INFO] [timer.py:264:stop] epoch=0/micro_step=224/global_step=56, RunningAvgSamplesPerSec=24.217078252276735, CurrSamplesPerSec=25.575080976046234, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
63%|███████████████████████████████████████████████████████▉ | 227/361 [00:38<00:21, 6.34batch/s][2024-11-08 17:08:36,650] [INFO] [logging.py:129:log_dist] [Rank 0] step=57, skipped=3, lr=[8.079268292682927e-05], mom=[(0.9, 0.95)]
63%|████████████████████████████████████████████████████████▏ | 228/361 [00:39<00:21, 6.21batch/s][2024-11-08 17:08:36,651] [INFO] [timer.py:264:stop] epoch=0/micro_step=228/global_step=57, RunningAvgSamplesPerSec=24.235047266709465, CurrSamplesPerSec=25.246585989011606, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
64%|████████████████████████████████████████████████████████▉ | 231/361 [00:39<00:20, 6.24batch/s][2024-11-08 17:08:37,299] [INFO] [logging.py:129:log_dist] [Rank 0] step=58, skipped=3, lr=[8.02439024390244e-05], mom=[(0.9, 0.95)]
64%|█████████████████████████████████████████████████████████▏ | 232/361 [00:39<00:21, 6.10batch/s][2024-11-08 17:08:37,299] [INFO] [timer.py:264:stop] epoch=0/micro_step=232/global_step=58, RunningAvgSamplesPerSec=24.244018627263653, CurrSamplesPerSec=24.747846536180898, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
65%|█████████████████████████████████████████████████████████▉ | 235/361 [00:40<00:20, 6.22batch/s][2024-11-08 17:08:37,944] [INFO] [logging.py:129:log_dist] [Rank 0] step=59, skipped=3, lr=[7.969512195121952e-05], mom=[(0.9, 0.95)]
65%|██████████████████████████████████████████████████████████▏ | 236/361 [00:40<00:20, 6.12batch/s][2024-11-08 17:08:37,944] [INFO] [timer.py:264:stop] epoch=0/micro_step=236/global_step=59, RunningAvgSamplesPerSec=24.254728013704472, CurrSamplesPerSec=24.869898375842865, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
66%|██████████████████████████████████████████████████████████▉ | 239/361 [00:40<00:19, 6.23batch/s][2024-11-08 17:08:38,590] [INFO] [logging.py:129:log_dist] [Rank 0] step=60, skipped=3, lr=[7.914634146341464e-05], mom=[(0.9, 0.95)]
66%|███████████████████████████████████████████████████████████▏ | 240/361 [00:41<00:19, 6.11batch/s][2024-11-08 17:08:38,590] [INFO] [timer.py:264:stop] epoch=0/micro_step=240/global_step=60, RunningAvgSamplesPerSec=24.26434072788457, CurrSamplesPerSec=24.82511303049506, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
Epoch: 0, step: 240, global_step:60, loss: 0.4170543670654297
step: 240-60-60
67%|███████████████████████████████████████████████████████████▉ | 243/361 [00:41<00:18, 6.27batch/s][2024-11-08 17:08:39,231] [INFO] [logging.py:129:log_dist] [Rank 0] step=61, skipped=3, lr=[7.859756097560976e-05], mom=[(0.9, 0.95)]
68%|████████████████████████████████████████████████████████████▏ | 244/361 [00:41<00:19, 6.15batch/s][2024-11-08 17:08:39,232] [INFO] [timer.py:264:stop] epoch=0/micro_step=244/global_step=61, RunningAvgSamplesPerSec=24.276778290208945, CurrSamplesPerSec=25.020602599411824, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
68%|████████████████████████████████████████████████████████████▉ | 247/361 [00:42<00:18, 6.23batch/s][2024-11-08 17:08:39,877] [INFO] [logging.py:129:log_dist] [Rank 0] step=62, skipped=3, lr=[7.804878048780489e-05], mom=[(0.9, 0.95)]
69%|█████████████████████████████████████████████████████████████▏ | 248/361 [00:42<00:18, 6.13batch/s][2024-11-08 17:08:39,878] [INFO] [timer.py:264:stop] epoch=0/micro_step=248/global_step=62, RunningAvgSamplesPerSec=24.285927834409847, CurrSamplesPerSec=24.838197056688, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
70%|█████████████████████████████████████████████████████████████▉ | 251/361 [00:42<00:17, 6.23batch/s][2024-11-08 17:08:40,527] [INFO] [logging.py:129:log_dist] [Rank 0] step=63, skipped=3, lr=[7.75e-05], mom=[(0.9, 0.95)]
70%|██████████████████████████████████████████████████████████████▏ | 252/361 [00:42<00:17, 6.20batch/s][2024-11-08 17:08:40,527] [INFO] [timer.py:264:stop] epoch=0/micro_step=252/global_step=63, RunningAvgSamplesPerSec=24.292672281080844, CurrSamplesPerSec=24.704272280923412, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
71%|██████████████████████████████████████████████████████████████▊ | 255/361 [00:43<00:17, 6.20batch/s][2024-11-08 17:08:41,319] [INFO] [logging.py:129:log_dist] [Rank 0] step=64, skipped=3, lr=[7.695121951219513e-05], mom=[(0.9, 0.95)]
71%|███████████████████████████████████████████████████████████████ | 256/361 [00:43<00:21, 4.83batch/s][2024-11-08 17:08:41,320] [INFO] [timer.py:264:stop] epoch=0/micro_step=256/global_step=64, RunningAvgSamplesPerSec=24.214169828122238, CurrSamplesPerSec=20.2269398328965, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
72%|███████████████████████████████████████████████████████████████▊ | 259/361 [00:44<00:17, 5.86batch/s][2024-11-08 17:08:41,934] [INFO] [logging.py:129:log_dist] [Rank 0] step=65, skipped=3, lr=[7.640243902439025e-05], mom=[(0.9, 0.95)]
72%|████████████████████████████████████████████████████████████████ | 260/361 [00:44<00:16, 5.97batch/s][2024-11-08 17:08:41,934] [INFO] [timer.py:264:stop] epoch=0/micro_step=260/global_step=65, RunningAvgSamplesPerSec=24.242053426183123, CurrSamplesPerSec=26.105854769282715, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
73%|████████████████████████████████████████████████████████████████▊ | 263/361 [00:44<00:15, 6.23batch/s][2024-11-08 17:08:42,568] [INFO] [logging.py:129:log_dist] [Rank 0] step=66, skipped=3, lr=[7.585365853658536e-05], mom=[(0.9, 0.95)]
73%|█████████████████████████████████████████████████████████████████ | 264/361 [00:45<00:15, 6.15batch/s][2024-11-08 17:08:42,568] [INFO] [timer.py:264:stop] epoch=0/micro_step=264/global_step=66, RunningAvgSamplesPerSec=24.25788093862369, CurrSamplesPerSec=25.298427024206166, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
74%|█████████████████████████████████████████████████████████████████▊ | 267/361 [00:45<00:20, 4.68batch/s][2024-11-08 17:08:43,426] [INFO] [logging.py:129:log_dist] [Rank 0] step=67, skipped=3, lr=[7.530487804878049e-05], mom=[(0.9, 0.95)]
74%|██████████████████████████████████████████████████████████████████ | 268/361 [00:45<00:18, 5.07batch/s][2024-11-08 17:08:43,427] [INFO] [timer.py:264:stop] epoch=0/micro_step=268/global_step=67, RunningAvgSamplesPerSec=24.146962869406238, CurrSamplesPerSec=18.680366123113842, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
75%|██████████████████████████████████████████████████████████████████▊ | 271/361 [00:46<00:15, 5.77batch/s][2024-11-08 17:08:44,059] [INFO] [logging.py:129:log_dist] [Rank 0] step=68, skipped=3, lr=[7.475609756097562e-05], mom=[(0.9, 0.95)]
75%|███████████████████████████████████████████████████████████████████ | 272/361 [00:46<00:15, 5.91batch/s][2024-11-08 17:08:44,060] [INFO] [timer.py:264:stop] epoch=0/micro_step=272/global_step=68, RunningAvgSamplesPerSec=24.164145085155738, CurrSamplesPerSec=25.335943513637066, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
76%|████████████████████████████████████████████████████████████████████ | 276/361 [00:47<00:14, 5.89batch/s][2024-11-08 17:08:44,716] [INFO] [logging.py:129:log_dist] [Rank 0] step=69, skipped=3, lr=[7.420731707317073e-05], mom=[(0.9, 0.95)]
76%|████████████████████████████████████████████████████████████████████ | 276/361 [00:47<00:14, 5.89batch/s][2024-11-08 17:08:44,717] [INFO] [timer.py:264:stop] epoch=0/micro_step=276/global_step=69, RunningAvgSamplesPerSec=24.167977622087502, CurrSamplesPerSec=24.423604154884174, MemAllocated=11.89GB, MaxMemAllocated=14.2GB
77%|████████████████████████████████████████████████████████████████████▊ | 279/361 [00:47<00:13, 6.18batch/s][2024-11-08 17:08:45,355] [INFO] [logging.py:129:log_dist] [Rank 0] step=70, skipped=3, lr=[7.365853658536585e-05], mom=[(0.9, 0.95)]
78%|█████████████████████████████████████████████████████████████████████ | 280/361 [00:47<00:13, 6.20batch/s][2024-11-08 17:08:45,356] [INFO] [timer.py:264:stop] epoch=0/micro_step=280/global_step=70, RunningAvgSamplesPerSec=24.181127760827227, CurrSamplesPerSec=25.095980058902583, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
Epoch: 0, step: 280, global_step:70, loss: 0.5200233459472656
step: 280-70-70
78%|█████████████████████████████████████████████████████████████████████▊ | 283/361 [00:48<00:12, 6.29batch/s][2024-11-08 17:08:45,994] [INFO] [logging.py:129:log_dist] [Rank 0] step=71, skipped=3, lr=[7.310975609756098e-05], mom=[(0.9, 0.95)]
79%|██████████████████████████████████████████████████████████████████████ | 284/361 [00:48<00:12, 6.21batch/s][2024-11-08 17:08:45,995] [INFO] [timer.py:264:stop] epoch=0/micro_step=284/global_step=71, RunningAvgSamplesPerSec=24.194437327990286, CurrSamplesPerSec=25.135157276674356, MemAllocated=11.89GB, MaxMemAllocated=14.2GB
80%|██████████████████████████████████████████████████████████████████████▊ | 287/361 [00:48<00:11, 6.25batch/s][2024-11-08 17:08:46,640] [INFO] [logging.py:129:log_dist] [Rank 0] step=72, skipped=3, lr=[7.256097560975609e-05], mom=[(0.9, 0.95)]
80%|███████████████████████████████████████████████████████████████████████ | 288/361 [00:49<00:11, 6.15batch/s][2024-11-08 17:08:46,640] [INFO] [timer.py:264:stop] epoch=0/micro_step=288/global_step=72, RunningAvgSamplesPerSec=24.203680960088075, CurrSamplesPerSec=24.858972356027664, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
81%|███████████████████████████████████████████████████████████████████████▋ | 291/361 [00:49<00:11, 6.23batch/s][2024-11-08 17:08:47,289] [INFO] [logging.py:129:log_dist] [Rank 0] step=73, skipped=3, lr=[7.201219512195122e-05], mom=[(0.9, 0.95)]
81%|███████████████████████████████████████████████████████████████████████▉ | 292/361 [00:49<00:11, 6.08batch/s][2024-11-08 17:08:47,289] [INFO] [timer.py:264:stop] epoch=0/micro_step=292/global_step=73, RunningAvgSamplesPerSec=24.210649595873104, CurrSamplesPerSec=24.708592778834852, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
82%|████████████████████████████████████████████████████████████████████████▋ | 295/361 [00:50<00:10, 6.22batch/s][2024-11-08 17:08:47,935] [INFO] [logging.py:129:log_dist] [Rank 0] step=74, skipped=3, lr=[7.146341463414634e-05], mom=[(0.9, 0.95)]
82%|████████████████████████████████████████████████████████████████████████▉ | 296/361 [00:50<00:10, 6.11batch/s][2024-11-08 17:08:47,935] [INFO] [timer.py:264:stop] epoch=0/micro_step=296/global_step=74, RunningAvgSamplesPerSec=24.21927870718334, CurrSamplesPerSec=24.848037531477832, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
83%|█████████████████████████████████████████████████████████████████████████▋ | 299/361 [00:50<00:09, 6.24batch/s][2024-11-08 17:08:48,579] [INFO] [logging.py:129:log_dist] [Rank 0] step=75, skipped=3, lr=[7.091463414634147e-05], mom=[(0.9, 0.95)]
83%|█████████████████████████████████████████████████████████████████████████▉ | 300/361 [00:51<00:09, 6.12batch/s][2024-11-08 17:08:48,579] [INFO] [timer.py:264:stop] epoch=0/micro_step=300/global_step=75, RunningAvgSamplesPerSec=24.228320984735063, CurrSamplesPerSec=24.897560310107867, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
84%|██████████████████████████████████████████████████████████████████████████▋ | 303/361 [00:51<00:09, 6.05batch/s][2024-11-08 17:08:49,234] [INFO] [logging.py:129:log_dist] [Rank 0] step=76, skipped=3, lr=[7.03658536585366e-05], mom=[(0.9, 0.95)]
84%|██████████████████████████████████████████████████████████████████████████▉ | 304/361 [00:51<00:09, 6.07batch/s][2024-11-08 17:08:49,234] [INFO] [timer.py:264:stop] epoch=0/micro_step=304/global_step=76, RunningAvgSamplesPerSec=24.23187036991251, CurrSamplesPerSec=24.493777263344555, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
85%|███████████████████████████████████████████████████████████████████████████▋ | 307/361 [00:52<00:08, 6.10batch/s][2024-11-08 17:08:49,901] [INFO] [logging.py:129:log_dist] [Rank 0] step=77, skipped=3, lr=[6.981707317073172e-05], mom=[(0.9, 0.95)]
85%|███████████████████████████████████████████████████████████████████████████▉ | 308/361 [00:52<00:08, 6.15batch/s][2024-11-08 17:08:49,902] [INFO] [timer.py:264:stop] epoch=0/micro_step=308/global_step=77, RunningAvgSamplesPerSec=24.2291999369994, CurrSamplesPerSec=24.033171902684792, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
86%|████████████████████████████████████████████████████████████████████████████▋ | 311/361 [00:52<00:08, 5.97batch/s][2024-11-08 17:08:50,585] [INFO] [logging.py:129:log_dist] [Rank 0] step=78, skipped=3, lr=[6.926829268292683e-05], mom=[(0.9, 0.95)]
86%|████████████████████████████████████████████████████████████████████████████▉ | 312/361 [00:53<00:08, 6.06batch/s][2024-11-08 17:08:50,586] [INFO] [timer.py:264:stop] epoch=0/micro_step=312/global_step=78, RunningAvgSamplesPerSec=24.218644523668612, CurrSamplesPerSec=23.452335992702185, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
87%|█████████████████████████████████████████████████████████████████████████████▋ | 315/361 [00:53<00:07, 6.35batch/s][2024-11-08 17:08:51,213] [INFO] [logging.py:129:log_dist] [Rank 0] step=79, skipped=3, lr=[6.871951219512196e-05], mom=[(0.9, 0.95)]
88%|█████████████████████████████████████████████████████████████████████████████▉ | 316/361 [00:53<00:07, 6.20batch/s][2024-11-08 17:08:51,214] [INFO] [timer.py:264:stop] epoch=0/micro_step=316/global_step=79, RunningAvgSamplesPerSec=24.234843698916755, CurrSamplesPerSec=25.53274362341541, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
88%|██████████████████████████████████████████████████████████████████████████████▋ | 319/361 [00:54<00:06, 6.32batch/s][2024-11-08 17:08:51,859] [INFO] [logging.py:129:log_dist] [Rank 0] step=80, skipped=3, lr=[6.817073170731708e-05], mom=[(0.9, 0.95)]
89%|██████████████████████████████████████████████████████████████████████████████▉ | 320/361 [00:54<00:06, 6.13batch/s][2024-11-08 17:08:51,859] [INFO] [timer.py:264:stop] epoch=0/micro_step=320/global_step=80, RunningAvgSamplesPerSec=24.242568842686993, CurrSamplesPerSec=24.852528113184572, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
Epoch: 0, step: 320, global_step:80, loss: 0.4399711608886719
step: 320-80-80
89%|███████████████████████████████████████████████████████████████████████████████▋ | 323/361 [00:54<00:06, 6.29batch/s][2024-11-08 17:08:52,495] [INFO] [logging.py:129:log_dist] [Rank 0] step=81, skipped=3, lr=[6.76219512195122e-05], mom=[(0.9, 0.95)]
90%|███████████████████████████████████████████████████████████████████████████████▉ | 324/361 [00:54<00:05, 6.21batch/s][2024-11-08 17:08:52,495] [INFO] [timer.py:264:stop] epoch=0/micro_step=324/global_step=81, RunningAvgSamplesPerSec=24.254721410318105, CurrSamplesPerSec=25.241648071599638, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
91%|████████████████████████████████████████████████████████████████████████████████▌ | 327/361 [00:55<00:06, 5.16batch/s][2024-11-08 17:08:53,261] [INFO] [logging.py:129:log_dist] [Rank 0] step=82, skipped=3, lr=[6.707317073170732e-05], mom=[(0.9, 0.95)]
91%|████████████████████████████████████████████████████████████████████████████████▊ | 328/361 [00:55<00:06, 5.29batch/s][2024-11-08 17:08:53,261] [INFO] [timer.py:264:stop] epoch=0/micro_step=328/global_step=82, RunningAvgSamplesPerSec=24.206790449709434, CurrSamplesPerSec=20.938004834655874, MemAllocated=11.9GB, MaxMemAllocated=14.2GB
92%|█████████████████████████████████████████████████████████████████████████████████▌ | 331/361 [00:56<00:05, 5.83batch/s][2024-11-08 17:08:53,925] [INFO] [logging.py:129:log_dist] [Rank 0] step=83, skipped=3, lr=[6.652439024390245e-05], mom=[(0.9, 0.95)]
92%|█████████████████████████████████████████████████████████████████████████████████▊ | 332/361 [00:56<00:04, 5.94batch/s][2024-11-08 17:08:53,925] [INFO] [timer.py:264:stop] epoch=0/micro_step=332/global_step=83, RunningAvgSamplesPerSec=24.20615787157799, CurrSamplesPerSec=24.155622048063705, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
93%|██████████████████████████████████████████████████████████████████████████████████▌ | 335/361 [00:56<00:04, 5.99batch/s][2024-11-08 17:08:54,586] [INFO] [logging.py:129:log_dist] [Rank 0] step=84, skipped=3, lr=[6.597560975609756e-05], mom=[(0.9, 0.95)]
93%|██████████████████████████████████████████████████████████████████████████████████▊ | 336/361 [00:57<00:04, 6.06batch/s][2024-11-08 17:08:54,587] [INFO] [timer.py:264:stop] epoch=0/micro_step=336/global_step=84, RunningAvgSamplesPerSec=24.206624369159524, CurrSamplesPerSec=24.244433742932937, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
94%|███████████████████████████████████████████████████████████████████████████████████▌ | 339/361 [00:57<00:03, 6.22batch/s][2024-11-08 17:08:55,265] [INFO] [logging.py:129:log_dist] [Rank 0] step=85, skipped=3, lr=[6.542682926829269e-05], mom=[(0.9, 0.95)]
94%|███████████████████████████████████████████████████████████████████████████████████▊ | 340/361 [00:57<00:03, 5.70batch/s][2024-11-08 17:08:55,265] [INFO] [timer.py:264:stop] epoch=0/micro_step=340/global_step=85, RunningAvgSamplesPerSec=24.19957376091189, CurrSamplesPerSec=23.635039835322225, MemAllocated=11.92GB, MaxMemAllocated=14.2GB
95%|████████████████████████████████████████████████████████████████████████████████████▌ | 343/361 [00:58<00:02, 6.25batch/s][2024-11-08 17:08:55,881] [INFO] [logging.py:129:log_dist] [Rank 0] step=86, skipped=3, lr=[6.487804878048781e-05], mom=[(0.9, 0.95)]
95%|████████████████████████████████████████████████████████████████████████████████████▊ | 344/361 [00:58<00:02, 6.26batch/s][2024-11-08 17:08:55,882] [INFO] [timer.py:264:stop] epoch=0/micro_step=344/global_step=86, RunningAvgSamplesPerSec=24.219952048904265, CurrSamplesPerSec=26.039940593444225, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
96%|█████████████████████████████████████████████████████████████████████████████████████▌ | 347/361 [00:58<00:02, 6.40batch/s][2024-11-08 17:08:56,512] [INFO] [logging.py:129:log_dist] [Rank 0] step=87, skipped=3, lr=[6.432926829268292e-05], mom=[(0.9, 0.95)]
96%|█████████████████████████████████████████████████████████████████████████████████████▊ | 348/361 [00:59<00:02, 6.23batch/s][2024-11-08 17:08:56,512] [INFO] [timer.py:264:stop] epoch=0/micro_step=348/global_step=87, RunningAvgSamplesPerSec=24.23366289933785, CurrSamplesPerSec=25.44351907413834, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
97%|██████████████████████████████████████████████████████████████████████████████████████▌ | 351/361 [00:59<00:02, 4.86batch/s][2024-11-08 17:08:57,300] [INFO] [logging.py:129:log_dist] [Rank 0] step=88, skipped=3, lr=[6.378048780487805e-05], mom=[(0.9, 0.95)]
98%|██████████████████████████████████████████████████████████████████████████████████████▊ | 352/361 [00:59<00:01, 5.21batch/s][2024-11-08 17:08:57,301] [INFO] [timer.py:264:stop] epoch=0/micro_step=352/global_step=88, RunningAvgSamplesPerSec=24.17973149697895, CurrSamplesPerSec=20.333337893081097, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
98%|███████████████████████████████████████████████████████████████████████████████████████▌ | 355/361 [01:00<00:01, 6.00batch/s][2024-11-08 17:08:57,921] [INFO] [logging.py:129:log_dist] [Rank 0] step=89, skipped=3, lr=[6.323170731707318e-05], mom=[(0.9, 0.95)]
99%|███████████████████████████████████████████████████████████████████████████████████████▊ | 356/361 [01:00<00:00, 6.07batch/s][2024-11-08 17:08:57,922] [INFO] [timer.py:264:stop] epoch=0/micro_step=356/global_step=89, RunningAvgSamplesPerSec=24.19756159737619, CurrSamplesPerSec=25.835941859469933, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
99%|████████████████████████████████████████████████████████████████████████████████████████▌| 359/361 [01:00<00:00, 6.30batch/s][2024-11-08 17:08:58,556] [INFO] [logging.py:129:log_dist] [Rank 0] step=90, skipped=3, lr=[6.26829268292683e-05], mom=[(0.9, 0.95)]
100%|████████████████████████████████████████████████████████████████████████████████████████▊| 360/361 [01:01<00:00, 6.19batch/s][2024-11-08 17:08:58,556] [INFO] [timer.py:264:stop] epoch=0/micro_step=360/global_step=90, RunningAvgSamplesPerSec=24.20938413604395, CurrSamplesPerSec=25.284091656936724, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
Epoch: 0, step: 360, global_step:90, loss: 0.3734106540679932
step: 360-90-90
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [01:00<00:00, 5.94batch/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [01:01<00:00, 5.89batch/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [01:01<00:00, 5.90batch/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [01:01<00:00, 5.90batch/s]
0%| | 0/361 [00:00<?, ?batch/s]Beginning of Epoch 2/2, Total Micro Batches 361
1%|▊ | 3/361 [00:00<01:07, 5.31batch/s][2024-11-08 17:08:59,306] [INFO] [logging.py:129:log_dist] [Rank 0] step=91, skipped=3, lr=[6.213414634146341e-05], mom=[(0.9, 0.95)]
1%|▊ | 3/361 [00:00<01:07, 5.29batch/s][2024-11-08 17:08:59,306] [INFO] [timer.py:264:stop] epoch=0/micro_step=364/global_step=91, RunningAvgSamplesPerSec=24.21735028560466, CurrSamplesPerSec=24.939474628516038, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
2%|█▌ | 6/361 [00:01<00:57, 6.22batch/s][2024-11-08 17:08:59,931] [INFO] [logging.py:129:log_dist] [Rank 0] step=92, skipped=3, lr=[6.158536585365854e-05], mom=[(0.9, 0.95)]
2%|█▊ | 7/361 [00:01<00:58, 6.10batch/s][2024-11-08 17:08:59,931] [INFO] [timer.py:264:stop] epoch=0/micro_step=368/global_step=92, RunningAvgSamplesPerSec=24.232647561483713, CurrSamplesPerSec=25.67607310135441, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
3%|██▋ | 11/361 [00:01<00:56, 6.18batch/s][2024-11-08 17:09:00,573] [INFO] [logging.py:129:log_dist] [Rank 0] step=93, skipped=3, lr=[6.103658536585367e-05], mom=[(0.9, 0.95)]
3%|██▋ | 11/361 [00:01<00:56, 6.20batch/s][2024-11-08 17:09:00,573] [INFO] [timer.py:264:stop] epoch=0/micro_step=372/global_step=93, RunningAvgSamplesPerSec=24.240892982577318, CurrSamplesPerSec=25.00664550180473, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
4%|███▍ | 14/361 [00:02<00:54, 6.31batch/s][2024-11-08 17:09:01,210] [INFO] [logging.py:129:log_dist] [Rank 0] step=94, skipped=3, lr=[6.0487804878048785e-05], mom=[(0.9, 0.95)]
4%|███▋ | 15/361 [00:02<00:55, 6.21batch/s][2024-11-08 17:09:01,210] [INFO] [timer.py:264:stop] epoch=0/micro_step=376/global_step=94, RunningAvgSamplesPerSec=24.25060444204992, CurrSamplesPerSec=25.16811246438491, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
5%|████▍ | 18/361 [00:02<00:54, 6.28batch/s][2024-11-08 17:09:01,851] [INFO] [logging.py:129:log_dist] [Rank 0] step=95, skipped=3, lr=[5.993902439024391e-05], mom=[(0.9, 0.95)]
5%|████▋ | 19/361 [00:03<00:55, 6.20batch/s][2024-11-08 17:09:01,851] [INFO] [timer.py:264:stop] epoch=0/micro_step=380/global_step=95, RunningAvgSamplesPerSec=24.258815075841476, CurrSamplesPerSec=25.038703806193816, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
6%|█████▍ | 22/361 [00:03<00:53, 6.29batch/s][2024-11-08 17:09:02,491] [INFO] [logging.py:129:log_dist] [Rank 0] step=96, skipped=3, lr=[5.939024390243903e-05], mom=[(0.9, 0.95)]
6%|█████▋ | 23/361 [00:03<00:54, 6.17batch/s][2024-11-08 17:09:02,491] [INFO] [timer.py:264:stop] epoch=0/micro_step=384/global_step=96, RunningAvgSamplesPerSec=24.267062521632727, CurrSamplesPerSec=25.059348136263193, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
7%|██████▋ | 27/361 [00:04<00:54, 6.08batch/s][2024-11-08 17:09:03,164] [INFO] [logging.py:129:log_dist] [Rank 0] step=97, skipped=3, lr=[5.8841463414634155e-05], mom=[(0.9, 0.95)]
7%|██████▋ | 27/361 [00:04<00:55, 6.06batch/s][2024-11-08 17:09:03,164] [INFO] [timer.py:264:stop] epoch=0/micro_step=388/global_step=97, RunningAvgSamplesPerSec=24.262539087469598, CurrSamplesPerSec=23.844700661267062, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
8%|███████▍ | 30/361 [00:05<01:03, 5.19batch/s][2024-11-08 17:09:03,897] [INFO] [logging.py:129:log_dist] [Rank 0] step=98, skipped=3, lr=[5.8292682926829274e-05], mom=[(0.9, 0.95)]
9%|███████▋ | 31/361 [00:05<01:00, 5.50batch/s][2024-11-08 17:09:03,898] [INFO] [timer.py:264:stop] epoch=0/micro_step=392/global_step=98, RunningAvgSamplesPerSec=24.234790286853816, CurrSamplesPerSec=21.859691604893605, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
9%|████████▍ | 34/361 [00:05<00:57, 5.65batch/s][2024-11-08 17:09:04,562] [INFO] [logging.py:129:log_dist] [Rank 0] step=99, skipped=3, lr=[5.774390243902439e-05], mom=[(0.9, 0.95)]
10%|████████▋ | 35/361 [00:05<00:55, 5.83batch/s][2024-11-08 17:09:04,562] [INFO] [timer.py:264:stop] epoch=0/micro_step=396/global_step=99, RunningAvgSamplesPerSec=24.233669621007838, CurrSamplesPerSec=24.12652975156865, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
11%|█████████▍ | 38/361 [00:06<00:51, 6.30batch/s][2024-11-08 17:09:05,179] [INFO] [logging.py:129:log_dist] [Rank 0] step=100, skipped=3, lr=[5.719512195121952e-05], mom=[(0.9, 0.95)]
11%|█████████▋ | 39/361 [00:06<00:51, 6.27batch/s][2024-11-08 17:09:05,180] [INFO] [timer.py:264:stop] epoch=0/micro_step=400/global_step=100, RunningAvgSamplesPerSec=24.250400845913706, CurrSamplesPerSec=25.990977151040383, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
11%|█████████▋ | 39/361 [00:06<00:51, 6.24batch/s]Epoch: 1, step: 40, global_step:100, loss: 0.3425929069519043
step: 40-100-100
12%|██████████▋ | 43/361 [00:07<00:51, 6.21batch/s][2024-11-08 17:09:05,843] [INFO] [logging.py:129:log_dist] [Rank 0] step=101, skipped=3, lr=[5.664634146341464e-05], mom=[(0.9, 0.95)]
[2024-11-08 17:09:05,843] [INFO] [timer.py:264:stop] epoch=0/micro_step=404/global_step=101, RunningAvgSamplesPerSec=24.24988322235398, CurrSamplesPerSec=24.199226483510095, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
13%|███████████▋ | 47/361 [00:07<00:50, 6.25batch/s][2024-11-08 17:09:06,472] [INFO] [logging.py:129:log_dist] [Rank 0] step=102, skipped=3, lr=[5.609756097560976e-05], mom=[(0.9, 0.95)]
13%|███████████▋ | 47/361 [00:07<00:50, 6.25batch/s][2024-11-08 17:09:06,473] [INFO] [timer.py:264:stop] epoch=0/micro_step=408/global_step=102, RunningAvgSamplesPerSec=24.261550989446256, CurrSamplesPerSec=25.474976910460285, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
14%|████████████▍ | 50/361 [00:08<00:50, 6.21batch/s][2024-11-08 17:09:07,119] [INFO] [logging.py:129:log_dist] [Rank 0] step=103, skipped=3, lr=[5.554878048780488e-05], mom=[(0.9, 0.95)]
14%|████████████▋ | 51/361 [00:08<00:49, 6.20batch/s][2024-11-08 17:09:07,119] [INFO] [timer.py:264:stop] epoch=0/micro_step=412/global_step=103, RunningAvgSamplesPerSec=24.266849867070878, CurrSamplesPerSec=24.80864897201143, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
15%|█████████████▍ | 54/361 [00:08<00:48, 6.29batch/s][2024-11-08 17:09:07,760] [INFO] [logging.py:129:log_dist] [Rank 0] step=104, skipped=3, lr=[5.5e-05], mom=[(0.9, 0.95)]
15%|█████████████▋ | 55/361 [00:09<00:49, 6.19batch/s][2024-11-08 17:09:07,761] [INFO] [timer.py:264:stop] epoch=0/micro_step=416/global_step=104, RunningAvgSamplesPerSec=24.274027638026443, CurrSamplesPerSec=25.02148884611155, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
16%|██████████████▍ | 58/361 [00:09<00:48, 6.27batch/s][2024-11-08 17:09:08,402] [INFO] [logging.py:129:log_dist] [Rank 0] step=105, skipped=3, lr=[5.445121951219512e-05], mom=[(0.9, 0.95)]
16%|██████████████▋ | 59/361 [00:09<00:48, 6.18batch/s][2024-11-08 17:09:08,403] [INFO] [timer.py:264:stop] epoch=0/micro_step=420/global_step=105, RunningAvgSamplesPerSec=24.28081496226798, CurrSamplesPerSec=24.993606855032088, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
17%|███████████████▍ | 62/361 [00:10<00:47, 6.27batch/s][2024-11-08 17:09:09,044] [INFO] [logging.py:129:log_dist] [Rank 0] step=106, skipped=3, lr=[5.390243902439025e-05], mom=[(0.9, 0.95)]
17%|███████████████▋ | 63/361 [00:10<00:48, 6.15batch/s][2024-11-08 17:09:09,045] [INFO] [timer.py:264:stop] epoch=0/micro_step=424/global_step=106, RunningAvgSamplesPerSec=24.287404115547645, CurrSamplesPerSec=24.9857529809237, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
18%|████████████████▍ | 66/361 [00:10<00:46, 6.28batch/s][2024-11-08 17:09:09,685] [INFO] [logging.py:129:log_dist] [Rank 0] step=107, skipped=3, lr=[5.3353658536585366e-05], mom=[(0.9, 0.95)]
19%|████████████████▋ | 67/361 [00:10<00:47, 6.16batch/s][2024-11-08 17:09:09,686] [INFO] [timer.py:264:stop] epoch=0/micro_step=428/global_step=107, RunningAvgSamplesPerSec=24.29428785426089, CurrSamplesPerSec=25.032110031314314, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
19%|█████████████████▍ | 70/361 [00:11<00:46, 6.26batch/s][2024-11-08 17:09:10,329] [INFO] [logging.py:129:log_dist] [Rank 0] step=108, skipped=3, lr=[5.2804878048780485e-05], mom=[(0.9, 0.95)]
20%|█████████████████▋ | 71/361 [00:11<00:47, 6.15batch/s][2024-11-08 17:09:10,329] [INFO] [timer.py:264:stop] epoch=0/micro_step=432/global_step=108, RunningAvgSamplesPerSec=24.300098406737824, CurrSamplesPerSec=24.926033739218198, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
20%|██████████████████▍ | 74/361 [00:12<00:45, 6.24batch/s][2024-11-08 17:09:10,975] [INFO] [logging.py:129:log_dist] [Rank 0] step=109, skipped=3, lr=[5.225609756097561e-05], mom=[(0.9, 0.95)]
21%|██████████████████▋ | 75/361 [00:12<00:46, 6.12batch/s][2024-11-08 17:09:10,976] [INFO] [timer.py:264:stop] epoch=0/micro_step=436/global_step=109, RunningAvgSamplesPerSec=24.304825445035657, CurrSamplesPerSec=24.81650200745077, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
22%|███████████████████▍ | 78/361 [00:12<00:45, 6.25batch/s][2024-11-08 17:09:11,619] [INFO] [logging.py:129:log_dist] [Rank 0] step=110, skipped=3, lr=[5.1707317073170736e-05], mom=[(0.9, 0.95)]
22%|███████████████████▋ | 79/361 [00:12<00:45, 6.13batch/s][2024-11-08 17:09:11,620] [INFO] [timer.py:264:stop] epoch=0/micro_step=440/global_step=110, RunningAvgSamplesPerSec=24.310298914961287, CurrSamplesPerSec=24.910517405842256, MemAllocated=11.89GB, MaxMemAllocated=14.2GB
22%|███████████████████▋ | 79/361 [00:12<00:45, 6.16batch/s]Epoch: 1, step: 80, global_step:110, loss: 0.33206987380981445
step: 80-110-110
23%|████████████████████▍ | 82/361 [00:13<00:44, 6.26batch/s][2024-11-08 17:09:12,270] [INFO] [logging.py:129:log_dist] [Rank 0] step=111, skipped=3, lr=[5.1158536585365855e-05], mom=[(0.9, 0.95)]
23%|████████████████████▋ | 83/361 [00:13<00:45, 6.08batch/s][2024-11-08 17:09:12,271] [INFO] [timer.py:264:stop] epoch=0/micro_step=444/global_step=111, RunningAvgSamplesPerSec=24.313334906610493, CurrSamplesPerSec=24.64570841594564, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
24%|█████████████████████▍ | 86/361 [00:14<00:44, 6.24batch/s][2024-11-08 17:09:12,912] [INFO] [logging.py:129:log_dist] [Rank 0] step=112, skipped=3, lr=[5.060975609756098e-05], mom=[(0.9, 0.95)]
24%|█████████████████████▋ | 87/361 [00:14<00:44, 6.13batch/s][2024-11-08 17:09:12,912] [INFO] [timer.py:264:stop] epoch=0/micro_step=448/global_step=112, RunningAvgSamplesPerSec=24.3195073282069, CurrSamplesPerSec=25.01158511151755, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
25%|██████████████████████▍ | 90/361 [00:14<00:43, 6.26batch/s][2024-11-08 17:09:13,554] [INFO] [logging.py:129:log_dist] [Rank 0] step=113, skipped=3, lr=[5.00609756097561e-05], mom=[(0.9, 0.95)]
25%|██████████████████████▋ | 91/361 [00:14<00:43, 6.15batch/s][2024-11-08 17:09:13,554] [INFO] [timer.py:264:stop] epoch=0/micro_step=452/global_step=113, RunningAvgSamplesPerSec=24.325366401316995, CurrSamplesPerSec=24.98752990654296, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
26%|███████████████████████▍ | 94/361 [00:15<00:42, 6.23batch/s][2024-11-08 17:09:14,200] [INFO] [logging.py:129:log_dist] [Rank 0] step=114, skipped=3, lr=[4.951219512195122e-05], mom=[(0.9, 0.95)]
26%|███████████████████████▋ | 95/361 [00:15<00:43, 6.13batch/s][2024-11-08 17:09:14,200] [INFO] [timer.py:264:stop] epoch=0/micro_step=456/global_step=114, RunningAvgSamplesPerSec=24.32984398929755, CurrSamplesPerSec=24.837277784384476, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
27%|████████████████████████▍ | 98/361 [00:15<00:41, 6.26batch/s][2024-11-08 17:09:14,845] [INFO] [logging.py:129:log_dist] [Rank 0] step=115, skipped=3, lr=[4.8963414634146345e-05], mom=[(0.9, 0.95)]
27%|████████████████████████▋ | 99/361 [00:16<00:42, 6.11batch/s][2024-11-08 17:09:14,846] [INFO] [timer.py:264:stop] epoch=0/micro_step=460/global_step=115, RunningAvgSamplesPerSec=24.334400185708084, CurrSamplesPerSec=24.855685374653106, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
29%|█████████████████████████▍ | 103/361 [00:16<00:42, 6.14batch/s][2024-11-08 17:09:15,488] [INFO] [logging.py:129:log_dist] [Rank 0] step=116, skipped=3, lr=[4.8414634146341464e-05], mom=[(0.9, 0.95)]
29%|█████████████████████████▍ | 103/361 [00:16<00:41, 6.17batch/s][2024-11-08 17:09:15,489] [INFO] [timer.py:264:stop] epoch=0/micro_step=464/global_step=116, RunningAvgSamplesPerSec=24.339541405287804, CurrSamplesPerSec=24.93479507310005, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
29%|██████████████████████████▏ | 106/361 [00:17<00:42, 6.05batch/s][2024-11-08 17:09:16,146] [INFO] [logging.py:129:log_dist] [Rank 0] step=117, skipped=3, lr=[4.786585365853658e-05], mom=[(0.9, 0.95)]
30%|██████████████████████████▍ | 107/361 [00:17<00:41, 6.07batch/s][2024-11-08 17:09:16,146] [INFO] [timer.py:264:stop] epoch=0/micro_step=468/global_step=117, RunningAvgSamplesPerSec=24.34023006463361, CurrSamplesPerSec=24.41895623481717, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
30%|███████████████████████████ | 110/361 [00:17<00:41, 6.00batch/s][2024-11-08 17:09:16,809] [INFO] [logging.py:129:log_dist] [Rank 0] step=118, skipped=3, lr=[4.731707317073171e-05], mom=[(0.9, 0.95)]
31%|███████████████████████████▎ | 111/361 [00:18<00:41, 6.08batch/s][2024-11-08 17:09:16,809] [INFO] [timer.py:264:stop] epoch=0/micro_step=472/global_step=118, RunningAvgSamplesPerSec=24.33894305594973, CurrSamplesPerSec=24.19180280098901, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
32%|████████████████████████████ | 114/361 [00:18<00:38, 6.36batch/s][2024-11-08 17:09:17,434] [INFO] [logging.py:129:log_dist] [Rank 0] step=119, skipped=3, lr=[4.676829268292683e-05], mom=[(0.9, 0.95)]
32%|████████████████████████████▎ | 115/361 [00:18<00:39, 6.27batch/s][2024-11-08 17:09:17,434] [INFO] [timer.py:264:stop] epoch=0/micro_step=476/global_step=119, RunningAvgSamplesPerSec=24.349623753308766, CurrSamplesPerSec=25.655567675889692, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
33%|█████████████████████████████▎ | 119/361 [00:19<00:39, 6.19batch/s][2024-11-08 17:09:18,073] [INFO] [logging.py:129:log_dist] [Rank 0] step=120, skipped=3, lr=[4.6219512195121954e-05], mom=[(0.9, 0.95)]
33%|█████████████████████████████▎ | 119/361 [00:19<00:38, 6.22batch/s][2024-11-08 17:09:18,073] [INFO] [timer.py:264:stop] epoch=0/micro_step=480/global_step=120, RunningAvgSamplesPerSec=24.355890582882218, CurrSamplesPerSec=25.1120290649763, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
33%|█████████████████████████████▎ | 119/361 [00:19<00:39, 6.18batch/s]Epoch: 1, step: 120, global_step:120, loss: 0.3332340240478516
step: 120-120-120
34%|██████████████████████████████ | 122/361 [00:19<00:40, 5.89batch/s][2024-11-08 17:09:18,769] [INFO] [logging.py:129:log_dist] [Rank 0] step=121, skipped=3, lr=[4.567073170731708e-05], mom=[(0.9, 0.95)]
34%|██████████████████████████████▎ | 123/361 [00:20<00:40, 5.95batch/s][2024-11-08 17:09:18,769] [INFO] [timer.py:264:stop] epoch=0/micro_step=484/global_step=121, RunningAvgSamplesPerSec=24.34434225002701, CurrSamplesPerSec=23.0544206668251, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
35%|███████████████████████████████ | 126/361 [00:20<00:37, 6.34batch/s][2024-11-08 17:09:19,388] [INFO] [logging.py:129:log_dist] [Rank 0] step=122, skipped=3, lr=[4.51219512195122e-05], mom=[(0.9, 0.95)]
35%|███████████████████████████████▎ | 127/361 [00:20<00:37, 6.29batch/s][2024-11-08 17:09:19,388] [INFO] [timer.py:264:stop] epoch=0/micro_step=488/global_step=122, RunningAvgSamplesPerSec=24.356688417151517, CurrSamplesPerSec=25.920994649024326, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
36%|████████████████████████████████ | 130/361 [00:21<00:36, 6.30batch/s][2024-11-08 17:09:20,032] [INFO] [logging.py:129:log_dist] [Rank 0] step=123, skipped=3, lr=[4.457317073170732e-05], mom=[(0.9, 0.95)]
36%|████████████████████████████████▎ | 131/361 [00:21<00:37, 6.14batch/s][2024-11-08 17:09:20,032] [INFO] [timer.py:264:stop] epoch=0/micro_step=492/global_step=123, RunningAvgSamplesPerSec=24.36121333953055, CurrSamplesPerSec=24.916649460703848, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
37%|█████████████████████████████████ | 134/361 [00:21<00:36, 6.27batch/s][2024-11-08 17:09:20,677] [INFO] [logging.py:129:log_dist] [Rank 0] step=124, skipped=3, lr=[4.4024390243902443e-05], mom=[(0.9, 0.95)]
37%|█████████████████████████████████▎ | 135/361 [00:21<00:36, 6.16batch/s][2024-11-08 17:09:20,677] [INFO] [timer.py:264:stop] epoch=0/micro_step=496/global_step=124, RunningAvgSamplesPerSec=24.365263426911657, CurrSamplesPerSec=24.865429154941804, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
38%|██████████████████████████████████ | 138/361 [00:22<00:35, 6.28batch/s][2024-11-08 17:09:21,317] [INFO] [logging.py:129:log_dist] [Rank 0] step=125, skipped=3, lr=[4.347560975609756e-05], mom=[(0.9, 0.95)]
39%|██████████████████████████████████▎ | 139/361 [00:22<00:36, 6.16batch/s][2024-11-08 17:09:21,317] [INFO] [timer.py:264:stop] epoch=0/micro_step=500/global_step=125, RunningAvgSamplesPerSec=24.370829918643214, CurrSamplesPerSec=25.069533229889178, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
39%|███████████████████████████████████ | 142/361 [00:23<00:34, 6.28batch/s][2024-11-08 17:09:21,957] [INFO] [logging.py:129:log_dist] [Rank 0] step=126, skipped=3, lr=[4.292682926829268e-05], mom=[(0.9, 0.95)]
40%|███████████████████████████████████▎ | 143/361 [00:23<00:35, 6.16batch/s][2024-11-08 17:09:21,958] [INFO] [timer.py:264:stop] epoch=0/micro_step=504/global_step=126, RunningAvgSamplesPerSec=24.37606868962739, CurrSamplesPerSec=25.038040535753993, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
40%|███████████████████████████████████▉ | 146/361 [00:23<00:34, 6.30batch/s][2024-11-08 17:09:22,596] [INFO] [logging.py:129:log_dist] [Rank 0] step=127, skipped=3, lr=[4.237804878048781e-05], mom=[(0.9, 0.95)]
41%|████████████████████████████████████▏ | 147/361 [00:23<00:34, 6.18batch/s][2024-11-08 17:09:22,597] [INFO] [timer.py:264:stop] epoch=0/micro_step=508/global_step=127, RunningAvgSamplesPerSec=24.381787656166015, CurrSamplesPerSec=25.112320371664058, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
42%|████████████████████████████████████▉ | 150/361 [00:24<00:33, 6.32batch/s][2024-11-08 17:09:23,233] [INFO] [logging.py:129:log_dist] [Rank 0] step=128, skipped=3, lr=[4.1829268292682926e-05], mom=[(0.9, 0.95)]
42%|█████████████████████████████████████▏ | 151/361 [00:24<00:33, 6.20batch/s][2024-11-08 17:09:23,233] [INFO] [timer.py:264:stop] epoch=0/micro_step=512/global_step=128, RunningAvgSamplesPerSec=24.388079296053398, CurrSamplesPerSec=25.200917591621373, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
43%|█████████████████████████████████████▉ | 154/361 [00:25<00:35, 5.89batch/s][2024-11-08 17:09:23,909] [INFO] [logging.py:129:log_dist] [Rank 0] step=129, skipped=3, lr=[4.1280487804878045e-05], mom=[(0.9, 0.95)]
43%|██████████████████████████████████████▏ | 155/361 [00:25<00:34, 5.91batch/s][2024-11-08 17:09:23,909] [INFO] [timer.py:264:stop] epoch=0/micro_step=516/global_step=129, RunningAvgSamplesPerSec=24.38285777439429, CurrSamplesPerSec=23.742330094673854, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
44%|██████████████████████████████████████▉ | 158/361 [00:25<00:32, 6.18batch/s][2024-11-08 17:09:24,534] [INFO] [logging.py:129:log_dist] [Rank 0] step=130, skipped=3, lr=[4.073170731707317e-05], mom=[(0.9, 0.95)]
44%|███████████████████████████████████████▏ | 159/361 [00:25<00:32, 6.22batch/s][2024-11-08 17:09:24,535] [INFO] [timer.py:264:stop] epoch=0/micro_step=520/global_step=130, RunningAvgSamplesPerSec=24.39220042349669, CurrSamplesPerSec=25.639845179495538, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
44%|███████████████████████████████████████▍ | 160/361 [00:25<00:31, 6.32batch/s]Epoch: 1, step: 160, global_step:130, loss: 0.3119508743286133
step: 160-130-130
45%|███████████████████████████████████████▉ | 162/361 [00:26<00:39, 5.01batch/s][2024-11-08 17:09:25,297] [INFO] [logging.py:129:log_dist] [Rank 0] step=131, skipped=3, lr=[4.01829268292683e-05], mom=[(0.9, 0.95)]
45%|████████████████████████████████████████▏ | 163/361 [00:26<00:37, 5.34batch/s][2024-11-08 17:09:25,297] [INFO] [timer.py:264:stop] epoch=0/micro_step=524/global_step=131, RunningAvgSamplesPerSec=24.36206905954809, CurrSamplesPerSec=21.0359083233331, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
46%|████████████████████████████████████████▉ | 166/361 [00:27<00:32, 6.03batch/s][2024-11-08 17:09:25,919] [INFO] [logging.py:129:log_dist] [Rank 0] step=132, skipped=3, lr=[3.9634146341463416e-05], mom=[(0.9, 0.95)]
46%|█████████████████████████████████████████▏ | 167/361 [00:27<00:31, 6.11batch/s][2024-11-08 17:09:25,919] [INFO] [timer.py:264:stop] epoch=0/micro_step=528/global_step=132, RunningAvgSamplesPerSec=24.37242314281195, CurrSamplesPerSec=25.786136647580207, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
47%|█████████████████████████████████████████▉ | 170/361 [00:27<00:30, 6.36batch/s][2024-11-08 17:09:26,543] [INFO] [logging.py:129:log_dist] [Rank 0] step=133, skipped=3, lr=[3.908536585365854e-05], mom=[(0.9, 0.95)]
47%|██████████████████████████████████████████▏ | 171/361 [00:27<00:30, 6.29batch/s][2024-11-08 17:09:26,543] [INFO] [timer.py:264:stop] epoch=0/micro_step=532/global_step=133, RunningAvgSamplesPerSec=24.382078241087473, CurrSamplesPerSec=25.705873612457136, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
48%|██████████████████████████████████████████▉ | 174/361 [00:28<00:29, 6.33batch/s][2024-11-08 17:09:27,285] [INFO] [logging.py:129:log_dist] [Rank 0] step=134, skipped=3, lr=[3.853658536585366e-05], mom=[(0.9, 0.95)]
48%|███████████████████████████████████████████▏ | 175/361 [00:28<00:35, 5.20batch/s][2024-11-08 17:09:27,285] [INFO] [timer.py:264:stop] epoch=0/micro_step=536/global_step=134, RunningAvgSamplesPerSec=24.358438706242126, CurrSamplesPerSec=21.613291611269435, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
49%|███████████████████████████████████████████▉ | 178/361 [00:29<00:30, 6.03batch/s][2024-11-08 17:09:27,901] [INFO] [logging.py:129:log_dist] [Rank 0] step=135, skipped=3, lr=[3.798780487804878e-05], mom=[(0.9, 0.95)]
50%|████████████████████████████████████████████▏ | 179/361 [00:29<00:29, 6.12batch/s][2024-11-08 17:09:27,902] [INFO] [timer.py:264:stop] epoch=0/micro_step=540/global_step=135, RunningAvgSamplesPerSec=24.37022302567894, CurrSamplesPerSec=26.03262723823164, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
50%|████████████████████████████████████████████▊ | 182/361 [00:29<00:28, 6.33batch/s][2024-11-08 17:09:28,536] [INFO] [logging.py:129:log_dist] [Rank 0] step=136, skipped=3, lr=[3.7439024390243906e-05], mom=[(0.9, 0.95)]
51%|█████████████████████████████████████████████ | 183/361 [00:29<00:28, 6.19batch/s][2024-11-08 17:09:28,536] [INFO] [timer.py:264:stop] epoch=0/micro_step=544/global_step=136, RunningAvgSamplesPerSec=24.37680635590186, CurrSamplesPerSec=25.28522531285095, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
52%|█████████████████████████████████████████████▊ | 186/361 [00:30<00:30, 5.69batch/s][2024-11-08 17:09:29,475] [INFO] [logging.py:129:log_dist] [Rank 0] step=137, skipped=3, lr=[3.6890243902439025e-05], mom=[(0.9, 0.95)]
52%|██████████████████████████████████████████████ | 187/361 [00:30<00:40, 4.27batch/s][2024-11-08 17:09:29,475] [INFO] [timer.py:264:stop] epoch=0/micro_step=548/global_step=137, RunningAvgSamplesPerSec=24.299777630277898, CurrSamplesPerSec=17.07127644311658, MemAllocated=12.0GB, MaxMemAllocated=14.2GB
53%|██████████████████████████████████████████████▊ | 190/361 [00:31<00:31, 5.46batch/s][2024-11-08 17:09:30,113] [INFO] [logging.py:129:log_dist] [Rank 0] step=138, skipped=3, lr=[3.634146341463415e-05], mom=[(0.9, 0.95)]
53%|███████████████████████████████████████████████ | 191/361 [00:31<00:30, 5.66batch/s][2024-11-08 17:09:30,114] [INFO] [timer.py:264:stop] epoch=0/micro_step=552/global_step=138, RunningAvgSamplesPerSec=24.305641645529935, CurrSamplesPerSec=25.124100460031105, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
54%|███████████████████████████████████████████████▊ | 194/361 [00:31<00:26, 6.22batch/s][2024-11-08 17:09:30,730] [INFO] [logging.py:129:log_dist] [Rank 0] step=139, skipped=3, lr=[3.579268292682927e-05], mom=[(0.9, 0.95)]
54%|████████████████████████████████████████████████ | 195/361 [00:32<00:26, 6.20batch/s][2024-11-08 17:09:30,730] [INFO] [timer.py:264:stop] epoch=0/micro_step=556/global_step=139, RunningAvgSamplesPerSec=24.31734943352585, CurrSamplesPerSec=26.02200796836039, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
55%|████████████████████████████████████████████████▊ | 198/361 [00:32<00:25, 6.36batch/s][2024-11-08 17:09:31,359] [INFO] [logging.py:129:log_dist] [Rank 0] step=140, skipped=3, lr=[3.5243902439024395e-05], mom=[(0.9, 0.95)]
55%|█████████████████████████████████████████████████ | 199/361 [00:32<00:25, 6.27batch/s][2024-11-08 17:09:31,359] [INFO] [timer.py:264:stop] epoch=0/micro_step=560/global_step=140, RunningAvgSamplesPerSec=24.32555430972514, CurrSamplesPerSec=25.504457577123457, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
55%|█████████████████████████████████████████████████ | 199/361 [00:32<00:25, 6.24batch/s]Epoch: 1, step: 200, global_step:140, loss: 0.2866189360618591
step: 200-140-140
56%|█████████████████████████████████████████████████▊ | 202/361 [00:33<00:28, 5.54batch/s][2024-11-08 17:09:32,203] [INFO] [logging.py:129:log_dist] [Rank 0] step=141, skipped=3, lr=[3.4695121951219514e-05], mom=[(0.9, 0.95)]
56%|██████████████████████████████████████████████████ | 203/361 [00:33<00:29, 5.28batch/s][2024-11-08 17:09:32,203] [INFO] [timer.py:264:stop] epoch=0/micro_step=564/global_step=141, RunningAvgSamplesPerSec=24.276544168884904, CurrSamplesPerSec=18.9951613481569, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
57%|██████████████████████████████████████████████████▊ | 206/361 [00:33<00:25, 6.13batch/s][2024-11-08 17:09:32,808] [INFO] [logging.py:129:log_dist] [Rank 0] step=142, skipped=3, lr=[3.414634146341464e-05], mom=[(0.9, 0.95)]
57%|███████████████████████████████████████████████████ | 207/361 [00:34<00:24, 6.18batch/s][2024-11-08 17:09:32,808] [INFO] [timer.py:264:stop] epoch=0/micro_step=568/global_step=142, RunningAvgSamplesPerSec=24.291193738368143, CurrSamplesPerSec=26.51522146097617, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
58%|███████████████████████████████████████████████████▊ | 210/361 [00:34<00:23, 6.34batch/s][2024-11-08 17:09:33,471] [INFO] [logging.py:129:log_dist] [Rank 0] step=143, skipped=3, lr=[3.359756097560976e-05], mom=[(0.9, 0.95)]
58%|████████████████████████████████████████████████████ | 211/361 [00:34<00:25, 5.87batch/s][2024-11-08 17:09:33,471] [INFO] [timer.py:264:stop] epoch=0/micro_step=572/global_step=143, RunningAvgSamplesPerSec=24.290575454983173, CurrSamplesPerSec=24.204288713998576, MemAllocated=11.91GB, MaxMemAllocated=14.2GB
59%|████████████████████████████████████████████████████▊ | 214/361 [00:35<00:25, 5.72batch/s][2024-11-08 17:09:34,212] [INFO] [logging.py:129:log_dist] [Rank 0] step=144, skipped=3, lr=[3.304878048780488e-05], mom=[(0.9, 0.95)]
60%|█████████████████████████████████████████████████████ | 215/361 [00:35<00:26, 5.46batch/s][2024-11-08 17:09:34,212] [INFO] [timer.py:264:stop] epoch=0/micro_step=576/global_step=144, RunningAvgSamplesPerSec=24.26971671088242, CurrSamplesPerSec=21.648501141996267, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
60%|█████████████████████████████████████████████████████▋ | 218/361 [00:36<00:24, 5.73batch/s][2024-11-08 17:09:34,908] [INFO] [logging.py:129:log_dist] [Rank 0] step=145, skipped=3, lr=[3.2500000000000004e-05], mom=[(0.9, 0.95)]
61%|█████████████████████████████████████████████████████▉ | 219/361 [00:36<00:25, 5.61batch/s][2024-11-08 17:09:34,909] [INFO] [timer.py:264:stop] epoch=0/micro_step=580/global_step=145, RunningAvgSamplesPerSec=24.260472922667606, CurrSamplesPerSec=23.01564601052541, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
61%|██████████████████████████████████████████████████████▋ | 222/361 [00:36<00:22, 6.08batch/s][2024-11-08 17:09:35,544] [INFO] [logging.py:129:log_dist] [Rank 0] step=146, skipped=3, lr=[3.195121951219512e-05], mom=[(0.9, 0.95)]
62%|██████████████████████████████████████████████████████▉ | 223/361 [00:36<00:22, 6.15batch/s][2024-11-08 17:09:35,544] [INFO] [timer.py:264:stop] epoch=0/micro_step=584/global_step=146, RunningAvgSamplesPerSec=24.26706213525647, CurrSamplesPerSec=25.247621297408262, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
63%|███████████████████████████████████████████████████████▋ | 226/361 [00:37<00:21, 6.40batch/s][2024-11-08 17:09:36,168] [INFO] [logging.py:129:log_dist] [Rank 0] step=147, skipped=3, lr=[3.140243902439024e-05], mom=[(0.9, 0.95)]
63%|███████████████████████████████████████████████████████▉ | 227/361 [00:37<00:21, 6.26batch/s][2024-11-08 17:09:36,168] [INFO] [timer.py:264:stop] epoch=0/micro_step=588/global_step=147, RunningAvgSamplesPerSec=24.276421995665114, CurrSamplesPerSec=25.704012745978105, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
64%|████████████████████████████████████████████████████████▋ | 230/361 [00:37<00:20, 6.37batch/s][2024-11-08 17:09:36,803] [INFO] [logging.py:129:log_dist] [Rank 0] step=148, skipped=3, lr=[3.085365853658537e-05], mom=[(0.9, 0.95)]
64%|████████████████████████████████████████████████████████▉ | 231/361 [00:38<00:20, 6.23batch/s][2024-11-08 17:09:36,803] [INFO] [timer.py:264:stop] epoch=0/micro_step=592/global_step=148, RunningAvgSamplesPerSec=24.282862648235923, CurrSamplesPerSec=25.254338616719473, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
65%|█████████████████████████████████████████████████████████▋ | 234/361 [00:38<00:20, 6.33batch/s][2024-11-08 17:09:37,442] [INFO] [logging.py:129:log_dist] [Rank 0] step=149, skipped=3, lr=[3.0304878048780494e-05], mom=[(0.9, 0.95)]
65%|█████████████████████████████████████████████████████████▉ | 235/361 [00:38<00:20, 6.19batch/s][2024-11-08 17:09:37,443] [INFO] [timer.py:264:stop] epoch=0/micro_step=596/global_step=149, RunningAvgSamplesPerSec=24.288224762006244, CurrSamplesPerSec=25.097312781743437, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
66%|██████████████████████████████████████████████████████████▋ | 238/361 [00:39<00:19, 6.32batch/s][2024-11-08 17:09:38,077] [INFO] [logging.py:129:log_dist] [Rank 0] step=150, skipped=3, lr=[2.9756097560975613e-05], mom=[(0.9, 0.95)]
66%|██████████████████████████████████████████████████████████▉ | 239/361 [00:39<00:19, 6.23batch/s][2024-11-08 17:09:38,078] [INFO] [timer.py:264:stop] epoch=0/micro_step=600/global_step=150, RunningAvgSamplesPerSec=24.29453239997263, CurrSamplesPerSec=25.258768111503425, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
66%|██████████████████████████████████████████████████████████▉ | 239/361 [00:39<00:19, 6.21batch/s]Epoch: 1, step: 240, global_step:150, loss: 0.256827712059021
step: 240-150-150
67%|███████████████████████████████████████████████████████████▋ | 242/361 [00:39<00:18, 6.33batch/s][2024-11-08 17:09:38,713] [INFO] [logging.py:129:log_dist] [Rank 0] step=151, skipped=3, lr=[2.920731707317073e-05], mom=[(0.9, 0.95)]
67%|███████████████████████████████████████████████████████████▉ | 243/361 [00:39<00:18, 6.24batch/s][2024-11-08 17:09:38,713] [INFO] [timer.py:264:stop] epoch=0/micro_step=604/global_step=151, RunningAvgSamplesPerSec=24.300763130404146, CurrSamplesPerSec=25.259500173322063, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
68%|████████████████████████████████████████████████████████████▋ | 246/361 [00:40<00:18, 6.29batch/s][2024-11-08 17:09:39,356] [INFO] [logging.py:129:log_dist] [Rank 0] step=152, skipped=3, lr=[2.8658536585365857e-05], mom=[(0.9, 0.95)]
68%|████████████████████████████████████████████████████████████▉ | 247/361 [00:40<00:18, 6.19batch/s][2024-11-08 17:09:39,356] [INFO] [timer.py:264:stop] epoch=0/micro_step=608/global_step=152, RunningAvgSamplesPerSec=24.305097730505043, CurrSamplesPerSec=24.968666469159295, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
69%|█████████████████████████████████████████████████████████████▋ | 250/361 [00:41<00:17, 6.17batch/s][2024-11-08 17:09:40,003] [INFO] [logging.py:129:log_dist] [Rank 0] step=153, skipped=3, lr=[2.8109756097560976e-05], mom=[(0.9, 0.95)]
70%|█████████████████████████████████████████████████████████████▉ | 251/361 [00:41<00:17, 6.20batch/s][2024-11-08 17:09:40,004] [INFO] [timer.py:264:stop] epoch=0/micro_step=612/global_step=153, RunningAvgSamplesPerSec=24.308179665892204, CurrSamplesPerSec=24.779454741280084, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
70%|██████████████████████████████████████████████████████████████▌ | 254/361 [00:41<00:16, 6.37batch/s][2024-11-08 17:09:40,642] [INFO] [logging.py:129:log_dist] [Rank 0] step=154, skipped=3, lr=[2.7560975609756102e-05], mom=[(0.9, 0.95)]
71%|██████████████████████████████████████████████████████████████▊ | 255/361 [00:41<00:17, 6.16batch/s][2024-11-08 17:09:40,642] [INFO] [timer.py:264:stop] epoch=0/micro_step=616/global_step=154, RunningAvgSamplesPerSec=24.313398121360198, CurrSamplesPerSec=25.127919842986085, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
71%|███████████████████████████████████████████████████████████████▌ | 258/361 [00:42<00:18, 5.65batch/s][2024-11-08 17:09:41,404] [INFO] [logging.py:129:log_dist] [Rank 0] step=155, skipped=3, lr=[2.701219512195122e-05], mom=[(0.9, 0.95)]
72%|███████████████████████████████████████████████████████████████▊ | 259/361 [00:42<00:17, 5.84batch/s][2024-11-08 17:09:41,404] [INFO] [timer.py:264:stop] epoch=0/micro_step=620/global_step=155, RunningAvgSamplesPerSec=24.288679936323394, CurrSamplesPerSec=21.037682233215396, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
73%|████████████████████████████████████████████████████████████████▌ | 262/361 [00:43<00:15, 6.31batch/s][2024-11-08 17:09:42,020] [INFO] [logging.py:129:log_dist] [Rank 0] step=156, skipped=3, lr=[2.646341463414634e-05], mom=[(0.9, 0.95)]
73%|████████████████████████████████████████████████████████████████▊ | 263/361 [00:43<00:15, 6.27batch/s][2024-11-08 17:09:42,021] [INFO] [timer.py:264:stop] epoch=0/micro_step=624/global_step=156, RunningAvgSamplesPerSec=24.2992077455268, CurrSamplesPerSec=26.02507576455483, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
74%|█████████████████████████████████████████████████████████████████▌ | 266/361 [00:43<00:19, 4.91batch/s][2024-11-08 17:09:42,878] [INFO] [logging.py:129:log_dist] [Rank 0] step=157, skipped=3, lr=[2.5914634146341466e-05], mom=[(0.9, 0.95)]
74%|█████████████████████████████████████████████████████████████████▊ | 267/361 [00:44<00:20, 4.67batch/s][2024-11-08 17:09:42,878] [INFO] [timer.py:264:stop] epoch=0/micro_step=628/global_step=157, RunningAvgSamplesPerSec=24.252316455372384, CurrSamplesPerSec=18.69615059676673, MemAllocated=11.93GB, MaxMemAllocated=14.2GB
75%|██████████████████████████████████████████████████████████████████▊ | 271/361 [00:44<00:15, 5.75batch/s][2024-11-08 17:09:43,506] [INFO] [logging.py:129:log_dist] [Rank 0] step=158, skipped=3, lr=[2.5365853658536585e-05], mom=[(0.9, 0.95)]
[2024-11-08 17:09:43,506] [INFO] [timer.py:264:stop] epoch=0/micro_step=632/global_step=158, RunningAvgSamplesPerSec=24.260190055508122, CurrSamplesPerSec=25.54564111718296, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
76%|███████████████████████████████████████████████████████████████████▌ | 274/361 [00:45<00:13, 6.29batch/s][2024-11-08 17:09:44,135] [INFO] [logging.py:129:log_dist] [Rank 0] step=159, skipped=3, lr=[2.481707317073171e-05], mom=[(0.9, 0.95)]
76%|███████████████████████████████████████████████████████████████████▊ | 275/361 [00:45<00:14, 6.09batch/s][2024-11-08 17:09:44,135] [INFO] [timer.py:264:stop] epoch=0/micro_step=636/global_step=159, RunningAvgSamplesPerSec=24.267658092300014, CurrSamplesPerSec=25.491775921864143, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
77%|████████████████████████████████████████████████████████████████████▌ | 278/361 [00:45<00:13, 6.09batch/s][2024-11-08 17:09:44,798] [INFO] [logging.py:129:log_dist] [Rank 0] step=160, skipped=3, lr=[2.426829268292683e-05], mom=[(0.9, 0.95)]
77%|████████████████████████████████████████████████████████████████████▊ | 279/361 [00:46<00:13, 6.13batch/s][2024-11-08 17:09:44,799] [INFO] [timer.py:264:stop] epoch=0/micro_step=640/global_step=160, RunningAvgSamplesPerSec=24.26710791946863, CurrSamplesPerSec=24.18100254733031, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
77%|████████████████████████████████████████████████████████████████████▊ | 279/361 [00:45<00:13, 6.11batch/s]Epoch: 1, step: 280, global_step:160, loss: 0.31913166046142577
step: 280-160-160
78%|█████████████████████████████████████████████████████████████████████▊ | 283/361 [00:46<00:12, 6.28batch/s][2024-11-08 17:09:45,425] [INFO] [logging.py:129:log_dist] [Rank 0] step=161, skipped=3, lr=[2.3719512195121952e-05], mom=[(0.9, 0.95)]
[2024-11-08 17:09:45,425] [INFO] [timer.py:264:stop] epoch=0/micro_step=644/global_step=161, RunningAvgSamplesPerSec=24.275283506956484, CurrSamplesPerSec=25.640070490582143, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
79%|██████████████████████████████████████████████████████████████████████▌ | 286/361 [00:47<00:11, 6.35batch/s][2024-11-08 17:09:46,059] [INFO] [logging.py:129:log_dist] [Rank 0] step=162, skipped=3, lr=[2.3170731707317075e-05], mom=[(0.9, 0.95)]
80%|██████████████████████████████████████████████████████████████████████▊ | 287/361 [00:47<00:11, 6.26batch/s][2024-11-08 17:09:46,059] [INFO] [timer.py:264:stop] epoch=0/micro_step=648/global_step=162, RunningAvgSamplesPerSec=24.281444324267373, CurrSamplesPerSec=25.30242361586033, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
80%|███████████████████████████████████████████████████████████████████████▍ | 290/361 [00:47<00:11, 6.35batch/s][2024-11-08 17:09:46,695] [INFO] [logging.py:129:log_dist] [Rank 0] step=163, skipped=3, lr=[2.2621951219512197e-05], mom=[(0.9, 0.95)]
81%|███████████████████████████████████████████████████████████████████████▋ | 291/361 [00:47<00:11, 6.21batch/s][2024-11-08 17:09:46,696] [INFO] [timer.py:264:stop] epoch=0/micro_step=652/global_step=163, RunningAvgSamplesPerSec=24.287026786020355, CurrSamplesPerSec=25.214505068431176, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
81%|████████████████████████████████████████████████████████████████████████▍ | 294/361 [00:48<00:10, 6.32batch/s][2024-11-08 17:09:47,333] [INFO] [logging.py:129:log_dist] [Rank 0] step=164, skipped=3, lr=[2.207317073170732e-05], mom=[(0.9, 0.95)]
82%|████████████████████████████████████████████████████████████████████████▋ | 295/361 [00:48<00:10, 6.23batch/s][2024-11-08 17:09:47,334] [INFO] [timer.py:264:stop] epoch=0/micro_step=656/global_step=164, RunningAvgSamplesPerSec=24.29212369486296, CurrSamplesPerSec=25.141560562142864, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
83%|█████████████████████████████████████████████████████████████████████████▍ | 298/361 [00:49<00:09, 6.32batch/s][2024-11-08 17:09:47,970] [INFO] [logging.py:129:log_dist] [Rank 0] step=165, skipped=3, lr=[2.152439024390244e-05], mom=[(0.9, 0.95)]
83%|█████████████████████████████████████████████████████████████████████████▋ | 299/361 [00:49<00:10, 6.20batch/s][2024-11-08 17:09:47,971] [INFO] [timer.py:264:stop] epoch=0/micro_step=660/global_step=165, RunningAvgSamplesPerSec=24.297442587978647, CurrSamplesPerSec=25.190946985236238, MemAllocated=11.89GB, MaxMemAllocated=14.2GB
84%|██████████████████████████████████████████████████████████████████████████▍ | 302/361 [00:49<00:09, 6.31batch/s][2024-11-08 17:09:48,625] [INFO] [logging.py:129:log_dist] [Rank 0] step=166, skipped=3, lr=[2.0975609756097564e-05], mom=[(0.9, 0.95)]
84%|██████████████████████████████████████████████████████████████████████████▋ | 303/361 [00:49<00:09, 6.02batch/s][2024-11-08 17:09:48,625] [INFO] [timer.py:264:stop] epoch=0/micro_step=664/global_step=166, RunningAvgSamplesPerSec=24.29875489767388, CurrSamplesPerSec=24.514535359915264, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
85%|███████████████████████████████████████████████████████████████████████████▍ | 306/361 [00:50<00:09, 5.99batch/s][2024-11-08 17:09:49,288] [INFO] [logging.py:129:log_dist] [Rank 0] step=167, skipped=3, lr=[2.0426829268292683e-05], mom=[(0.9, 0.95)]
85%|███████████████████████████████████████████████████████████████████████████▋ | 307/361 [00:50<00:08, 6.09batch/s][2024-11-08 17:09:49,288] [INFO] [timer.py:264:stop] epoch=0/micro_step=668/global_step=167, RunningAvgSamplesPerSec=24.298116555178645, CurrSamplesPerSec=24.19384364153257, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
86%|████████████████████████████████████████████████████████████████████████████▍ | 310/361 [00:51<00:08, 5.85batch/s][2024-11-08 17:09:49,965] [INFO] [logging.py:129:log_dist] [Rank 0] step=168, skipped=3, lr=[1.9878048780487806e-05], mom=[(0.9, 0.95)]
86%|████████████████████████████████████████████████████████████████████████████▋ | 311/361 [00:51<00:08, 5.98batch/s][2024-11-08 17:09:49,965] [INFO] [timer.py:264:stop] epoch=0/micro_step=672/global_step=168, RunningAvgSamplesPerSec=24.29440668270727, CurrSamplesPerSec=23.697375869335257, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
87%|█████████████████████████████████████████████████████████████████████████████▍ | 314/361 [00:51<00:07, 6.40batch/s][2024-11-08 17:09:50,576] [INFO] [logging.py:129:log_dist] [Rank 0] step=169, skipped=3, lr=[1.9329268292682928e-05], mom=[(0.9, 0.95)]
87%|█████████████████████████████████████████████████████████████████████████████▋ | 315/361 [00:51<00:07, 6.32batch/s][2024-11-08 17:09:50,576] [INFO] [timer.py:264:stop] epoch=0/micro_step=676/global_step=169, RunningAvgSamplesPerSec=24.30522197155159, CurrSamplesPerSec=26.244640855893042, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
88%|██████████████████████████████████████████████████████████████████████████████▍ | 318/361 [00:52<00:06, 6.39batch/s][2024-11-08 17:09:51,208] [INFO] [logging.py:129:log_dist] [Rank 0] step=170, skipped=3, lr=[1.878048780487805e-05], mom=[(0.9, 0.95)]
88%|██████████████████████████████████████████████████████████████████████████████▋ | 319/361 [00:52<00:06, 6.30batch/s][2024-11-08 17:09:51,209] [INFO] [timer.py:264:stop] epoch=0/micro_step=680/global_step=170, RunningAvgSamplesPerSec=24.311304964991994, CurrSamplesPerSec=25.37170081760081, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
88%|██████████████████████████████████████████████████████████████████████████████▋ | 319/361 [00:52<00:06, 6.26batch/s]Epoch: 1, step: 320, global_step:170, loss: 0.25384347438812255
step: 320-170-170
89%|███████████████████████████████████████████████████████████████████████████████▍ | 322/361 [00:52<00:06, 6.29batch/s][2024-11-08 17:09:51,854] [INFO] [logging.py:129:log_dist] [Rank 0] step=171, skipped=3, lr=[1.8231707317073173e-05], mom=[(0.9, 0.95)]
89%|███████████████████████████████████████████████████████████████████████████████▋ | 323/361 [00:53<00:06, 6.26batch/s][2024-11-08 17:09:51,854] [INFO] [timer.py:264:stop] epoch=0/micro_step=684/global_step=171, RunningAvgSamplesPerSec=24.314519615143027, CurrSamplesPerSec=24.86688493090245, MemAllocated=11.85GB, MaxMemAllocated=14.2GB
90%|████████████████████████████████████████████████████████████████████████████████▎ | 326/361 [00:53<00:05, 6.34batch/s][2024-11-08 17:09:52,602] [INFO] [logging.py:129:log_dist] [Rank 0] step=172, skipped=3, lr=[1.7682926829268292e-05], mom=[(0.9, 0.95)]
91%|████████████████████████████████████████████████████████████████████████████████▌ | 327/361 [00:53<00:06, 5.14batch/s][2024-11-08 17:09:52,602] [INFO] [timer.py:264:stop] epoch=0/micro_step=688/global_step=172, RunningAvgSamplesPerSec=24.29530136140045, CurrSamplesPerSec=21.432375040552873, MemAllocated=11.96GB, MaxMemAllocated=14.2GB
91%|█████████████████████████████████████████████████████████████████████████████████▎ | 330/361 [00:54<00:05, 5.55batch/s][2024-11-08 17:09:53,285] [INFO] [logging.py:129:log_dist] [Rank 0] step=173, skipped=3, lr=[1.7134146341463418e-05], mom=[(0.9, 0.95)]
92%|█████████████████████████████████████████████████████████████████████████████████▌ | 331/361 [00:54<00:05, 5.77batch/s][2024-11-08 17:09:53,286] [INFO] [timer.py:264:stop] epoch=0/micro_step=692/global_step=173, RunningAvgSamplesPerSec=24.290359544051466, CurrSamplesPerSec=23.47846054741365, MemAllocated=11.87GB, MaxMemAllocated=14.2GB
93%|██████████████████████████████████████████████████████████████████████████████████▎ | 334/361 [00:55<00:04, 5.80batch/s][2024-11-08 17:09:53,939] [INFO] [logging.py:129:log_dist] [Rank 0] step=174, skipped=3, lr=[1.6585365853658537e-05], mom=[(0.9, 0.95)]
93%|██████████████████████████████████████████████████████████████████████████████████▌ | 335/361 [00:55<00:04, 5.97batch/s][2024-11-08 17:09:53,940] [INFO] [timer.py:264:stop] epoch=0/micro_step=696/global_step=174, RunningAvgSamplesPerSec=24.29168877972303, CurrSamplesPerSec=24.521110120857482, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
94%|███████████████████████████████████████████████████████████████████████████████████▎ | 338/361 [00:55<00:03, 6.40batch/s][2024-11-08 17:09:54,561] [INFO] [logging.py:129:log_dist] [Rank 0] step=175, skipped=3, lr=[1.603658536585366e-05], mom=[(0.9, 0.95)]
94%|███████████████████████████████████████████████████████████████████████████████████▌ | 339/361 [00:55<00:03, 6.20batch/s][2024-11-08 17:09:54,561] [INFO] [timer.py:264:stop] epoch=0/micro_step=700/global_step=175, RunningAvgSamplesPerSec=24.29996514861986, CurrSamplesPerSec=25.812588797242434, MemAllocated=11.88GB, MaxMemAllocated=14.2GB
95%|████████████████████████████████████████████████████████████████████████████████████▌ | 343/361 [00:56<00:02, 6.21batch/s][2024-11-08 17:09:55,224] [INFO] [logging.py:129:log_dist] [Rank 0] step=176, skipped=3, lr=[1.5487804878048782e-05], mom=[(0.9, 0.95)]
95%|████████████████████████████████████████████████████████████████████████████████████▌ | 343/361 [00:56<00:02, 6.23batch/s][2024-11-08 17:09:55,224] [INFO] [timer.py:264:stop] epoch=0/micro_step=704/global_step=176, RunningAvgSamplesPerSec=24.299338244198974, CurrSamplesPerSec=24.191331886681077, MemAllocated=11.86GB, MaxMemAllocated=14.2GB
96%|█████████████████████████████████████████████████████████████████████████████████████▎ | 346/361 [00:56<00:02, 6.48batch/s][2024-11-08 17:09:55,838] [INFO] [logging.py:129:log_dist] [Rank 0] step=177, skipped=3, lr=[1.4939024390243904e-05], mom=[(0.9, 0.95)]
96%|█████████████████████████████████████████████████████████████████████████████████████▌ | 347/361 [00:57<00:02, 6.39batch/s][2024-11-08 17:09:55,838] [INFO] [timer.py:264:stop] epoch=0/micro_step=708/global_step=177, RunningAvgSamplesPerSec=24.30910850707453, CurrSamplesPerSec=26.137710382287874, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
97%|██████████████████████████████████████████████████████████████████████████████████████▎ | 350/361 [00:57<00:01, 6.40batch/s][2024-11-08 17:09:56,624] [INFO] [logging.py:129:log_dist] [Rank 0] step=178, skipped=3, lr=[1.4390243902439027e-05], mom=[(0.9, 0.95)]
97%|██████████████████████████████████████████████████████████████████████████████████████▌ | 351/361 [00:57<00:02, 4.87batch/s][2024-11-08 17:09:56,625] [INFO] [timer.py:264:stop] epoch=0/micro_step=712/global_step=178, RunningAvgSamplesPerSec=24.28260899514373, CurrSamplesPerSec=20.392356713284403, MemAllocated=11.93GB, MaxMemAllocated=14.2GB
98%|███████████████████████████████████████████████████████████████████████████████████████▎ | 354/361 [00:58<00:01, 5.85batch/s][2024-11-08 17:09:57,236] [INFO] [logging.py:129:log_dist] [Rank 0] step=179, skipped=3, lr=[1.3841463414634147e-05], mom=[(0.9, 0.95)]
98%|███████████████████████████████████████████████████████████████████████████████████████▌ | 355/361 [00:58<00:00, 6.00batch/s][2024-11-08 17:09:57,237] [INFO] [timer.py:264:stop] epoch=0/micro_step=716/global_step=179, RunningAvgSamplesPerSec=24.292699698956497, CurrSamplesPerSec=26.209555385812912, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
99%|████████████████████████████████████████████████████████████████████████████████████████▎| 358/361 [00:58<00:00, 6.33batch/s][2024-11-08 17:09:57,859] [INFO] [logging.py:129:log_dist] [Rank 0] step=180, skipped=3, lr=[1.329268292682927e-05], mom=[(0.9, 0.95)]
99%|████████████████████████████████████████████████████████████████████████████████████████▌| 359/361 [00:59<00:00, 6.27batch/s][2024-11-08 17:09:57,859] [INFO] [timer.py:264:stop] epoch=0/micro_step=720/global_step=180, RunningAvgSamplesPerSec=24.300508226461744, CurrSamplesPerSec=25.766424596691824, MemAllocated=11.84GB, MaxMemAllocated=14.2GB
99%|████████████████████████████████████████████████████████████████████████████████████████▌| 359/361 [00:59<00:00, 6.24batch/s]Epoch: 1, step: 360, global_step:180, loss: 0.2547494947910309
step: 360-180-180
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [00:59<00:00, 6.07batch/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [00:59<00:00, 6.07batch/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████| 361/361 [00:59<00:00, 6.07batch/s]
[2024-11-08 17:10:00,482] [INFO] [launch.py:351:main] Process 3443025 exits successfully.
[2024-11-08 17:10:00,482] [INFO] [launch.py:351:main] Process 3443026 exits successfully.
[2024-11-08 17:10:00,482] [INFO] [launch.py:351:main] Process 3443024 exits successfully.
[2024-11-08 17:10:00,482] [INFO] [launch.py:351:main] Process 3443023 exits successfully.