PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True,garbage_collection_threshold:0.8,max_split_size_mb:128" deepspeed --num_gpus 6 finetune.py /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
[2025-10-22 11:29:08,447] [WARNING] [runner.py:232:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2025-10-22 11:29:08,448] [INFO] [runner.py:630:main] cmd = /home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --log_level=info finetune.py /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
[2025-10-22 11:29:12,873] [INFO] [launch.py:162:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5]}
[2025-10-22 11:29:12,874] [INFO] [launch.py:168:main] nnodes=1, num_local_procs=6, node_rank=0
[2025-10-22 11:29:12,874] [INFO] [launch.py:179:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5]})
[2025-10-22 11:29:12,874] [INFO] [launch.py:180:main] dist_world_size=6
[2025-10-22 11:29:12,874] [INFO] [launch.py:184:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5
[2025-10-22 11:29:12,875] [INFO] [launch.py:272:main] process 3734091 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=0', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
[2025-10-22 11:29:12,875] [INFO] [launch.py:272:main] process 3734092 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=1', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
[2025-10-22 11:29:12,876] [INFO] [launch.py:272:main] process 3734093 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=2', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
[2025-10-22 11:29:12,877] [INFO] [launch.py:272:main] process 3734094 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=3', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
[2025-10-22 11:29:12,878] [INFO] [launch.py:272:main] process 3734095 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=4', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
[2025-10-22 11:29:12,879] [INFO] [launch.py:272:main] process 3734096 spawned with command: ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=5', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json']
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
usage: finetune.py [-h] [--local_rank LOCAL_RANK]
finetune.py: error: unrecognized arguments: /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune /mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d configs/lora.yaml yes --deepspeed ds_config.json
[2025-10-22 11:29:13,880] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734091
[2025-10-22 11:29:13,920] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734092
[2025-10-22 11:29:13,949] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734093
[2025-10-22 11:29:13,978] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734094
[2025-10-22 11:29:14,007] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734095
[2025-10-22 11:29:14,007] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 3734096
[2025-10-22 11:29:14,036] [ERROR] [launch.py:341:sigkill_handler] ['/home/zhaoshukuo/miniconda3/envs/glm-z1/bin/python3.10', '-u', 'finetune.py', '--local_rank=5', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/finetune', '/mnt/data/zhaoshukuo/try/GLM-Z1-32B-0414/cache/models--zai-org--GLM-Z1-32B-0414/snapshots/8eb2858992c1f749e2a6d4075455decc2484722d', 'configs/lora.yaml', 'yes', '--deepspeed', 'ds_config.json'] exits with return code = 2