Anaconda Usages

本文详细介绍如何使用conda进行软件包的安装、卸载以及环境的导出与恢复,为用户提供了一个全面的conda环境管理流程。

使用conda安装软件包(如matplotlib):

conda install matplotlib

使用conda卸载软件包(如matplotlib):

conda remove matplotlib

使用conda将当前终端所处环境导出:

conda env export > environment.txt

使用conda从文件中恢复环境:

conda env create -f environment.txt

 

 

 

 

rank1]:[E830 06:13:19.780493643 ProcessGroupNCCL.cpp:632] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=146806, OpType=ALLREDUCE, NumelIn=3675597, NumelOut=3675597, Timeout(ms)=600000) ran for 600472 milliseconds before timing out.████████████████ | 519/804 [18:24<09:52, 2.08s/it, loss=0.0163] [rank1]:[E830 06:13:57.611628340 ProcessGroupNCCL.cpp:2271] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 146806 PG status: last enqueued work: 146806, last completed work: 146804 [rank1]:[E830 06:13:57.634486146 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank1]:[E830 06:14:02.704255369 ProcessGroupNCCL.cpp:2106] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping. [rank0]:[E830 06:14:02.725916508 ProcessGroupNCCL.cpp:1685] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore. [rank1]:[E830 06:14:02.768576450 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 146806, last completed NCCL work: 146804.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E830 06:14:02.768580472 ProcessGroupNCCL.cpp:1746] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 1 and we will try our best to dump the debug info. Last enqueued NCCL work: 146804, last completed NCCL work: 146804.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E830 06:14:08.324151488 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank1]:[E830 06:14:08.324155500 ProcessGroupNCCL.cpp:1536] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank1]:[E830 06:16:17.610834920 ProcessGroupNCCL.cpp:684] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E830 06:16:17.667404073 ProcessGroupNCCL.cpp:698] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E830 06:16:45.611443648 ProcessGroupNCCL.cpp:1899] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=146806, OpType=ALLREDUCE, NumelIn=3675597, NumelOut=3675597, Timeout(ms)=600000) ran for 600472 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7a36a8d785e8 in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7a36533e2a6d in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7a36533e47f0 in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a36533e5efd in /root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdbbf4 (0x7a3642edbbf4 in /root/anaconda3/envs/torch/bin/../lib/libstdc++.so.6) frame #5: <unknown function> + 0x94ac3 (0x7a36aa694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x7a36aa726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank0]:[F830 06:22:40.572585407 ProcessGroupNCCL.cpp:1557] [PG ID 0 PG GUID 0(default_pg) Rank 0] [PG ID 0 PG GUID 0(default_pg) Rank 0] Terminating the process after attempting to dump debug info, due to collective timeout or exception. [rank1]: Traceback (most recent call last): [rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/resource_sharer.py", line 139, in _serve [rank1]: msg = conn.recv() [rank1]: ^^^^^^^^^^^ [rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 249, in recv [rank1]: buf = self._recv_bytes() [rank1]: ^^^^^^^^^^^^^^^^^^ [rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 413, in _recv_bytes [rank1]: buf = self._recv(4) [rank1]: ^^^^^^^^^^^^^ [rank1]: File "/root/anaconda3/envs/torch/lib/python3.12/multiprocessing/connection.py", line 382, in _recv [rank1]: raise EOFError [rank1]: EOFError W0830 06:40:48.263000 3372 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3385 closing signal SIGTERM E0830 06:40:48.433000 3372 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 3386) of binary: /root/anaconda3/envs/torch/bin/python3.12 Traceback (most recent call last): File "/root/anaconda3/envs/torch/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/torch/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== train.py FAILED ----------------------------------------------------- Failures: <NO_OTHER_FAILURES> ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-08-30_06:40:48 host : ub-MS-7A93 rank : 1 (local_rank: 1) exitcode : -6 (pid: 3386) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 3386
最新发布
08-31
# Get Started ## Installation 1. clone this repo. ``` git clone https://github.com/ShiqiYu/OpenGait.git ``` 2. Install dependenices: - pytorch >= 1.10 - torchvision - pyyaml - tensorboard - opencv-python - tqdm - py7zr - kornia - einops Install dependenices by [Anaconda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html): ``` conda install tqdm pyyaml tensorboard opencv kornia einops -c conda-forge conda install pytorch==1.10 torchvision -c pytorch ``` Or, Install dependenices by pip: ``` pip install tqdm pyyaml tensorboard opencv-python kornia einops pip install torch==1.10 torchvision==0.11 ``` ## Prepare dataset See [prepare dataset](2.prepare_dataset.md). ## Get trained model - Option 1: ``` python misc/download_pretrained_model.py ``` - Option 2: Go to the [release page](https://github.com/ShiqiYu/OpenGait/releases/), then download the model file and uncompress it to [output](output). ## Train Train a model by ``` CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase train ``` - `python -m torch.distributed.launch` [DDP](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) launch instruction. - `--nproc_per_node` The number of gpus to use, and it must equal the length of `CUDA_VISIBLE_DEVICES`. - `--cfgs` The path to config file. - `--phase` Specified as `train`. <!-- - `--iter` You can specify a number of iterations or use `restore_hint` in the config file and resume training from there. --> - `--log_to_file` If specified, the terminal log will be written on disk simultaneously. You can run commands in [train.sh](train.sh) for training different models. ## Test Evaluate the trained model by ``` CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./configs/baseline/baseline.yaml --phase test ``` - `--phase` Specified as `test`. - `--iter` Specify a iteration checkpoint. **Tip**: Other arguments are the same as train phase. You can run commands in [test.sh](test.sh) for testing different models. ## Customize 1. Read the [detailed config](docs/1.detailed_config.md) to figure out the usage of needed setting items; 2. See [how to create your model](docs/2.how_to_create_your_model.md); 3. There are some advanced usages, refer to [advanced usages](docs/3.advanced_usages.md), please. ## Warning - In `DDP` mode, zombie processes may be generated when the program terminates abnormally. You can use this command [sh misc/clean_process.sh](./misc/clean_process.sh) to clear them.
03-18
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值