Hang File Generator

介绍Oracle数据库Hang时自动生成hanganalyze&systemstatetrace文件的工具,该工具帮助用户快速定位Oracle数据库Hang的问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Oracle数据库Hang时自动生成hanganalyze & systemstate trace文件的工具。
HANGFG User Guide [ID 362094.1]
/home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] ***************************************** W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0703 16:30:36.069853 3914856 torch/distributed/run.py:766] ***************************************** /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,321 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,322 >> loading file chat_template.jinja /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/wiseatc/.local/lib/python3.11/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:43,904 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:697] 2025-07-03 16:30:43,913 >> loading configuration file /mnt/data1/models/1.5B/config.json [INFO|configuration_utils.py:771] 2025-07-03 16:30:43,919 >> Model config Qwen2Config { "_name_or_path": "/mnt/data1/models/1.5B", "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151643, "hidden_act": "silu", "hidden_size": 1536, "initializer_range": 0.02, "intermediate_size": 8960, "max_position_embeddings": 131072, "max_window_layers": 21, "model_type": "qwen2", "num_attention_heads": 12, "num_hidden_layers": 28, "num_key_value_heads": 2, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 10000, "sliding_window": 4096, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0", "use_cache": true, "use_mrope": false, "use_sliding_window": false, "vocab_size": 151936 } [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.model [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2048] 2025-07-03 16:30:43,920 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2313] 2025-07-03 16:30:44,493 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank1]:[W703 16:30:45.102845887 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank2]:[W703 16:30:45.126706430 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2] using GPU 2 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank3]:[W703 16:30:45.136836682 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 3] using GPU 3 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. Setting num_proc from 16 back to 1 for the train split to disable multiprocessing as it only contains one shard. Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 120 examples [00:00, 6525.39 examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] Converting format of dataset (num_proc=16): 0%| | 0/120 [00:00<?, ? examples/s] /usr/local/lib/python3.11/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user. warnings.warn( # warn only once [rank0]:[W703 16:31:05.679961201 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device. [rank0]: multiprocess.pool.RemoteTraceback: [rank0]: """ [rank0]: Traceback (most recent call last): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker [rank0]: result = (True, func(*args, **kwds)) [rank0]: ^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 688, in _write_generator_to_queue [rank0]: for i, result in enumerate(func(**kwargs)): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single [rank0]: for i, example in iter_outputs(shard_iterable): [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs [rank0]: yield i, apply_function(example, i, offset=offset) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function [rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 94, in __call__ [rank0]: if self.dataset_attr.prompt and example[self.dataset_attr.prompt]: [rank0]: ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/formatting/formatting.py", line 278, in __getitem__ [rank0]: value = self.data[key] [rank0]: ~~~~~~~~~^^^^^ [rank0]: KeyError: 'instruction' [rank0]: """ [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module> [rank0]: launch() [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch [rank0]: run_exp() [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp [rank0]: _training_function(config={"args": args, "callbacks": callbacks}) [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/tuner.py", line 72, in _training_function [rank0]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 51, in run_sft [rank0]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 304, in get_dataset [rank0]: dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 182, in _get_merged_dataset [rank0]: datasets[dataset_name] = _load_single_dataset(dataset_attr, model_args, data_args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/loader.py", line 162, in _load_single_dataset [rank0]: return align_dataset(dataset, dataset_attr, data_args, training_args) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/LLaMA-Factory/src/llamafactory/data/converter.py", line 279, in align_dataset [rank0]: return dataset.map( [rank0]: ^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 557, in wrapper [rank0]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/arrow_dataset.py", line 3171, in map [rank0]: for rank, done, content in iflatmap_unordered( [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in iflatmap_unordered [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 728, in <listcomp> [rank0]: [async_result.get(timeout=0.05) for async_result in async_results] [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/home/wiseatc/.local/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get [rank0]: raise self._value [rank0]: KeyError: 'instruction' [rank0]:[W703 16:31:06.912491219 ProcessGroupNCCL.cpp:1479] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) W0703 16:31:07.960560 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914916 closing signal SIGTERM W0703 16:31:07.961188 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914917 closing signal SIGTERM W0703 16:31:07.961536 3914856 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 3914918 closing signal SIGTERM E0703 16:31:08.371267 3914856 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 3914915) of binary: /usr/bin/python3.11 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 892, in main run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-07-03_16:31:07 host : wiseatc-Super-Server rank : 0 (local_rank: 0) exitcode : 1 (pid: 3914915) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Traceback (most recent call last): File "/home/wiseatc/.local/bin/llamafactory-cli", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/wiseatc/LLaMA-Factory/src/llamafactory/cli.py", line 130, in main process = subprocess.run( ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/subprocess.py", line 569, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '4', '--master_addr', '127.0.0.1', '--master_port', '41919', '/home/wiseatc/LLaMA-Factory/src/llamafactory/launcher.py', 'saves/DeepSeek-R1-1.5B-Distill/lora/train_2025-07-03-16-29-46/training_args.yaml']' returned non-zero exit status 1.
07-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值