关于LightGCN的实践——AttributeError: ‘train_thread‘ object has no attribute ‘data‘

在大规模数据中运行GCN时遇到'AttributeError: 'train_thread‘ object has no attribute ‘data‘'的问题。经过排查,发现并非服务器内存或GPU资源问题,而是可能与batch_size及embedding size有关。降低embedding size导致性能下降,提示需要平衡模型复杂度与资源占用。考虑的解决方案包括增加维度、多GPU训练和优化数据处理。初步测试显示,三层64D模型在小数据集上的效果优于四层32D模型。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题来源:GCN代码在数据量较大的时候出现的问题。本文粉丝可见

就在刚刚过去的2h内发生了两次这个错误,首先猜测是数据量逐渐上来的原因,下面溯源。

具体错误如下:

Training @ 2020-10-15 09:40:09.802495s
Traceback (most recent call last):
  File "/data1/xulm1/LightGCN/video_update_rec_items.py", line 662, in <module>
    _, batch_loss, batch_mf_loss, batch_emb_loss, batch_reg_loss = train_cur.data
AttributeError: 'train_thread' object has no attribute 'data'
/data1/xulm1/LightGCN/utilitys.py:592: RuntimeWarning: divide by zero encountered in power
  d_inv = np.power(rowsum, -1).flatten()
/data1/xulm1/LightGCN/utilitys.py:560: RuntimeWarning: divide by zero encountered in
/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field `model.action-expert-variant` is annotated with type `typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora']`, but the default value `gemma_300m_lora` has type `<class 'str'>`. We'll try to handle this gracefully, but it may cause unexpected behavior. warnings.warn(message) 19:07:30.004 [I] Running on: shuo-hp (10287:train.py:195) INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' 19:07:30.228 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (10287:xla_bridge.py:945) INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory 19:07:30.228 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (10287:xla_bridge.py:945) 19:07:30.500 [I] Wiped checkpoint directory /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoints.py:25) 19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332) 19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332) 19:07:30.500 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (10287:multihost.py:375) 19:07:30.500 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>}, handler_registry=None (10287:checkpoint_manager.py:622) 19:07:30.501 [I] Deferred registration for item: "assets". Adding handler `<openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>` for item "assets" and save args `<class 'openpi.training.checkpoints.CallbackSave'>` and restore args `<class 'openpi.training.checkpoints.CallbackRestore'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "train_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>` for item "train_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>` for item "params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239) 19:07:30.501 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>}). (10287:composite_checkpoint_handler.py:508) 19:07:30.501 [I] orbax-checkpoint version: 0.11.1 (10287:abstract_checkpointer.py:35) 19:07:30.501 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x72e5cacb85e0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (10287:async_checkpointer.py:80) 19:07:30.501 [I] Found 0 checkpoint steps in /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoint_manager.py:1528) 19:07:30.501 [I] Saving root metadata (10287:checkpoint_manager.py:1569) 19:07:30.501 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (10287:multihost.py:293) 19:07:30.501 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x72e5cadffd10> (10287:checkpoint_manager.py:797) 19:07:30.553 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (10287:config.py:166) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.553 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.554 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). 19:07:30.555 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213) Traceback (most recent call last): File "/home/shuo/VLA/openpi/scripts/train.py", line 273, in <module> main(_config.cli()) File "/home/shuo/VLA/openpi/scripts/train.py", line 226, in main batch = next(data_iter) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 177, in __iter__ for batch in self._data_loader: File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 257, in __iter__ batch = next(data_iter) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__ data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data return self._process_data(data) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data data.reraise() File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraise raise exception KeyError: Caught KeyError in DataLoader worker process 0. Original Traceback (most recent call last): File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop data = fetcher.fetch(index) # type: ignore[possibly-undefined] ^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] ~~~~~~~~~~~~^^^^^ File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 47, in __getitem__ return self._transform(self._dataset[index]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 70, in __call__ data = transform(data) ^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in __call__ return jax.tree.map(lambda k: flat_item[k], self.structure) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree.py", line 155, in map return tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in tree_map return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in <genexpr> return treedef.unflatten(f(*xs) for xs in zip(*all_leaves)) ^^^^^^ File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in <lambda> return jax.tree.map(lambda k: flat_item[k], self.structure) ~~~~~~~~~^^^ KeyError: 'observation.images.cam_low'
05-13
import qlib from qlib.constant import REG_CN from qlib.data import D provider_uri=r"E:\aaa___qlib_project\data_bin1" qlib.init(provider_uri=provider_uri,region=REG_CN) D.instruments('601000.SH') instruments = ["601000.SH"] data = D.features(instruments, fields=['$open', '$high', '$low', '$close', '$vol'], start_time='2024-10-25', end_time='2025-7-17') print(data) 执行出错 %runfile E:/aaa___qlib_project/prepare_for_data_1.py --wdir [46124:MainThread](2025-07-23 00:22:25,403) INFO - qlib.Initialization - [config.py:420] - default_conf: client. [46124:MainThread](2025-07-23 00:22:25,405) INFO - qlib.Initialization - [__init__.py:74] - qlib successfully initialized based on client settings. [46124:MainThread](2025-07-23 00:22:25,406) INFO - qlib.Initialization - [__init__.py:76] - data_path={'__DEFAULT_FREQ': WindowsPath('E:/aaa___qlib_project/data_bin1')} --------------------------------------------------------------------------- TypeError Traceback (most recent call last) File D:\anaconda\envs\qlib_liang_hua_touzi\Lib\site-packages\qlib\data\data.py:1186, in BaseProvider.features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors) 1185 try: -> 1186 return DatasetD.dataset( 1187 instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors 1188 ) 1189 except TypeError: TypeError: LocalDatasetProvider.dataset() got multiple values for argument 'inst_processors' During handling of the above exception, another exception occurred: AttributeError Traceback (most recent call last) File e:\aaa___qlib_project\prepare_for_data_1.py:28 26 D.instruments('601000.SH') 27 instruments = ["601000.SH"] ---> 28 data = D.features(instruments, fields=['$open', '$high', '$low', '$close', '$vol'], start_time='2024-10-25', end_time='2025-7-17') 29 print(data) File D:\anaconda\envs\qlib_liang_hua_touzi\Lib\site-packages\qlib\data\data.py:1190, in BaseProvider.features(self, instruments, fields, start_time, end_time, freq, disk_cache, inst_processors) 1186 return DatasetD.dataset( 1187 instruments, fields, start_time, end_time, freq, disk_cache, inst_processors=inst_processors 1188 ) 1189 except TypeError: -> 1190 return DatasetD.dataset(instruments, fields, start_time, end_time, freq, inst_processors=inst_processors) File D:\anaconda\envs\qlib_liang_hua_touzi\Lib\site-packages\qlib\data\data.py:923, in LocalDatasetProvider.dataset(self, instruments, fields, start_time, end_time, freq, inst_processors) 921 start_time = cal[0] 922 end_time = cal[-1] --> 923 data = self.dataset_processor( 924 instruments_d, column_names, start_time, end_time, freq, inst_processors=inst_processors 925 ) 927 return data File D:\anaconda\envs\qlib_liang_hua_touzi\Lib\site-packages\qlib\data\data.py:577, in DatasetProvider.dataset_processor(instruments_d, column_names, start_time, end_time, freq, inst_processors) 567 inst_l.append(inst) 568 task_l.append( 569 delayed(DatasetProvider.inst_calculator)( 570 inst, start_time, end_time, freq, normalize_column_names, spans, C, inst_processors 571 ) 572 ) 574 data = dict( 575 zip( 576 inst_l, --> 577 ParallelExt(n_jobs=workers, backend=C.joblib_backend, maxtasksperchild=C.maxtasksperchild)(task_l), 578 ) 579 ) 581 new_data = dict() 582 for inst in sorted(data.keys()): File D:\anaconda\envs\qlib_liang_hua_touzi\Lib\site-packages\qlib\utils\paral.py:24, in ParallelExt.__init__(self, *args, **kwargs) 22 super(ParallelExt, self).__init__(*args, **kwargs) 23 if isinstance(self._backend, MultiprocessingBackend): ---> 24 self._backend_args["maxtasksperchild"] = maxtasksperchild AttributeError: 'ParallelExt' object has no attribute '_backend_args'
最新发布
07-24
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小李飞刀李寻欢

您的欣赏将是我奋斗路上的动力!

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值