/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/tyro/_parsers.py:332: UserWarning: The field `model.action-expert-variant` is annotated with type `typing.Literal['dummy', 'gemma_300m', 'gemma_2b', 'gemma_2b_lora']`, but the default value `gemma_300m_lora` has type `<class 'str'>`. We'll try to handle this gracefully, but it may cause unexpected behavior.
warnings.warn(message)
19:07:30.004 [I] Running on: shuo-hp (10287:train.py:195)
INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
19:07:30.228 [I] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig' (10287:xla_bridge.py:945)
INFO:2025-05-12 19:07:30,228:jax._src.xla_bridge:945: Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
19:07:30.228 [I] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory (10287:xla_bridge.py:945)
19:07:30.500 [I] Wiped checkpoint directory /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoints.py:25)
19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332)
19:07:30.500 [I] Created BasePyTreeCheckpointHandler: pytree_metadata_options=PyTreeMetadataOptions(support_rich_types=False), array_metadata_store=None (10287:base_pytree_checkpoint_handler.py:332)
19:07:30.500 [I] [thread=MainThread] Failed to get flag value for EXPERIMENTAL_ORBAX_USE_DISTRIBUTED_PROCESS_ID. (10287:multihost.py:375)
19:07:30.500 [I] [process=0][thread=MainThread] CheckpointManager init: checkpointers=None, item_names=None, item_handlers={'assets': <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, 'train_state': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, 'params': <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>}, handler_registry=None (10287:checkpoint_manager.py:622)
19:07:30.501 [I] Deferred registration for item: "assets". Adding handler `<openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>` for item "assets" and save args `<class 'openpi.training.checkpoints.CallbackSave'>` and restore args `<class 'openpi.training.checkpoints.CallbackRestore'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239)
19:07:30.501 [I] Deferred registration for item: "train_state". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>` for item "train_state" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239)
19:07:30.501 [I] Deferred registration for item: "params". Adding handler `<orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>` for item "params" and save args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239)
19:07:30.501 [I] Deferred registration for item: "metrics". Adding handler `<orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>` for item "metrics" and save args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>` and restore args `<class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>` to `_handler_registry`. (10287:composite_checkpoint_handler.py:239)
19:07:30.501 [I] Initialized registry DefaultCheckpointHandlerRegistry({('assets', <class 'openpi.training.checkpoints.CallbackSave'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('assets', <class 'openpi.training.checkpoints.CallbackRestore'>): <openpi.training.checkpoints.CallbackHandler object at 0x72e5cae0ff50>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('train_state', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa0e90>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeSaveArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('params', <class 'orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeRestoreArgs'>): <orbax.checkpoint._src.handlers.pytree_checkpoint_handler.PyTreeCheckpointHandler object at 0x72e5cafa05d0>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonSaveArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>, ('metrics', <class 'orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonRestoreArgs'>): <orbax.checkpoint._src.handlers.json_checkpoint_handler.JsonCheckpointHandler object at 0x72e5cad7fd10>}). (10287:composite_checkpoint_handler.py:508)
19:07:30.501 [I] orbax-checkpoint version: 0.11.1 (10287:abstract_checkpointer.py:35)
19:07:30.501 [I] [process=0][thread=MainThread] Using barrier_sync_fn: <function get_barrier_sync_fn.<locals>.<lambda> at 0x72e5cacb85e0> timeout: 7200 secs and primary_host=0 for async checkpoint writes (10287:async_checkpointer.py:80)
19:07:30.501 [I] Found 0 checkpoint steps in /home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name (10287:checkpoint_manager.py:1528)
19:07:30.501 [I] Saving root metadata (10287:checkpoint_manager.py:1569)
19:07:30.501 [I] [process=0][thread=MainThread] Skipping global process sync, barrier name: CheckpointManager:save_metadata (10287:multihost.py:293)
19:07:30.501 [I] [process=0][thread=MainThread] CheckpointManager created, primary_host=0, CheckpointManagerOptions=CheckpointManagerOptions(save_interval_steps=1, max_to_keep=1, keep_time_interval=None, keep_period=5000, should_keep_fn=None, best_fn=None, best_mode='max', keep_checkpoints_without_metrics=True, step_prefix=None, step_format_fixed_length=None, step_name_format=None, create=False, cleanup_tmp_directories=False, save_on_steps=frozenset(), single_host_load_and_broadcast=False, todelete_subdir=None, enable_background_delete=False, read_only=False, enable_async_checkpointing=True, async_options=AsyncOptions(timeout_secs=7200, barrier_sync_fn=None, post_finalization_callback=None, create_directories_asynchronously=False), multiprocessing_options=MultiprocessingOptions(primary_host=0, active_processes=None, barrier_sync_key_prefix=None), should_save_fn=None, file_options=FileOptions(path_permission_mode=None), save_root_metadata=True, temporary_path_class=None, save_decision_policy=None), root_directory=/home/shuo/VLA/openpi/checkpoints/pi0_ours_aloha/your_experiment_name: <orbax.checkpoint.checkpoint_manager.CheckpointManager object at 0x72e5cadffd10> (10287:checkpoint_manager.py:797)
19:07:30.553 [I] Loaded norm stats from s3://openpi-assets/checkpoints/pi0_base/assets/trossen (10287:config.py:166)
Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None).
19:07:30.553 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213)
Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None).
19:07:30.554 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213)
Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None).
19:07:30.555 [W] Returning existing local_dir `/home/shuo/VLA/lerobot/aloha-real-data` as remote repo cannot be accessed in `snapshot_download` (None). (10287:_snapshot_download.py:213)
Traceback (most recent call last):
File "/home/shuo/VLA/openpi/scripts/train.py", line 273, in <module>
main(_config.cli())
File "/home/shuo/VLA/openpi/scripts/train.py", line 226, in main
batch = next(data_iter)
^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 177, in __iter__
for batch in self._data_loader:
File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 257, in __iter__
batch = next(data_iter)
^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 708, in __next__
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1480, in _next_data
return self._process_data(data)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1505, in _process_data
data.reraise()
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/_utils.py", line 733, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index) # type: ignore[possibly-undefined]
^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 52, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
~~~~~~~~~~~~^^^^^
File "/home/shuo/VLA/openpi/src/openpi/training/data_loader.py", line 47, in __getitem__
return self._transform(self._dataset[index])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 70, in __call__
data = transform(data)
^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in __call__
return jax.tree.map(lambda k: flat_item[k], self.structure)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree.py", line 155, in map
return tree_util.tree_map(f, tree, *rest, is_leaf=is_leaf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in tree_map
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/shuo/VLA/openpi/.venv/lib/python3.11/site-packages/jax/_src/tree_util.py", line 358, in <genexpr>
return treedef.unflatten(f(*xs) for xs in zip(*all_leaves))
^^^^^^
File "/home/shuo/VLA/openpi/src/openpi/transforms.py", line 101, in <lambda>
return jax.tree.map(lambda k: flat_item[k], self.structure)
~~~~~~~~~^^^
KeyError: 'observation.images.cam_low'
最新发布