(base) root@74fb9740dd84:/workspace/data/CH4/01.train# dp train input.json
2025-11-25 05:58:04.052073: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-11-25 05:58:04.057367: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1764050284.064012 710 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1764050284.066338 710 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1764050284.071566 710 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764050284.071581 710 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764050284.071583 710 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1764050284.071584 710 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-11-25 05:58:04.073369: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX_VNNI AVX_VNNI_INT8 AVX_NE_CONVERT, in other operations, rebuild TensorFlow with the appropriate compiler flags.
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit: Successfully load libcudart.so.12
[2025-11-25 05:58:08,701] DEEPMD INFO Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-11-25 05:58:08,737] DEEPMD INFO If you encounter the error 'an illegal memory access was encountered', this may be due to a TensorFlow issue. To avoid this, set the environment variable DP_INFER_BATCH_SIZE to a smaller value than the last adjusted batch size. The environment variable DP_INFER_BATCH_SIZE controls the inference batch size (nframes * natoms).
[2025-11-25 05:58:09,883] DEEPMD INFO Neighbor statistics: training data with minimal neighbor distance: 1.042950
[2025-11-25 05:58:09,883] DEEPMD INFO Neighbor statistics: training data with maximum neighbor size: [4 1] (cutoff radius: 6.000000)
[2025-11-25 05:58:09,902] DEEPMD INFO _____ _____ __ __ _____ _ _ _
[2025-11-25 05:58:09,902] DEEPMD INFO | __ \ | __ \ | \/ || __ \ | | (_)| |
[2025-11-25 05:58:09,902] DEEPMD INFO | | | | ___ ___ | |__) || \ / || | | | ______ | | __ _ | |_
[2025-11-25 05:58:09,902] DEEPMD INFO | | | | / _ \ / _ \| ___/ | |\/| || | | ||______|| |/ /| || __|
[2025-11-25 05:58:09,902] DEEPMD INFO | |__| || __/| __/| | | | | || |__| | | < | || |_
[2025-11-25 05:58:09,902] DEEPMD INFO |_____/ \___| \___||_| |_| |_||_____/ |_|\_\|_| \__|
[2025-11-25 05:58:09,902] DEEPMD INFO Please read and cite:
[2025-11-25 05:58:09,902] DEEPMD INFO Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2025-11-25 05:58:09,902] DEEPMD INFO Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2025-11-25 05:58:09,902] DEEPMD INFO Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025)
[2025-11-25 05:58:09,902] DEEPMD INFO See https://deepmd.rtfd.io/credits/ for details.
[2025-11-25 05:58:09,902] DEEPMD INFO ---------------------------------------------------------------------------------------
[2025-11-25 05:58:09,902] DEEPMD INFO installed to: /opt/deepmd-kit/lib/python3.12/site-packages/deepmd
[2025-11-25 05:58:09,902] DEEPMD INFO source:
[2025-11-25 05:58:09,902] DEEPMD INFO source branch: HEAD
[2025-11-25 05:58:09,902] DEEPMD INFO source commit: eeadafb
[2025-11-25 05:58:09,902] DEEPMD INFO source commit at: 2025-11-05 14:55:36 +0100
[2025-11-25 05:58:09,902] DEEPMD INFO use float prec: double
[2025-11-25 05:58:09,902] DEEPMD INFO build variant: cuda
[2025-11-25 05:58:09,902] DEEPMD INFO Backend: TensorFlow
[2025-11-25 05:58:09,902] DEEPMD INFO TF ver: unknown
[2025-11-25 05:58:09,902] DEEPMD INFO build with TF ver: 2.19.1
[2025-11-25 05:58:09,902] DEEPMD INFO build with TF inc: /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/include/
[2025-11-25 05:58:09,902] DEEPMD INFO /opt/deepmd-kit/include
[2025-11-25 05:58:09,902] DEEPMD INFO build with TF lib:
[2025-11-25 05:58:09,902] DEEPMD INFO running on: 74fb9740dd84
[2025-11-25 05:58:09,902] DEEPMD INFO computing device: gpu:0
[2025-11-25 05:58:09,902] DEEPMD INFO CUDA_VISIBLE_DEVICES: unset
[2025-11-25 05:58:09,902] DEEPMD INFO Count of visible GPUs: 1
[2025-11-25 05:58:09,902] DEEPMD INFO num_intra_threads: 0
[2025-11-25 05:58:09,902] DEEPMD INFO num_inter_threads: 0
[2025-11-25 05:58:09,902] DEEPMD INFO ---------------------------------------------------------------------------------------
[2025-11-25 05:58:09,930] DEEPMD INFO ---Summary of DataSystem: training -----------------------------------------------
[2025-11-25 05:58:09,930] DEEPMD INFO found 1 system(s):
[2025-11-25 05:58:09,930] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-11-25 05:58:09,930] DEEPMD INFO ../00.data/training_data 5 7 22 1.000e+00 T
[2025-11-25 05:58:09,930] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-11-25 05:58:09,953] DEEPMD INFO ---Summary of DataSystem: validation -----------------------------------------------
[2025-11-25 05:58:09,953] DEEPMD INFO found 1 system(s):
[2025-11-25 05:58:09,953] DEEPMD INFO system natoms bch_sz n_bch prob pbc
[2025-11-25 05:58:09,953] DEEPMD INFO ../00.data/validation_data 5 7 5 1.000e+00 T
[2025-11-25 05:58:09,953] DEEPMD INFO --------------------------------------------------------------------------------------
[2025-11-25 05:58:09,953] DEEPMD INFO training without frame parameter
[2025-11-25 05:58:09,953] DEEPMD INFO data stating... (this step may take long time)
[2025-11-25 05:58:10,081] DEEPMD INFO built lr
[2025-11-25 05:58:10,427] DEEPMD INFO built network
[2025-11-25 05:58:10,890] DEEPMD INFO built training
[2025-11-25 05:58:10,891] DEEPMD WARNING To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2025-11-25 05:58:10,911] DEEPMD INFO initialize model from scratch
[2025-11-25 05:58:11,269] DEEPMD INFO start training at lr 1.00e-03 (== 1.00e-03), decay_step 5000, decay_rate 0.950006, final lr will be 3.51e-08
[2025-11-25 05:58:11,683] DEEPMD INFO batch 0: trn: rmse = 1.13e+01, rmse_e = 7.04e-01, rmse_f = 3.56e-01, lr = 1.00e-03
[2025-11-25 05:58:11,683] DEEPMD INFO batch 0: val: rmse = 1.42e+01, rmse_e = 7.06e-01, rmse_f = 4.50e-01
[2025-11-25 05:58:37,056] DEEPMD INFO batch 1000: trn: rmse = 4.89e+00, rmse_e = 3.31e-01, rmse_f = 1.55e-01, lr = 1.00e-03
[2025-11-25 05:58:37,056] DEEPMD INFO batch 1000: val: rmse = 4.17e+00, rmse_e = 3.32e-01, rmse_f = 1.32e-01
[2025-11-25 05:58:37,057] DEEPMD INFO batch 1000: total wall time = 25.79 s
[2025-11-25 05:59:01,438] DEEPMD INFO batch 2000: trn: rmse = 4.16e+00, rmse_e = 1.98e-02, rmse_f = 1.31e-01, lr = 1.00e-03
[2025-11-25 05:59:01,438] DEEPMD INFO batch 2000: val: rmse = 3.85e+00, rmse_e = 2.04e-02, rmse_f = 1.22e-01
[2025-11-25 05:59:01,438] DEEPMD INFO batch 2000: total wall time = 24.38 s
[2025-11-25 05:59:22,714] DEEPMD INFO batch 3000: trn: rmse = 4.73e+00, rmse_e = 7.67e-02, rmse_f = 1.50e-01, lr = 1.00e-03
[2025-11-25 05:59:22,714] DEEPMD INFO batch 3000: val: rmse = 3.76e+00, rmse_e = 7.63e-02, rmse_f = 1.19e-01
[2025-11-25 05:59:22,714] DEEPMD INFO batch 3000: total wall time = 21.28 s
[2025-11-25 05:59:46,195] DEEPMD INFO batch 4000: trn: rmse = 5.26e+00, rmse_e = 2.37e-02, rmse_f = 1.66e-01, lr = 1.00e-03
[2025-11-25 05:59:46,195] DEEPMD INFO batch 4000: val: rmse = 3.79e+00, rmse_e = 2.42e-02, rmse_f = 1.20e-01
[2025-11-25 05:59:46,195] DEEPMD INFO batch 4000: total wall time = 23.48 s
[2025-11-25 06:00:09,875] DEEPMD INFO batch 5000: trn: rmse = 4.22e+00, rmse_e = 4.11e-02, rmse_f = 1.37e-01, lr = 9.50e-04
[2025-11-25 06:00:09,876] DEEPMD INFO batch 5000: val: rmse = 4.09e+00, rmse_e = 4.09e-02, rmse_f = 1.33e-01
[2025-11-25 06:00:09,876] DEEPMD INFO batch 5000: total wall time = 23.68 s
[2025-11-25 06:00:33,788] DEEPMD INFO batch 6000: trn: rmse = 3.64e+00, rmse_e = 2.27e-02, rmse_f = 1.18e-01, lr = 9.50e-04
[2025-11-25 06:00:33,788] DEEPMD INFO batch 6000: val: rmse = 3.24e+00, rmse_e = 2.27e-02, rmse_f = 1.05e-01
[2025-11-25 06:00:33,789] DEEPMD INFO batch 6000: total wall time = 23.91 s
[2025-11-25 06:00:55,546] DEEPMD INFO batch 7000: trn: rmse = 1.47e+01, rmse_e = 6.75e+00, rmse_f = 4.60e-01, lr = 9.50e-04
[2025-11-25 06:00:55,547] DEEPMD INFO batch 7000: val: rmse = 1.39e+01, rmse_e = 6.75e+00, rmse_f = 4.33e-01
[2025-11-25 06:00:55,547] DEEPMD INFO batch 7000: total wall time = 21.76 s
[2025-11-25 06:01:19,881] DEEPMD INFO batch 8000: trn: rmse = 1.53e+01, rmse_e = 6.75e+00, rmse_f = 4.80e-01, lr = 9.50e-04
[2025-11-25 06:01:19,881] DEEPMD INFO batch 8000: val: rmse = 1.51e+01, rmse_e = 6.75e+00, rmse_f = 4.73e-01
[2025-11-25 06:01:19,881] DEEPMD INFO batch 8000: total wall time = 24.33 s
[2025-11-25 06:01:44,351] DEEPMD INFO batch 9000: trn: rmse = 1.26e+01, rmse_e = 6.74e+00, rmse_f = 3.87e-01, lr = 9.50e-04
[2025-11-25 06:01:44,351] DEEPMD INFO batch 9000: val: rmse = 1.54e+01, rmse_e = 6.75e+00, rmse_f = 4.82e-01
[2025-11-25 06:01:44,351] DEEPMD INFO batch 9000: total wall time = 24.47 s
[2025-11-25 06:02:06,170] DEEPMD INFO batch 10000: trn: rmse = 1.55e+01, rmse_e = 6.75e+00, rmse_f = 4.87e-01, lr = 9.03e-04
[2025-11-25 06:02:06,171] DEEPMD INFO batch 10000: val: rmse = 1.35e+01, rmse_e = 6.75e+00, rmse_f = 4.14e-01
[2025-11-25 06:02:06,171] DEEPMD INFO batch 10000: total wall time = 21.82 s
[2025-11-25 06:02:06,286] DEEPMD INFO saved checkpoint model.ckpt
[2025-11-25 06:02:30,824] DEEPMD INFO batch 11000: trn: rmse = 1.18e+01, rmse_e = 6.74e+00, rmse_f = 3.55e-01, lr = 9.03e-04
[2025-11-25 06:02:30,825] DEEPMD INFO batch 11000: val: rmse = 1.37e+01, rmse_e = 6.75e+00, rmse_f = 4.25e-01
[2025-11-25 06:02:30,825] DEEPMD INFO batch 11000: total wall time = 24.65 s
[2025-11-25 06:02:55,313] DEEPMD INFO batch 12000: trn: rmse = 1.50e+01, rmse_e = 6.75e+00, rmse_f = 4.70e-01, lr = 9.03e-04
[2025-11-25 06:02:55,313] DEEPMD INFO batch 12000: val: rmse = 1.43e+01, rmse_e = 6.75e+00, rmse_f = 4.45e-01
[2025-11-25 06:02:55,313] DEEPMD INFO batch 12000: total wall time = 24.49 s
[2025-11-25 06:03:19,394] DEEPMD INFO batch 13000: trn: rmse = 1.41e+01, rmse_e = 6.75e+00, rmse_f = 4.38e-01, lr = 9.03e-04
[2025-11-25 06:03:19,394] DEEPMD INFO batch 13000: val: rmse = 1.52e+01, rmse_e = 6.75e+00, rmse_f = 4.77e-01
[2025-11-25 06:03:19,394] DEEPMD INFO batch 13000: total wall time = 24.08 s
[2025-11-25 06:03:41,154] DEEPMD INFO batch 14000: trn: rmse = 1.45e+01, rmse_e = 6.75e+00, rmse_f = 4.51e-01, lr = 9.03e-04
[2025-11-25 06:03:41,154] DEEPMD INFO batch 14000: val: rmse = 1.54e+01, rmse_e = 6.75e+00, rmse_f = 4.83e-01
[2025-11-25 06:03:41,155] DEEPMD INFO batch 14000: total wall time = 21.76 s