llava-onevision-1.5 是meta ai 最新开源的多模态大模型训练框架(参考资料1),里面包含了数据集、模型以及训练方法,对训练多模态大模型很有帮助,因此将过程分享一下,供批评指正
环境配置如下:
ubuntu 22.04,conda 23.7.4,cuda 12.6,python 3.11版本,内核版本:Linux ubuntu-workstation 6.8.0-54-generic #56~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Sat Feb 8 11:41:24 UTC 2 x86_64 x86_64 x86_64 GNU/Linux,Intel(R) Core(TM) i9-14900K,单显卡 4090
虚拟环境finetunning配置如下
torch 2.9.1+cuda116,transformer_engine 为2.10
按照readme.md 文件,下面开始每个步骤的训练
readme.md 文件推荐使用docker 的方式来进行,本次没有按照docker的方式进行
1 下载已有模型或手工合并模型
You have two options to get started with LLaVA-OneVision-1.5-stage-0:
Option 1: Download pre-trained model from Hugging Face
Download our LLaVA-OneVision-1.5-4B-stage0 model directly from Hugging Face.
Option 2: Merge initial weights yourself
Alternatively, you can merge the initial weights from the original ViT and LLM:
python ds/merge_model.py \
--vit_path DeepGlint-AI/rice-vit-large-patch14-560 \
--llm_path Qwen/Qwen3-4B-Instruct-2507 \
--output LLaVA-OneVision-1.5-4B-stage0
Note: When merging weights, the adapter component will be initialized with default values.
这一步主要是初始化模型,可以从hugging face直接下载模型,也可以将2个模型合并起来,并初始化参数,并把模型导出到指定目录LLaVA-OneVision-1.5-4B-stage0,模型大致是这个样子
(Pdb) self.model
LLaVAOneVision1_5_Model(
(visual): RiceTransformerPretrainedModel(
(patch_embed): RicePatchEmbed(
(proj): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
)
(rotary_pos_emb): RiceRotaryEmbedding()
(pre_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(blocks): ModuleList(
(0-23): 24 x RiceBlock(
(norm1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(norm2): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(attn): RiceSdpaAttention(
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
(proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(mlp): RiceMlp(
(fc1): Linear(in_features=1024, out_features=4096, bias=True)
(act): GELUActivation()
(fc2): Linear(in_features=4096, out_features=1024, bias=True)
)
)
)
(merger): RicePatchMerger(
(ln_q): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=4096, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=4096, out_features=2560, bias=True)
)
)
)
(language_model): LLaVAOneVision1_5_TextModel(
(embed_tokens): Embedding(151936, 2560)
(layers): ModuleList(
(0-35): 36 x LLaVAOneVision1_5_DecoderLayer(
(self_attn): LLaVAOneVision1_5_SdpaAttention(
(q_proj): Linear(in_features=2560, out_features=4096, bias=False)
(k_proj): Linear(in_features=2560, out_features=1024, bias=False)
(v_proj): Linear(in_features=2560, out_features=1024, bias=False)
(o_proj): Linear(in_features=4096, out_features=2560, bias=False)
(q_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
(k_norm): LLaVAOneVision1_5_RMSNorm((128,), eps=1e-06)
)
(mlp): LLaVAOneVision1_5_MLP(
(gate_proj): Linear(in_features=2560, out_features=9728, bias=False)
(up_proj): Linear(in_features=2560, out_features=9728, bias=False)
(down_proj): Linear(in_features=9728, out_features=2560, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
(post_attention_layernorm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
)
)
(norm): LLaVAOneVision1_5_RMSNorm((2560,), eps=1e-06)
(rotary_emb): LLaVAOneVision1_5_RotaryEmbedding()
)
)
主要由2个子模型:语言模型和视觉模型,视觉模型是rice-vit,语言模型是qwen,我试了下手工合并,视觉模型在其中一个block运算的时候,输出结果为nan,导致视觉模型校验失败,目前未找到原因,因此是下载的模型
2 开始图片、文本对的对齐训练
(1)数据集
数据集主要是文本对,一张图片对应一个描述,也是从其他数据集迁移过来的。先看一下数据集LLaVA-558K-Webdataset,里面有几个压缩包pretrain-000000.tar-pretrain-000004.tar,包含对应的idx 文件,解压pretrain-000000.tar,里面是分好类的图片和对应的json 文件
-r--r--r-- 1 ubuntu ubuntu 33378 Sep 18 17:31 ps_00022075.img0.jpg
-r--r--r-- 1 ubuntu ubuntu 31665 Sep 18 17:31 ps_00022075.img10.jpg
-r--r--r-- 1 ubuntu ubuntu 47355 Sep 18 17:31 ps_00022075.img11.jpg
-r--r--r-- 1 ubuntu ubuntu 36372 Sep 18 17:31 ps_00022075.img12.jpg
-r--r--r-- 1 ubuntu ubuntu 35609 Sep 18 17:31 ps_00022075.img13.jpg
-r--r--r-- 1 ubuntu ubuntu 37153 Sep 18 17:31 ps_00022075.img14.jpg
-r--r--r-- 1 ubuntu ubuntu 19173 Sep 18 17:31 ps_00022075.img15.jpg
-r--r--r-- 1 ubuntu ubuntu 42421 Sep 18 17:31 ps_00022075.img16.jpg
-r--r--r-- 1 ubuntu ubuntu 29117 Sep 18 17:31 ps_00022075.img17.jpg
-r--r--r-- 1 ubuntu ubuntu 13242 Sep 18 17:31 ps_00022075.img18.jpg
-r--r--r-- 1 ubuntu ubuntu 16562 Sep 18 17:31 ps_00022075.img1.jpg
-r--r--r-- 1 ubuntu ubuntu 22301 Sep 18 17:31 ps_00022075.img2.jpg
-r--r--r-- 1 ubuntu ubuntu 47226 Sep 18 17:31 ps_00022075.img3.jpg
-r--r--r-- 1 ubuntu ubuntu 27445 Sep 18 17:31 ps_00022075.img4.jpg
-r--r--r-- 1 ubuntu ubuntu 41999 Sep 18 17:31 ps_00022075.img5.jpg
-r--r--r-- 1 ubuntu ubuntu 16453 Sep 18 17:31 ps_00022075.img6.jpg
-r--r--r-- 1 ubuntu ubuntu 20248 Sep 18 17:31 ps_00022075.img7.jpg
-r--r--r-- 1 ubuntu ubuntu 35984 Sep 18 17:31 ps_00022075.img8.jpg
-r--r--r-- 1 ubuntu ubuntu 42419 Sep 18 17:31 ps_00022075.img9.jpg
-r--r--r-- 1 ubuntu ubuntu 3779 Sep 18 17:31 ps_00022075.json
可以看到一些原文件和一个对应的json 文件,json 文件 json 文件打开后有3个key值:prompts、captions、images,prompts 为需要描述的提示词,captions 为图片的描述答案,images 为路径,json 文件打开后的每个key 对应的value 是1个列表,从先向后对应图片和描述,选取一张图片,看一下效果

它 的prompt 为 Write a terse but informative summary of the picture.
它的captions 为the most cat litter and feeding basket gift basket
有点简洁。
开始训练前,先安装必要的包,pip install -r requirements.txt
开始执行AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh
(2)执行过程的错误
第1个错误:
no module name transformer_engine
关于transformer_engine的安装,请参考资料4
第2个错误:
AIAK_TRAINING_PATH=../LLaVA-OneVision-1.5-main DATA_PATH=LLaVA-558K-Webdataset TOKENIZER_PATH=LLaVA-OneVision-1.5-4B-stage0 CHECKPOINT_PATH=LLaVA-OneVision-1.5-4B-stage0_mcore_tp1_pp1 bash examples/llava_ov_1_5/quick_start/stage_1_alignment_llava_ov_4b.sh
NameError: name 'ApexFusedRMSNorm' is not defined,定位ApexFusedRMSNorm,发现是这么导入的
from apex.normalization.fused_layer_norm import FusedRMSNorm as ApexFusedRMSNorm
apex 包没有安装,安装apex ,pip install apex,再次运行发现依旧报错:
File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/apex/__init__.py", line 13, in <module>
from pyramid.session import UnencryptedCookieSessionFactoryConfig
ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' from 'pyramid.session' (unknown location)
apex 安装错误,需要按照参考资料2的说明使用以下方法安装apex
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install
发现不再报错,继续开始stage_1 的训练,
第3个错误
File "/home/ubuntu/anaconda3/envs/finetunning/lib/python3.11/site-packages/torch/cuda/__init__.py", line 567, in set_device
torch._C._cuda_setDevice(device)torch._C._cuda_setDevice(device)
torchtorch..AcceleratorErrorAcceleratorError: : CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
这台机子只有1块4090显卡,说是序号错了,序号应为0才对,可能是训练分配的gpu 数据不对,查看stage_1_alignment_llava_ov_4b.sh,其中有1行
GPUS_PER_NODE=8,
每个节点分配的gpu 数目为8,远超了我的配置,改为1,再次运行
第4个错误
/usr/include/pybind11/detail/common.h:215:10: fatal error: Python.h: 没有那个文件或目录
215 | #include <Python.h>, 按照参考资料3的解决办法,需要将Python.h 的目录添加至环境变量,不过不是添加的/usr/local/python3.10的目录,而是添加的虚拟环境的目录/home/ubuntu/anaconda3/envs/finetunning/include/python3.11,再次运行
第5个错误
ModuleNotFoundError: No module named 'fused_layer_norm_cuda' when instantiating LocalNorm
apex 安装的时候未添加cuda和c++ 扩展,安装方式更改为:
python setup.py install --cpp_ext --cuda_ext
第6个错误
WARNING:DotProductAttention:flash-attn may provide important feature support or performance improvement. Please install flash-attn >= 2.1.1, <= 2.8.3 by pip3 install flash-attn==<version>.
WARNING:megatron.core.utils:No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
[rank0]: ValueError: No dot product attention backend is available for the provided inputs. Please run with NVTE_DEBUG=1 NVTE_DEBUG_LEVEL=2 to find out the reasons for disabling all backends.
安装flash-attn,pip install flash-attn --no-build-isolation,也可以通过安装预编译的轮子文件或源码安装,预编译的轮子文件要求环境一样,例如torch、cuda和python 的版本要一致才可以。
目前的环境,torch 为2.9.0,transformer_engine 为2.10.0,transformer_engine_torch 为2.10.0,目前的flash_attn2 的版本为2.8.1,flash_attn-2.8.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl,支持torch 2.9.0,但是需要python 为3.12,不过降版本也可以解决此问题,前提是cuda、cudnn\torch、flash_attn、transformer_engine 的版本要匹配,修改之后的版本如下
cuda 12.6
torch 2.7.1+cu126
flash_attn 2.8.1(flash_attn-2.8.1+cu12torch2.7cxx11abiTRUE-cp311-cp311-linux_x86_64.whl,whl 文件安装)
transformer_engine 2.7.0
transformer_engine_cu12 2.7.0
transformer_engine_torch 2.7.0
再次运行,这步已经可以正常运行
[2025-12-23 19:35:20] iteration 41/ 2500 | consumed samples: 328 | elapsed time per iteration (ms): 12330.9 | throughput (token/sec/GPU): 3925.3 | learning rate: 9.994915E-05 | global batch size: 8 | lm loss: 4.042104E+00 | loss scale: 1.0 | grad norm: 1.191 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:35:33] iteration 42/ 2500 | consumed samples: 336 | elapsed time per iteration (ms): 12731.7 | throughput (token/sec/GPU): 3978.4 | learning rate: 9.994629E-05 | global batch size: 8 | lm loss: 4.256773E+00 | loss scale: 1.0 | grad norm: 0.940 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:35:46] iteration 43/ 2500 | consumed samples: 344 | elapsed time per iteration (ms): 13025.3 | throughput (token/sec/GPU): 3899.3 | learning rate: 9.994335E-05 | global batch size: 8 | lm loss: 4.183313E+00 | loss scale: 1.0 | grad norm: 0.834 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:35:58] iteration 44/ 2500 | consumed samples: 352 | elapsed time per iteration (ms): 12038.4 | throughput (token/sec/GPU): 3876.6 | learning rate: 9.994033E-05 | global batch size: 8 | lm loss: 4.031400E+00 | loss scale: 1.0 | grad norm: 0.714 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:36:11] iteration 45/ 2500 | consumed samples: 360 | elapsed time per iteration (ms): 12927.5 | throughput (token/sec/GPU): 3798.3 | learning rate: 9.993723E-05 | global batch size: 8 | lm loss: 4.208323E+00 | loss scale: 1.0 | grad norm: 0.541 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:36:24] iteration 46/ 2500 | consumed samples: 368 | elapsed time per iteration (ms): 12839.2 | throughput (token/sec/GPU): 3844.2 | learning rate: 9.993405E-05 | global batch size: 8 | lm loss: 4.360492E+00 | loss scale: 1.0 | grad norm: 2.003 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:36:37] iteration 47/ 2500 | consumed samples: 376 | elapsed time per iteration (ms): 12916.7 | throughput (token/sec/GPU): 3762.7 | learning rate: 9.993080E-05 | global batch size: 8 | lm loss: 4.059428E+00 | loss scale: 1.0 | grad norm: 1.276 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:36:49] iteration 48/ 2500 | consumed samples: 384 | elapsed time per iteration (ms): 12855.7 | throughput (token/sec/GPU): 3811.3 | learning rate: 9.992746E-05 | global batch size: 8 | lm loss: 3.937055E+00 | loss scale: 1.0 | grad norm: 0.701 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
[2025-12-23 19:37:02] iteration 49/ 2500 | consumed samples: 392 | elapsed time per iteration (ms): 12480.5 | throughput (token/sec/GPU): 3861.7 | learning rate: 9.992405E-05 | global batch size: 8 | lm loss: 4.157461E+00 | loss scale: 1.0 | grad norm: 0.605 | num zeros: 0 | number of skipped iterations: 0 | number of nan iterations: 0 |
参考资料:
1https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5
2 https://zhuanlan.zhihu.com/p/654726878
3 https://blog.youkuaiyun.com/weixin_40511249/article/details/109136597
4 https://blog.youkuaiyun.com/robator/article/details/156082990?spm=1011.2415.3001.5331
5 https://blog.youkuaiyun.com/li_jiaoyang/article/details/117431876
1376

被折叠的 条评论
为什么被折叠?



