DeepSeek：统一多模态模型Janus Pro

最新推荐文章于 2025-04-25 10:59:16 发布

测试游记

最新推荐文章于 2025-04-25 10:59:16 发布

阅读量640

点赞数 4

本文链接：https://blog.youkuaiyun.com/weixin_37786060/article/details/146432578

版权

一、背景

DeepSeek（深度求索）推出的Janus Pro 是一款统一多模态模型，旨在通过整合多种模态（如文本、图像、视频等）的输入与输出能力，实现更高效、更智能的跨模态理解与生成任务。

作为DeepSeek在多模态领域的重要布局，Janus Pro展现了公司在通用人工智能（AGI）研究中的技术积累与创新。

核心特点

统一的多模态架构
Janus Pro采用统一的模型架构，支持对文本、图像、视频等多种模态数据的联合处理，无需依赖独立的分支模块。这种设计减少了不同模态间的信息割裂，提升了跨模态任务（如图文问答、视频描述生成）的效率和准确性。
模态协同与知识共享
模型通过跨模态对齐技术，将不同模态的数据映射到统一的语义空间，实现知识共享。例如，文本描述与图像特征在共享空间中相互增强，提升模型对复杂场景的理解能力。
高效推理与扩展性
基于MoE（Mixture of Experts）架构，Janus Pro在训练和推理时动态分配计算资源，针对不同任务激活特定的专家模块，既保证了模型容量，又提高了计算效率。
支持高分辨率图像与长上下文
通过创新的视觉编码器，Janus Pro能够处理更高分辨率的图像（如4K级别），同时支持超长文本输入（如数十万tokens），适用于复杂文档分析、长视频理解等场景。

关键技术突破

跨模态对齐技术
采用对比学习与自监督策略，强化多模态数据的语义对齐，例如图像区域与文本描述的关联性。
动态路由机制
MoE架构中的动态路由机制可根据输入内容自动选择最相关的专家网络，优化任务性能。
高分辨率视觉处理
引入分块编码和局部注意力机制，解决传统视觉模型因图像分辨率过高导致的计算瓶颈。

应用场景

智能助理

支持多轮对话中混合图文内容的交互，例如根据用户提供的图片生成定制化建议。
内容生成

跨模态生成能力可用于图文创作、视频脚本生成、广告设计等场景。
教育/医疗

解析复杂图表、医学影像，并提供结合专业知识的文本解释。
工业与科研

处理多模态数据（如传感器数据+文本报告）进行故障诊断或科研分析。

性能表现

在权威多模态评测（如VQA-v2 、TextVQA 、COCO-Caption等）中，Janus Pro展现了与GPT-4V、Gemini等顶尖模型相当或更优的性能。
在长文档理解、高分辨率图像细节问答等任务中表现突出，体现了对复杂场景的深度理解能力。

定位与意义

Janus Pro是DeepSeek向多模态AGI迈进的关键一步，其统一架构和高效设计降低了多模态应用的开发门槛，同时为未来扩展更多模态（如3D、音频）奠定了基础。通过开源部分模型或API，DeepSeek进一步推动多模态技术的普及与生态建设。

小结

1、Janus Pro是DeepSeek Janus系列模型的第二代

2、Janus Pro是统一多模态大模型，支持多模态（目前是图像）输入和输出

3、最新Janus Pro开源了1B和7B两个尺寸，暂未开放在线 API，只能本地部署使用（API或许后续会上线）

4、Janus Pro仍然延续了DeepSeek一贯风格，高性能+高性价比+全开源

5、传统多模态大模型无法在一个模型中同时完成多模态输入和输出,从而在实现具身智能时会存在大量的计算浪费

二、本地部署

硬件GPU显存要求

任务类型	Janus Pro 1B	Janus Pro 7B
图像识别	5G（3060）	15G(4080)
图片生成	14G(4080)	40G(3090/4090 *2)

下载代码

git clone https://github.com/deepseek-ai/Janus.git
cd Janus

安装环境

conda create --name janus python=3.10
conda activate janus 
pip install -e .

使用魔搭进行模型下载

https://www.modelscope.cn/models/deepseek-ai/Janus-Pro-1B/files

pip install modelscope
mkdir Janus-Pro-1B
modelscope download --model deepseek-ai/Janus-Pro-1B --local_dir ./Janus-Pro-1B

三、试用

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

# 指定模型路径
model_path = "./Janus-Pro-1B"

# 加载VLChatProcessor
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)

# 加载分词器
tokenizer = vl_chat_processor.tokenizer

# 加载vl_gpt
try:
    vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
        model_path, trust_remote_code=True
    )
    vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()
except Exception as e:
    print(e)

image = "./pic/1.png"
question = "请解释下这张图片上的内容。"
conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]

pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

image = "./pic/2.png"
question = "请将这个数学公式转换成 LaTeX 代码。"
conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image],
    },
    {"role": "<|Assistant|>", "content": ""},
]
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)
# run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)
# run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)
answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

测试图片：

输出结果

/opt/homebrew/Caskroom/miniconda/base/envs/janus/bin/python /Users/zhongxin/github/Janus/test.py 
Python version is above 3.10, patching the collections module.
/opt/homebrew/Caskroom/miniconda/base/envs/janus/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:594: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
Some kwargs in processor config are unused and will not have any effect: image_tag, num_image_tokens, add_special_token, ignore_id, sft_format, mask_prompt. 
Torch not compiled with CUDA enabled
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

<|User|>: <image_placeholder>
请解释下这张图片上的内容。

<|Assistant|>: 这是一张对比图，分为左右两部分，每部分都有一只柴犬形象。

左边：
- 图片上方有一行黑色粗体文字：“Decoupling Visual Encoding”。
- 柴犬站立着，肌肉发达，看起来非常强壮。

右边：
- 图片上方也有一行黑色粗体文字：“Single Visual Encoder”。
- 柴犬坐在地上，表情悲伤，看起来很沮丧。

整体风格：
- 图片采用了幽默的网络迷因风格，使用了两只柴犬来表达不同的概念。
- 左侧的柴犬代表“Decoupling Visual Encoding”，右侧的柴犬代表“Single Visual Encoder”。
- 文字使用的是简单的黑色粗体字体，没有特殊装饰。

图片中的文字：
- “Decoupling Visual Encoding”
- “Single Visual Encoder”

这些文字位于图片的顶部，用黑色粗体字体显示。
You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.

<|User|>: <image_placeholder>
请将这个数学公式转换成 LaTeX 代码。

<|Assistant|>: \( A_n = \alpha _0 \left[ \begin{array}{l} 1 + \frac{3}{4} \sum_{k=1}^{n} \begin{pmatrix} \frac{4}{9} \\ k \end{pmatrix} \right] \end{array]

在 LaTeX 中，这个公式可以表示为：

\[ A_n = \alpha_0 \left[ \begin{array}{l} 1 + \frac{3}{4} \sum_{k=1}^{n} \begin{pmatrix} \frac{4}{9} \\ k \end{pmatrix} \right] \]

进程已结束，退出代码为 0