Seed持续领跑，国产开源模型异军突起丨多模态模型8月最新榜单揭晓

最新推荐文章于 2025-12-01 23:02:05 发布

原创最新推荐文章于 2025-12-01 23:02:05 发布 · 1k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#开源 #大模型评测 #大模型 #司南评测 #多模态模型

榜单专栏收录该内容

9 篇文章

订阅专栏

多模态大模型正在成为通往通用人工智能（AGI）的关键引擎。它打破了文本、图像、语音、视频等多模态信息之间的边界，让 AI 不再局限于单一任务，而是能够整合多源信息，实现更立体的世界认知。这种能力不仅推动了感知与理解的提升，更让模型在推理、决策与协作等层面展现出高阶智能，让 AI 能够像人一样理解世界、解决问题，带来更流畅、更可信赖的体验。随着模型基础能力的持续增强，多模态智能正在渗透到越来越多的行业和场景中，推动智能新时代的加速到来。

司南持续关注大模型的发展动态，基于闭源评测基准，近期针对国内外主流多模态模型进行了全面评测，现公布司南多模态模型 8 月评测榜单。

本期榜单共包含 71 个多模态模型，其中包含：4 个国内 API 模型，20 个海外 API 模型以及 47 个开源多模态模型。需要提示的是：由于榜单规则，为提升闭源评测集榜单的时效性与先进性，我们对部分旧模型进行了移除，并测试了他们的最新版本。

综合榜单解读

整体性能排名

在本次评测中，多模态大模型的整体表现差异再次被拉大，平均得分从 17.60（最低）到 61.05 （最高）。性能排名最高的模型是 Seed1.6-vision-250815 ，而最低的则是 LLaVA-v1.5-7B。
在整体性能上，闭源模型 Seed1.6-vision-250815 在 通用感知、空间感知和信息图表理解 三个维度上领跑所有模型，表现优异。但在 视觉质量感知、多模态推理 和 多模态创作 这些维度上仍有提升空间，尤其在视觉质量感知上，该模型位列榜单中游，说明对图像质量属性、美学价值及主观视觉效果的评估能力仍为一个待突破的方向。
在开源模型中，GLM-4.5V 凭借其在通用感知、空间感知等方面的出众能力，占据了榜单次名。而新近发布的 GPT-5-20250807 多方面能力均衡，并在多模态创作、多模态推理 方面占据上风，展现出其独到之处，位列榜单第 3。
在以上三者之后，GPT-5-mini-20250807, InternVL3.5-241B-A28B 这 2 个模型分别占据了榜单的第 4、5 名。

深层洞察

模型放慢迭代脚步，开源模型能堪大任

相比 5 月榜单，模型整体迭代速度有所减慢，最好模型的性能相比上期只增长了不到 2 分，说明由于数据量饱和或整体架构的瓶颈，目前多模态大模型整体迭代已进入瓶颈期，整体性能阶段性的提升被按下了暂停键。
开源模型正迎头赶上，不断拉近与闭源模型领头羊的距离。相比原先的开源模型领头羊 Qwen2.5-VL-72B，GLM-4.5V 提升了近 10 分，与头名的差距从上期的十余分拉近到本期的近一分；同时 InternVL3.5-241B-A28B也在原先基础上不断迭代，进入 Top5 的行列，这说明了开源多模态大模型通过数据与后训练方面的不断创新，已经逐渐突破了固有的护城河，正在接近多模态大模型整体的能力上限。

国产模型持续发力，对海外模型实现一定压制，但在部分方面仍有差距

本次榜单中前 5 名中，国产模型占据 3 席，其中最领先的开源模型和闭源模型都为国产模型，这进一步体现了国产模型的性能优势与出众的技术力。
在多模态创作等领域，GPT-5-20250807 等海外模型仍具有一定优势，同时整体的性能与榜单头名的差距也从上期的 6 分拉近到了本期的不到 2 分，说明海外模型在部分能力上仍具有不可替代的优势，整体性能迭代势头不可小觑。

隔代模型迭代快速，性能展现显著提升，感知与推理能力协同并进

关于开源模型，InternVL3.5-241B-A28B 相比 InternVL3-78B 在信息图像感知、多模态创作等维度均有不同程度的进步，GLM-4.5V 相比上一代更是展现出不小的提升。
关于闭源模型，领头羊 Seed1.6-vision-250815 相比 Seed1.5-VL 整体性能有小幅提升，GPT-5-20250807 则是相比 GPT-4.1-20250414 和 ChatGPT-4o-latest 展现出全方面的性能提升，在多模态推理方面尤为显著。

榜单规则说明

本榜单基于闭源测试数据，对不同多模态模型在 通用感知、空间感知、视觉质量感知、信息图形理解、多模态推理、多模态创作 六大能力维度上的表现进行了评测，并基于归一化分数计算模型的平均得分进行排序。闭源评测基准拥有较为丰富的题目类型，包含单选、多选、填空、开放性创作等，且为中英文双语，可以同时考验模型的多语言理解能力。

子维度能力解读

通用感知

在通用感知能力方面，闭源模型 Seed1.6-vision-250815 排名第一。但需注意的是，在通用感知测试中存在涉及人物识别的问题，而部分闭源 / 开源模型会对此类问题产生拒答 (其中 GPT-4.1-20250414、Claude4-Opus、InternVL3-38B 等模型均在不同程度上有此类现象)。下方的性能展示图片也标注了每个模型的拒答比率。

视觉质量感知

在视觉质量感知方面，在前 5 名中只有 GPT-5-20250807 为本次榜单新近上榜的，展现出目前比较领先的多模态模型（如Seed1.6-vision-250815，InternVL3.5-241B-A28B等）在这一方面仍存在一定欠缺。值得注意的是，小参数量模型，如 SAIL-VL-2B 等，在视觉质量感知上也展现了突出的性能。

空间感知

在空间感知方面，当前多模态大模型的能力相较上期榜单实现了一定的性能提升（Top5 全为新进模型），模型取得的最高分数首次突破 60 分，但整体表现仍有待加强。闭源模型在空间感知上实现了一定的性能突破，而开源模型相较上期进步相对有限。

信息图形感知

在信息图形感知方面，Seed1.6-vision-250815 在评测中取得了最佳的成绩，而令人惊讶的是，GPT-5-mini-20250807 比 GPT-5-20250807 在这一维度取得了更佳的成绩。整体上，闭源 API 模型在信息图形理解方面的性能处于领先的地位，占据了前 10 名中的 8 席。开源模型则相较上期榜单在这一方面有一定性能提升，但仍凸显不足。

多模态推理

在多模态推理方面，Gemini-5-20250807 在评测中取得了最佳的成绩，且较第 2 名（Gemini-2.5-Pro，也即上一期的第 1 名）有较大的领先优势，展现出多模态大模型在推理方面快速迭代成效显著。整体上，闭源 API 模型在多模态推理方面的性能处于领先的地位，InternVL3.5-241B-A28B 及 GLM-4.5V 也展现了和闭源模型近似的性能，强势进入前 10 名。整体上，当前大部分多模态大模型的多模态推理分数仍存在一定不足，尚未触碰到 60 分的及格线。

多模态创作

在多模态创作方面，GPT-5-20250807 及其小量级模型 GPT-5-mini-20250807 在评测中取得了前两名的成绩。整体上，闭源 API 模型在多模态创作方面的性能处于领先的地位，占据了前 10 名中的 8 席。开源多模态模型在这一方面仍有一定缺陷，尚待补足。

评测题目示例

通用感知

问题1（考察计数、幻觉）

图中戴帽子的女士是右边的第几个人？

答案

问题2（考察地点识别）

图中的景点是什么，位于哪个城市？

答案

少林寺，河南郑州登封市

视觉质量感知

问题1（考察底层视觉）

What is the most serious distortion in this image?

A. The overall blur.

B. The overexposure on the wall.

C. The noise on the wall.

D. The underexposure on the human face.

答案

问题2（考察图像质量评估）

What factor has the most negative impact on the visual quality of this image?

A. Gaussian white noise

B. Impulse noise

C. Multiplicative noise

答案

空间感知

问题1（考察3D空间感知）

Consider the real-world 3D locations and orientations of the objects. Which side of the white SUV is facing the 24 hr parking?

A. front

B. left

C. back

D. right

答案

问题2（考察抽象空间感知）

In which of the four options is this cube shown from a different perspective?

A. A

B. B

C. C

D. D

E. E

答案

信息图形感知

问题1 （考察中文 OCR ）

根据图上的信息，哪位民警接待了这位老百姓？

答案

易焕农

问题2 （考察图表理解）

On which benchmarks does Qwen-VL-MAX perform the best?

答案

['CCBench','MME','MMVet']

多模态推理

问题1（考察函数）

若集合 \left\{(x, y) | y = x + t(x^2 - x), 0 \leq t \leq 1, 1 \leq x \leq 2 \right\} 表示的图形中，两点之间最大距离为 d，面积为 S，则（）

A. d = 3， S < 1

B. d = 3 ，S > 1

C. d = \sqrt{10} ，S < 1

D. d = \sqrt{10} ，S > 1

答案

问题2（考察谜题推断）

从所给的四个选项中，选择最合适的一个填入问号处，使之呈现一定的规律性

答案

多模态创作

问题（考察软件工程图像解释说明）

Assume you are an experienced software development engineer, skilled in analyzing various diagrams in software engineering.
This is a software engineering diagram that you encounter in your work.
Please follow the requirements below to draft a document to ensure that even a complete beginner with no prior knowledge of software engineering can understand the type of diagram, the business process, and the division of work after reading it.
1. Ensure the document is concise and clear.
2. The document should clearly specify the type of diagram, its purpose, and the stage in software engineering where it is typically used.

参考答案（由 GPT-4o 创作)

This is a Swimlane Activity Diagram, a type of UML (Unified Modeling Language) diagram used to visualize processes or workflows. The swimlanes organize the workflow into distinct lanes, each representing a specific actor or system involved in the process. Here, the lanes represent the Patient, Hospital Management System, and Hospital Admin.

The purpose of this diagram is to depict the discharge process in a hospital management system. It visually illustrates how different actors interact with the system to complete the discharge process, covering all necessary steps, responsibilities, and decision points.

Swimlane activity diagrams are primarily used in the requirement gathering and system design phases of software engineering. They help:
    • Define workflows and interactions between systems and actors.
    • Clarify roles and responsibilities for different entities.
    • Identify decision points and ensure process consistency.
REFERENCE_ANSWER_BY_GPT4O. # Document: Understanding the Swimlane Activity Diagram for Hospital Management System

## 1. **Type of Diagram**
This is a **Swimlane Activity Diagram**, a type of activity diagram used in software engineering. It visually represents the flow of activities in a process and divides responsibilities among different entities (or "swimlanes").

---

## 2. **Purpose of the Diagram**
The purpose of this Swimlane Activity Diagram is to:
- Illustrate the **business process** of discharging a patient in a hospital management system.
- Clearly define the **roles and responsibilities** of the three entities involved: the **Patient**, the **Hospital Management System**, and the **Hospital Admin**.
- Show the **sequence of actions** and how tasks are distributed among these entities.

This diagram helps stakeholders (e.g., developers, business analysts, and hospital staff) understand the workflow and ensure that the system is designed to meet the hospital's operational needs.

---

## 3. **Stage in Software Engineering**
This diagram is typically used during the **Analysis and Design Phase** of the Software Development Life Cycle (SDLC). It helps:
- Analyze the current workflow.
- Design the system's functionality and interactions.
- Communicate the process to both technical and non-technical stakeholders.

---

## 4. **Business Process Overview**
The diagram represents the process of discharging a patient from a hospital. Below is a step-by-step explanation of the workflow:

### **Entities Involved**
1. **Patient**: The individual receiving medical care.
2. **Hospital Management System**: The software system managing hospital operations.
3. **Hospital Admin**: The staff responsible for administrative tasks.

### **Steps in the Process**
1. **Start**: The process begins when the patient interacts with the system.
2. **Display List of Services**: The system shows available services to the patient.
3. **Request to Discharge Patient**: The patient requests to be discharged.
4. **Display Patient Information**: The system retrieves and displays the patient's details.
5. **Admission Cost**: The hospital admin checks the admission cost.
6. **Approve or Confirm the Discharge**: The admin approves or confirms the discharge request.
7. **Calculate and Display Payment Information**: The system calculates the total payment due and displays it.
8. **Check for Admission Payment**: The patient reviews the payment details.
9. **Pays the Bill**: The patient makes the payment.
10. **Update Patient Status & Release Payment Information**: The system updates the patient's status and confirms the payment.
11. **Save the Transaction to Archives**: The system saves the payment details for record-keeping.
12. **Receive Payment and Release Receipt**: The patient receives a receipt for the payment.
13. **Get Incentives**: The hospital admin receives any applicable incentives.
14. **End**: The process concludes.

---

## 5. **Division of Work**
The diagram divides the workflow into three "swimlanes," each representing a specific entity's responsibilities:
- **Patient**: Initiates the discharge process, reviews payment details, and makes the payment.
- **Hospital Management System**: Automates tasks such as displaying information, calculating payments, updating records, and saving transactions.
- **Hospital Admin**: Handles administrative tasks like confirming discharge and checking admission costs.

---

## 6. **Conclusion**
This Swimlane Activity Diagram is a powerful tool for visualizing and understanding the discharge process in a hospital management system. By clearly defining roles and responsibilities, it ensures that the system is designed to streamline operations and improve efficiency.

注意事项

多模态模型闭源评测基准及对应的榜单每季度更新一次，于每季度最后一个月的月初发布 (3 / 6 / 9 / 12 月)
新模型厂商若想参与该季度榜单评测，可在每季度的第 2 个月（2 / 5 / 8 / 11 月）的 25 日前 写邮件至 opencompass@pjlab.org.cn 申请评测
仅支持开源或对外提供 API 服务商业化模型
若为开源模型参与榜单评测，请提供 Hugging Face 或 ModelScope 模型权重及链接；API 模型参与评测，提供 API 网址链接，并写明模型版本
若为 API 模型，推荐提供 OpenAI 标准接口 API，非 OpenAI 标准接口需要在 VLMEvalKit 算法仓库中提 PR 以支持该模型
为保证和真实用户获得完全一样的模型使用体验，对于 API 模型，司南均通过常规采购渠道获取，不接受厂商提供的 key