DAM-3B-Video配置方法和性能研究

原创已于 2025-07-03 16:22:44 修改 · 911 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#语言模型

于 2025-06-24 14:54:51 首次发布

模型简介

DAM-3B（Describe Anything 3B）是英伟达推出的一款多模态大语言模型，专门用于为图像和视频中的特定区域生成详细描述。用户可以通过点、边界框、涂鸦或掩码等方式来标识目标区域，从而得到精准且符合上下文的文本描述。DAM-3B的核心创新在于“焦点提示”技术与“局部视觉骨干网络”。焦点提示技术将全图信息与目标区域的高分辨率裁剪图相结合，确保细节不丢失，并保留整体背景。而局部视觉骨干网络则通过嵌入图像和掩码输入，利用门控交叉注意力机制，有效整合全局特征与局部特征，随后将这些特征传递至大语言模型以生成描述。

模型的一个特性是只使用3B的参数量就能实现良好的推理和

模型官网和下载

模型链接：https://huggingface.co/nvidia/DAM-3B-Video

代码链接：https://github.com/NVlabs/describe-anything

配置方法：

下载代码（使用git clone并事先装好git lfs服务）
下载模型
根据代码路径下载sam2.1模型配置权重
1. 在https://huggingface.co/facebook/sam2.1-hiera-large/tree/main里面有参数文件
2. 可以使用命令下载权重：
```
wget "https://dl.fbaipublicfiles.com/segment_anything_2/092824/sam2.1_hiera_large.pt"
```
假如在服务器环境，则需要根据链接的方法修改gradio环境变量（因为普通用户不允许在服务器根目录下创建文件夹，但gradio默认在根目录下创建/tmp文件夹）
运行成功

运行代码命令：

python demo_video.py --model-path /home/ocean/workspace/DAM-3B-Video

运行界面

在运行前需要提取视频第一帧，并圈住特定区域（圈的区域起到一个掩码的作用，）

也能通过其他demo文件实现提取任意帧

每次修改prompt并保存后，需要清理缓存或重新运行程序

实验结果

由于csdn网站所限，本地视频无法加载到这个表格当中

场景	video	prompt	圈住区域	输出
偷车		Given the video in the form of a sequence of frames above, describe the object in the masked region in the video in detail.	完整画面	A silver sedan is parked on a cobblestone surface, its sleek body reflecting the ambient light. The car is positioned with its front end slightly angled towards the viewer, showcasing its modern design and aerodynamic shape. The vehicle's windows are tinted, providing a subtle contrast to the metallic sheen of the body. As the sequence progresses, the car remains stationary, its presence exuding a sense of calm and stability. The wheels are aligned straight, indicating a precise parking job. The car's side mirrors and door handles are clearly visible, adding to its polished appearance. Throughout the sequence, the car maintains its poised stance, embodying a blend of elegance and functionality.
			特定区域（此处为人物）	A sleek silver car, characterized by its compact and modern design, is prominently featured. The vehicle's smooth, metallic surface reflects light, accentuating its aerodynamic contours. The car is stationary, parked on a cobblestone pathway, which adds a textured contrast to its polished exterior. Its windows are tinted, providing a sense of privacy and sophistication. The car's front grille and headlights are distinct, giving it a sharp, focused appearance. Throughout the sequence, the car remains in place, suggesting a moment of pause or readiness to move. The overall impression is one of elegance and functionality, with the car poised as if waiting for its next journey.
		Given the video in the form of a sequence of frames above, briefly describe the object in the masked region in the video.	完整画面	A silver sedan is parked on a cobblestone surface, its sleek body reflecting the ambient light. The car is positioned with its front end slightly angled towards the viewer, showcasing its modern design and aerodynamic shape. The vehicle's windows are tinted, providing a subtle contrast to the metallic sheen of the body. As the sequence progresses, the car remains stationary, its presence exuding a sense of calm and stability. The wheels, with their intricate alloy design, are aligned perfectly with the cobblestone, indicating a precise parking job. The car's side mirrors and door handles are clearly visible, adding to its polished appearance. Throughout the sequence, the car maintains its poised stance, embodying a blend of elegance and functionality.
			特定区域（此处为画面左侧）	A silver car, sleek and modern in design, is positioned on a cobblestone surface. Its smooth, metallic body reflects light subtly, highlighting its aerodynamic contours. The car is stationary, with its front end slightly angled, suggesting a parked position. The vehicle's windows are tinted, adding a touch of sophistication to its appearance. The car's wheels are aligned straight, indicating stability and readiness for movement. Throughout the sequence, the car remains in place, exuding a sense of calm and control, as if waiting for its next journey.
		Describe this video in detail. Is there any car theft in the video? Is there anyone in the picture pulling the car door?	完整画面	The video does not show any car theft or car door being pulled. It is a continuous shot of a silver car parked on a cobblestone driveway, with a person walking nearby. The person is wearing a red shirt and black shorts, and they are walking towards the car. The car remains stationary throughout the sequence, and there is no indication of any suspicious activity or interaction with the car.
偷包裹		Given the video in the form of a sequence of frames above, briefly describe the object in the masked region in the video.	完整画面	A person wearing a dark hooded jacket and light-colored pants is captured in motion, walking with a steady and purposeful stride. The individual appears to be moving forward, with their arms slightly bent at the elbows, suggesting a natural walking posture. The hood of the jacket is up, partially obscuring the head, and the pants are loose-fitting, allowing for easy movement. The person's gait is smooth and consistent, indicating a casual pace. As they progress, their body remains aligned, with the legs moving in a rhythmic pattern, maintaining balance and direction. The overall demeanor suggests a sense of focus and determination, as they continue their journey without any visible interaction with other objects or individuals.
偷包裹			特定区域（此处为画面中心区域）	A small, fluffy dog with a thick, curly coat of light gray fur is seen trotting energetically across a concrete pathway. Its ears are perked up, and its tail is held high, indicating excitement and alertness. The dog moves with a lively gait, its paws making soft, rhythmic sounds as it progresses forward. Its head is slightly lowered, suggesting a focused and determined demeanor. As it continues, the dog occasionally turns its head, as if responding to sounds or movements in its environment, maintaining a steady pace throughout the sequence.
		Describe this video in detail. Is there any package theft in the video? Is there anyone in the video picking up the package?	完整画面	The video does not show any package theft or theft of any other items. There is no indication of someone picking up a package in the provided sequence. The focus of the video is on the movement of a person walking past a house, with the house's exterior wall and a stone-patterned wall visible in the background. The person is wearing a hoodie and appears to be walking towards the house, with their back facing the camera. The sequence captures the person's movement as they walk past the house, providing a clear view of the house's exterior and the stone wall.
翻墙		Given the video in the form of a sequence of frames above, briefly describe the object in the masked region in the video.	完整画面	A person, dressed in a dark, heavy coat, is seen walking steadily across a snow-covered surface. The individual moves with a purposeful stride, their arms slightly bent at the elbows, suggesting a brisk pace. The coat appears to be thick, providing warmth and protection against the cold. As they progress, their legs move in a rhythmic pattern, indicating a consistent and determined walk. The person's head is slightly tilted forward, possibly focused on their path ahead. The movement is fluid, with each step taken confidently, maintaining a straight trajectory. The overall demeanor suggests a sense of urgency or determination, as they navigate the snowy terrain with ease.
翻墙		Given the video in the form of a sequence of frames above, describe this video in detail. Are there any abnormal events in the video?	完整画面	No, there are no abnormal events in the sequence. The video shows a consistent scene of a house with a porch, featuring a large window with red trim. The window is divided into three sections, with the middle section being slightly wider than the other two. The porch is supported by a red column on the right side, which is also part of the house's exterior. The house appears to be well-maintained, with clean lines and a classic design. The window and porch are the main focus, and there is no movement or change in the scene throughout the sequence.
烟/火		Given the video in the form of a sequence of frames above, briefly describe the object in the masked region in the video.	完整画面	A large, flat, and smooth surface, likely a concrete or stone floor, extends across the scene, characterized by its uniform texture and light gray color. The surface appears to be slightly elevated, forming a continuous, unbroken expanse. As the sequence progresses, the floor maintains its consistent appearance, with subtle variations in shading that suggest minor imperfections or wear. The edges of the floor are clean and well-defined, indicating a well-maintained structure. Throughout the sequence, the floor remains stationary, providing a stable and unchanging backdrop to the surrounding environment. Its presence is dominant, suggesting it is a significant feature within the space, possibly serving as a foundation or walkway.
持枪		Given the video in the form of a sequence of frames above, briefly describe the object in the masked region in the video.	完整画面	视频过长，提示CUDA out of memory

初步结论

模型自带的prompt（Given the video in the form of a sequence of frames above, describe the object in the masked region in the video in detail.）会生成很多文学性很强的无用描述，后期需调整prompt
模型圈住特定区域后只会识别该区域的变化（如偷包裹视频里圈住画面中心只能识别到狗），可用于特殊场景的识别
模型的输出带有以下特征：
1. 先仔细描述对象的静态特征，再描述对象在视频里体现的运动情况
2. 会对某一对象有详细描写（比如烟/火视频里的地面，偷车视频里的车），但经常会忽视画面其他对象的运动以及全局信息
3. 在能捕捉到某个对象的运动同时，难以捕捉该运动背后的含义（比如翻墙视频里，只能识别到行走和低头的动作，并不能识别翻墙的行为）
在那些要求模型关注视频全局变化和行为意图的prompt当中（如"Is there any car theft in the video? Is there anyone in the picture pulling the car door?"）,模型并不能精准识别行为意图和判断全局特征