模型TimeZero的新版Time-R1部署记录

最新推荐文章于 2025-11-29 16:24:14 发布

原创最新推荐文章于 2025-11-29 16:24:14 发布 · 475 阅读

9 ·

CC 4.0 BY-SA版权

文章标签：

#python #计算机视觉

模型专栏收录该内容

1 篇文章

订阅专栏

AI的出现，是否能替代IT从业者？ 10w+人浏览 1.4k人参与

文章目录

部署准备

代码下载

链接: 代码下载地址

预训练模型权重下载

提示：本次部署选择参数量为 7B
链接: 权重下载地址

可能需要单独下载的包

提示：本次部署选择参数量为 7B
链接: flash-atnn依赖包下载地址
链接: flashInfer依赖包下载地址

环境准备

设备：ubuntu 20.04
anaconda版本：4.5.11
conda虚拟环境：
	python版本：3.10.0

# 查看当前已有虚拟环境
conda env list
# 创建新的虚拟环境
conda create -n timer1 python==3.10

部署步骤

根据官方提供的requirements.txt的环境文件，pip会有依赖包版本矛盾，所以，如果只跑demo.py的话，按如下步骤部署即可，按照目前的步骤部署是可以跑通demo的。

# 1、进入创建的虚拟环境——此处下载采用清华源
conda activate timer1
# 2、下载torch等几个包
pip install torch==2.6.0 torchaudio==2.6.0 torchvision==0.21.0 torchdata torchtext tqdm==4.67.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
# 3、下载gradio等几个包
pip install gradio==4.44.1 ffmpeg==1.4 google==3.0.0 json5==
0.9.14 jupyter==1.0.0 Markdown==3.4.4 -i https://pypi.tuna.tsinghua.edu.cn/simple
# 4、下载transformer和openai的包
pip install openai==1.75.0 qwen-vl-utils==0.0.10 vllm==0.8.4 transformers==4.51.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
# 5、下载decord
pip install decord==0.6.0 -i https://pypi.tuna.tsinghua.edu
# 6、需单独下载flash-attn——应该需要’魔法‘，这边下到本地了，安装命令如下
pip install flash_attn-2.7.1.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# 7、需单独下载flashInfer——应该需要’魔法‘
# 网页直接下载安装
pip install flashinfer-python -i https://flashinfer.ai/whl/cu124/torch2.6
# 下载到本地后安装——按照目前安装的依赖包，运行后推荐安装这个版本，
pip install flashinfer_python-0.2.2.post1+cu124torch2.6-cp38-abi3-linux_x86_64.whl

避坑

安装flashinfer_python时，大于0.2.3版本的包，不适用当前的依赖环境，提示如下

INFO 09-09 17:31:44 [topk_topp_sampler.py:44] Currently, FlashInfer top-p & top-k sampling sampler is disabled because FlashInfer>=v0.2.3 is not backward compatible. Falling back to the PyTorch-native implementation of top-p & top-k sampling.

部署运行结果

官方demo.py

INFO 09-09 17:55:30 [__init__.py:239] Automatically detected platform cuda.
INFO 09-09 17:55:44 [config.py:689] This model supports multiple tasks: {'generate', 'reward', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
INFO 09-09 17:55:44 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 09-09 17:55:46 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='./Time-R1-7B', speculative_config=None, tokenizer='./Time-R1-7B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=7808, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=./Time-R1-7B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=True, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 09-09 17:55:47 [utils.py:2444] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f18b04fd990>
INFO 09-09 17:55:50 [parallel_state.py:959] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
WARNING 09-09 17:55:50 [interface.py:310] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 09-09 17:55:50 [cuda.py:221] Using Flash Attention backend on V1 engine.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 09-09 17:55:54 [gpu_model_runner.py:1276] Starting to load model ./Time-R1-7B...
INFO 09-09 17:55:55 [config.py:3466] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
INFO 09-09 17:55:55 [topk_topp_sampler.py:59] Using FlashInfer for top-p & top-k sampling.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [02:01<06:03, 121.05s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [04:02<04:02, 121.02s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [06:02<02:00, 120.82s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [06:44<00:00, 89.66s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [06:44<00:00, 101.13s/it]

INFO 09-09 18:02:40 [loader.py:458] Loading weights took 404.63 seconds
INFO 09-09 18:02:40 [gpu_model_runner.py:1291] Model loading took 15.6271 GiB and 405.408518 seconds
INFO 09-09 18:02:42 [gpu_model_runner.py:1560] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 video items of the maximum feature size.
INFO 09-09 18:02:58 [backends.py:416] Using cache directory: /root/.cache/vllm/torch_compile_cache/4f31b0e175/rank_0_0 for vLLM's torch.compile
INFO 09-09 18:02:58 [backends.py:426] Dynamo bytecode transform time: 7.55 s
INFO 09-09 18:02:59 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 09-09 18:03:05 [monitor.py:33] torch.compile takes 7.55 s in total
INFO 09-09 18:03:07 [kv_cache_utils.py:634] GPU KV cache size: 53,952 tokens
INFO 09-09 18:03:07 [kv_cache_utils.py:637] Maximum concurrency for 7,808 tokens per request: 6.91x
INFO 09-09 18:03:47 [gpu_model_runner.py:1626] Graph capturing finished in 39 secs, took 0.49 GiB
INFO 09-09 18:03:47 [core.py:163] init engine (profile, create kv cache, warmup model) took 66.65 seconds
INFO 09-09 18:03:49 [core_client.py:435] Core engine process 0 ready.
qwen-vl-utils using decord to read video.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Processed prompts: 100%|█████████████████████████████████████| 1/1 [00:02<00:00,  2.75s/it, est. speed input: 1209.30 toks/s, output: 20.35 toks/s]
<think>
The event "person sitting down in a chair" occurs when the person enters the frame and sits on the chair, preparing to engage with the items on the table.
</think>
<answer>
2.00 to 7.00</answer><|im_end|> [2.0, 7.0]

--- Timing Summary ---
Total program execution time: 7.21 seconds

修改demo.py代码如下几处可实现本地案例测试，修改见下方代码注释处：

def get_args():
    parser = argparse.ArgumentParser(
        description="Evaluation for training-free video temporal grounding (Single GPU Version)"
    )
    # 注意——修改此处为你的权重目录
    parser.add_argument(
        "--model_base", type=str, default="your file path"
    )
    parser.add_argument("--batch_size", type=int, default=1, help="Batch size")
    parser.add_argument(
        "--output_dir",
        type=str,
        default="logs/demo",
        help="Directory to save checkpoints",
    )
    parser.add_argument(
        "--device", type=str, default="cuda:0", help="GPU device to use"
    )
    parser.add_argument(
        "--pipeline_parallel_size", type=int, default=1, help="GPU nodes"
    )
    # 测试视频路径
    parser.add_argument(
        "--video_path", type=str, default="your video"
    )
    # 查找视频的内容
    parser.add_argument(
        "--query", type=str, default="person sitting down in a chair."
    )
    parser.add_argument("--max_new_tokens", type=int, default=128)
    parser.add_argument(
        "--total_pixels", type=int, default=3584 * 28 * 28, help="total_pixels"
    )
    return parser.parse_args()