利用HuggingFace一键模型部署功能，快速构建语音助手

卓普云

于 2025-06-18 15:51:36 发布

阅读量1.3k

点赞数 24

CC 4.0 BY-SA版权

分类专栏：教程文章标签：人工智能 Huggingface AIGC

本文链接：https://blog.youkuaiyun.com/DO_Community/article/details/148742318

教程专栏收录该内容

45 篇文章

订阅专栏

1-Click Models（即一键模型部署功能）是由DigitalOcean与Hugging Face联合推出的新方案，旨在通过云端最强大的GPU轻松调用顶尖开源大语言模型（LLMs）。用户无需复杂配置即可直接使用最佳模型。

本教程将指导您开发一个支持语音功能的个人助手应用，该应用基于Gradio框架并集成FastAPI。您将学习如何：

调用1-Click Model GPU Droplet
使用语音识别与合成技术实现语音交互
通过API扩展功能

1-Click HuggingFace模型与GPU Droplets

可用模型列表

当前1-Click Models提供以下LLM选项（按应用场景分类）：

Meta系列：Llama-3.1系列（8B/70B/405B参数）
Qwen系列：Qwen2.5-7B-Instruct
Gemma系列：Gemma-2-9b/27b-it
Mixtral系列：8x7B/8x22B-Instruct
Hermes系列：Llama-3.1-8B/70B/405B及Mixtral-8x7B-DPO

部署步骤

按照官方文档或我们往期博客创建GPU Droplet实例
可在Bilibili搜索“如何使用DigitalOcean GPU Droplets的一键部署Huggingface模型”观看配套视频教程，获取完整操作指引，同时通过DigitalOcean后台资源页面确认最新的模型列表及费用。
如果你的团队用量较大，或计划长时间使用GPU Droplet云服务器，还可签订优惠价格，具体可咨询DigitalOcean中国区独家战略合作伙伴卓普云。

GPU Droplet 实例启动后，进入下一步交互配置。

与一键模型部署进行交互

如果我们想在同一台机器上与其进行交互，连接到一键模型部署是非常简单的。“当连接到 HUGS Droplet 时，初始的 SSH 消息将显示一个 Bearer Token（持有者令牌），该令牌是向已部署的 HUGS Droplet 公共 IP 发送请求所必需的。然后你可以通过 localhost（如果你在 HUGS Droplet 内部连接）或其公共 IP 向消息 API 发送请求。”（来源）。因此，如果我们要从其他机器访问 Droplet，则需要获取这个 Bearer Token。使用 SSH 连接到你的机器以获取该令牌的副本，并将其保存以备后用。

如果我们只是想从我们的 GPU Droplet 向推理端点发送请求，事情就变得相当简单了。该变量已经保存到了环境变量中。

一旦我们在我们选择使用的机器上设置了 Bearer Token 变量，我们就可以开始对模型进行推理了。目前有两种方式进行操作：cURL 和 Python。端点将自动运行于端口 8080，因此我们可以默认向我们的机器发送请求。如果我们使用的是不同的机器，请将下面的 localhost 值更改为 IPv4 地址。

cURL

curl http://localhost:8080/v1/chat/completions \
    -X POST \
    -d '{"messages":[{"role":"user","content":"What is Deep Learning?"}],"temperature":0.7,"top_p":0.95,"max_tokens":128}}' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer $BEARER_TOKEN"

这段代码将会向模型提出问题：“什么是深度学习？”并返回如下格式的响应：

{
  "object": "chat.completion",
  "id": "",
  "created": 1731532721,
  "model": "hfhugs/Meta-Llama-3.1-8B-Instruct",
  "system_fingerprint": "2.3.1-dev0-sha-169178b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "**Deep Learning: A Subfield of Machine Learning**\n=====================================================\nDeep learning is a subfield of machine learning that focuses on the use of artificial neural networks to analyze and interpret data. It is inspired by the structure and function of the human brain and is particularly well-suited for tasks such as image and speech recognition, natural language processing, and data classification.\n\n**Key Characteristics of Deep Learning:**\n1. **Artificial Neural Networks**: Deep learning models are composed of multiple layers of interconnected nodes or \"neurons\" that process and transform inputs into outputs.\n2. **Non-Linear Transformations**: Each layer applies a non-linear"
      },
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "completion_tokens": 128,
    "total_tokens": 168
  }
}

这可以随后根据需要插入到各种 Web 开发应用中。

Python

该模型也可以通过 Hugging Face Hub 或 OpenAI 包在 Python 中进行访问。我们将在这次演示中引用 Hugging Face Hub 的参考代码。

import os
from huggingface_hub import InferenceClient
client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))
chat_completion = client.chat.completions.create(
    messages=[
        {"role":"user","content":"What is Deep Learning?"},
    ],
    temperature=0.7,
    top_p=0.95,
    max_tokens=128,
)

这将返回一个格式化的响应作为 ChatCompletionOutput 对象。

ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='length', index=0, message=ChatCompletionOutputMessage(role='assistant', content='**Deep Learning: An Overview**\nDeep Learning is a subset of Machine Learning that involves the use of Artificial Neural Networks (ANNs) with multiple layers to analyze and interpret data. These networks are inspired by the structure and function of the human brain, with each layer processing the input data in a hierarchical manner.\n\n**Key Characteristics:**\n1.  **Multiple Layers:** Deep Learning models typically have 2 or more hidden layers, allowing them to learn complex patterns and relationships in the data.\n2.  **Neural Networks:** Deep Learning models are based on artificial neural networks, which are composed of interconnected nodes (neurons) that process', tool_calls=None), logprobs=None)], created=1731532948, id='', model='hfhugs/Meta-Llama-3.1-8B-Instruct', system_fingerprint='2.3.1-dev0-sha-169178b', usage=ChatCompletionOutputUsage(completion_tokens=128, prompt_tokens=40, total_tokens=168))

我们可以通过以下方式仅打印输出内容：

chat_completion.choices[0]['message']['content']

创建语音启用的个人助手

为了充分利用这一强大的新工具，我们开发了一个新的个人助手应用程序来与这些模型一起运行。该应用程序完全支持语音功能，能够听取并朗读输入和输出内容。为了实现这一点，演示使用 Whisper 将音频输入转录为文本，或者直接接收纯文本，并将其输入由一键 GPU Droplets 提供动力的 LLM 来生成文本响应。然后我们使用 Coqui-AI 的 XTTS2 模型将文本输入转换为可理解的音频输出。值得注意的是，该软件使用了语音克隆技术来生成输出音频，因此用户将收到一个接近自己说话声音的语音输出。

请查看以下代码：

import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread
import os
from huggingface_hub import InferenceClient
import gradio as gr
import random
import time
from TTS.api import TTS
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
import scipy.io.wavfile as wavfile
import numpy as np

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id_w = "openai/whisper-large-v3"
model_w = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id_w, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model_w.to(device)

processor = AutoProcessor.from_pretrained(model_id_w)

pipe_w = pipeline(
    "automatic-speech-recognition",
    model=model_w,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

client = InferenceClient(base_url="http://localhost:8080", api_key=os.getenv("BEARER_TOKEN"))

# 示例：使用 YourTTS 在英语、法语和葡萄牙语中进行语音克隆
# tts = TTS("tts_models/multilingual/multi-dataset/bark", gpu=True)

# 获取 v2.0.2 版本
tts = TTS(model_name="xtts_v2.0.2", gpu=True)

with gr.Blocks() as demo:
    chatbot = gr.Chatbot(type="messages")
    with gr.Row():
        msg = gr.Textbox(label = 'Prompt')
        audi = gr.Audio(label = 'Transcribe audio')
    with gr.Row():
        submit = gr.Button('Submit')
        submit_audio = gr.Button('Submit Audio')
        read_audio = gr.Button('Transcribe Text to Audio')
        clear = gr.ClearButton([msg, chatbot])
    with gr.Row():
        token_val = gr.Slider(label = 'Max new tokens', value = 512, minimum = 128, maximum = 1024, step = 8, interactive=True)
        temperature_ = gr.Slider(label = 'Temperature', value = .7, minimum = 0, maximum =1, step = .1, interactive=True)
        top_p_ = gr.Slider(label = 'Top P', value = .95, minimum = 0, maximum =1, step = .05, interactive=True)

    def respond(message, chat_history, token_val, temperature_, top_p_):
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output.wav")
        return "", chat_history, #"output.wav"

    def respond_audio(audi, chat_history, token_val, temperature_, top_p_):  
        wavfile.write("output.wav", 44100, audi[1]) 
        result = pipe_w('output.wav')
        message = result["text"]
        print(message)
        bot_message = client.chat.completions.create(messages=[{"role":"user","content":f"{message}"},],temperature=temperature_,top_p=top_p_,max_tokens=token_val,).choices[0]['message']['content']
        chat_history.append({"role": "user", "content": message})
        chat_history.append({"role": "assistant", "content": bot_message})
        # tts.tts_to_file(bot_message, speaker_wav="output.wav", language="en", file_path="output2.wav")
        # tts.tts_to_file(bot_message,
                # file_path="output.wav",
                # speaker_wav="output.wav",
                # language="en")
        return "", chat_history, #"output.wav"

    def read_text(chat_history):
        print(chat_history)
        print(type(chat_history))
        tts.tts_to_file(chat_history[-1]['content'],
                file_path="output.wav",
                speaker_wav="output.wav",
                language="en")
        return 'output.wav'

    msg.submit(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit.click(respond, [msg, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    submit_audio.click(respond_audio, [audi, chatbot, token_val, temperature_, top_p_], [msg, chatbot])
    read_audio.click(read_text, [chatbot], [audi])

demo.launch(share = True)

综合起来，这个集成系统使得我们可以充分利用云 GPU 的速度和可用性，充当各种任务的个人助手。我们一直在使用它替代流行的闭源工具如 Gemini 和 ChatGPT，并对结果印象深刻。

设置与运行演示

要将所需的包安装到你的 GPU Droplet 上，请将以下内容粘贴到终端中：

pip install gradio tts huggingface_hub transformers datasets scipy torch torchaudio

要运行此演示，只需将上面的代码粘贴到你的一键模型启用的云 GPU 上的一个空白 Python 文件中（我们随意称其为 app.py），然后使用以下命令运行它：

python3 app.py

结束语

本教程中为开发个人助手应用程序所做的努力已经在我们的日常生活中证明了其实用性，我们希望其他人也能从中找到一些用途。此外，新推出的一键模型 GPU Droplets 为企业级 LLM 软件提供了一个非常有趣的替代方案。虽然对于单个用户来说成本较高，但我们能想到许多使用场景（特别是运行最大的开源 LLM）可以合理地支持这笔支出。我们最新的产品提供了目前可用的最大 Mixtral 和 LLaMA 模型，因此这是一个测试这些模型与最佳竞争模型之间性能的好机会。