本地部署 AI TTS，比商业方案还牛X！_f5tts本地部署教程-优快云博客

本文链接：https://blog.youkuaiyun.com/a543768773/article/details/147088769

1️⃣ 环境准备

# 1. 安装 Git
sudo apt-get update
sudo apt-get install git

# 2. 安装 FFmpeg
sudo apt-get install ffmpeg

# 3. 安装 Anaconda（推荐使用它来隔离 Python 环境）
wget https://repo.anaconda.com/archive/Anaconda3-2023.07-1-Linux-x86_64.sh
bash Anaconda3-2023.07-1-Linux-x86_64.sh

2️⃣ F5-TTS 安装与运行

# 1. 克隆项目
git clone https://github.com/F5-TTS/F5-TTS.git
cd F5-TTS

# 2. 创建虚拟环境
conda create -n f5tts python=3.10 -y
conda activate f5tts

# 3. 安装依赖
pip install -r requirements.txt

# 4. 启动应用
python app.py

现在访问 http://localhost:5000，即可看到 TTS 操作界面。

3️⃣ 使用 Web 界面生成情感语音

输入你想要转换的文本（支持英文、中文等）
选择情绪（例如 Happy、Sad、Angry）
点击「Generate」按钮
下载生成的 .wav 文件

4️⃣ 核心模型调用逻辑源码解读

你可以打开 app.py 找到核心 TTS 合成的流程，我们抽取如下：

from inference.infer import generate_audio

@app.route("/api/generate", methods=["POST"])
def generate():
    input_text = request.json["text"]
    emotion = request.json["emotion"]

    output_path = generate_audio(text=input_text, emotion=emotion)
    return send_file(output_path, as_attachment=True)

继续深入 inference/infer.py 中，我们发现核心逻辑是：

def generate_audio(text, emotion):
    tokens = tokenizer(text)
    emotion_embedding = get_emotion_embedding(emotion)
    audio = tts_model.infer(tokens, emotion_embedding)
    save_wav(audio, "output.wav")
    return "output.wav"

5️⃣ 🔧 批量文本 + 多情绪合成脚本（自定义二次开发）

你可以基于 API 或模型接口封装一个 Python 脚本，实现批量语音合成：

import requests
import os

TEXTS = [
    "欢迎来到我们的节目，今天我们聊聊AI与人类的未来。",
    "我今天感到非常兴奋，因为我们迎来了突破！",
    "这个故事有些伤感，但值得我们深思。"
]

EMOTIONS = ["neutral", "happy", "sad"]

def synthesize(text, emotion):
    response = requests.post("http://localhost:5000/api/generate", json={
        "text": text,
        "emotion": emotion
    })
    filename = f"output_{emotion}_{text[:6]}.wav"
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"[✔] 生成成功：{filename}")

for text in TEXTS:
    for emo in EMOTIONS:
        synthesize(text, emo)

运行该脚本后将自动生成多种情感版本的语音输出！

6️⃣ 🎙️ 播客合成 + 自动命名输出（增强功能）

我们可以使用 pydub 自动拼接生成播客内容：

from pydub import AudioSegment

# 加载生成的 wav 文件
segments = [
    AudioSegment.from_wav("output_happy_欢迎来.wav"),
    AudioSegment.from_wav("output_sad_这个故.wav")
]

# 添加过渡音乐
transition = AudioSegment.silent(duration=1000)

# 合并音频段
final_podcast = segments[0]
for seg in segments[1:]:
    final_podcast += transition + seg

# 保存
final_podcast.export("my_podcast_with_emotions.wav", format="wav")
print("播客文件已保存：my_podcast_with_emotions.wav")

✨ Bonus：命令行界面一键生成播客

python generate_podcast.py --text "你好，今天是个好天气。" --emotion happy --output podcast_happy.wav

通过 argparse 实现 CLI 参数传入，非常适合自动化部署：

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--text", type=str, required=True)
parser.add_argument("--emotion", type=str, default="neutral")
parser.add_argument("--output", type=str, default="output.wav")
args = parser.parse_args()

# 发请求略...