Dolphin语音识别(包含原模型部署代码，模型导出，模型量化源码)

最新推荐文章于 2025-12-09 16:47:37 发布

原创最新推荐文章于 2025-12-09 16:47:37 发布 · 913 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#语音识别 #人工智能 #算法

该文章已生成可运行项目，

DolphinASR 项目

这是一个基于Dolphin语音识别模型的工程项目，实现了模型推理、ONNX导出和量化优化等功能。本项目支持音频文件的语音识别，并提供了模型优化和部署相关的工具链。

项目结构

├── configs/            # 配置文件目录
├── Dolphin/           # Dolphin ASR 核心模型
├── models/            # 预训练模型和示例音频
│   └── ASR/
│       └── dolphin-base/  # 基础模型
├── model_classes/     # 模型类定义
├── model_utils/      # 工具函数
└── outputs/          # 导出模型和日志输出

主要功能模块

1. 语音识别推理 (run.py)

基础的语音识别功能实现，支持以下特性：

加载预训练模型
音频文件识别
支持语言和地区设置
支持时间戳预测

使用示例：

import dolphin

# 加载音频和模型
waveform = dolphin.load_audio("test.wav", sr=16000)
model = dolphin.load_model("base", "models/ASR/dolphin-base/", "cpu")

# 执行识别
result = model(waveform, lang_sym="zh", region_sym="CN", predict_time=True)
print(result)

2. ONNX模型导出 (export.py)

将PyTorch模型导出为ONNX格式，支持以下功能：

模型导出为ONNX格式
支持动态batch size
支持特征提取前端导出
导出过程日志记录

使用示例：

python export.py

3. ONNX推理 (run_onnx.py)

使用ONNX Runtime进行模型推理，具有以下特性：

支持ONNX模型加载和推理
音频预处理和特征提取
支持语言检测
集成beam search解码

使用示例：

python run_onnx.py

4. 模型量化优化 (optim.py)

实现模型量化和优化，支持：

动态量化 (UInt8)
支持选择性算子量化
模型大小对比分析
量化过程日志记录

使用示例：

python optim.py -m outputs/model.onnx -o outputs/model_uint8.onnx

5. 模型信息导出 (export_model_info.py)

分析并导出ONNX模型信息：

基本模型信息
输入输出节点分析
算子统计
参数统计

使用示例：

python export_model_info.py -m outputs/model.onnx -o outputs/model.info

算法实现细节

1. 音频处理流程

音频加载和预处理
- 使用torchaudio加载音频并进行归一化
- 自动重采样至16kHz采样率
- 转换为单声道（多声道取平均）
- 规范化音频长度（补零或截断）

def load_audio_file(audio_path):
    """加载和预处理音频文件"""
    # 加载音频
    waveform, sample_rate = torchaudio.load(audio_path, normalize=True)
    
    # 重采样到16kHz
    if sample_rate != 16000:
        waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)
    
    # 转换为单声道
    if waveform.dim() == 2 and waveform.size(0) > 1:
        waveform = waveform.mean(dim=0, keepdim=True)
    
    # 规范化长度
    target_length = int(16000 * 30)  # 30秒音频
    if waveform.size(1) > target_length:
        waveform = waveform[:, :target_length]
    else:
        pad_length = target_length - waveform.size(1)
        waveform = torch.nn.functional.pad(waveform, (0, pad_length))
    
    return waveform

特征提取
- 使用ESPnet前端提取特征
- 计算Short-time Fourier transform (STFT)
- 提取log-mel频谱图特征（80维）
- 特征归一化（基于全局统计量）
- 支持在线特征提取和批处理

def extract_features(waveform):
    """提取音频特征"""
    # STFT参数
    n_fft = 512
    win_length = 400
    hop_length = 160
    n_mels = 80
    
    # 计算STFT
    spec = torch.stft(
        waveform,
        n_fft=n_fft,
        win_length=win_length,
        hop_length=hop_length,
        window=torch.hann_window(win_length),
        return_complex=True
    )
    
    # 计算梅尔频谱
    mel_basis = torchaudio.transforms.MelScale(
        n_mels=n_mels,
        sample_rate=16000,
        f_min=0,
        f_max=8000,
        n_stft=n_fft // 2 + 1
    )
    
    mel_spec = mel_basis(spec.abs().pow(2))
    
    # 计算log-mel特征
    features = torch.log(mel_spec + 1e-7)
    
    # 特征归一化
    stats = torch.load('feats_stats.npz')
    mean = stats['mean']
    std = stats['std']
    features = (features - mean) / std
    
    return features

### 2. 模型结构

Dolphin ASR采用了基于Conformer-Transformer的编解码架构：

1. 编码器（Conformer）
   - 多层Conformer块处理声学特征
   - 每个Conformer块包含：
     - 多头自注意力机制
     - 卷积模块（深度可分离卷积）
     - 前馈网络
     - Layer Normalization和残差连接
   - 位置编码用于保持序列信息
   - 支持相对位置注意力机制

2. 解码器（Transformer）
   - 自回归解码过程
   - 多层Transformer解码器块
   - 交叉注意力机制关联声学和文本信息
   - 集成了语言模型得分
   - 支持缓存机制加速推理

```python
class TransformerDecoder(nn.Module):
    """Transformer解码器实现"""
    def __init__(self, d_model=256, nhead=4, num_layers=6):
        super().__init__()
        self.d_model = d_model
        
        # 位置编码
        self.pos_encoder = PositionalEncoding(d_model)
        
        # 解码器层
        decoder_layer = nn.TransformerDecoderLayer(
            d_model=d_model,
            nhead=nhead,
            dim_feedforward=2048,
            dropout=0.1,
            batch_first=True
        )
        self.decoder = nn.TransformerDecoder(
            decoder_layer,
            num_layers=num_layers
        )
        
        # 输出层
        self.output_layer = nn.Linear(d_model, vocab_size)
        
    def forward(self, tgt, memory, tgt_mask=None, tgt_key_padding_mask=None):
        """
        参数:
            tgt: 目标序列 [batch, seq_len, d_model]
            memory: 编码器输出 [batch, src_len, d_model]
            tgt_mask: 解码器自注意力mask
            tgt_key_padding_mask: padding mask
        """
        # 添加位置编码
        tgt = self.pos_encoder(tgt)
        
        # Transformer解码
        output = self.decoder(
            tgt,
            memory,
            tgt_mask=tgt_mask,
            tgt_key_padding_mask=tgt_key_padding_mask
        )
        
        # 输出层
        logits = self.output_layer(output)
        
        return logits

多任务学习机制
- 语音识别（ASR）任务
- 语言识别任务
- 时间戳预测任务
- 使用任务特定的标记控制输出

3. 推理优化

模型导出优化
- 导出前模型剪枝和压缩
- 静态图优化，消除冗余计算
- 算子融合减少内存访问
- 支持动态batch和序列长度
- 移除训练专用模块
量化策略
- 选择性算子量化
  - 仅量化计算密集型算子
  - 保留关键算子的FP32精度
- 权重量化为UInt8
  - 每层独立量化
  - 优化量化比例因子
- 激活值处理
  - 动态量化范围计算
  - 校准集优化量化参数

def quantize_model(model_path, output_path):
    """模型量化实现"""
    import onnx
    from onnxruntime.quantization import (
        quantize_dynamic,
        QuantType,
        CalibrationDataReader
    )
    
    # 配置量化参数
    weight_type = QuantType.QUInt8
    op_types_to_quantize = ['MatMul', 'Gemm', 'Conv']
    
    # 读取校准数据
    class CalibrationData(CalibrationDataReader):
        def __init__(self, data_dir):
            self.data = []  # 加载校准数据集
            self.current = 0
            
        def get_next(self):
            if self.current >= len(self.data):
                return None
            batch = self.data[self.current]
            self.current += 1
            return {'input': batch}
    
    # 量化模型
    calibration_data = CalibrationData('calibration_data/')
    
    try:
        # 动态量化
        quantize_dynamic(
            model_input=model_path,
            model_output=output_path,
            weight_type=weight_type,
            optimize_model=True,
            op_types_to_quantize=op_types_to_quantize,
            calibration_data_reader=calibration_data,
            per_channel=False
        )
        
        print(f"量化模型已保存到: {output_path}")
        
        # 比较模型大小
        orig_size = os.path.getsize(model_path) / (1024 * 1024)
        quant_size = os.path.getsize(output_path) / (1024 * 1024)
        print(f"原始模型大小: {orig_size:.2f} MB")
        print(f"量化后大小: {quant_size:.2f} MB")
        print(f"压缩比: {orig_size/quant_size:.2f}x")
        
    except Exception as e:
        print(f"量化失败: {str(e)}")
        return False
    
    return True

解码优化
- 改进的Beam Search算法
  - 自适应束宽
  - 长度惩罚
  - 覆盖度惩罚
- 批处理并行化
  - 动态批处理
  - 自动负载均衡
- 解码加速策略
  - KV缓存机制
  - 提前停止策略
  - 局部注意力优化

def beam_search_decode(model, encoder_out, beam_size=5, max_len=100):
    """Beam Search解码实现"""
    batch_size = encoder_out.size(0)
    device = encoder_out.device
    
    # 初始化beam
    beams = [(torch.tensor([BOS_ID], device=device), 0.0)]  # (seq, score)
    
    for _ in range(max_len):
        candidates = []
        
        # 扩展每个beam
        for seq, score in beams:
            if seq[-1] == EOS_ID:
                candidates.append((seq, score))
                continue
                
            # 预测下一个token
            decoder_out = model.decode(seq.unsqueeze(0), encoder_out)
            logits = decoder_out[0, -1]  # 最后一个时间步
            probs = F.log_softmax(logits, dim=-1)
            
            # 选择top-k个候选
            values, indices = probs.topk(beam_size)
            for value, index in zip(values, indices):
                new_seq = torch.cat([seq, index.unsqueeze(0)])
                new_score = score + value.item()
                candidates.append((new_seq, new_score))
        
        # 保留最好的beam_size个候选
        beams = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_size]
        
        # 检查是否所有beam都结束
        if all(seq[-1] == EOS_ID for seq, _ in beams):
            break
    
    # 返回最好的序列
    best_seq, best_score = beams[0]
    return best_seq[1:-1], best_score  # 移除BOS和EOS

环境配置

项目依赖的Python包信息见req.info文件。主要依赖包括：

PyTorch
torchaudio
ESPnet
ONNX Runtime
NumPy
SoundFile

性能优化

1. 计算性能优化

算子融合：减少内存访问和数据搬运
内存优化：重用内存空间，避免冗余分配
并行计算：利用多核CPU和向量化指令
推理调度：动态批处理和负载均衡

2. 内存优化

特征缓存：复用预处理特征
权重共享：减少模型参数占用
内存池：高效的内存分配和回收
增量计算：支持流式处理长音频

3. 精度控制

量化后精度对比分析
重点算子双精度保护
数值稳定性优化
量化参数自动校准

部署指南

1. 环境准备

Python 3.10.12
CUDA >= 12.4 (可选，用于GPU加速)
CMake >= 3.28.3
G++ 编译器支持C++13
GCC 编译器支持GCC13

2. 安装步骤

# 1. 创建虚拟环境
conda create -n dolphin python==3.10.12
conda activate dolphin

# 2. 安装基础依赖
pip install -r req.info

# 3. 安装ONNX Runtime
pip install onnxruntime-gpu  # GPU版本
# 或
pip install onnxruntime     # CPU版本

结果输出

2025-08-19 16:29:57,726 - __main__ - INFO - ONNX Input names: ['feats', 'feats_lengths']
2025-08-19 16:29:57,727 - __main__ - INFO - ONNX Output names: ['encoder_out', 'encoder_out_lens']
2025-08-19 16:29:57,727 - __main__ - INFO - Features shape: (1, 1000, 80)
2025-08-19 16:29:57,727 - __main__ - INFO - Feature lengths shape: (1,)
2025-08-19 16:29:58,497 - __main__ - INFO - 
=== ASR Results ===
2025-08-19 16:29:58,498 - __main__ - INFO - Text: <zh><CN><asr><0.00> 感谢使用有数据服务和精华大学联合开发的一种录音和阅读任务的一种识别模型<7.94>
2025-08-19 16:29:58,498 - __main__ - INFO - Clean Text: 感谢使用有数据服务和精华大学联合开发的一种录音和阅读任务的一种识别模型
2025-08-19 16:29:58,498 - __main__ - INFO - Language: zh
2025-08-19 16:29:58,498 - __main__ - INFO - Region: CN
2025-08-19 16:29:58,500 - __main__ - INFO - Score: Hypothesis(yseq=tensor([39999,   143,   175,     6,   325,  1826,  5893,  2908,  1843,  4458,
         2654,  1886,  3721,  2484,  3079, 17445,  9012, 15666, 12673,  1886,
        11471,  5353, 15666, 16726, 17598,   722, 40000]), score=tensor(-25.7387), scores={'decoder': tensor(-25.7387), 'scorefilter': tensor(0.)}, states={'decoder': [tensor([[-0.0559,  0.0024,  0.2962,  ...,  0.1710, -0.0679, -0.1606],
        [ 0.6933,  0.0921, -1.1097,  ..., -0.5377, -0.0695,  0.8377],
        [ 0.5695,  0.1781,  0.3112,  ...,  0.3090,  0.1677, -0.2476],
        ...,
        [-0.1352, -1.1039, -0.1234,  ...,  0.4055, -0.5021,  0.0355],
        [-0.1164, -0.0749, -0.5474,  ...,  0.7158,  0.4599,  0.3187],
        [ 0.2039,  0.7832, -0.3054,  ...,  0.6979, -0.0881,  0.3987]]), tensor([[-0.0210,  0.1519,  0.1822,  ...,  0.2000, -0.0646, -0.1083],
        [ 0.6161,  0.0176, -1.0275,  ..., -0.4555, -0.0348,  0.7722],
        [ 0.6035,  0.1107,  0.5663,  ...,  0.4942,  0.2629, -0.3459],
        ...,
        [-0.0956, -0.8588,  0.0351,  ...,  0.4005, -0.4152, -0.1518],
        [-0.1492,  0.0388, -0.5141,  ...,  0.6685,  0.5209,  0.0262],
        [ 0.0413,  0.8956, -0.3908,  ...,  0.5826,  0.0737,  0.3107]]), tensor([[-0.1251,  0.3811,  0.2250,  ...,  0.2681, -0.0140, -0.1142],
        [ 0.5175,  0.1782, -1.0779,  ..., -0.4581,  0.0794,  0.8473],
        [ 0.6025,  0.3691,  0.8968,  ...,  0.6299,  0.4678, -0.2685],
        ...,
        [-0.1942, -0.9402,  0.2420,  ...,  0.2509, -0.1502, -0.2688],
        [-0.2646, -0.2218, -0.3126,  ...,  0.6305,  0.6521, -0.1964],
        [-0.1782,  1.0727, -0.3937,  ...,  0.5300,  0.2716,  0.1100]]), tensor([[-0.0760,  0.4523,  0.1773,  ...,  0.2886, -0.0159, -0.1692],
        [ 0.4968,  0.2197, -0.9863,  ..., -0.5276,  0.0706,  0.7036],
        [ 0.7338,  0.4345,  1.1088,  ...,  0.8963,  0.5496, -0.3469],
        ...,
        [-0.3948, -1.1404,  0.4496,  ...,  0.6916,  0.0749, -0.7918],
        [-0.6575, -0.3179, -0.1926,  ...,  1.0463,  1.1432, -0.8874],
        [-0.0174,  1.1767, -0.7258,  ...,  0.3832, -0.0849, -0.0311]]), tensor([[ 0.0791,  0.6426,  0.4632,  ...,  0.1128, -0.0367, -0.2546],
        [ 0.5814,  0.2835, -0.7559,  ..., -0.6847,  0.1353,  0.6821],
        [ 0.8037,  0.4703,  1.2129,  ...,  0.8614,  0.6099, -0.4334],
        ...,
        [-0.8724, -1.3997,  0.5137,  ...,  0.4874,  0.3613, -1.2213],
        [-0.9555, -0.7569, -0.8970,  ...,  1.0926,  1.8838, -0.8425],
        [ 0.1763,  0.8431, -0.7783,  ...,  0.3326, -0.1989, -0.0824]]), tensor([[ 0.2680,  0.5685,  0.5597,  ...,  0.1078,  0.0216, -0.2524],
        [ 0.7731,  0.2350, -0.5138,  ..., -0.6490,  0.2301,  0.4662],
        [ 0.9573,  0.5148,  1.4831,  ...,  0.8033,  0.7197, -0.4451],
        ...,
        [-1.1165, -2.3988,  0.1883,  ..., -0.2096,  0.6099, -1.7917],
        [-1.1990, -1.4435, -1.1971,  ...,  0.6126,  1.9833, -0.8168],
        [-0.0419,  0.4274, -0.4500,  ...,  0.4606,  0.0210,  0.0624]])], 'scorefilter': None}, hs=[])