<think>我们遇到了一个ModuleNotFoundError,提示没有安装webrtcvad模块。因此,我们需要确保该模块已安装。此外,由于我们的代码中还使用了speechbrain,所以也要确保它已安装。
但是,由于用户可能没有安装这些模块,我们应该在代码中提供更友好的错误提示,或者自动尝试安装缺失的模块(在生产环境中不推荐自动安装,但在某些场景下可以接受)。
然而,考虑到代码的健壮性,我们可以在程序开始时检查必要的模块,并给出明确的安装指导。
修改方案:
1. 在代码开头添加模块导入的异常捕获,并给出明确的安装指导。
2. 或者,在程序启动时检查所有依赖模块,如果缺失则提示用户安装。
这里我们选择在程序启动时检查依赖模块。
但是,由于问题是在运行时出现的,我们可以在导入模块时使用try-except,然后提示用户安装缺失的模块。
修改后的代码开头部分:
```python
import os
import sys
import subprocess
import logging
# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# 定义必需的模块
REQUIRED_MODULES = [
('webrtcvad', 'webrtcvad'),
('speechbrain', 'speechbrain'),
('noisereduce', 'noisereduce'),
('pydub', 'pydub'),
('modelscope', 'modelscope'),
('scipy', 'scipy'),
('numpy', 'numpy'),
('torch', 'torch'),
]
# 检查模块是否安装
missing_modules = []
for module_tuple in REQUIRED_MODULES:
module_name, package_name = module_tuple
try:
__import__(module_name)
except ImportError:
missing_modules.append(package_name)
if missing_modules:
logger.error(f"缺少必要的模块: {', '.join(missing_modules)}")
logger.error("请使用以下命令安装: pip install " + " ".join(missing_modules))
sys.exit(1)
# 现在可以安全导入其他模块
import torch
import numpy as np
from pydub import AudioSegment, effects
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import noisereduce as nr
from scipy.io import wavfile
import json
import re
from collections import defaultdict
import webrtcvad
import speechbrain as sb
import time
# ... 其余代码保持不变 ...
```
但是,请注意,有些模块的名称和安装包名称不一致(例如,`torch`的包名为`pytorch`,但通常我们使用`torch`来安装,即`pip install torch`)。因此,我们上面的元组中第二个元素是安装时使用的包名,通常与模块名相同。
然而,对于`torch`,我们通常使用`pip install torch`,所以包名还是`torch`。但是,对于`webrtcvad`,安装包名就是`webrtcvad`。所以我们的列表是准确的。
另外,我们也可以考虑在程序启动时自动安装缺失的模块,但这在大多数情况下不是一个好习惯,因为可能会影响用户环境。所以这里我们选择提示用户安装。
现在,如果用户运行程序时缺少模块,程序会打印错误信息并退出。
此外,我们还需要注意,在Windows上安装`webrtcvad`需要确保有C++编译环境,因为它是一个C扩展模块。如果安装失败,可能需要安装Visual Studio Build Tools。
考虑到这些,我们可以在错误信息中给出更详细的安装指导。
修改后的代码开头:
```python
import os
import sys
import logging
# 配置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
# 定义必需的模块
REQUIRED_MODULES = [
('webrtcvad', 'webrtcvad'),
('speechbrain', 'speechbrain'),
('noisereduce', 'noisereduce'),
('pydub', 'pydub'),
('modelscope', 'modelscope'),
('scipy', 'scipy'),
('numpy', 'numpy'),
('torch', 'torch'),
]
# 检查模块是否安装
missing_modules = []
for module_name, package_name in REQUIRED_MODULES:
try:
__import__(module_name)
except ImportError:
missing_modules.append(package_name)
if missing_modules:
logger.error("缺少必要的模块: %s", ", ".join(missing_modules))
logger.error("请使用以下命令安装: pip install %s", " ".join(missing_modules))
logger.error("注意: 安装webrtcvad可能需要Visual Studio C++ Build Tools (Windows用户)")
sys.exit(1)
# 现在可以安全导入其他模块
import torch
import numpy as np
from pydub import AudioSegment, effects
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
import noisereduce as nr
from scipy.io import wavfile
import json
import re
from collections import defaultdict
import webrtcvad
import speechbrain as sb
import time
# ... 其余代码保持不变 ...
```
这样,用户遇到缺失模块时就能得到明确的提示。
但是,原问题中用户已经遇到了`webrtcvad`缺失,所以我们需要确保用户安装了这个模块。
另外,由于我们添加了模块检查,如果用户没有安装`webrtcvad`,程序会退出并提示安装,而不会执行到后面的代码。
现在,我们提供完整的代码,包括上述修改。
注意:由于代码较长,我们只修改了开头部分,其余部分保持不变。
以下是完整的代码,包括开头的模块检查:</think>这个错误表明您的环境中没有安装 `webrtcvad` 模块。让我提供一个完整的解决方案,包括安装指南和优化后的代码:
### 解决方案
#### 1. 安装缺失的依赖
首先需要安装 `webrtcvad` 和其他必要的依赖:
```bash
# 安装核心依赖
pip install webrtcvad speechbrain noisereduce pydub modelscope scipy
# 安装可选依赖(用于日志和错误处理)
pip install loguru rich
```
#### 2. 优化后的完整代码
以下是完整的优化代码,解决了依赖问题并增强了健壮性:
```python
import os
import sys
import torch
import numpy as np
from pydub import AudioSegment, effects
from scipy.io import wavfile
import json
import re
from collections import defaultdict
import time
import logging
import traceback
# 设置日志
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)
try:
import webrtcvad
logger.info("webrtcvad 模块加载成功")
except ImportError:
logger.error("webrtcvad 模块未安装,将使用备选VAD方案")
webrtcvad = None
try:
import speechbrain
from speechbrain.pretrained import SpectralMaskEnhancement
logger.info("speechbrain 模块加载成功")
except ImportError:
logger.error("speechbrain 模块未安装,将使用备选降噪方案")
speechbrain = None
# 全局缓存模型实例
MODEL_CACHE = {}
def get_model_instance(model_dir, param_dict):
"""缓存模型实例避免重复下载"""
key = hash((model_dir, frozenset(param_dict.items())))
if key not in MODEL_CACHE:
logger.info(f"初始化模型: {model_dir}")
try:
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
MODEL_CACHE[key] = pipeline(
task=Tasks.auto_speech_recognition,
model=model_dir,
ngpu=torch.cuda.is_available(),
vad_batch_size=1,
punc_batch_size=1,
chunk_size=8,
param_dict=param_dict
)
except ImportError:
logger.error("modelscope 模块未安装,请安装: pip install modelscope")
sys.exit(1)
except Exception as e:
logger.error(f"模型初始化失败: {str(e)}")
sys.exit(1)
return MODEL_CACHE[key]
def extract_voice_activity(audio_path, aggressiveness=3):
"""语音活动检测,支持多种方案"""
try:
# 尝试使用webrtcvad
if webrtcvad:
return extract_voice_activity_webrtc(audio_path, aggressiveness)
# 备选方案:基于能量的VAD
return extract_voice_activity_energy(audio_path)
except Exception as e:
logger.error(f"语音活动检测失败: {str(e)}")
# 出错时返回整个音频
audio = AudioSegment.from_wav(audio_path)
return 0, len(audio) / audio.frame_rate
def extract_voice_activity_webrtc(audio_path, aggressiveness=3):
"""使用WebRTC VAD进行语音活动检测"""
# 读取音频
audio = AudioSegment.from_wav(audio_path)
# 转换为WebRTC VAD要求的格式: 16kHz, 16bit, 单声道
audio = audio.set_frame_rate(16000).set_channels(1).set_sample_width(2)
samples = np.array(audio.get_array_of_samples())
# 初始化VAD
vad = webrtcvad.Vad(aggressiveness)
frame_duration = 30 # ms
frame_size = int(16000 * frame_duration / 1000)
total_frames = len(samples) // frame_size
frames = []
# 检测每一帧的语音活动
for i in range(total_frames):
start = i * frame_size
end = start + frame_size
frame = samples[start:end]
is_speech = vad.is_speech(frame.tobytes(), 16000)
frames.append((i * frame_duration / 1000, is_speech))
# 检测语音段
speech_segments = []
in_speech = False
start_time = 0
for i, (timestamp, is_speech) in enumerate(frames):
if is_speech and not in_speech:
in_speech = True
start_time = timestamp
elif not is_speech and in_speech:
in_speech = False
end_time = timestamp
speech_segments.append((start_time, end_time))
# 处理最后一段
if in_speech:
end_time = (len(frames) - 1) * frame_duration / 1000
speech_segments.append((start_time, end_time))
if not speech_segments:
logger.warning("未检测到语音活动,返回整个音频段")
return 0, len(samples) / 16000
# 合并相邻段(间隔小于0.5秒)
merged_segments = []
if speech_segments:
current_start, current_end = speech_segments[0]
for start, end in speech_segments[1:]:
if start - current_end < 0.5: # 500ms间隔内合并
current_end = end
else:
merged_segments.append((current_start, current_end))
current_start, current_end = start, end
merged_segments.append((current_start, current_end))
# 取最长段
longest_segment = max(merged_segments, key=lambda x: x[1] - x[0])
# 添加前后缓冲
buffer = 0.5 # 0.5秒缓冲
start_time = max(0, longest_segment[0] - buffer)
end_time = min(len(samples) / 16000, longest_segment[1] + buffer)
logger.info(f"检测到主要语音段: {start_time:.2f}s - {end_time:.2f}s")
return start_time, end_time
def extract_voice_activity_energy(audio_path, min_silence_duration=1.0, threshold_db=-40):
"""基于能量的备选语音活动检测方案"""
# 加载音频
audio = AudioSegment.from_wav(audio_path)
samples = np.array(audio.get_array_of_samples())
sample_rate = audio.frame_rate
# 如果是立体声,转换为单声道
if len(samples.shape) > 1:
samples = np.mean(samples, axis=1)
# 计算短时能量
frame_length = int(0.025 * sample_rate) # 25ms帧
hop_length = int(0.01 * sample_rate) # 10ms跳幅
energy = []
for i in range(0, len(samples) - frame_length, hop_length):
frame = samples[i:i + frame_length]
energy.append(np.sum(frame ** 2))
energy = np.array(energy)
# 确保所有能量值都是正数
energy = np.clip(energy, 1e-10, None)
# 转换为分贝
energy_db = 10 * np.log10(energy)
# 动态阈值 - 基于背景噪音水平
noise_floor = np.percentile(energy_db, 20) # 使用20百分位作为噪音基准
threshold = noise_floor + 10 # 高于噪音10dB作为语音阈值
# 找到语音段
speech_frames = energy_db > threshold
# 合并相邻的语音段,消除短暂停顿
speech_segments = []
in_speech = False
speech_start = 0
for i, is_speech in enumerate(speech_frames):
if is_speech and not in_speech:
in_speech = True
speech_start = i
elif not is_speech and in_speech:
# 检查是否为短暂停顿
if i - speech_start > min_silence_duration * 100: # 转换为帧数
speech_segments.append((speech_start, i))
in_speech = False
if in_speech:
speech_segments.append((speech_start, len(speech_frames)))
if not speech_segments:
# 如果没有检测到语音,返回整个音频
return 0, len(samples) / sample_rate
# 取最长的语音段(假设为主要对话)
longest_segment = max(speech_segments, key=lambda x: x[1] - x[0])
# 转换为时间
start_time = longest_segment[0] * hop_length / sample_rate
end_time = (longest_segment[1] * hop_length + frame_length) / sample_rate
# 添加前后缓冲
buffer = 0.3 # 0.3秒缓冲
start_time = max(0, start_time - buffer)
end_time = min(len(samples) / sample_rate, end_time + buffer)
logger.info(f"检测到有效语音时间段: {start_time:.2f}s - {end_time:.2f}s")
return start_time, end_time
def extract_main_voice(audio_path, output_path):
"""降噪处理,支持多种方案"""
try:
# 尝试使用SpeechBrain
if speechbrain:
return extract_main_voice_speechbrain(audio_path, output_path)
# 备选方案:使用noisereduce
return extract_main_voice_noisereduce(audio_path, output_path)
except Exception as e:
logger.error(f"降噪失败: {str(e)}")
# 出错时直接复制原始音频
import shutil
shutil.copyfile(audio_path, output_path)
return output_path
def extract_main_voice_speechbrain(audio_path, output_path):
"""使用SpeechBrain进行降噪"""
# 初始化降噪模型
logger.info("使用SpeechBrain降噪")
denoiser = SpectralMaskEnhancement.from_hparams(
source="speechbrain/metricgan-plus-voicebank",
savedir="pretrained_models/metricgan-plus-voicebank",
)
# 加载音频
rate, data = wavfile.read(audio_path)
# 如果多声道,取第一个声道
if len(data.shape) > 1:
data = data[:, 0]
# 转换为PyTorch张量
samples = torch.tensor(data).float()
# 降噪处理
enhanced = denoiser.enhance_batch(
samples.unsqueeze(0).unsqueeze(0), # 添加批次和通道维度
lengths=torch.tensor([len(data)]),
sample_rate=rate
)
# 保存结果
enhanced_np = enhanced.squeeze().numpy()
wavfile.write(output_path, rate, enhanced_np.astype(np.int16))
logger.info(f"降噪完成,保存到: {output_path}")
return output_path
def extract_main_voice_noisereduce(audio_path, output_path):
"""使用noisereduce进行降噪(备选方案)"""
logger.info("使用noisereduce降噪")
import noisereduce as nr
# 加载音频
rate, data = wavfile.read(audio_path)
# 如果多声道,取第一个声道
if len(data.shape) > 1:
data = data[:, 0]
# 取前0.5秒作为噪音样本
if len(data) > int(rate * 0.5):
noise_sample = data[:int(rate * 0.5)]
else:
noise_sample = data
# 降噪处理
reduced_noise = nr.reduce_noise(
y=data,
sr=rate,
y_noise=noise_sample,
prop_decrease=0.8,
stationary=True,
n_std_thresh_stationary=1.5,
use_tqdm=False
)
# 保存结果
wavfile.write(output_path, rate, reduced_noise.astype(np.int16))
logger.info(f"降噪完成,保存到: {output_path}")
return output_path
def enhance_telephone_quality(audio_path, output_path):
"""增强电话录音质量,优化频段"""
try:
# 加载音频
audio = AudioSegment.from_wav(audio_path)
# 电话频段滤波 (300-3400Hz)
audio = audio.high_pass_filter(300).low_pass_filter(3400)
# 适度的动态范围压缩
audio = audio.compress_dynamic_range(threshold=-25.0, ratio=3.0)
# 音量标准化
audio = effects.normalize(audio)
# 保存
audio.export(output_path, format="wav")
logger.info(f"电话质量增强完成,保存到: {output_path}")
return output_path
except Exception as e:
logger.error(f"电话质量增强失败: {str(e)}")
# 出错时直接复制原始音频
import shutil
shutil.copyfile(audio_path, output_path)
return output_path
def robust_speech_recognition(input_wav, output_txt, target_speakers=2, max_retries=2):
"""优化后的鲁棒语音识别"""
temp_files = []
result = None
try:
for attempt in range(max_retries + 1):
try:
logger.info(f"开始语音识别处理 (尝试 {attempt+1}/{max_retries+1})")
# 步骤1: 检测有效语音时间段
start_time, end_time = extract_voice_activity(input_wav)
# 步骤2: 裁剪有效语音段
cropped_path = f"temp_cropped_{attempt}.wav"
audio = AudioSegment.from_wav(input_wav)
cropped_audio = audio[int(start_time * 1000):int(end_time * 1000)]
cropped_audio.export(cropped_path, format="wav")
temp_files.append(cropped_path)
# 步骤3: 提取主要说话人声音
denoised_path = f"temp_denoised_{attempt}.wav"
extract_main_voice(cropped_path, denoised_path)
temp_files.append(denoised_path)
# 步骤4: 电话质量增强
enhanced_path = f"temp_enhanced_{attempt}.wav"
enhance_telephone_quality(denoised_path, enhanced_path)
temp_files.append(enhanced_path)
# 模型目录
model_dir = r"D:/models/ASR-models/iic/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn"
# 优化的参数组合
param_dict_options = [
{"spk_num": target_speakers, "vad_threshold": 0.85, "min_silence_duration": 600},
{"spk_num": target_speakers, "vad_threshold": 0.90, "min_silence_duration": 800},
{"spk_num": target_speakers, "vad_threshold": 0.80, "min_silence_duration": 700}
]
best_result = None
best_score = -1
for i, param_dict in enumerate(param_dict_options):
logger.info(f"尝试参数组合 {i+1}/{len(param_dict_options)}")
try:
inference_pipeline = get_model_instance(model_dir, param_dict)
result = inference_pipeline(input=enhanced_path)
# 评估结果
if result and "sentence_info" in result:
segments = result["sentence_info"]
if segments:
# 计算平均持续时间
durations = [seg["end"] - seg["start"] for seg in segments]
avg_duration = np.mean(durations)
# 计算说话人分布
speaker_counts = defaultdict(int)
for seg in segments:
speaker_counts[seg.get("spk", 0)] += 1
# 评分标准
duration_score = min(avg_duration / 2000, 1.0)
speaker_dist_score = 1.0 - (max(speaker_counts.values()) / len(segments))
total_score = duration_score * 0.6 + speaker_dist_score * 0.4
logger.info(f"参数组合 {i+1} 评分: {total_score:.3f} "
f"(时长分: {duration_score:.3f}, 分布分: {speaker_dist_score:.3f})")
if total_score > best_score:
best_score = total_score
best_result = result
logger.info(f"找到更好的结果,分数: {total_score:.3f}")
except Exception as e:
logger.error(f"参数组合 {i+1} 执行失败: {str(e)[:100]}")
continue
# 回退机制
if not best_result:
logger.warning("所有参数组合均失败,尝试使用默认参数")
try:
inference_pipeline = get_model_instance(model_dir, param_dict_options[0])
best_result = inference_pipeline(input=enhanced_path)
except Exception as e:
logger.error(f"默认参数失败: {str(e)[:100]},尝试使用原始音频")
inference_pipeline = get_model_instance(model_dir, param_dict_options[0])
best_result = inference_pipeline(input=input_wav)
# 解析结果
if not best_result:
logger.error("未获取到识别结果")
return None
if isinstance(best_result, list):
best_result = best_result[0] # 取第一个结果
output_lines = []
if "sentence_info" in best_result:
# 按开始时间排序
segments = sorted(best_result["sentence_info"], key=lambda x: x["start"])
# 过滤过短片段
min_duration = 0.5 # 0.5秒
filtered_segments = [
seg for seg in segments
if (seg["end"] - seg["start"]) / 1000 > min_duration
]
# 应用说话人映射
for i, segment in enumerate(filtered_segments):
speaker_id = segment.get("spk", 0)
speaker = f"spk{speaker_id % target_speakers}"
text = segment.get("text", "")
start = segment.get("start", 0) / 1000
end = segment.get("end", 0) / 1000
output_lines.append(f"{speaker} [{start:.2f}-{end:.2f}s]: {text}")
else:
logger.warning("无法识别的结果格式")
logger.debug(json.dumps(best_result, indent=2, ensure_ascii=False))
return None
# 写入文件
with open(output_txt, 'w', encoding='utf-8') as f:
f.write("\n".join(output_lines))
logger.info(f"结果已保存至 {output_txt}")
return best_result
except Exception as e:
logger.error(f"处理过程中发生错误: {str(e)}", exc_info=True)
if attempt < max_retries:
logger.info(f"将在 {2 ** attempt} 秒后重试...")
time.sleep(2 ** attempt)
else:
raise
finally:
# 清理临时文件
for file in temp_files:
if os.path.exists(file):
try:
os.remove(file)
logger.debug(f"删除临时文件: {file}")
except Exception as e:
logger.warning(f"删除临时文件失败: {file}, 错误: {str(e)}")
def post_process_result(input_txt, output_txt, target_speakers=2):
"""后期处理:合并短句,纠正常见错误,优化说话人分配"""
try:
with open(input_txt, 'r', encoding='utf-8') as f:
lines = f.readlines()
if not lines:
logger.warning("输入文件为空,跳过后期处理")
with open(output_txt, 'w', encoding='utf-8') as f:
f.write("")
return
# 解析所有片段
segments = []
for line in lines:
line = line.strip()
if not line:
continue
# 使用正则表达式匹配时间戳格式
match = re.match(r'^(spk\d+)\s*\[(\d+\.\d+)-(\d+\.\d+)s\]:\s*(.*)$', line)
if match:
speaker = match.group(1)
start_time = float(match.group(2))
end_time = float(match.group(3))
text = match.group(4).strip()
segments.append({
"speaker": speaker,
"start": start_time,
"end": end_time,
"text": text
})
else:
# 处理没有时间戳的简单格式
if ':' in line:
parts = line.split(':', 1)
speaker = parts[0].strip()
text = parts[1].strip()
segments.append({
"speaker": speaker,
"start": 0, # 未知开始时间
"end": 0, # 未知结束时间
"text": text
})
# 如果没有片段,直接返回
if not segments:
with open(output_txt, 'w', encoding='utf-8') as f:
f.write("")
logger.info("没有检测到任何语音片段")
return
# 优化说话人分配
# 1. 合并连续同一说话人的短句
merged_segments = []
current_segment = segments[0]
for i in range(1, len(segments)):
seg = segments[i]
# 如果当前说话人与下一个相同,且时间连续或接近,则合并
if (seg["speaker"] == current_segment["speaker"] and
seg["start"] - current_segment["end"] < 2.0): # 2秒内视为连续
current_segment["end"] = seg["end"]
current_segment["text"] += " " + seg["text"]
else:
merged_segments.append(current_segment)
current_segment = seg
merged_segments.append(current_segment)
# 2. 纠正可能的说话人分配错误
# 基于说话人切换频率和片段长度进行修正
speaker_durations = defaultdict(float)
for seg in merged_segments:
duration = seg["end"] - seg["start"]
speaker_durations[seg["speaker"]] += duration
# 找出主要说话人(持续时间最长的)
if speaker_durations:
main_speaker = max(speaker_durations.items(), key=lambda x: x[1])[0]
# 对于非常短的片段,如果与主要说话人重叠,则修正为同一说话人
for seg in merged_segments:
duration = seg["end"] - seg["start"]
if duration < 1.0 and seg["speaker"] != main_speaker:
# 检查是否与主要说话人的片段有重叠
for main_seg in merged_segments:
if (main_seg["speaker"] == main_speaker and
not (seg["end"] < main_seg["start"] or seg["start"] > main_seg["end"])):
seg["speaker"] = main_speaker
break
# 常见错误修正
corrections = {
"你好": "您好",
"喂": "您好",
"嗯": "",
"啊": "",
"呃": "",
"这个": "",
"那个": "",
"hello": "您好",
"hi": "您好"
}
corrected_segments = []
for seg in merged_segments:
text = seg["text"]
# 应用修正
for wrong, correct in corrections.items():
text = text.replace(wrong, correct)
# 去除多余空格
text = ' '.join(text.split())
if text: # 只保留非空文本
seg["text"] = text
corrected_segments.append(seg)
# 写入处理后的结果
with open(output_txt, 'w', encoding='utf-8') as f:
for seg in corrected_segments:
if seg["start"] > 0 or seg["end"] > 0: # 有时间戳
f.write(f"{seg['speaker']} [{seg['start']:.2f}-{seg['end']:.2f}s]: {seg['text']}\n")
else:
f.write(f"{seg['speaker']}: {seg['text']}\n")
logger.info(f"后期处理完成,保存到: {output_txt}")
except Exception as e:
logger.error(f"后期处理失败: {str(e)}", exc_info=True)
# 出错时直接复制原始结果
import shutil
shutil.copyfile(input_txt, output_txt)
logger.info(f"直接复制原始结果到: {output_txt}")
if __name__ == "__main__":
try:
# 配置输入输出
input_audio = "D:/python/语音情感分析/实际录音/测试/中国移动(10086)_20250506114248_converted.wav"
raw_output = "raw_diarization_result.txt"
final_output = "diarization_result.txt"
# 执行鲁棒识别
recognition_result = robust_speech_recognition(input_audio, raw_output, target_speakers=2)
# 后期处理
post_process_result(raw_output, final_output, target_speakers=2)
# 打印最终结果
print("\n最终识别结果:")
with open(final_output, 'r', encoding='utf-8') as f:
for line in f:
print(line.strip())
except Exception as e:
logger.critical(f"主程序执行失败: {str(e)}", exc_info=True)
print(f"程序执行失败: {str(e)}")
traceback.print_exc()
```
### 关键改进说明
#### 1. 健壮的依赖处理
- **优雅降级**:当 `webrtcvad` 或 `speechbrain` 未安装时,自动切换到备选方案
- **详细错误日志**:提供清晰的安装指导
- **模块检查**:在导入时捕获异常并提供解决方案
#### 2. 多级容错机制
```python
# 在关键函数中实现多级备选方案
def extract_voice_activity(audio_path):
try:
if webrtcvad: # 首选方案
return extract_voice_activity_webrtc()
else: # 备选方案
return extract_voice_activity_energy()
except: # 终极备选
return 0, total_duration
```
#### 3. 增强的日志系统
- 详细记录每个处理步骤
- 包含错误堆栈信息
- 不同级别的日志输出(INFO, WARNING, ERROR)
#### 4. 临时文件管理
- 使用唯一文件名避免冲突
- 自动清理临时文件
- 详细的文件操作日志
#### 5. 配置建议
1. **模型路径配置**:
```python
# 修改此行指向您的模型目录
model_dir = r"D:/models/ASR-models/iic/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn"
```
2. **参数调优**:
```python
# 调整这些参数优化性能
param_dict_options = [
{"spk_num": 2, "vad_threshold": 0.85, "min_silence_duration": 600},
{"spk_num": 2, "vad_threshold": 0.90, "min_silence_duration": 800},
{"spk_num": 2, "vad_threshold": 0.80, "min_silence_duration": 700}
]
```
3. **重试策略**:
```python
# 调整重试次数和等待时间
recognition_result = robust_speech_recognition(
input_audio,
raw_output,
target_speakers=2,
max_retries=3 # 增加重试次数
)
```
### 运行说明
1. 安装必要依赖:
```bash
pip install webrtcvad speechbrain noisereduce pydub modelscope scipy
```
2. 确保模型路径正确:
```python
model_dir = r"您的模型路径"
```
3. 运行脚本:
```bash
python speech_recognition.py
```
4. 查看结果:
- `raw_diarization_result.txt`:原始识别结果
- `diarization_result.txt`:后处理后的最终结果