speech or no speech detection

本文详细介绍了语音信号处理之时域分析,包括端点检测的实现与Python代码示例。通过实例展示了如何使用音量阈值和过零率来区分人类语音与噪音,并提供了频率范围过滤方法和VAD(语音活动检测)的概念。文章还探讨了时域与频域的端点检测方法,旨在帮助读者理解和应用语音信号处理技术。
 http://python.developermemo.com/668_18026120/   speech or no speech detection

http://ibillxia.github.io/blog/2013/05/22/audio-signal-processing-time-domain-Voice-Activity-Detection/ 

语音信号处理之时域分析-端点检测及Python实现



import audioop
import pyaudio as pa
import wav

class speech():
    def __init__(self):
        # soundtrack properties
        self.format = pa.paInt16
        self.rate = 16000
        self.channel = 1
        self.chunk = 1024
        self.threshold = 150
        self.file = 'audio.wav'

        # intialise microphone stream
        self.audio = pa.PyAudio()
        self.stream = self.audio.open(format=self.format,
                                  channels=self.channel,
                                  rate=self.rate,
                                  input=True,
                                  frames_per_buffer=self.chunk)


    def record(self)
        while True:
            data = self.stream.read(self.chunk)
            rms = audioop.rms(data,2) #get input volume
            if rms>self.threshold: #if input volume greater than threshold
                break

        # array to store frames
        frames = []
        # record upto silence only
        while rms>threshold:
            data = self.stream.read(self.chunk)
            rms = audioop.rms(data,2)
            frames.append(data)

        print 'finished recording.... writing file....'
        write_frames = wav.open(self.file, 'wb')
        write_frames.setnchannels(self.channel)
        write_frames.setsampwidth(self.audio.get_sample_size(self.format))
        write_frames.setframerate(self.rate)
        write_frames.writeframes(''.join(frames))
        write_frames.close()

Is there a way I can differentiate between human voice and other noise in Python ? Hope somebody can find me a solution.

I am testing your code in ubuntu, from where wav package to download & test ?

I think that your issue is that at the moment you are trying to record without recognition of the speech so it is not discriminating - recognisable speech is anything that gives meaningful results after recognition - so catch 22. You could simplify matters by looking for an opening keyword. You can also filter on voice frequency range as the human ear and the telephone companies both do and you can look at the mark space ratio - I believe that there were some publications a while back on that but look out - it varies from language to language. A quick Google can be very informative. You may also find this article interesting.

Thanks for the response. This is just the recording part. I have the recognition part as a separate module. It would be good if you elaborated the frequency range filtration thing. I haven't heard of it. Could you suggest me some documentation or something so that I can learn about that ?
Added some more details to the answer above.
Thank you for the reply Steve. Those links were informative.

I think waht you are looking for is VAD (voice activity detection). VAD can be used for preprocessing speech for ASR. Here is some open-source project for implements of VAD link. May it help you.



端点检测

端点检测(End-Point Detection,EPD)的目标是要决定信号的语音开始和结束的位置,所以又可以称为Speech Detection或Voice Activity Detection(VAD)。 端点检测在语音预处理中扮演着一个非常重要的角色。

常见的端点检测方法大致可以分为如下两类:
(1)时域(Time Domain)的方法:计算量比较小,因此比较容易移植到计算能力较差的嵌入式平台
(a)音量:只使用音量来进行端检,是最简单的方法,但是容易对清音造成误判。另外,不同的音量计算方法得到的结果也不尽相同,至于那种方法更好也没有定论。
(b)音量和过零率:以音量为主,过零率为辅,可以对清音进行较精密的检测。
(2)频域(Frequency Domain)的方法:计算量相对较大。
(a)频谱的变化性(Variance):有声音的频谱变化较规律,可以作为一个判断标准。
(b)频谱的Entropy:有规律的频谱的Entropy一般较小,这也可以作为一个端检的判断标准。

下面我们分别从这两个方面来探讨端检的具体方法和过程。

时域的端检方法

时域的端检方法分为只用音量的方法和同时使用音量和过零率的方法。只使用音量的方法最简单计算量也最小,我们只需要设定一个音量阈值,任何音量小于该阈值的帧 被认为是静音(silence)。这种方法的关键在于如何选取这个阈值,一种常用的方法是使用一些带标签的数据来训练得到一个阈值,使得误差最小。

下面我们来看看最简单的、不需要训练的方法,其代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import wave
import numpy as np
import matplotlib.pyplot as plt
import Volume as vp

def findIndex(vol,thres):
    l = len(vol)
    ii = 0
    index = np.zeros(4,dtype=np.int16)
    for i in range(l-1):
        if((vol[i]-thres)*(vol[i+1]-thres)<0):
            index[ii]=i
            ii = ii+1
    return index[[0,-1]]

fw = wave.open('sunday.wav','r')
params = fw.getparams()
nchannels, sampwidth, framerate, nframes = params[:4]
strData = fw.readframes(nframes)
waveData = np.fromstring(strData, dtype=np.int16)
waveData = waveData*1.0/max(abs(waveData))  # normalization
fw.close()

frameSize = 256
overLap = 128
vol = vp.calVolume(waveData,frameSize,overLap)
threshold1 = max(vol)*0.10
threshold2 = min(vol)*10.0
threshold3 = max(vol)*0.05+min(vol)*5.0

time = np.arange(0,nframes) * (1.0/framerate)
frame = np.arange(0,len(vol)) * (nframes*1.0/len(vol)/framerate)
index1 = findIndex(vol,threshold1)*(nframes*1.0/len(vol)/framerate)
index2 = findIndex(vol,threshold2)*(nframes*1.0/len(vol)/framerate)
index3 = findIndex(vol,threshold3)*(nframes*1.0/len(vol)/framerate)
end = nframes * (1.0/framerate)

plt.subplot(211)
plt.plot(time,waveData,color="black")
plt.plot([index1,index1],[-1,1],'-r')
plt.plot([index2,index2],[-1,1],'-g')
plt.plot([index3,index3],[-1,1],'-b')
plt.ylabel('Amplitude')

plt.subplot(212)
plt.plot(frame,vol,color="black")
plt.plot([0,end],[threshold1,threshold1],'-r', label="threshold 1")
plt.plot([0,end],[threshold2,threshold2],'-g', label="threshold 2")
plt.plot([0,end],[threshold3,threshold3],'-b', label="threshold 3")
plt.legend()
plt.ylabel('Volume(absSum)')
plt.xlabel('time(seconds)')
plt.show()
其中计算音量的函数calVolume参见  音量及其Python实现一文。程序的运行结果如下图:

这里采用了三种设置阈值的方法,但这几种设置方法对所有的输入都是相同的,对于一些特定的语音数据可能得不到很好的结果,比如杂音较强、清音较多或音量 变化较大等语音信号,此时单一阈值的方法的效果就不太好了,虽然我们可以通过增加帧与帧之间的重叠部分,但相对而言计算量会比较大。下面我们利用一些更多的 特征来进行端点加测,例如使用过零率等信息,其过程如下:
(1)以较高音量阈值( τu )为标准,找到初步的端点;
(2)将端点前后延伸到低音量阈值( τl )处;
(3)再将端点前后延伸到过零率阈值( τzc )处,以包含语音中清音的部分。
这种方法需要确定三个阈值( τu , τl , τzc ),可以用各种搜寻方法来调整这三个参数。其示意图(参考[1])如下:

我们在同一个图中绘制出音量和过零率的阈值图如下:
可以看到我们可以通过过零率的阈值来把错分的清音加入到语音部分来。上图使用到的阈值还是和音量的阈值选取方法相同,比较简单直接。

另外,我们还可以连续对波形进行微分,再计算音量,这样就可以凸显清音的部分,从而将其正确划分出来,详见参考[1]。

频域的端检方法

有声音的信号在频谱上会有重复的谐波结构,因此我们也可以使用频谱的变化性(Variation)或Entropy来进行端点检测,可以参见如下链接: http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/paper/endPointDetection/

总之,端点检测是语音预处理的重头戏,其实现方法也是五花八门,本文只给出了最简单最原始也最好理解的几种方法,这些方法要真正做到实用,还需要针对一些 特殊的情况在做一些精细的设置和处理,但对于一般的应用场景应该还是基本够用的。

参考(References)

[1]EPD in Time Domain: http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/epdTimeDomain.asp?title=6-2%20EPD%20in%20Time%20Domain
[2]EPD in Frequency Domain: http://neural.cs.nthu.edu.tw/jang/books/audiosignalprocessing/epdFreqDomain.asp?title=6-3%20EPD%20in%20Frequency%20Domain

Original Link:  http://ibillxia.github.io/blog/2013/05/22/audio-signal-processing-time-domain-Voice-Activity-Detection/
Attribution - NON-Commercial - ShareAlike - Copyright ©  Bill Xia

 May 22nd, 2013 Posted in ASSP Tagged with PythonSpeech信号处理端点检测

  • 楼主,最近我也在看VAD,关于你的findIndex函数的作用,不知道我的理解对不对:遍历每一帧的音量值,如果遇到不满足阙值的帧,就把帧号放到index数组里?但我在运行这段程序时出错了,麻烦问下findIndex函数是不是有问题?

    def findIndex(vol,thres):
    l = len(vol)
    ii = 0
    index = np.zeros(4,dtype=np.int16) # index是一个4个元素的数组
    for i in range(l-1): # vol数组的长度一般都肯定大于4
    if((vol[i]-thres)*(vol[i+1]-thres)<0):
    index[ii]=i # ii的递增在for循环内,当有4个以上符合条件的值时,
    ii = ii+1 # index数组肯定会out of boundary吧。
    return index[[0,-1]]


    import torch import torchaudio from typing import Callable, List import warnings languages = ['ru', 'en', 'de', 'es'] class OnnxWrapper(): def __init__(self, path, force_onnx_cpu=False): import numpy as np global np import onnxruntime opts = onnxruntime.SessionOptions() opts.inter_op_num_threads = 1 opts.intra_op_num_threads = 1 if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers(): self.session = onnxruntime.InferenceSession(path, providers=['CPUExecutionProvider'], sess_options=opts) else: self.session = onnxruntime.InferenceSession(path, sess_options=opts) self.reset_states() if '16k' in path: warnings.warn('This model support only 16000 sampling rate!') self.sample_rates = [16000] else: self.sample_rates = [8000, 16000] def _validate_input(self, x, sr: int): if x.dim() == 1: x = x.unsqueeze(0) if x.dim() > 2: raise ValueError(f"Too many dimensions for input audio chunk {x.dim()}") if sr != 16000 and (sr % 16000 == 0): step = sr // 16000 x = x[:,::step] sr = 16000 if sr not in self.sample_rates: raise ValueError(f"Supported sampling rates: {self.sample_rates} (or multiply of 16000)") if sr / x.shape[1] > 31.25: raise ValueError("Input audio chunk is too short") return x, sr def reset_states(self, batch_size=1): self._state = torch.zeros((2, batch_size, 128)).float() self._context = torch.zeros(0) self._last_sr = 0 self._last_batch_size = 0 def __call__(self, x, sr: int): x, sr = self._validate_input(x, sr) num_samples = 512 if sr == 16000 else 256 if x.shape[-1] != num_samples: raise ValueError(f"Provided number of samples is {x.shape[-1]} (Supported values: 256 for 8000 sample rate, 512 for 16000)") batch_size = x.shape[0] context_size = 64 if sr == 16000 else 32 if not self._last_batch_size: self.reset_states(batch_size) if (self._last_sr) and (self._last_sr != sr): self.reset_states(batch_size) if (self._last_batch_size) and (self._last_batch_size != batch_size): self.reset_states(batch_size) if not len(self._context): self._context = torch.zeros(batch_size, context_size) x = torch.cat([self._context, x], dim=1) if sr in [8000, 16000]: ort_inputs = {'input': x.numpy(), 'state': self._state.numpy(), 'sr': np.array(sr, dtype='int64')} ort_outs = self.session.run(None, ort_inputs) out, state = ort_outs self._state = torch.from_numpy(state) else: raise ValueError() self._context = x[..., -context_size:] self._last_sr = sr self._last_batch_size = batch_size out = torch.from_numpy(out) return out def audio_forward(self, x, sr: int): outs = [] x, sr = self._validate_input(x, sr) self.reset_states() num_samples = 512 if sr == 16000 else 256 if x.shape[1] % num_samples: pad_num = num_samples - (x.shape[1] % num_samples) x = torch.nn.functional.pad(x, (0, pad_num), 'constant', value=0.0) for i in range(0, x.shape[1], num_samples): wavs_batch = x[:, i:i+num_samples] out_chunk = self.__call__(wavs_batch, sr) outs.append(out_chunk) stacked = torch.cat(outs, dim=1) return stacked.cpu() class Validator(): def __init__(self, url, force_onnx_cpu): self.onnx = True if url.endswith('.onnx') else False torch.hub.download_url_to_file(url, 'inf.model') if self.onnx: import onnxruntime if force_onnx_cpu and 'CPUExecutionProvider' in onnxruntime.get_available_providers(): self.model = onnxruntime.InferenceSession('inf.model', providers=['CPUExecutionProvider']) else: self.model = onnxruntime.InferenceSession('inf.model') else: self.model = init_jit_model(model_path='inf.model') def __call__(self, inputs: torch.Tensor): with torch.no_grad(): if self.onnx: ort_inputs = {'input': inputs.cpu().numpy()} outs = self.model.run(None, ort_inputs) outs = [torch.Tensor(x) for x in outs] else: outs = self.model(inputs) return outs def read_audio(path: str, sampling_rate: int = 16000): list_backends = torchaudio.list_audio_backends() assert len(list_backends) > 0, 'The list of available backends is empty, please install backend manually. \ \n Recommendations: \n \tSox (UNIX OS) \n \tSoundfile (Windows OS, UNIX OS) \n \tffmpeg (Windows OS, UNIX OS)' try: effects = [ ['channels', '1'], ['rate', str(sampling_rate)] ] wav, sr = torchaudio.sox_effects.apply_effects_file(path, effects=effects) except: wav, sr = torchaudio.load(path) if wav.size(0) > 1: wav = wav.mean(dim=0, keepdim=True) if sr != sampling_rate: transform = torchaudio.transforms.Resample(orig_freq=sr, new_freq=sampling_rate) wav = transform(wav) sr = sampling_rate assert sr == sampling_rate return wav.squeeze(0) def save_audio(path: str, tensor: torch.Tensor, sampling_rate: int = 16000): torchaudio.save(path, tensor.unsqueeze(0), sampling_rate, bits_per_sample=16) def init_jit_model(model_path: str, device=torch.device('cpu')): model = torch.jit.load(model_path, map_location=device) model.eval() return model def make_visualization(probs, step): import pandas as pd pd.DataFrame({'probs': probs}, index=[x * step for x in range(len(probs))]).plot(figsize=(16, 8), kind='area', ylim=[0, 1.05], xlim=[0, len(probs) * step], xlabel='seconds', ylabel='speech probability', colormap='tab20') @torch.no_grad() def get_speech_timestamps(audio: torch.Tensor, model, threshold: float = 0.5, sampling_rate: int = 16000, min_speech_duration_ms: int = 250, max_speech_duration_s: float = float('inf'), min_silence_duration_ms: int = 100, speech_pad_ms: int = 30, return_seconds: bool = False, time_resolution: int = 1, visualize_probs: bool = False, progress_tracking_callback: Callable[[float], None] = None, neg_threshold: float = None, window_size_samples: int = 512, min_silence_at_max_speech: float = 98, use_max_poss_sil_at_max_speech: bool = True): """ This method is used for splitting long audios into speech chunks using silero VAD Parameters ---------- audio: torch.Tensor, one dimensional One dimensional float torch.Tensor, other types are casted to torch if possible model: preloaded .jit/.onnx silero VAD model threshold: float (default - 0.5) Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. sampling_rate: int (default - 16000) Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates min_speech_duration_ms: int (default - 250 milliseconds) Final speech chunks shorter min_speech_duration_ms are thrown out max_speech_duration_s: int (default - inf) Maximum duration of speech chunks in seconds Chunks longer than max_speech_duration_s will be split at the timestamp of the last silence that lasts more than 100ms (if any), to prevent agressive cutting. Otherwise, they will be split aggressively just before max_speech_duration_s. min_silence_duration_ms: int (default - 100 milliseconds) In the end of each speech chunk wait for min_silence_duration_ms before separating it speech_pad_ms: int (default - 30 milliseconds) Final speech chunks are padded by speech_pad_ms each side return_seconds: bool (default - False) whether return timestamps in seconds (default - samples) time_resolution: bool (default - 1) time resolution of speech coordinates when requested as seconds visualize_probs: bool (default - False) whether draw prob hist or not progress_tracking_callback: Callable[[float], None] (default - None) callback function taking progress in percents as an argument neg_threshold: float (default = threshold - 0.15) Negative threshold (noise or exit threshold). If model's current state is SPEECH, values BELOW this value are considered as NON-SPEECH. min_silence_at_max_speech: float (default - 98ms) Minimum silence duration in ms which is used to avoid abrupt cuts when max_speech_duration_s is reached use_max_poss_sil_at_max_speech: bool (default - True) Whether to use the maximum possible silence at max_speech_duration_s or not. If not, the last silence is used. window_size_samples: int (default - 512 samples) !!! DEPRECATED, DOES NOTHING !!! Returns ---------- speeches: list of dicts list containing ends and beginnings of speech chunks (samples or seconds based on return_seconds) """ if not torch.is_tensor(audio): try: audio = torch.Tensor(audio) except: raise TypeError("Audio cannot be casted to tensor. Cast it manually") if len(audio.shape) > 1: for i in range(len(audio.shape)): # trying to squeeze empty dimensions audio = audio.squeeze(0) if len(audio.shape) > 1: raise ValueError("More than one dimension in audio. Are you trying to process audio with 2 channels?") if sampling_rate > 16000 and (sampling_rate % 16000 == 0): step = sampling_rate // 16000 sampling_rate = 16000 audio = audio[::step] warnings.warn('Sampling rate is a multiply of 16000, casting to 16000 manually!') else: step = 1 if sampling_rate not in [8000, 16000]: raise ValueError("Currently silero VAD models support 8000 and 16000 (or multiply of 16000) sample rates") window_size_samples = 512 if sampling_rate == 16000 else 256 hop_size_samples = int(window_size_samples) model.reset_states() min_speech_samples = sampling_rate * min_speech_duration_ms / 1000 speech_pad_samples = sampling_rate * speech_pad_ms / 1000 max_speech_samples = sampling_rate * max_speech_duration_s - window_size_samples - 2 * speech_pad_samples min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 min_silence_samples_at_max_speech = sampling_rate * min_silence_at_max_speech / 1000 audio_length_samples = len(audio) speech_probs = [] for current_start_sample in range(0, audio_length_samples, hop_size_samples): chunk = audio[current_start_sample: current_start_sample + window_size_samples] if len(chunk) < window_size_samples: chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) try: speech_prob = model(chunk, sampling_rate).item() except Exception as e: import ipdb; ipdb.set_trace() speech_probs.append(speech_prob) # caculate progress and seng it to callback function progress = current_start_sample + hop_size_samples if progress > audio_length_samples: progress = audio_length_samples progress_percent = (progress / audio_length_samples) * 100 if progress_tracking_callback: progress_tracking_callback(progress_percent) triggered = False speeches = [] current_speech = {} if neg_threshold is None: neg_threshold = max(threshold - 0.15, 0.01) temp_end = 0 # to save potential segment end (and tolerate some silence) prev_end = next_start = 0 # to save potential segment limits in case of maximum segment size reached possible_ends = [] for i, speech_prob in enumerate(speech_probs): if (speech_prob >= threshold) and temp_end: if temp_end != 0: sil_dur = (hop_size_samples * i) - temp_end if sil_dur > min_silence_samples_at_max_speech: possible_ends.append((temp_end, sil_dur)) temp_end = 0 if next_start < prev_end: next_start = hop_size_samples * i if (speech_prob >= threshold) and not triggered: triggered = True current_speech['start'] = hop_size_samples * i continue if triggered and (hop_size_samples * i) - current_speech['start'] > max_speech_samples: if possible_ends: if use_max_poss_sil_at_max_speech: prev_end, dur = max(possible_ends, key=lambda x: x[1]) # use the longest possible silence segment in the current speech chunk else: prev_end, dur = possible_ends[-1] # use the last possible silence segement current_speech['end'] = prev_end speeches.append(current_speech) current_speech = {} next_start = prev_end + dur if next_start < prev_end + hop_size_samples * i: # previously reached silence (< neg_thres) and is still not speech (< thres) #triggered = False current_speech['start'] = next_start else: triggered = False #current_speech['start'] = next_start prev_end = next_start = temp_end = 0 possible_ends = [] else: current_speech['end'] = hop_size_samples * i speeches.append(current_speech) current_speech = {} prev_end = next_start = temp_end = 0 triggered = False possible_ends = [] continue if (speech_prob < neg_threshold) and triggered: if not temp_end: temp_end = hop_size_samples * i # if ((hop_size_samples * i) - temp_end) > min_silence_samples_at_max_speech: # condition to avoid cutting in very short silence # prev_end = temp_end if (hop_size_samples * i) - temp_end < min_silence_samples: continue else: current_speech['end'] = temp_end if (current_speech['end'] - current_speech['start']) > min_speech_samples: speeches.append(current_speech) current_speech = {} prev_end = next_start = temp_end = 0 triggered = False possible_ends = [] continue if current_speech and (audio_length_samples - current_speech['start']) > min_speech_samples: current_speech['end'] = audio_length_samples speeches.append(current_speech) for i, speech in enumerate(speeches): if i == 0: speech['start'] = int(max(0, speech['start'] - speech_pad_samples)) if i != len(speeches) - 1: silence_duration = speeches[i+1]['start'] - speech['end'] if silence_duration < 2 * speech_pad_samples: speech['end'] += int(silence_duration // 2) speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - silence_duration // 2)) else: speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples)) speeches[i+1]['start'] = int(max(0, speeches[i+1]['start'] - speech_pad_samples)) else: speech['end'] = int(min(audio_length_samples, speech['end'] + speech_pad_samples)) if return_seconds: audio_length_seconds = audio_length_samples / sampling_rate for speech_dict in speeches: speech_dict['start'] = max(round(speech_dict['start'] / sampling_rate, time_resolution), 0) speech_dict['end'] = min(round(speech_dict['end'] / sampling_rate, time_resolution), audio_length_seconds) elif step > 1: for speech_dict in speeches: speech_dict['start'] *= step speech_dict['end'] *= step if visualize_probs: make_visualization(speech_probs, hop_size_samples / sampling_rate) return speeches class VADIterator: def __init__(self, model, threshold: float = 0.5, sampling_rate: int = 16000, min_silence_duration_ms: int = 100, speech_pad_ms: int = 30 ): """ Class for stream imitation Parameters ---------- model: preloaded .jit/.onnx silero VAD model threshold: float (default - 0.5) Speech threshold. Silero VAD outputs speech probabilities for each audio chunk, probabilities ABOVE this value are considered as SPEECH. It is better to tune this parameter for each dataset separately, but "lazy" 0.5 is pretty good for most datasets. sampling_rate: int (default - 16000) Currently silero VAD models support 8000 and 16000 sample rates min_silence_duration_ms: int (default - 100 milliseconds) In the end of each speech chunk wait for min_silence_duration_ms before separating it speech_pad_ms: int (default - 30 milliseconds) Final speech chunks are padded by speech_pad_ms each side """ self.model = model self.threshold = threshold self.sampling_rate = sampling_rate if sampling_rate not in [8000, 16000]: raise ValueError('VADIterator does not support sampling rates other than [8000, 16000]') self.min_silence_samples = sampling_rate * min_silence_duration_ms / 1000 self.speech_pad_samples = sampling_rate * speech_pad_ms / 1000 self.reset_states() def reset_states(self): self.model.reset_states() self.triggered = False self.temp_end = 0 self.current_sample = 0 @torch.no_grad() def __call__(self, x, return_seconds=False, time_resolution: int = 1): """ x: torch.Tensor audio chunk (see examples in repo) return_seconds: bool (default - False) whether return timestamps in seconds (default - samples) time_resolution: int (default - 1) time resolution of speech coordinates when requested as seconds """ if not torch.is_tensor(x): try: x = torch.Tensor(x) except: raise TypeError("Audio cannot be casted to tensor. Cast it manually") window_size_samples = len(x[0]) if x.dim() == 2 else len(x) self.current_sample += window_size_samples speech_prob = self.model(x, self.sampling_rate).item() if (speech_prob >= self.threshold) and self.temp_end: self.temp_end = 0 if (speech_prob >= self.threshold) and not self.triggered: self.triggered = True speech_start = max(0, self.current_sample - self.speech_pad_samples - window_size_samples) return {'start': int(speech_start) if not return_seconds else round(speech_start / self.sampling_rate, time_resolution)} if (speech_prob < self.threshold - 0.15) and self.triggered: if not self.temp_end: self.temp_end = self.current_sample if self.current_sample - self.temp_end < self.min_silence_samples: return None else: speech_end = self.temp_end + self.speech_pad_samples - window_size_samples self.temp_end = 0 self.triggered = False return {'end': int(speech_end) if not return_seconds else round(speech_end / self.sampling_rate, time_resolution)} return None def collect_chunks(tss: List[dict], wav: torch.Tensor, seconds: bool = False, sampling_rate: int = None) -> torch.Tensor: """Collect audio chunks from a longer audio clip This method extracts audio chunks from an audio clip, using a list of provided coordinates, and concatenates them together. Coordinates can be passed either as sample numbers or in seconds, in which case the audio sampling rate is also needed. Parameters ---------- tss: List[dict] Coordinate list of the clips to collect from the audio. wav: torch.Tensor, one dimensional One dimensional float torch.Tensor, containing the audio to clip. seconds: bool (default - False) Whether input coordinates are passed as seconds or samples. sampling_rate: int (default - None) Input audio sampling rate. Required if seconds is True. Returns ------- torch.Tensor, one dimensional One dimensional float torch.Tensor of the concatenated clipped audio chunks. Raises ------ ValueError Raised if sampling_rate is not provided when seconds is True. """ if seconds and not sampling_rate: raise ValueError('sampling_rate must be provided when seconds is True') chunks = list() _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss for i in _tss: chunks.append(wav[i['start']:i['end']]) return torch.cat(chunks) def drop_chunks(tss: List[dict], wav: torch.Tensor, seconds: bool = False, sampling_rate: int = None) -> torch.Tensor: """Drop audio chunks from a longer audio clip This method extracts audio chunks from an audio clip, using a list of provided coordinates, and drops them. Coordinates can be passed either as sample numbers or in seconds, in which case the audio sampling rate is also needed. Parameters ---------- tss: List[dict] Coordinate list of the clips to drop from from the audio. wav: torch.Tensor, one dimensional One dimensional float torch.Tensor, containing the audio to clip. seconds: bool (default - False) Whether input coordinates are passed as seconds or samples. sampling_rate: int (default - None) Input audio sampling rate. Required if seconds is True. Returns ------- torch.Tensor, one dimensional One dimensional float torch.Tensor of the input audio minus the dropped chunks. Raises ------ ValueError Raised if sampling_rate is not provided when seconds is True. """ if seconds and not sampling_rate: raise ValueError('sampling_rate must be provided when seconds is True') chunks = list() cur_start = 0 _tss = _seconds_to_samples_tss(tss, sampling_rate) if seconds else tss for i in _tss: chunks.append((wav[cur_start: i['start']])) cur_start = i['end'] return torch.cat(chunks) def _seconds_to_samples_tss(tss: List[dict], sampling_rate: int) -> List[dict]: """Convert coordinates expressed in seconds to sample coordinates. """ return [{ 'start': round(crd['start']) * sampling_rate, 'end': round(crd['end']) * sampling_rate } for crd in tss] 这个是silero vad 的模型框架吗
    最新发布
    09-24
    import cv2 # OpenCV库,用于图像处理 import numpy as np # 数值计算库 from collections import defaultdict # 提供带默认值的字典 import shutil # 文件操作库 import pyttsx3 # 本地语音合成库 from datetime import datetime # 日期时间处理 import logging # 日志处理库 import threading # 线程库 import queue # 队列库 import json # JSON处理库 from pathlib import Path # 路径处理库 from typing import List # 类型提示 import time # 时间处理库 import os # 操作系统相关操作 import sys # 系统相关操作 from flask import Flask, render_template # Flask框架及模板渲染 import geocoder # 地理位置获取模块 from .weather.weather_service import WeatherService # 导入天气服务模块 from .weather.config import * # 如有需要可导入配置 from .weather.api_client import * # 如有需要可导入API客户端 from .weather.exception import * # 如有需要可导入异常定义 from .utils import read_text_baidu, user_speech_recognition, record_audio_until_silence, audio_to_text, text_to_speech_chinese, load_known_faces_from_folder # 导入通用工具函数 from .voice_feature import VoiceAssistant # 导入语音助手模块 from .face_recognition import FaceRecognition # 导入人脸识别系统 import RPi.GPIO as GPIO # 导入树莓派的GPIO库,用于HC-SR501传感器 # 初始化Flask应用 app = Flask(__name__) app.config['TEMPLATES_AUTO_RELOAD'] = True # 设置模板自动重载 # 全局运行标志,控制主循环0 running = True # 控制主循环是否继续运行 face_detected = False # 标记是否检测到人脸 wake_words = ["小", "小朋友", "朋友"] # 唤醒词列表 WAKE_WORD_THRESHOLD = 0.7 # 唤醒词识别阈值 assistant = None # 全局语音助手对象 face_detection_running = False # 标记人脸检测线程是否运行中 face_detection_success = False # 标记人脸检测是否成功 pir_detected = False # 标记是否检测到人靠近 # HC-SR501传感器设置 PIR_PIN = 17 # 假设HC-SR501连接到GPIO 17 GPIO.setmode(GPIO.BCM) # 使用BCM编号 GPIO.setup(PIR_PIN, GPIO.IN) # 设置PIR_PIN为输入 # 定义Flask路由,主页 @app.route('/') def index(): """渲染主界面页面""" return render_template('index.html') # 定义Flask路由,测试接口 @app.route('/test') def test(): """测试接口,验证Flask服务是否正常""" return "Flask server is working!" preloaded_face_data = None # 全局变量,存储预加载的人脸数据 def preload_face_data(): """预加载人脸数据到内存,提升识别速度""" face_system = FaceRecognition() # 创建人脸识别系统对象 image_paths_by_person = load_known_faces_from_folder("known_faces") # 加载已知人脸图片路径 for person_name, image_paths in image_paths_by_person.items(): # 遍历每个人 face_system.add_new_person(person_name, image_paths) # 添加人脸特征 return face_system # 返回人脸识别系统对象 def detect_face(timeout=60) -> bool: """使用预加载的人脸数据检测人脸,超时返回False""" global preloaded_face_data if preloaded_face_data is None: # 检查是否已预加载 print("Error: Face data not preloaded.") return False start_time = time.time() # 记录检测开始时间 while time.time() - start_time < timeout and running: # 在超时时间内循环检测 try: recognized = preloaded_face_data.start_recognition() # 调用识别方法 if recognized: # 如果识别到人脸 print("Face detected") return True else: print("No face detected (during activation)") time.sleep(0.5) # 未识别到,等待0.5秒再试 except Exception as e: print(f"Error during recognition: {e}") # 捕获异常并打印 time.sleep(0.5) return False # 超时未识别到人脸 def get_user_location(): """通过geocoder获取用户地理位置""" try: return geocoder.ip("me") # 获取本机IP对应的地理位置 except Exception as e: print(f"Error getting location: {e}") return None def assistant_mode(): """语音助手主循环,处理用户语音指令""" global running, face_detected, assistant if assistant is None: # 检查语音助手是否初始化 print("Error: Assistant not initialized.") return location = get_user_location() # 获取用户地理位置 weather = WeatherService() # 创建天气服务对象 longitude = location.lng if location else None # 获取经度 latitude = location.lat if location else None # 获取纬度 response = weather.get_weather_info(longitude, latitude) if longitude and latitude else {} # 获取天气信息 listening_duration = 60 # 监听超时时间(秒) last_interaction_time = time.time() # 上次交互时间 read_text_baidu("你好!有什么我可以帮您?") # 语音播报欢迎语 while running and face_detected: # 只要系统运行且检测到人脸 if time.time() - last_interaction_time > listening_duration: # 超时未交互 read_text_baidu("等待唤醒...") face_detected = False break text = user_speech_recognition() # 获取用户语音输入 if text: # 如果识别到内容 print(f"User said: {text}") last_interaction_time = time.time() # 重置交互时间 if '天气' in text and response: # 天气查询 print("Weather query detected") print("Response: " + response.get('weather_condition', '')) read_text_baidu(f"今天的天气状况如下, 位置:{location.city if location else ''}") read_text_baidu(f"天气:{response.get('weather_condition', '')}") read_text_baidu(f"温度:{response.get('temperature', '')}") read_text_baidu(f"体感温度:{response.get('feels_like', '')}") read_text_baidu(f"湿度:{response.get('humidity', '')}") read_text_baidu(f"风向:{response.get('wind_direction', '')}") read_text_baidu(f"风速:{response.get('wind_speed', '')}") read_text_baidu(f"气压:{response.get('pressure', '')}") read_text_baidu(f"能见度:{response.get('visibility', '')}") read_text_baidu(f"云量:{response.get('cloud_coverage', '')}") continue elif '几点' in text: # 时间查询 print("Time query detected") read_text_baidu(f"现在是 {time.strftime('%H:%M')}") continue elif '空调' in text: # 空调控制 print("AC query detected") read_text_baidu("好的,正在处理空调指令。") continue elif '拜拜' in text or '再见' in text: # 结束对话 read_text_baidu("拜拜,下次再见!") face_detected = False continue else: # 其他内容交给大模型处理 print("Deepseek request") print(f"Heard: {text}, processing with DeepSeek...") response = assistant.chat(text) # 调用大模型 print(f"DeepSeek response: {response}") read_text_baidu(response) continue else: print("Listening for command...") # 未识别到内容,继续监听 time.sleep(1) if face_detected: # 如果还在检测状态,提示等待唤醒 read_text_baidu("等待唤醒...") face_detected = False def run_face_detection(): """运行人脸检测线程,检测成功则标记,否则语音提示失败""" global face_detection_running, face_detection_success face_detection_running = True # 标记检测线程已启动 if detect_face(timeout=60): # 检测人脸 face_detection_success = True # 检测成功 else: text_to_speech_chinese("我无法识别您的面部。如果需要我,请随时叫我。") # 检测失败语音提示 face_detection_running = False # 检测线程结束 def pir_detection_loop(): """HC-SR501传感器检测主循环,检测到人靠近后启动人脸检测和助手""" global running, face_detected, assistant global face_detection_running, face_detection_success, pir_detected if assistant is None: # 检查语音助手是否初始化 print("Error: Assistant not initialized.") return while running: # 主循环 if GPIO.input(PIR_PIN): # 检测到人靠近 print("PIR sensor detected motion") pir_detected = True if not face_detection_running: # 如果没有人脸检测线程在运行 text_to_speech_chinese("检测到人靠近,请靠近并扫描您的面部以继续互动。这是为了您的安全。") threading.Thread(target=run_face_detection).start() # 启动人脸检测线程 else: pir_detected = False if face_detection_success: # 如果人脸检测成功,启动助手线程 face_detected = True face_detection_success = False # 重置标志 assistant_thread = threading.Thread(target=assistant_mode) assistant_thread.start() time.sleep(0.1) # 防止CPU占用过高 def wake_word_detection_loop(): """唤醒词检测主循环,检测到唤醒词后启动人脸检测和助手""" global running, face_detected, assistant global face_detection_running, face_detection_success if assistant is None: # 检查语音助手是否初始化 print("Error: Assistant not initialized.") return while running: # 主循环 if face_detected: # 如果已检测到人脸,等待 time.sleep(2) continue if face_detection_success: # 如果人脸检测成功,启动助手线程 face_detected = True face_detection_success = False # 重置标志 assistant_thread = threading.Thread(target=assistant_mode) assistant_thread.start() continue if not face_detection_running: # 如果没有人脸检测线程在运行 print("Listening for wake word...") # 等待唤醒词 script = audio_to_text() # 语音转文本 if script: for wake_word in wake_words: # 检查是否包含唤醒词 if wake_word in script: text_to_speech_chinese("唤醒成功,请靠近并扫描您的面部以继续互动。这是为了您的安全。") threading.Thread(target=run_face_detection).start() # 启动人脸检测线程 break else: print("Wake word not detected.") # 未检测到唤醒词 else: print("No speech detected.") # 未检测到语音 time.sleep(1) # 防止CPU占用过高 def launch_gui(): """启动图形界面,优先PyQt5,失败则尝试Kivy,再失败则控制台模式""" try: from PyQt5.QtWebEngineWidgets import QWebEngineView from PyQt5.QtWidgets import QApplication from PyQt5.QtCore import QUrl import sys app = QApplication(sys.argv) web = QWebEngineView() web.load(QUrl("http://localhost:8080")) web.show() sys.exit(app.exec_()) except ImportError: try: from kivy.app import App from kivy.uix.label import Label class SimpleApp(App): def build(self): return Label(text='Display Error, your Smart Mirror is Running\nhttp://localhost:8080') SimpleApp().run() except ImportError: print("GUI frameworks not available - running in console mode") print("Access the mirror at http://localhost:8080") import time while True: time.sleep(1) def main() -> None: """主入口,初始化各模块并启动主线程""" global running, assistant, preloaded_face_data print("Smart Mirror started.") assistant = VoiceAssistant( baidu_app_id=os.getenv("BAIDU_APP_ID"), baidu_api_key=os.getenv("BAIDU_API_KEY"), baidu_secret_key=os.getenv("BAIDU_SECRET_KEY"), deepseek_api_key=os.getenv("DEEPSEEK_API_KEY"), ) preloaded_face_data = preload_face_data() gui_thread = threading.Thread(target=launch_gui) # 启动GUI线程 pir_thread = threading.Thread(target=pir_detection_loop,daemon=True) # 启动 voice_thread = threading.Thread(target=wake_word_detection_loop, daemon=True) pir_thread.start() voice_thread.start() gui_thread.start() try: while True: # 主线程保持运行 time.sleep(1) except KeyboardInterrupt: print("Exiting...") running = False # 终止主循环 pir_thread.join() voice_thread.join() GPIO.cleanup() # 清理GPIO设置 if __name__ == "__main__": main() # 运行主入口这段代码是否能完成人体靠近时检测到并进行人脸识别,同时不和语音的识别冲突
    07-04
    评论
    成就一亿技术人!
    拼手气红包6.0元
    还能输入1000个字符
     
    红包 添加红包
    表情包 插入表情
     条评论被折叠 查看
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值