梅尔频谱(mel-spectrogram)提取,griffin_lim声码器【python代码分析】

本文聚焦语音分析、合成与转换,介绍利用机器学习进行语音任务时常用的梅尔频谱。详细阐述从音频波形提取梅尔频谱的步骤,包括预加重、分帧加窗、STFT等;也说明了从梅尔频谱重建音频波形的过程,如转换幅度谱、用griffin_lim算法重建等。
部署运行你感兴趣的模型镜像

在语音分析,合成,转换中,第一步往往是提取语音特征参数。
利用机器学习方法进行上述语音任务,常用到梅尔频谱。
本文介绍从音频文件提取梅尔频谱,和从梅尔频谱变成音频波形。

从音频波形提取Mel频谱:

  1. 对音频信号预加重、分帧和加窗
  2. 对每帧信号进行短时傅立叶变换STFT,得到短时幅度谱
  3. 短时幅度谱通过Mel滤波器组得到Mel频谱

从Mel频谱重建音频波形

  1. Mel频谱转换成幅度谱
  2. griffin_lim声码器算法重建波形
  3. 去加重

声码器有很多种,比如world,straight等,但是griffin_lim是特殊的,它不需要相位信息就可以重频谱重建波形,实际上它根据帧之间的关系估计相位信息。和成的音频质量也较高,代码也比较简单。

音频波形 到 mel-spectrogram

sr = 24000 # Sample rate.
n_fft = 2048 # fft points (samples)
frame_shift = 0.0125 # seconds
frame_length = 0.05 # seconds
hop_length = int(sr*frame_shift) # samples.
win_length = int(sr*frame_length) # samples.
n_mels = 512 # Number of Mel banks to generate
power = 1.2 # Exponent for amplifying the predicted magnitude
n_iter = 100 # Number of inversion iterations
preemphasis = .97 # or None
max_db = 100
ref_db = 20
top_db = 15
def get_spectrograms(fpath):
    '''Returns normalized log(melspectrogram) and log(magnitude) from `sound_file`.
    Args:
      sound_file: A string. The full path of a sound file.

    Returns:
      mel: A 2d array of shape (T, n_mels) <- Transposed
      mag: A 2d array of shape (T, 1+n_fft/2) <- Transposed
 '''
    # Loading sound file
    y, sr = librosa.load(fpath, sr=sr)

    # Trimming
    y, _ = librosa.effects.trim(y, top_db=top_db)

    # Preemphasis
    y = np.append(y[0], y[1:] - preemphasis * y[:-1])
    
    # stft
    linear = librosa.stft(y=y,
                          n_fft=n_fft,
                          hop_length=hop_length,
                          win_length=win_length)

    # magnitude spectrogram
    mag = np.abs(linear)  # (1+n_fft//2, T)

    # mel spectrogram
    mel_basis = librosa.filters.mel(sr, n_fft, n_mels)  # (n_mels, 1+n_fft//2)
    mel = np.dot(mel_basis, mag)  # (n_mels, t)

    # to decibel
    mel = 20 * np.log10(np.maximum(1e-5, mel))
    mag = 20 * np.log10(np.maximum(1e-5, mag))

    # normalize
    mel = np.clip((mel - ref_db + max_db) / max_db, 1e-8, 1)
    mag = np.clip((mag - ref_db + max_db) / max_db, 1e-8, 1)

    # Transpose
    mel = mel.T.astype(np.float32)  # (T, n_mels)
    mag = mag.T.astype(np.float32)  # (T, 1+n_fft//2)

    return mel, mag

mel-spectrogram 到 音频波形

def melspectrogram2wav(mel):
    '''# Generate wave file from spectrogram'''
    # transpose
    mel = mel.T

    # de-noramlize
    mel = (np.clip(mel, 0, 1) * max_db) - max_db + ref_db

    # to amplitude
    mel = np.power(10.0, mel * 0.05)
    m = _mel_to_linear_matrix(sr, n_fft, n_mels)
    mag = np.dot(m, mel)

    # wav reconstruction
    wav = griffin_lim(mag)

    # de-preemphasis
    wav = signal.lfilter([1], [1, -preemphasis], wav)

    # trim
    wav, _ = librosa.effects.trim(wav)

    return wav.astype(np.float32)

def spectrogram2wav(mag):
    '''# Generate wave file from spectrogram'''
    # transpose
    mag = mag.T

    # de-noramlize
    mag = (np.clip(mag, 0, 1) * max_db) - max_db + ref_db

    # to amplitude
    mag = np.power(10.0, mag * 0.05)

    # wav reconstruction
    wav = griffin_lim(mag)

    # de-preemphasis
    wav = signal.lfilter([1], [1, -preemphasis], wav)

    # trim
    wav, _ = librosa.effects.trim(wav)

    return wav.astype(np.float32)

几个辅助函数:

def _mel_to_linear_matrix(sr, n_fft, n_mels):
    m = librosa.filters.mel(sr, n_fft, n_mels)
    m_t = np.transpose(m)
    p = np.matmul(m, m_t)
    d = [1.0 / x if np.abs(x) > 1.0e-8 else x for x in np.sum(p, axis=0)]
    return np.matmul(m_t, np.diag(d))

def griffin_lim(spectrogram):
    '''Applies Griffin-Lim's raw.
    '''
    X_best = copy.deepcopy(spectrogram)
    for i in range(n_iter):
        X_t = invert_spectrogram(X_best)
        est = librosa.stft(X_t, n_fft, hop_length, win_length=win_length)
        phase = est / np.maximum(1e-8, np.abs(est))
        X_best = spectrogram * phase
    X_t = invert_spectrogram(X_best)
    y = np.real(X_t)

    return y


def invert_spectrogram(spectrogram):
    '''
    spectrogram: [f, t]
    '''
    return librosa.istft(spectrogram, hop_length, win_length=win_length, window="hann")

预加重:
语音信号的平均功率谱受声门激励和口鼻辐射影响,高频端约在800HZ以上按6dB/倍频程衰落,预加重的目的是提升高频成分,使信号频谱平坦化,以便于频谱分析或声道参数分析.

您可能感兴趣的与本文相关的镜像

Facefusion

Facefusion

AI应用

FaceFusion是全新一代AI换脸工具,无需安装,一键运行,可以完成去遮挡,高清化,卡通脸一键替换,并且Nvidia/AMD等显卡全平台支持

请帮我整理下这段代码”import librosa.feature import librosa import soundfile import librosa.display import matplotlib.pyplot as plt import numpy as np import torch from speechbrain.inference.vocoders import HIFIGAN import torchaudio if __name__ == '__main__': n_fft = 1024 # FFT窗口长度 win_length = None hop_length = 256 n_mels = 80 fmin = 0 fmax = 8000 # 音频转mel # wav_data, sr = librosa.load('pretrained_model/speechbrain_hifigan/example.wav', sr=22050) # mel shape ndarray (64, 490) # mel = librosa.feature.melspectrogram(y=wav_data, sr=sr, n_fft=n_fft, hop_length=hop_length, win_length=win_length, # n_mels=n_mels, fmin=fmin, fmax=fmax) wav_data, sr = torchaudio.load('pretrained_model/speechbrain_hifigan/example.wav') wav2mel = torchaudio.transforms.MelSpectrogram(sample_rate=sr, n_mels=n_mels, n_fft=n_fft, hop_length=hop_length,power=1, norm='slaney',mel_scale='slaney') mel = wav2mel(wav_data) # plt.figure() # p = plt.imshow(mel.log2()[0, :, :].detach().numpy(), cmap='gray') # plt.show() # 求对数 # mel = torch.log(torch.clamp(torch.tensor(mel.detach()), min=1e-5) * 1) # mel = torch.exp(mel.detach()) # mel_spec_db = librosa.power_to_db(mel, ref=np.max) # plt.figure() # librosa.display.specshow(mel_spec_db, sr=sr) # plt.title('Mel spectrogram') # plt.xlabel("Time") # plt.ylabel("Frequency") # plt.colorbar(format='%+2.0f dB') # plt.show() # mel转音频 # wav_data_2 = librosa.feature.inverse.mel_to_audio(mel, sr=sr, n_fft=n_fft, hop_length=hop_length, # win_length=win_length, fmin=fmin, fmax=fmax) # soundfile.write('ubnt_test_mel.wav', wav_data_2, sr) # 梅尔频谱频谱 griffinlim_transform = torchaudio.transforms.InverseMelScale( n_stft=513, n_mels=n_mels, sample_rate=sr, mel_scale='slaney', norm='slaney' ) linear_spectrogram = griffinlim_transform(mel) # 频谱到音频 griffin_lim = torchaudio.transforms.GriffinLim( n_fft = n_fft, hop_length = hop_length, n_iter = 32, ) wav = griffin_lim(linear_spectrogram) torchaudio.save('mel_test.wav', wav, sr) # hifi_gan = HIFIGAN.from_hparams(source='pretrained_model/speechbrain_hifigan', # savedir='pretrained_model/speechbrain_hifigan') # # mel = torch.tensor(mel) # mel = mel[0] # waveforms = hifi_gan.decode_batch(mel) # # torchaudio.save('mel_hifigan_resc.wav', waveforms, 22050)“,实现读入音频后转为梅尔频谱再转为音频的功能。
最新发布
12-16
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值