Neural Noise Embedding for End to End Speech Enhancement with Conditional Layer Normalization


在这里插入图片描述

ICASSP2021

0. 摘要

为了应对各种复杂的噪声场景,本文引入了一种新的增强架构,它将深度自动编码器与神经噪声嵌入相结合。 在这项研究中,引入了一种新的归一化方法,称为条件层归一化 (CLN),以改进基于深度学习的语音增强方法对unseen environments的泛化。噪声嵌入通过 CLN 层来规范语音增强任务的网络。所提出的网络可以根据从带噪语音输入中提取的不同噪声信息进行自适应调整。 整个网络以端到端的方式进行训练,实验结果表明,该网络模型能够捕获噪声信息,提升模型鲁棒性。

1. 简介

大多基于深度学习的语音增强方法直接从带噪信号中预测纯净语音信号,没有考虑噪声信息。一般来说,训练集中可以包含大量的不同噪声环境,以提升模型的泛化能力和降噪性能。如果可以估计噪声信息,并将其嵌入到网络中作为附加提示,则可以显著缓解噪声噪声类型不匹配的问题。

本文提出了一种新的条件归一化方法(CLN),用于时域语音增强。带噪信号输入到网络中,学习噪声嵌入(embedding)。在增强网络中,带噪信号通过以噪声嵌入为条件的CLN层,从而得到增强后的信号。所提出的方法中,所有的子网络都使用所提出的损失函数进行端到端训练。

2. 模型方法

在这里插入图片描述
本文探索了一种具有 CLN 层 (AECLN) 的新型自动编码器框架,以提高单声道语音增强的泛化能力,整体架构如图1所示。模型包含三个部分:噪声估计网络,残差卷积网络和增强网络。通过噪声估计网络得到噪声嵌入向量 n e m b n_{emb} nemb,然后该向量通过残差卷积网络,以压缩噪声环境特征并提升嵌入特征的表达能力。

2.1 Conditional Layer Normalization

层归一化(LN)是一种对中间层分布进行归一化的技术。 它可以实现更平滑的梯度、更快的训练,并且可以被视为一种正则化机制。channel-wise LN的公式定义如公式(1)所示:
在这里插入图片描述
其中 x ∈ R C × T x \in R^{C \times T} xR

While the provided references do not directly address the steps to create a local AI agent for speech-to-speech conversion, here are general steps based on common practices in the field: ### 1. Define the Requirements - Determine the specific use - case, such as language translation, voice modulation, or speech enhancement. Decide on the languages and accents the agent should support. ### 2. Data Collection - Gather a large and diverse dataset of speech samples. This should include different speakers, speaking styles, and acoustic environments. The dataset should also cover the range of languages and accents identified in the requirements. ### 3. Pre - processing - Clean the collected data by removing background noise, normalizing audio levels, and segmenting the speech into appropriate units (e.g., words or phrases). - Transcribe the speech data to create text - audio pairs, which will be used for training. ### 4. Select a Model Architecture - There are several neural network architectures suitable for speech - to - speech conversion, such as Transformer - based models, Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). Consider factors like the complexity of the task, computational resources, and performance requirements when choosing. ### 5. Training the Model - Split the pre - processed dataset into training, validation, and test sets. - Use the training set to train the selected model. Adjust the model's parameters using an appropriate optimization algorithm, such as Stochastic Gradient Descent (SGD) or its variants. - Monitor the model's performance on the validation set during training to prevent overfitting. ### 6. Evaluation - Use the test set to evaluate the trained model's performance. Common evaluation metrics for speech - to - speech conversion include Word Error Rate (WER), Mean Opinion Score (MOS), and Character Error Rate (CER). ### 7. Deployment - Package the trained model into a local application. This may involve creating a user interface (UI) for easy interaction. - Ensure that the application can handle real - time speech input and output. Optimize the model for local execution to reduce latency. ### 8. Testing and Refinement - Conduct thorough testing on the local AI agent with a variety of real - world speech samples. - Based on the test results, identify areas for improvement and refine the model or the application. ```python # Here is a simple example of using Python's SpeechRecognition library for basic speech - to - text and Google Text - to - Speech for text - to - speech import speech_recognition as sr from gtts import gTTS import os # Record audio from microphone r = sr.Recognizer() with sr.Microphone() as source: print("Say something!") audio = r.listen(source) # Convert speech to text try: text = r.recognize_google(audio) print("You said: " + text) except sr.UnknownValueError: print("Could not understand audio") except sr.RequestError as e: print("Error; {0}".format(e)) # Convert text to speech tts = gTTS(text=text, lang='en') tts.save("output.mp3") os.system("mpg321 output.mp3") ```
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Aidanmomo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值