FastSpeech2——TTS论文阅读

最新推荐文章于 2025-02-25 10:10:07 发布

pied_piperG

最新推荐文章于 2025-02-25 10:10:07 发布

阅读量1.8k

点赞数 31

文章标签：论文阅读 TTS 语音合成音频语音识别机器学习

本文链接：https://blog.youkuaiyun.com/pied_piperG/article/details/135719625

版权

笔记地址：https://flowus.cn/share/1683b50b-1469-4d57-bef0-7631d39ac8f0
【FlowUs 息流】FastSpeech2

论文地址：lFastSpeech 2: Fast and High-Quality End-to-End Text to Speechhttps://arxiv.org/abs/2006.04558

Abstract：

tacotron→fastspeech，引入knowledge distillation，缓解TTS中one-to-many problem。问题：teacher-student distillation pipeline 1）复杂速度慢 2）不够准确 3）学生模型是从教师模型输出的结果来学习，而不是直接学习mel图谱，导致信息缺失

fastspeech2的解决方案：1）直接从gt进行训练 2）引入更多条件输入：pitch, enerngy, accurate, duration。具体为：extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference

1.Introduction：

fastspeech2改进之处：

1.直接使用gt来训练fastspeech2模型

2.为了缓解one-to-many problem，引入更多的声音condition；训练时，先从目标语音波形中提取pitch, energy, extrate duration，然后作为condition输入

3.音高energy难以预测且重要，采用方法we convert the pitch contour into pitch spectrogram using continuous wavelet transform and predict the pitch in the frequency domain, which can improve the accuracy of predicted pitch.

4.Fastspeech2s，不采用mel图谱，而是直接从text中生成语音波形

贡献：

FastSpeech 2 achieves a 3x training speed-up over FastSpeech by simplifying the training pipeline.
FastSpeech 2 alleviates the one-to-many mapping problem in TTS and achieves better voice quality.
FastSpeech 2s further simplifies the inference pipeline for speech synthesis while maintaining high voice quality, by directly generating speech waveform from text.