论文:VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design
演示:https://vits-2.github.io/demo/
论文:https://arxiv.org/abs/2307.16430


目前仍然存在的问题:
-
intermittent unnaturalness
-
low efficiency of the duration predictor
-
complex input format to alleviate the limitations of alignment and duration modeling (use of blank token)
-
insufficient speaker similarity in the multi-speaker model
-
slow training, and strong dependence on the phoneme conversion.
提出的方法:
-
a stochastic duration predictor trained through adversarial learning
-
normalizing flows improved by utilizing the transformer block
-
a speaker-conditioned text encoder to model multiple speakers’ characteristics better.

本文介绍了一种名为VITS2的改进方法,通过对抗学习和使用Transformer块进行speaker-conditioned文本编码,解决单阶段文本转语音的不连贯、低效等问题,提升了音频质量和处理多说话人特性的能力。
2万+

被折叠的 条评论
为什么被折叠?



