融合跨说话人韵律迁移的多语种文本到波形生成

尚增强; 张鹏远; 王丽

doi:10.12395/0371-0025.2022146

融合跨说话人韵律迁移的多语种文本到波形生成

Multilingual text-to-waveform with cross-speaker prosody transfer

摘要

摘要: 在多语种语音合成任务中, 由于单人多语种数据稀缺, 让一个音色同时支持多种语言合成变得非常困难。不同于已有方法仅在声学模型中解耦音色和发音, 提出一种融合跨说话人韵律迁移的端到端多语种语音合成方法, 采用两级层级条件变分自编码器直接建模从文本到波形的生成过程, 并解耦了音色、发音和韵律等信息。该方法通过迁移目标语种已有说话人的韵律风格来改善跨语种合成的韵律。实验表明, 所提模型在跨语种语音生成上获得了3.91和4.01的自然度和相似度平均意见得分, 相比基线跨语种合成字错误率降低到5.85%。韵律迁移以及消融实验也进一步证明了该方法的有效性。

Abstract: For the multilingual speech synthesis task, due to the scarcity of single-person multilingual data, it becomes very difficult for one voice to support multilingual synthesis at the same time. Unlike previous methods that only decouple timbre and pronunciation within acoustic models, this paper proposes an end-to-end multilingual speech synthesis method that incorporates cross-speaker prosody transfer, which uses a two-level hierarchical conditional variational auto-encoder to directly model the generation process from text-to-waveform and decouples timbre, pronunciation, and prosody. The method improves the prosody of cross-lingual synthesis by transferring the prosody style of existing speakers in the target language. Experiments reveal that the proposed model achieves an average opinion score of 3.91 and 4.01 for naturalness and similarity in cross-lingual speech generation. Objective indicators also show that the word error rate of this method is reduced to 5.85% compared with baselines. Besides, prosody transfer and ablation experiments further prove the effectiveness of proposed method.

HTML全文

参考文献(32)

施引文献

资源附件(0)