EI / SCOPUS / CSCD 收录

中文核心期刊

蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
引用本文: 蒿晓阳, 张鹏远. 使用变分自编码器的自回归多说话人中文语音合成[J]. 声学学报, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
HAO Xiaoyang, ZHANG Pengyuan. Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder[J]. ACTA ACUSTICA, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004
Citation: HAO Xiaoyang, ZHANG Pengyuan. Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder[J]. ACTA ACUSTICA, 2022, 47(3): 405-416. DOI: 10.15949/j.cnki.0371-0025.2022.03.004

使用变分自编码器的自回归多说话人中文语音合成

Autoregressive multi-speaker model in Chinese speech synthesis based on variational autoencoder

  • 摘要: 常见的多说话人语音合成有参数自适应及添加说话人标签两种方法。参数自适应方法获得的模型仅支持合成经过自适应的说话人的语音,模型不够鲁棒。传统的添加说话人标签的方法需要有监督地获得语音的说话人信息,并没有从语音信号本身无监督地学习说话人标签。为解决这些问题,提出了一种基于变分自编码器的自回归多说话人语音合成方法。方法首先利用变分自编码器无监督地学习说话人的信息并将其隐式编码为说话人标签,之后与文本的语言学特征送入到一个自回归声学参数预测网络中。此外,为了抑制多说话人语音数据引起的基频预测过拟合问题,声学参数网络采用了基频多任务学习的方法。预实验表明,自回归结构的加入降低了频谱误差1.018 dB,基频多任务学习降低了基频均方根误差6.861 Hz。在后续的多说话人对比实验中,提出的方法在3个多说话人实验的平均主观意见分(MOS)打分上分别达到3.71,3.55,3.15,拼音错误率分别为6.71%,7.54%,9.87%,提升了多说话人语音合成的音质。

     

    Abstract: Speaker adaption and speaker labels are two common methods for multi-speaker speech synthesis.The model obtained by speaker adaption can only support the speech of the adaptive speaker,and not robust enough.The conventional speaker label needs to obtain the speaker information of speech with supervision,and can’t learn the speaker label unsupervised from the speech itself.In order to solve the problems,a variational autoencoder based autoregressive multi-speaker framework is proposed.Firstly,speaker information is learned by variational autoencoder unsupervisedly and encoded into speaker labels.Then,speaker labels together with linguistic features are fed into an autoregressive acoustic model.Besides,acoustic model adopts multi-task learning to avoid the over-fitting of fundamental frequency.Pre-experiment shows,the autoregressive network structure decreases the cepstral distortion by 1.018 dB and root mean square error of fundamental frequency drops 6.861 Hz by multi-task learning.In the following comparative experiments,the Mean Opinion Score(MOS)scores respectively achieve 3.71,3.55,3.15 and Pinyin Error Rate achieve6.71%,7.54%,9.87%in three sub-tasks in multi-speaker speech synthesis by proposed method,which shows proposed methods observably improve the quality of synthesized speech.

     

/

返回文章
返回