基于窗口自注意力机制的MelGAN声码器
MelGAN vocoder based on a window self-attention mechanism
-
摘要: 基于生成对抗网络的声码器在实时语音合成效率上具有显著优势, 但现有模型提升语音质量常以增加参数量或降低泛化能力为代价。为此提出一种基于窗口自注意力机制的MelGAN声码器, 通过引入窗口自注意力机制和窗口逐层移位策略有效捕捉语音的长时关联特征, 并在训练阶段结合梅尔谱损失抑制噪声, 提升合成语音质量。相比利用空洞卷积建模长时关联的传统方法, 所提模型在有效捕捉语音长时特征的同时保持了较低参数量。实验结果表明, 所提模型在单说话人场景下的语音主观平均意见得分及语音质量评价模型得分均优于经典MelGAN, 并在未见说话人场景下也表现出良好的泛化能力。同时, 所提模型的合成语音质量与高性能HiFi-GAN声码器相当, 且参数量更低, 推理速度更快。Abstract: Vocoders based on generative adversarial networks are characterized by significant advantages in real-time speech generation efficiency. However, improvements in speech quality are often achieved at the cost of increased model size or reduced generalization ability. In this paper, a MelGAN vocoder based on a window self-attention mechanism is proposed. Long-term dependencies of speech are effectively captured by introducing a window self-attention mechanism and a layer-wise shifting strategy. Noise is suppressed during training by incorporating a mel-spectrogram loss, which enhances the quality of synthesized speech. Compared to conventional methods that model long-term dependencies using dilated convolutions, the proposed model efficiently captures speech features while maintaining a lower parameter count. Experimental results demonstrate that the proposed model outperforms the classic MelGAN on both subjective mean opinion scores and the speech quality evaluation model scores in single-speaker scenarios while exhibiting strong generalization capabilities for unseen speakers. Moreover, the proposed model achieves comparable synthesis quality to the high-performance HiFi-GAN vocoder, with significantly fewer parameters and faster inference speed.