MelGAN vocoder based on a window self-attention mechanism
-
Graphical Abstract
-
Abstract
Vocoders based on generative adversarial networks are characterized by significant advantages in real-time speech generation efficiency. However, improvements in speech quality are often achieved at the cost of increased model size or reduced generalization ability. In this paper, a MelGAN vocoder based on a window self-attention mechanism is proposed. Long-term dependencies of speech are effectively captured by introducing a window self-attention mechanism and a layer-wise shifting strategy. Noise is suppressed during training by incorporating a mel-spectrogram loss, which enhances the quality of synthesized speech. Compared to conventional methods that model long-term dependencies using dilated convolutions, the proposed model efficiently captures speech features while maintaining a lower parameter count. Experimental results demonstrate that the proposed model outperforms the classic MelGAN on both subjective mean opinion scores and the speech quality evaluation model scores in single-speaker scenarios while exhibiting strong generalization capabilities for unseen speakers. Moreover, the proposed model achieves comparable synthesis quality to the high-performance HiFi-GAN vocoder, with significantly fewer parameters and faster inference speed.
-
-