Monaural speech enhancement using U-net fused with multi-head self-attention
-
Graphical Abstract
-
Abstract
Under low Signal-to-Noise Ratio (SNR) and burst background noise conditions, the enhancement effect of existing deep learning-based speech enhancement methods is not satisfactory. In contrast, humans can exploit the long-term correlation of speech to form an integrated perception of different speech signals. Thus, describing the long-term dependencies of speech can help improve the enhancement performance under low SNR and burst background noise. Inspired by this feature, a time domain end-to-end monaural speech enhancement model TU-net that fuses the multi-head self-attention mechanism and U-net deep network is proposed. The TU-net network adopts the codec layer structure of U-net to achieve multi-scale feature fusion, and introduces the dual-path Transformer module using the multi-head self-attention mechanism to calculate the speech mask and better model long-term correlation. TU-net model is trained with a weighted sum loss function in the time domain, time-frequency domain and perceptual domain. Exhaustive experiments are carried out and the results show that TU-net outperforms than other similar monaural enhancement network models in several evaluation metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and SNR gain under low SNR and burst background noise conditions, and maintains relatively few network model parameters.
-
-