EI / SCOPUS / CSCD 收录

中文核心期刊

范君怡, 杨吉斌, 张雄伟, 郑昌艳. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716. DOI: 10.15949/j.cnki.0371-0025.2022.06.007
引用本文: 范君怡, 杨吉斌, 张雄伟, 郑昌艳. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716. DOI: 10.15949/j.cnki.0371-0025.2022.06.007
FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Monaural speech enhancement using U-net fused with multi-head self-attention[J]. ACTA ACUSTICA, 2022, 47(6): 703-716. DOI: 10.15949/j.cnki.0371-0025.2022.06.007
Citation: FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Monaural speech enhancement using U-net fused with multi-head self-attention[J]. ACTA ACUSTICA, 2022, 47(6): 703-716. DOI: 10.15949/j.cnki.0371-0025.2022.06.007

U-net网络中融合多头注意力机制的单通道语音增强

Monaural speech enhancement using U-net fused with multi-head self-attention

  • 摘要: 在低信噪比和突发背景噪声条件下,已有的深度学习网络模型在单通道语音增强方面效果并不理想,而人类可以利用语音的长时相关性对不同的语音信号形成综合感知。因此刻画语音的长时依赖关系有助于改进低信噪比和突发背景噪声下的增强性能。受该特性的启发,提出一种融合多头注意力机制和U-net深度网络的增强模型TU-net,实现基于时域的端到端单通道语音增强。TU-net网络模型采用U-net网络的编解码层对带噪语音信号进行多尺度特征融合,并利用多头注意力机制实现双路径Transformer,用于计算语音掩模,更好地建模长时相关性。该模型在时域、时频域和感知域计算损失函数,并通过加权组合损失函数指导训练。仿真实验结果表明,TU-net在低信噪比和突发背景噪声条件下增强语音信号的语音质量感知评估(PESQ)、短时客观可懂度(STOI)和信噪比增益等多个评价指标都优于同类的单通道增强网络模型,且保持相对较少的网络模型参数。

     

    Abstract: Under low Signal-to-Noise Ratio (SNR) and burst background noise conditions, the enhancement effect of existing deep learning-based speech enhancement methods is not satisfactory. In contrast, humans can exploit the long-term correlation of speech to form an integrated perception of different speech signals. Thus, describing the long-term dependencies of speech can help improve the enhancement performance under low SNR and burst background noise. Inspired by this feature, a time domain end-to-end monaural speech enhancement model TU-net that fuses the multi-head self-attention mechanism and U-net deep network is proposed. The TU-net network adopts the codec layer structure of U-net to achieve multi-scale feature fusion, and introduces the dual-path Transformer module using the multi-head self-attention mechanism to calculate the speech mask and better model long-term correlation. TU-net model is trained with a weighted sum loss function in the time domain, time-frequency domain and perceptual domain. Exhaustive experiments are carried out and the results show that TU-net outperforms than other similar monaural enhancement network models in several evaluation metrics such as Perceptual Evaluation of Speech Quality (PESQ), Short-Time Objective Intelligibility (STOI) and SNR gain under low SNR and burst background noise conditions, and maintains relatively few network model parameters.

     

/

返回文章
返回