EI / SCOPUS / CSCD 收录

中文核心期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

U-net网络中融合多头注意力机制的单通道语音增强

范君怡 杨吉斌 张雄伟 郑昌艳

范君怡, 杨吉斌, 张雄伟, 郑昌艳. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716. doi: 10.15949/j.cnki.0371-0025.2022.06.007
引用本文: 范君怡, 杨吉斌, 张雄伟, 郑昌艳. U-net网络中融合多头注意力机制的单通道语音增强[J]. 声学学报, 2022, 47(6): 703-716. doi: 10.15949/j.cnki.0371-0025.2022.06.007
FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Monaural speech enhancement using U-net fused with multi-head self-attention[J]. ACTA ACUSTICA, 2022, 47(6): 703-716. doi: 10.15949/j.cnki.0371-0025.2022.06.007
Citation: FAN Junyi, YANG Jibin, ZHANG Xiongwei, ZHENG Changyan. Monaural speech enhancement using U-net fused with multi-head self-attention[J]. ACTA ACUSTICA, 2022, 47(6): 703-716. doi: 10.15949/j.cnki.0371-0025.2022.06.007

U-net网络中融合多头注意力机制的单通道语音增强

doi: 10.15949/j.cnki.0371-0025.2022.06.007
基金项目: 

国家自然科学基金项目(62071484)资助

详细信息
    通讯作者:

    杨吉斌, yjbice@sina.com

    张雄伟, xwzhang9898@163.com

Monaural speech enhancement using U-net fused with multi-head self-attention

  • 摘要: 在低信噪比和突发背景噪声条件下,已有的深度学习网络模型在单通道语音增强方面效果并不理想,而人类可以利用语音的长时相关性对不同的语音信号形成综合感知。因此刻画语音的长时依赖关系有助于改进低信噪比和突发背景噪声下的增强性能。受该特性的启发,提出一种融合多头注意力机制和U-net深度网络的增强模型TU-net,实现基于时域的端到端单通道语音增强。TU-net网络模型采用U-net网络的编解码层对带噪语音信号进行多尺度特征融合,并利用多头注意力机制实现双路径Transformer,用于计算语音掩模,更好地建模长时相关性。该模型在时域、时频域和感知域计算损失函数,并通过加权组合损失函数指导训练。仿真实验结果表明,TU-net在低信噪比和突发背景噪声条件下增强语音信号的语音质量感知评估(PESQ)、短时客观可懂度(STOI)和信噪比增益等多个评价指标都优于同类的单通道增强网络模型,且保持相对较少的网络模型参数。

     

  • [1] 李轶南, 张雄伟, 贾冲, 陈亮, 曾理. 稀疏低秩噪声模型下无监督实时单通道语音增强算法. 声学学报, 2015; 40(4):607-614
    [2] Wang Y X, Wang D L. Towards scaling up classification-based speech separation. IEEE Trans. Audio Speech Lang. Process., 2013; 21(7):1381-1390
    [3] Xu Y, Du J, Dai L et al. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process., 2015; 23(1):7-19
    [4] 张晓艳, 张天骐, 葛宛营, 白杨柳. 联合深度神经网络和凸优化的单通道语音增强算法. 声学学报, 2021; 46(3):471-480
    [5] Park S R, Lee J W. A fully convolutional neural network for speech enhancement. Interspeech 2017, Stockholm, Sweden, International Speech Communication Association, 2017:1993-1997
    [6] Rethage D, Pons J, Serra X. A wavenet for speech denoising. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Calgary, AB, Canada, IEEE, 2018:5069-5073
    [7] Weninger F, Hershey J R, Roux J L et al. Discriminatively trained recurrent neural networks for single-channel speech separation. 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Atlanta, GA, USA, IEEE, 2014:577-581
    [8] Macartney C, Weyde T. Improved speech enhancement with the wave-u-net. arXiv:181111307, 2018
    [9] Yin D, Luo C, Xiong Z et al. Phasen:A phase-and-harmonics-aware speech enhancement network. Proceedings of the AAAI Conference on Artificial Intelligence, 2020; 34(5):9458-9465
    [10] Defossez A, Synnaeve G, Adi Y. Real time speech enhancement in the waveform domain. arXiv:200612847, 2020
    [11] Weninger F, Erdogan H, Watanabe S et al. Speech enhancement with lstm recurrent neural networks and its application to noise-robust asr. Latent Variable Analysis and Signal Separation, Cham, Springer International Publishing, 2015:91-99
    [12] Pascual S, Bonafonte A, Serrà J. Segan:Speech enhancement generative adversarial network. Interspeech 2017, Stockholm, Sweden, International Speech Communication Association, 2017:3642-3646
    [13] Shamma S, Fritz J. Adaptive auditory computations. Curr. Opin. Neurobiol., 2014; 25:164-168
    [14] Zion Golumbic E, Ding N, Bickel S et al. Mechanisms underlying selective neuronal tracking of attended speech at a "cocktail party". Neuron, 2013; 77(5):980-991
    [15] Pandey A, Wang D. Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:6629-6633
    [16] Kim J, EL-Khamy M, Lee J. T-gsa:Transformer with gaussian-weighted self-attention for speech enhancement. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:6649-6653
    [17] Wang K, He B, Zhu W-P. Tstnn:Two-stage transformer based neural network for speech enhancement in the time domain. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toronto, ON, Canada, IEEE, 2021:7098-7102
    [18] Li P, Jiang Z, Yin S et al. Pagan:A phase-adapted generative adversarial networks for speech enhancement. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:6234-6238
    [19] Jia H, Wang W, Mei S. Combining adaptive sparse nmf feature extraction and soft mask to optimize dnn for speech enhancement. Appl. Acoust., 2021; 171:107666
    [20] Liao C-F, Tsao Y, Lee H-Y et al. Noise adaptive speech enhancement using domain adversarial training. Interspeech 2019, Graz, Austria, International Speech Communication Association, 2019:3148-3152
    [21] Hou J, Zhao S. A real-time speech enhancement algorithm based on convolutional recurrent network and wiener filter. 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), Chengdu, China, IEEE, 2021:683-688
    [22] Shu X, Zhou Y, Cao Y. A progressive enhancement method for noisy and reverberant speech. 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, IEEE, 2018:1-5
    [23] Hao X, Su X, Wen S et al. Masking and inpainting:A two-stage speech enhancement approach for low snr and non-stationary noise. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:6959-6963
    [24] Chiluveru S R, Tripathy M. Nonstationary noise reduction in low snr speech signals with wavelet coefficient feature. 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, IEEE, 2020:647-653
    [25] Zhao B, Li Q, Lv Q et al. A spectrum adaptive segmentation empirical wavelet transform for noisy and nonstationary signal processing. IEEE Access, 2021; 9:106375-106386
    [26] Zão L, Coelho R. On the estimation of fundamental frequency from nonstationary noisy speech signals based on the hilbert-huang transform. IEEE Signal Process. Lett., 2018; 25(2):248-252
    [27] Medina C, Coelho R, Zão L. Impulsive noise detection for speech enhancement in hht domain. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29:2244-2253
    [28] Xu Z, Jiang T, Li C et al. An attention-augmented fully convolutional neural network for monaural speech enhancement. 202112th International Symposium on Chinese Spoken Language Processing (ISCSLP), Hong Kong, IEEE, 2021:1-5
    [29] Pandey A, Wang D. Dense cnn with self-attention for time-domain speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29:1270-1279
    [30] Hwang J W, Park R H, Park H M. Efficient audio-visual speech enhancement using deep u-net with early fusion of audio and video information and rnn attention blocks. IEEE Access, 2021; 9:137584-137598
    [31] Stoller D, Ewert S, Dixon S. Wave-u-net:A multi-scale neural network for end-to-end audio source separation. 19th International Society for Music Information Retrieval Conference (ISMIR 2018)
    [32] 常新旭, 张杨, 杨林等. 融合多头自注意力机制的语音增强方法. 西安电子科技大学学报, 2020; 47(1):104-110
    [33] Liu G, Gong K, Liang X et al. Cp-gan:Context pyramid generative adversarial network for speech enhancement. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:6624-6628
    [34] Dauphin Y N, Fan A, Auli M et al. Language modeling with gated convolutional networks. Proceedings of the 34th International Conference on Machine Learning- Volume 70, Sydney, NSW, Australia, JMLR.org, 2017:933-941
    [35] Chen J, Mao Q, Liu D. Dual-path transformer network:Direct context-aware modeling for end-to-end monaural speech separation. Interspeech2020, Shanghai, China, International Speech Communication Association, 2020:2642-2646
    [36] Luo Y, Chen Z, Yoshioka T. Dual-path rnn:Efficient long sequence modeling for time-domain single-channel speech separation. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Barcelona, Spain, IEEE, 2020:46-50
    [37] Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, Curran Associates Inc., 2017:6000-6010
    [38] Martin-Doñas J M, Gomez A M, Gonzalez J A et al. A deep learning loss function based on the perceptual evalu-[] ation of the speech quality. IEEE Signal Process. Lett., 2018; 25(11):1680-1684
    [39] Valentini-Botinhao C, Wang X, Takaki S et al. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. 9th ISCA Speech Synthesis Workshop, 2016:146-152
    [40] Veaux C, Yamagishi J, King S. The voice bank corpus:Design, collection and data analysis of a large regional accent speech database. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), Gurgaon, India, IEEE, 2013:1-4
    [41] Thiemann J, Ito N, Vincent E. The diverse environments multi-channel acoustic noise database (demand):A database of multichannel environmental noise recordings. J. Acoust. Soc. Am., 2013; 133(5):3591
    [42] Reddy C K A, Dubey H, Gopal V et al. Icassp 2021 deep noise suppression challenge. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Toronto, ON, Canada, IEEE, 2021:6623-6627
    [43] 徐岩, 王春丽. 语音信号增强技术及其应用. 北京:科学出版社, 2013:195-229
    [44] Hu Y, Loizou P C. Evaluation of objective quality measures for speech enhancement. IEEE Trans. Audio Speech Lang. Process., 2008; 16(1):229-238
    [45] Https://GITHUB.COM/IMLHF/SPEECH-ENHANCEM-ENT-MEASURES
    [46] Takahashi N, Agrawal P, Goswami N et al. Phasenet:Discretized phase modeling with deep neural networks for audio source separation. Interspeech 2018, Hyderabad, India, International Speech Communication Association, 2018:2713-2717
    [47] Soni M H, Shah N, Patil H A. Time-frequency masking-based speech enhancement using generative adversarial network. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., Calgary, AB, Canada, IEEE, 2018:5039-5043
    [48] Fu S-W, Liao C-F, Tsao Y et al. Metricgan:Generative adversarial networks based black-box metric scores optimization for speech enhancement. International Conference on Machine Learning, PMLR, 2019:2031-2041
    [49] Choi H-S, Kim J-H, Huh J et al. Phase-aware speech enhancement with deep complex u-net. ICLR 2019 Conference, New Orleans, Louisiana, United States, 2019
  • 加载中
图(1)
计量
  • 文章访问数:  433
  • HTML全文浏览量:  4
  • PDF下载量:  24
  • 被引次数: 0
出版历程
  • 收稿日期:  2021-11-04
  • 修回日期:  2022-03-27
  • 刊出日期:  2022-11-15

目录

    /

    返回文章
    返回