EI / SCOPUS / CSCD 收录

中文核心期刊

留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

面向鲁棒自动语音识别的一致性自监督学习方法

高长丰 程高峰 张鹏远

高长丰, 程高峰, 张鹏远. 面向鲁棒自动语音识别的一致性自监督学习方法[J]. 声学学报, 2023, 48(3): 578-587. doi: 10.15949/j.cnki.0371-0025.2023.03.008
引用本文: 高长丰, 程高峰, 张鹏远. 面向鲁棒自动语音识别的一致性自监督学习方法[J]. 声学学报, 2023, 48(3): 578-587. doi: 10.15949/j.cnki.0371-0025.2023.03.008
GAO Changfeng, CHENG Gaofeng, ZHANG Pengyuan. Consistency self-supervised learning method for robust automatic speech recognition[J]. ACTA ACUSTICA, 2023, 48(3): 578-587. doi: 10.15949/j.cnki.0371-0025.2023.03.008
Citation: GAO Changfeng, CHENG Gaofeng, ZHANG Pengyuan. Consistency self-supervised learning method for robust automatic speech recognition[J]. ACTA ACUSTICA, 2023, 48(3): 578-587. doi: 10.15949/j.cnki.0371-0025.2023.03.008

面向鲁棒自动语音识别的一致性自监督学习方法

doi: 10.15949/j.cnki.0371-0025.2023.03.008
基金项目: 国家重点研发计划项目(2020AAA0108002)资助
详细信息

Consistency self-supervised learning method for robust automatic speech recognition

  • 摘要:

    提出了一种基于一致性自监督学习的鲁棒自动语音识别方法。该方法通过使用语音信号仿真技术, 模拟一条语音在不同声学场景下的副本; 在通过自监督学习方式学习语音表征的同时, 极大化一条语音在不同声学环境下对应语音表征的相似性, 从而获取到与环境干扰无关的语音表征方式, 提高下游语音识别模型的性能。在远讲数据集CHiME-4和会议数据集AMI上的实验表明, 所提的一致性自监督学习算法能够取得相比已有的wav2vec2.0自监督学习基线算法30%以上的识别词错误率下降。这表明, 所提方法是一种获取噪声无关语音表征、提升鲁棒语音识别性能的有效方法。

     

  • 图 1  一致性对比损失函数的计算流程

    图 2  针对鲁棒语音识别的CSSL预训练流程

    图 3  纯净语音(a)与带噪语音(b)的语音频谱

    图 4  基线模型(a)与DAMC-CSSL模型(b)输出特征可视化结果

    表  1  使用不同预训练策略的wav2vec2.0模型在CHiME-4上的识别词错误率 (%)

    训练策略语言模型单通道六通道
    dt-sdt-ret-set-rdt-sdt-ret-set-r
    源数据预训练三元文法语言模型10.88.517.215.56.205.510.29.2
    数据混合预训练12.810.319.216.58.57.212.410.6
    连续预训练9.37.616.813.76.35.410.18.6
    +1∶1数据重现9.07.315.813.15.95.19.48.0
    +1∶3数据重现9.57.415.913.75.85.29.48.0
    +1∶9数据重现9.87.816.313.96.05.29.78.3
    源数据预训练神经网络语言模型7.75.813.311.34.23.47.16.2
    数据混合预训练9.47.015.012.15.74.68.87.2
    连续预训练6.75.013.210.13.83.37.25.6
    +1∶1数据重现6.44.812.39.73.83.26.65.4
    +1∶3数据重现6.54.712.19.73.63.06.45.1
    +1∶9数据重现6.94.812.410.13.63.06.35.3
    下载: 导出CSV

    表  2  wav2vec2.0自监督学习算法和所提CSSL自监督学习算法在CHiME-4上的识别词错误率(%)对比

    训练策略语言模型单通道六通道
    dt-sdt-ret-set-rdt-sdt-ret-set-r
    源数据预训练三元文法语言模型10.88.517.215.56.25.510.29.2
    +DA-SIMU10.58.916.815.56.45.610.08.8
    +DA-CSSL9.57.315.713.35.95.19.48.2
    连续预训练(1∶1)9.07.315.813.15.95.19.48.0
    +DA-SIMU9.07.215.513.05.95.29.57.9
    +DA-CSSL8.76.515.212.05.74.89.17.8
    +MC-CSSL9.07.015.012.75.64.98.87.5
    +DAMC-CSSL8.46.314.711.45.34.68.47.3
    +DAMCSE-CSSL8.86.515.212.25.74.89.17.6
    源数据预训练神经网络语言模型7.75.813.311.34.23.47.16.2
    + DA-SIMU7.35.512.910.93.93.26.85.6
    +DA-CSSL6.64.511.99.33.62.96.25.3
    连续预训练(1∶1)6.44.812.39.73.83.26.65.4
    +DA-SIMU6.44.511.89.23.73.06.65.1
    +DA-CSSL6.24.211.78.53.72.96.34.9
    +MC-CSSL6.24.511.69.03.52.85.74.8
    +DAMC-CSSL5.83.711.07.73.42.65.44.2
    +DAMCSE-CSSL6.24.311.68.63.52.86.34.8
    下载: 导出CSV

    表  3  wav2vec2.0自监督学习算法和所提CSSL自监督学习算法在AMI SDM上的识别词错误率(%)对比

    训练策略三元文法语言模型神经网络语言模型
    开发集测试集开发集测试集
    源数据预训练29.832.929.232.5
    +DA-CSSL28.231.426.930.2
    连续预训练29.232.528.031.3
    +DA-CSSL27.931.126.730.2
    +MC-CSSL28.832.327.731.2
    +DAMC-CSSL27.631.326.429.8
    下载: 导出CSV

    表  4  所提方法与已有鲁棒语音识别算法在CHiME-4上的识别词错误率(%)对比

    识别方法模型结构单通道六通道
    dt-ret-rdt-ret-r
    Kaldi[23]Hybrid-TDNN5.611.41.92.7
    Du等[24]Hybrid-DCNN4.69.22.12.2
    Wang等[25]Hybrid-BLSTM3.56.81.52.0
    ESPNet[26]E2E-Conformer11.720.67.914.2
    Guo等[13]E2E-Transformer15.826.8
    Tsunoo等[27]E2E-Conformer8.415.7
    Wang等[28]E2E-Wav2vec2.05.09.0
    Hubert[16]E2E-Hubert-base5.510.83.35.5
    所提方法E2E-Wav2vec2.03.77.72.03.5
    下载: 导出CSV
  • [1] Anguera X, Wooters C, Hernando J. Acoustic beamforming for speaker diarization of meetings. IEEE Trans. Audio Speech Lang. Process., 2007; 15(7): 2011—2022 doi: 10.1109/TASL.2007.902460
    [2] 陈明建, 胡振彪, 陈林. 一种基于加权TOPS的宽带DOA估计新方法. 数据采集与处理, 2019; 34(3): 453—461 doi: 10.16337/j.1004-9037.2019.03.009
    [3] 王子腾, 孙兴伟, 李军锋, 等. 近似窄带假设下的最小方差无失真响应波束形成. 声学学报, 2020; 45(2): 161—168 doi: 10.15949/j.cnki.0371-0025.2020.02.002
    [4] 石倩, 陈航艇, 张鹏远. 波达方向初始化空间混合概率模型的语音增强. 声学学报, 2022; 47(1): 139—150 doi: 10.15949/j.cnki.0371-0025.2022.01.016
    [5] Higuchi T, Ito N, Yoshioka T, et al. Robust MVDR beamforming using time-frequency masks for online/offline ASR in noise. IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, 2016: 5210—5214
    [6] 柯雨璇, 厉剑, 彭任华, 等. 用于自适应波束形成语音增强的球谐域掩蔽函数估计方法. 声学学报, 2021; 46(1): 67—80 doi: 10.15949/j.cnki.0371-0025.2021.01.007
    [7] 高飞, 黄哲莹, 王子腾, 等. 早晚期混响划分对理想比值掩蔽在语音识别性能上的影响. 声学学报, 2019; 44(4): 788—795 doi: 10.15949/j.cnki.0371-0025.2019.04.041
    [8] Wang Z Q, Wang D L. A joint training framework for robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process., 2016; 24(4): 796—806 doi: 10.1109/TASLP.2016.2528171
    [9] 陈欢. 自动语音识别噪声鲁棒性方法研究. 博士学位论文, 南京: 南京邮电大学, 2014
    [10] 杜俊. 自动语音识别中的噪声鲁棒性方法. 博士学位论文, 合肥: 中国科学技术大学, 2009
    [11] 柏财通, 崔翛龙, 郑会吉, 等. 基于自监督知识迁移的鲁棒性语音识别技术. 计算机应用, 2022; 42(10): 3217—3223 doi: 10.11772/j.issn.1001-9081.2021050808
    [12] 张开生, 赵小芬. 复杂环境下基于自适应深度神经网络的鲁棒语音识别. 计算机工程与科学, 2022; 44(6): 1105—1113 doi: 10.3969/j.issn.1007-130X.2022.06.019
    [13] Guo Y, Chen Y, Cheng G, et al. Far-field speech recognition based on complex-valued neural networks and inter-frame similarity difference method. IEEE Automatic Speech Recognition and Understanding Workshop, Cartagena, Colombia, 2021: 1003—1010
    [14] Fan C, Yi J, Tao J, et al. Gated recurrent fusion with joint training framework for robust end-to-end speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29: 198—209 doi: 10.1109/TASLP.2020.3039600
    [15] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 2020; 33: 12449—12460
    [16] Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process., 2021; 29: 3451—3460 doi: 10.1109/TASLP.2021.3122291
    [17] Gao C, Cheng G, Yang R, et al. Pre-training transformer decoder for end-to-end ASR model with unpaired text data. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada, 2021: 6543—6547
    [18] Liu X, Zhang F, Hou Z, et al. Self-supervised learning: Generative or contrastive. IEEE Trans. Knowl. Data Eng., 2023; 35(1): 857—876 doi: 10.1109/TKDE.2021.3090866
    [19] Barker J, Marxer R, Vincent E, et al. The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines. IEEE Workshop on Automatic Speech Recognition and Understanding, Scottsdale, Arizona, USA, 2015: 504—511
    [20] Renals S, Hain T, Bourlard H. Interpretation of multiparty meetings: the AMI and AMIDA projects. 2008 Hands-Free Speech Communication and Microphone Arrays, IEEE, 2008: 115—118
    [21] Panayotov V, Chen G, Povey D, et al. Librispeech: an ASR corpus based on public domain audio books. IEEE International Conference on Acoustics, Speech and Signal Processing, South Brisbane, Queensland, Australia, 2015: 5206—5210
    [22] Ko T, Peddinti V, Povey D, et al. A study on data augmentation of reverberant speech for robust speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, USA, 2017: 5220—5224
    [23] Chen S J, Subramanian A S, Xu H, et al. Building state-of-the-art distant speech recognition using the CHiME-4 challenge with a setup of speech enhancement baseline. Proc. Interspeech, ISCA, Hyderabad, 2018: 1571—1575
    [24] Du J, Tu Y H, Sun L, et al. The USTC-iFlytek system for CHiME-4 challenge. Proc. CHiME, 2016: 36—38
    [25] Wang Z Q, Wang P, Wang D L. Complex spectral mapping for single- and multi-channel speech enhancement and robust ASR. IEEE/ACM Trans. Audio. Speech. Lang. Process., 2020; 28: 1778—1787 doi: 10.1109/TASLP.2020.2998279
    [26] Guo P, Boyer F, Chang X, et al. Recent developments on espnet toolkit boosted by conformer. IEEE International Conference on Acoustics, Speech and Signal Processing, Toronto, Ontario, Canada, 2021: 5874—5878
    [27] Tsunoo E, Shibata K, Narisetty C, et al. Data augmentation methods for end-to-end speech recognition on distant-talk scenarios. Proc. Interspeech, ISCA, Brno, Czechia, 2021: 301—305
    [28] Wang H, Qian Y, Wang X, et al. Improving noise robustness of contrastive speech representation learning with speech reconstruction. IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 2022: 6062—6066
  • 加载中
计量
  • 文章访问数:  34
  • 被引次数: 0
出版历程
  • 收稿日期:  2022-06-02
  • 修回日期:  2022-10-28
  • 刊出日期:  2023-05-11

目录

    /

    返回文章
    返回