EI / SCOPUS / CSCD 收录

中文核心期刊

刘作桢, 吴愁, 黎塔, 赵庆卫. 面向自定义语音唤醒的关键词相关的单通道语音增强[J]. 声学学报, 2023, 48(2): 415-424. DOI: 10.15949/j.cnki.0371-0025.2023.02.012
引用本文: 刘作桢, 吴愁, 黎塔, 赵庆卫. 面向自定义语音唤醒的关键词相关的单通道语音增强[J]. 声学学报, 2023, 48(2): 415-424. DOI: 10.15949/j.cnki.0371-0025.2023.02.012
LIU Zuozhen, WU Chou, LI Ta, ZHAO Qingwei. Keyword-dependent monaural speech enhancement for open-vocabulary keyword spotting[J]. ACTA ACUSTICA, 2023, 48(2): 415-424. DOI: 10.15949/j.cnki.0371-0025.2023.02.012
Citation: LIU Zuozhen, WU Chou, LI Ta, ZHAO Qingwei. Keyword-dependent monaural speech enhancement for open-vocabulary keyword spotting[J]. ACTA ACUSTICA, 2023, 48(2): 415-424. DOI: 10.15949/j.cnki.0371-0025.2023.02.012

面向自定义语音唤醒的关键词相关的单通道语音增强

Keyword-dependent monaural speech enhancement for open-vocabulary keyword spotting

  • 摘要: 提出一种面向自定义语音唤醒的单通道语音增强方法。该方法预先将关键词音素信息存入文本编码矩阵, 并在常规语音增强模型基础上添加一个基于注意力机制的音素偏置模块。该模块利用语音增强模型中间特征从文本编码矩阵中获取当前帧的音素信息, 并将其融入语音增强模型的后续计算中, 从而提升语音增强模型对关键词相关音素的增强效果。在不同噪声环境下的实验结果表明, 该方法可以更有效地抑制关键词部分噪声。同时所提出方法对比常规语音增强方法与其他文本相关语音增强方法, 在自定义语音唤醒性能上可以分别获得14.3%和7.6%的相对提升。

     

    Abstract: A monaural speech enhancement algorithm for open-vocabulary keyword spotting is proposed. The algorithm stores the keyword phoneme information in the text encoding matrix in advance, and adds a phoneme bias module based on the attention mechanism on the basis of the conventional speech enhancement model. This module uses the intermediate features of the speech enhancement model to obtain the phoneme information of the current frame from the text encoding matrix, and integrates it into the subsequent calculation of the speech enhancement model, so that the model can obtain better enhancement performance on the specified keywords. The experimental results in different noise environments show that the proposed method can more effectively suppress the noise of keyword part and better recover the speech details. Meanwhile, the proposed method achieves an 14.3% relative improvement in open-vocabulary keyword spotting compared to conventional speech enhancement method, and an 7.6% relative improvement compared to other text-dependent speech enhancement method.

     

/

返回文章
返回