EI / SCOPUS / CSCD 收录

中文核心期刊

利用客观音质指标的远场语音增强模型两阶段训练方法

Two-stage training method for far-field speech enhancement model using objective sound quality metric

  • 摘要: 语音增强旨在从背景干扰中提取干净语音以提升其可懂度与感知质量, 虽基于深度学习的模型在此领域表现出色, 但远场场景(说话人远离传声器)下数据显著劣化, 导致常用模型训练方案难以获得有效增强结果。针对该问题, 提出了一种两阶段训练方法。第一阶段采用监督学习的方法初始化模型参数, 通过对实录数据的建模, 初步估计出每个时频点的能量大小。然而, 此阶段的输出信号存在较强的失真。第二阶段引入语音质量感知评估指标(PESQ), 通过强化学习方法解决客观指标不可微分的问题, 有效缓解了第一阶段的问题, 在实录数据上将PESQ指标提升0.10。针对三种典型语音增强模型的消融实验结果表明: 无论采用掩膜还是映射的推断策略, 第二阶段的训练都可以有效缓解第一阶段推断结果弱化的问题, 提升实录语音的感知质量。

     

    Abstract: Speech enhancement aims to extract clean speech from background interference to improve its intelligibility and perceptual quality. Although deep learning-based models have demonstrated remarkable performance in this field, significant degradation of data in far-field scenarios (where the speaker is distant from the microphone) makes it difficult for commonly adopted model training schemes to achieve effective enhancement results. To address this issue, this paper proposes a two-stage training method. In the first stage, supervised learning is used to initialize the model parameters, and by modeling the recorded data, the energy magnitude at each time-frequency point is preliminarily estimated. However, the output signal of this stage contains strong phase distortion and artifacts. In the second stage, the perceptual evaluation of speech quality (PESQ) metric is introduced, and reinforcement learning is employed to address the non-differentiable nature of objective metrics, effectively mitigating the issues of the first stage and improving speech quality, yielding a PESQ increase of 0.10. Ablation experiments on three typical speech enhancement models show that, regardless of whether a masking or mapping inference strategy is used, the second stage of training can effectively alleviate the phase distortion problem of the first stage inference results and improve the perceptual quality of recorded speech.

     

/

返回文章
返回