Two-stage training method for far-field speech enhancement model using objective sound quality metric

LEI Tong; HU Qinwen; LU Jing; CHEN Kai

doi:10.12395/0371-0025.2024393

LEI Tong, HU Qinwen, LU Jing, CHEN Kai. Two-stage training method for far-field speech enhancement model using objective sound quality metricJ. ACTA ACUSTICA. DOI: 10.12395/0371-0025.2024393

Citation:

LEI Tong, HU Qinwen, LU Jing, CHEN Kai. Two-stage training method for far-field speech enhancement model using objective sound quality metricJ. ACTA ACUSTICA. DOI: 10.12395/0371-0025.2024393

Citation:

LEI Tong, HU Qinwen, LU Jing, CHEN Kai. Two-stage training method for far-field speech enhancement model using objective sound quality metricJ. ACTA ACUSTICA. DOI: 10.12395/0371-0025.2024393

Two-stage training method for far-field speech enhancement model using objective sound quality metric

Graphical Abstract

Abstract

Abstract

Speech enhancement aims to extract clean speech from background interference to improve its intelligibility and perceptual quality. Although deep learning-based models have demonstrated remarkable performance in this field, significant degradation of data in far-field scenarios (where the speaker is distant from the microphone) makes it difficult for commonly adopted model training schemes to achieve effective enhancement results. To address this issue, this paper proposes a two-stage training method. In the first stage, supervised learning is used to initialize the model parameters, and by modeling the recorded data, the energy magnitude at each time-frequency point is preliminarily estimated. However, the output signal of this stage contains strong phase distortion and artifacts. In the second stage, the perceptual evaluation of speech quality (PESQ) metric is introduced, and reinforcement learning is employed to address the non-differentiable nature of objective metrics, effectively mitigating the issues of the first stage and improving speech quality, yielding a PESQ increase of 0.10. Ablation experiments on three typical speech enhancement models show that, regardless of whether a masking or mapping inference strategy is used, the second stage of training can effectively alleviate the phase distortion problem of the first stage inference results and improve the perceptual quality of recorded speech.

FullText(HTML)

References (53)

Cited By

Two-stage training method for far-field speech enhancement model using objective sound quality metric

Abstract

Catalog

Export File

Citation

Format

Content