A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network

TANG Guichen; LIANG Ruiyu; KONG Fanliu; XIE Yue; JU Mengjie

doi:10.15949/j.cnki.0371-0025.2022.05.003

TANG Guichen, LIANG Ruiyu, KONG Fanliu, XIE Yue, JU Mengjie. A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network[J]. ACTA ACUSTICA, 2022, 47(5): 692-702. DOI: 10.15949/j.cnki.0371-0025.2022.05.003

Citation:

A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network

Graphical Abstract

Graphical Abstract

Abstract

Abstract

The objective evaluation of speech quality can replace expensive manual scoring,but current objective indicators usually need pure reference speech,which is difficult to obtain in many practical acoustic systems.A noninvasive speech quality evaluation algorithm combining auxiliary target learning and Convolutional Recurrent Network（CRN） is proposed.Bark Frequency Cepstral Coefficients（BFCCs） which are based on human-like auditory filters,are used as the input of the CRN network to effectively reduce the network complexity.Firstly,frame-level features are extracted by a Convolutional Neural Network（CNN） from BFCCs.Then,long-term time dependence and sequence features are modeled by the Bidirectional Long Short-Term Memory（BiLSTM） networks in frame-level features.Finally,a self-attention mechanism is introduced into the CRN,thereby adaptively extracting useful information from frame-level features,which is then integrated into the characteristics of the sentence level and mapped into the final objective score.In addition,a multi-task training strategy is adopted,and Voice Activity Detection（VAD） is introduced as an auxiliary learning target to improve the performance of the algorithm.The experiments in public databases show that compared with other non-invasive algorithms,the proposed algorithm has a better correlation with the mean opinion score（MOS）.Moreover,it has a small parameter size and good generalization ability for the distorted speech database with MOS released by ITU-T P.808,which is close to the accuracy of the Perceptual Evaluation of Speech Quality（PESQ）.

FullText(HTML)

References (0)

Cited By

A non-invasive speech quality evaluation algorithm combining auxiliary target learning and convolutional recurrent network

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content