Mispronunciation detection and diagnosis with acoustic pronunciation model aided modeling

LIU Zongming; WANG Li; LI Junfeng; ZHANG Pengyuan

doi:10.15949/j.cnki.0371-0025.2023.01.020

Volume 48 Issue 1

Jan. 2023

Turn off MathJax

Article Contents

Abstract

References

ACTA ACUSTICA > 2023 > 48(1): 264-273. > DOI: 10.15949/j.cnki.0371-0025.2023.01.020 CSTR: 32049.14.11-2065.2023.01.020

LIU Zongming, WANG Li, LI Junfeng, ZHANG Pengyuan. Mispronunciation detection and diagnosis with acoustic pronunciation model aided modeling[J]. ACTA ACUSTICA, 2023, 48(1): 264-273. DOI: 10.15949/j.cnki.0371-0025.2023.01.020

Citation:

PDF (1614 KB)

Mispronunciation detection and diagnosis with acoustic pronunciation model aided modeling

1 Key Laboratory of Speech Acoustic and Content Understanding, Institute of Acoustics Chinese Academy of Sciences Beijing 100190;
2 University of Chinese Academy of Sciences Beijing 100049

More Information

Received Date: May 05, 2022
Revised Date: July 14, 2022
Available Online: January 17, 2023

Graphical Abstract

Abstract

Abstract

For Mispronunciation Detection and Diagnosis (MDD) tasks, expert-annotated data are scarce. To efficiently model pronunciation regularities on limited data and then aid MDD systems, an acoustic pronunciation model that integrates both acoustic and textual information is proposed. It models the mispronunciation generation process in a more theoretically complete way. Based on the acoustic correlation of different parts of this process, the model achieves aided modeling by sharing the acoustic encoder network parameters with the phoneme recognition model and optimizing it jointly in a multi-task learning manner. Moreover, the acoustic confidence masking-prediction training approach is proposed to further strengthen the correlation between the two tasks and improve the efficiency of aided modeling. Experiments show that the acoustic pronunciation model can effectively model mispronunciation regularities. With its aid in phoneme recognition modeling, the MDD system showed 4.9%, 9.5%, and 14.0% improvement in mispronunciation detection, diagnosis, and phoneme recognition, respectively. The acoustic confidence masking-prediction training method improves the efficiency of aided modeling, and both the masking parameters and the multi-task learning parameters can affect the effectiveness of aided modeling.

FullText(HTML)

References (28)

References

[1]	Li K, Qian X, Meng H. Mispronunciation detection and diagnosis in l2 english speech using multidistribution deep neural networks. IEEE ACM Trans. Audio Speech Lang. Process., 2016; 25(1):193-207
[2]	袁桦, 史永哲, 赵军红等. 基于JSM和MLP改进发音错误检测的方法. 自动化学报, 2014; 40(12):2815-2823
[3]	Leung W K, Liu X, Meng H. CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis. IEEE International Conference on Acoustics, Speech and Signal Processing, 2019:8132-8136
[4]	Feng Y, Fu G, Chen Q et al. SED-MDD:Towards sentence dependent end-to-end mispronunciation detection and diagnosis. IEEE International Conference on Acoustics, Speech and Signal Processing, 2020:3492-3496
[5]	Wu M, Li K, Leung W K et al. Transformer based end-to-end mispronunciation detection and diagnosis. Interspeech, ISCA, 2021:3954-3958
[6]	黄浩, 王建明, 哈力旦·阿布都热依木, 吾守尔·斯拉木. 自动发音错误检测中基于F1值最大化的声学模型训练方法. 声学学报, 2013; 38(6):751-758
[7]	Kawai G, Hirose K. A method for measuring the intelligibility and nonnativeness of phone quality in foreign language
[8]	Harrison A M, Lau W Y, Meng H M et al. Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer. Ninth Annual Conference of the International Speech Communication Association, 2008:2787-2790
[9]	Harrison A M, Lo W K, Qian X et al. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training. International Workshop on Speech and Language Technology in Education, 2009:2787-2790
[10]	Wang Y B, Lee L S. Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training. IEEE International Conference on Acoustics, Speech and Signal Processing, 2012:5049-5052
[11]	葛凤培, 潘复平, 董滨, 颜永红. 汉语发音质量评估的实验研究. 声学学报, 2010; 35(2):261-266
[12]	张劲松, 高迎明, 解焱陆. 基于DNN的发音偏误趋势检测. 清华大学学报(自然科学版), 2016; 56(11):1220-1225
[13]	安丽丽, 吴延年, 刘志等. 一种基于检错音网络的发音错误检测新算法. 电子与信息学报, 2012; 34(9):2085-2090
[14]	袁桦, 钱彦旻, 赵军红等. 基于优化检测网络和MLP特征改进发音错误检测的方法. 清华大学学报(自然科学版), 2012; 52(4):557-560
[15]	张茹, 韩纪庆. 一种基于音素模型感知度的发音质量评价方法. 声学学报, 2013; 38(2):201-207
[16]	Qian X, Soong F K, Meng H. Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT). Eleventh Annual Conference of the International Speech Communication Association, 2010:757-760
[17]	Luo D, Yang X, Wang L. Improvement of segmental mispronunciation detection with prior knowledge extracted from large L2 speech corpus. Twelfth Annual Conference of the International Speech Communication Association, 2011:1593-1596
[18]	Lo W K, Zhang S, Meng H. Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system. Eleventh annual Conference of the International Speech Communication Association, 2010:765-768
[19]	Qian X, Meng H, Soong F. Capturing L2 segmental mispronunciations with joint-sequence models in computer-aided pronunciation training (CAPT). 7th International Symposium on Chinese Spoken Language Processing, IEEE, 201084-88
[20]	Duan R, Kawahara T, Dantsuji M et al. Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data. IEEE International Conference on Acoustics, Speech and Signal Processing, 2017:5815-5819
[21]	Korzekwa D, Lorenzo-Trueba J, Zaporowski S et al. Mispronunciation detection in non-native (L2) English with uncertainty modeling. IEEE International Conference on Acoustics, Speech and Signal Processing, 2021:7738-7742
[22]	Higuchi Y, Watanabe S, Chen N et al. Mask CTC:Non-autoregressive end-to-end ASR with CTC and mask predict. Interspeech, ISCA, 2020:3655-3659
[23]	Zhao G, Sonsaat S, Silpachai A et al. L2-ARCTIC:A non-native English speech corpus. Interspeech, ISCA, 2018:2783-2787
[24]	Garofolo J S, Lamel L F, Fisher W M et al. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NIST Interagency/Internal Report (NISTIR)-4930, National Institute of Standards and Technology, 1993
[25]	Zhang Z, Wang Y, Yang J. Text-conditioned transformer for automatic pronunciation error detection. Speech Commun., 2021; 130:55-63
[26]	Peng L, Fu K, Lin B et al. A Study on fine-tuning wav2vec2.0 model for the task of mispronunciation detection and diagnosis. Interspeech, ISCA, 2021:4448-4452
[27]	Yan B C, Wu M C, Hung H T et al. An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling. Interspeech, ISCA, 2020:3032-3036
[28]	Watanabe S, Hori T, Kim S et al. Hybrid CTC/attention architecture for end-to-end speech recognition. IEEE J. Sel. Top. Signal Process., 2017; 11(8):1240-1253

[1]	OUYANG Ling, LIU Xiaozhou, LIU Jiehui, GONG Xiufen. Displacement estimation algorithm based on a priori estimation of adaptive window for ultrasonic elastography[J]. ACTA ACUSTICA, 2016, 41(5): 597-604. DOI: 10.15949/j.cnki.0371-0025.2016.05.007
[2]	WANG Yingchun, WANG Changhong, CHEN Long, WANG Yuling, QIU Wei. The research of element coordinates estimating algorithm of correlation sonar[J]. ACTA ACUSTICA, 2013, 38(1): 29-34. DOI: 10.15949/j.cnki.0371-0025.2013.01.004
[3]	YANG Baiding, TA Dean, WANG Weiqi. Study on the mean trabecular bone spacing based on fundamental frequency estimation method[J]. ACTA ACUSTICA, 2011, 36(2): 172-178. DOI: 10.15949/j.cnki.0371-0025.2011.02.020
[4]	ZHANG Peng, BAO Ming, FENG Dahang, YANG Jun, LI Xiaodong. Weighted maximum likelihood direction of arrival estimation algorithm and its application research[J]. ACTA ACUSTICA, 2010, 35(2): 235-240. DOI: 10.15949/j.cnki.0371-0025.2010.02.023
[5]	HUANG Hai, PAN Jiaqiang. Pitch detection method based on Hilbert-Huang Transform for speech signals[J]. ACTA ACUSTICA, 2006, 31(1): 35-41. DOI: 10.15949/j.cnki.0371-0025.2006.01.006
[6]	ZHANG Xiaofeng, ZHAO Junwei, MA Zhongcheng, LI Guijuan, WANG Rongqing. Research on localization algorithm with weighted least squares estimate for bistatic sonar[J]. ACTA ACUSTICA, 2004, 29(3): 283-286. DOI: 10.15949/j.cnki.0371-0025.2004.03.017
[7]	CHEN Huawei, ZHAO Junwei, GUO Yecai, CAI Zongyi, LI Guijuan, ZHOU Shihong. A Wiener weighting algorithm of adaptive time delay estimation in frequency domain[J]. ACTA ACUSTICA, 2003, 28(6): 514-517. DOI: 10.15949/j.cnki.0371-0025.2003.06.006
[8]	XING Hongyan, LIU Zhaoquan, WAN Mingxi. The generalized correlation algorithm for estimation of time delay based on Wavelet Transform[J]. ACTA ACUSTICA, 2002, 27(1): 88-93. DOI: 10.15949/j.cnki.0371-0025.2002.01.017
[9]	SUN Chao, YANG Yixin. Theoretical and experimental studies on high resolution IMP algorithm for bearing estimation[J]. ACTA ACUSTICA, 1999, 24(2): 210-218. DOI: 10.15949/j.cnki.0371-0025.1999.02.014
[10]	HU Bin, HE Qichao. Adaptive real-time estimation of speech LPC coefficients by LMS algorithm[J]. ACTA ACUSTICA, 1992, 17(1): 65-70. DOI: 10.15949/j.cnki.0371-0025.1992.01.009

Cited By

Get Citation

PDF

XML

Article Metrics

Article views (341) PDF downloads (51)

Mispronunciation detection and diagnosis with acoustic pronunciation model aided modeling

Abstract

References

Related Articles

Catalog

Article Metrics

Related

Mispronunciation detection and diagnosis with acoustic pronunciation model aided modeling

Abstract

References

Related Articles

Catalog

Article Metrics

Related

Export File

Citation

Format

Content