Short-time acoustic scene recognition method using multi-scale feature fusion

WANG Meng; ZHANG Pengyuan

doi:10.15949/j.cnki.0371-0025.2022.06.002

Volume 47 Issue 6

Nov. 2022

Turn off MathJax

Article Contents

Abstract

References

ACTA ACUSTICA > 2022 > 47(6): 717-726. > DOI: 10.15949/j.cnki.0371-0025.2022.06.002 CSTR: 32049.14.11-2065.2022.06.002

WANG Meng, ZHANG Pengyuan. Short-time acoustic scene recognition method using multi-scale feature fusion[J]. ACTA ACUSTICA, 2022, 47(6): 717-726. DOI: 10.15949/j.cnki.0371-0025.2022.06.002

Citation:

WANG Meng, ZHANG Pengyuan. Short-time acoustic scene recognition method using multi-scale feature fusion[J]. ACTA ACUSTICA, 2022, 47(6): 717-726. DOI: 10.15949/j.cnki.0371-0025.2022.06.002

Citation:

WANG Meng, ZHANG Pengyuan. Short-time acoustic scene recognition method using multi-scale feature fusion[J]. ACTA ACUSTICA, 2022, 47(6): 717-726. DOI: 10.15949/j.cnki.0371-0025.2022.06.002

PDF (2493 KB)

Short-time acoustic scene recognition method using multi-scale feature fusion

WANG Meng^1,2,
ZHANG Pengyuan^1,2, ,

1 Key Laboratory of Speech Acoustics and Content Understanding, Institute of Acoustics, Chinese Academy of Sciences Beijing 100190;
2 University of Chinese Academy of Sciences Beijing 100049

More Information

Received Date: October 25, 2021
Revised Date: March 29, 2022
Available Online: November 04, 2022

Graphical Abstract

Abstract

Abstract

For the problem of poor recognition performance in short-time acoustic scene recognition task, a method using multi-scale feature fusion is proposed. Firstly, this method takes the sum and difference of the stereo audio's left and right channels as input. And a long frame length is used for frame processing to ensure that the extracted frame-level features contain enough audio information. Then, the features are input frame by frame into a one-dimensional convolutional neural network which uses multi-scale feature fusion to make full use of the shallow, middle and deep embedding at different scales in the network. Finally, all the frame-level soft labels are integrated to obtain the scene label of the audio. Experimental results show that the accuracy of this method on the Detection and Classification of Acoustic Scenes and Events(DCASE) 2021 short-time audio scene dataset is 79.02%, which achieves state-of-the-art performance on this dataset so far.
- Acoustic scene recognition,
- Multi-scale feature fusion,
- Frame level,
- One-dimensional convolution

FullText(HTML)

References (30)

References

[1]	Sawhney N, Maes P. Situational awareness from environmental sounds. Project Rep. for Pattie Maes, 1997:1-7
[2]	Chu S, Narayanan S, Kuo C C J et al. Where am I? Scene recognition for mobile robots using audio features. 2006 IEEE International Conference On Multimedia and Expo. IEEE, 2006:885-888
[3]	Eronen A J, Peltonen V T, Tuomi J T et al. Audio-based context recognition. IEEE Trans. Audio Speech Lang. Process., 2005; 14(1):321-329
[4]	Ma L, Milner B, Smith D. Acoustic environment classification. ACM Trans. Speech Lang. Process., 2006; 3(2):1-22
[5]	Jiang H, Bai J, Zhang S et al. SVM-based audio scene classification. 2005 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 2005:131-136
[6]	Zhu Y, Ming Z. SVM-based video scene classification and segmentation. 2008 International Conference on Multimedia and Ubiquitous Engineering (mue 2008). IEEE, 2008:407-412
[7]	Dahl G E, Yu D, Deng L et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 2011; 20(1):30-42
[8]	Li J, Dai W, Metze F et al. A comparison of deep learning methods for environmental sound detection. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2017:126-130
[9]	Weiping Z, Jiantao Y, Xiaotao X et al. Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2017
[10]	Krizhevsky A, Sutskever I, Hinton G E. Imagenet classi- fication with deep convolutional neural networks. Proc. Adv. Neural Inf. Process. Syst., 2012:1097-1105
[11]	Han Y, Park J, Lee K. Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification. Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2017:1-5
[12]	Phan H, Koch P, Hertel L et al. CNN-LTE:a class of 1-X pooling convolutional neural networks on label tree embeddings for audio scene classification. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2017:136-140
[13]	Hershey S, Chaudhuri S, Ellis D P W et al. CNN architectures for large-scale audio classification. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2017:131-135
[14]	Sharma J, Granmo O C, Goodwin M. Environment sound classification using multiple feature channels and attention based deep convolutional neural network. Interspeech, Shanghai, China, 2020:1186-1190
[15]	杨立东, 胡江涛. 多优化机制下深度神经网络的音频场景识别. 信号处理, 2021; 37(10):1969-1976
[16]	Dai W, Dai C, Qu S et al. Very deep convolutional neural networks for raw waveforms. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2017:421-425
[17]	Lee J, Park J, Kim K L et al. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. Sound and Music Computing Conference, 2017:220-226
[18]	Gao W, McDonnell M, UniSA S. Acoustic scene classification using deep residual networks with focal loss and mild domain adaptation. Proc. Detection and Classification of Acoustic Scenes and Events Workshop, 2020
[19]	He K, Zhang X, Ren S et al. Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770-778
[20]	Ioffe S, Szegedy C. Batch normalization:Accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning, PMLR, 2015:448-456
[21]	Wang S, Mesaros A, Heittola T et al. A curated dataset of urban scenes for audio-visual scene analysis. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2021:626-630
[22]	Sutskever I, Martens J, Dahl G et al. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning, PMLR, 2013:1139-1147
[23]	Loshchilov I, Hutter F. Sgdr:Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
[24]	Van der Maaten L, Hinton G. Visualizing data using t-SNE. J. Mach. Learn. Res., 2008:2579-2605
[25]	Yang L, Chen X, Tao L. Acoustic scene classification using multi-scale features. Proc. Detection and Classification of Acoustic Scenes and Events, 2018:29-33
[26]	Zhu B, Wang C, Liu F et al. Learning environmental sounds with multi-scale convolutional neural network. 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, 2018:1-8
[27]	Hu H, Yang C H H, Xia X et al. A two-stage approach to device-robust acoustic scene classification. Proc. IEEE Int. Conf. Acoust. Speech Signal Process., IEEE, 2021:845-849
[28]	Suh S, Park S, Jeong Y et al. Designing acoustic scene classification models with CNN variants. Tech. Rep., Detection and Classification of Acoustic Scenes and Events, 2020
[29]	Zhang H, Wu C, Zhang Z et al. Resnest:Split-attention networks. arXiv preprint arXiv:2004.08955, 2020
[30]	Wang S, Heittola T, Mesaros A et al. Audio-visual scene classification:analysis of DCASE 2021 Challenge submissions. arXiv preprint arXiv:2105.13675, 2021

[1]	TAN Xiaofeng, LI Xihai, NIU Chao, ZENG Xiaoniu, LI Hongru, LIU Tianyou. Infrasound event classification with multi-channel multi-scale convolutional attention network[J]. ACTA ACUSTICA, 2025, 50(4): 892-898. DOI: 10.12395/0371-0025.2023286
[2]	LIANG Yinian, LI Jie, LONG Lirong, CHEN Fangjiong. Incoherently distributed sources localization using convolutional neural network[J]. ACTA ACUSTICA, 2024, 49(1): 38-48. DOI: 10.12395/0371-0025.2022138
[3]	ZHU Yintao, WU Haijun, SUN Ruihua, JIANG Weikang. A convolution quadrature method for acoustic time domain boundary element method[J]. ACTA ACUSTICA, 2023, 48(6): 1218-1226. DOI: 10.12395/0371-0025.2022097
[4]	SUN Xingwei, LI Junfeng, YAN Yonghong. Speech dereverberation method with convolutional neural network and reverberation time attention[J]. ACTA ACUSTICA, 2021, 46(6): 1234-1241. DOI: 10.15949/j.cnki.0371-0025.2021.06.043
[5]	WANG Wenbo, SU Lin, JIA Yuqing, REN Qunyan, MA Li. Convolution neural network ranging method in the deep-sea direct-arrival zone[J]. ACTA ACUSTICA, 2021, 46(6): 1081-1092. DOI: 10.15949/j.cnki.0371-0025.2021.06.027
[6]	LIAN Hailun, ZHOU Jian, HU Yuting, ZHENG Wenming. Whisper to normal speech conversion using deep convolutional neural networks[J]. ACTA ACUSTICA, 2020, 45(1): 137-144. DOI: 10.15949/j.cnki.0371-0025.2020.01.017
[7]	KANG Zhong-xu, ZHENG Si-fa, LIAN Xiao-min, LIU Hai-tao. Corrected one-dimensional approach for the acoustic simulation of expansion chamber silencer[J]. ACTA ACUSTICA, 2011, 36(6): 652-657. DOI: 10.15949/j.cnki.0371-0025.2011.06.008
[8]	LÜ Zhao, WU Xiaopei, ZHANG Chao, LI Mi. Robust speech features extraction in convolutional noise environment[J]. ACTA ACUSTICA, 2010, 35(4): 465-470. DOI: 10.15949/j.cnki.0371-0025.2010.04.013
[9]	ZHANG Hua, FENG Dazheng, PANG Jiyong. Blind convolutive separation algorithm for speech signals via joint block diagonalization[J]. ACTA ACUSTICA, 2009, 34(2): 167-174. DOI: 10.15949/j.cnki.0371-0025.2009.02.015
[10]	WANG Zuomin, ZHAO Songling. SOUND PROPAGATION IN ONE-DIMENSIONAL UNSTEADY FLOW[J]. ACTA ACUSTICA, 1989, 14(4): 279-281. DOI: 10.15949/j.cnki.0371-0025.1989.04.005

Cited By

Get Citation

PDF

XML

Article Metrics

Article views (456) PDF downloads (87)

Short-time acoustic scene recognition method using multi-scale feature fusion

Abstract

References

Related Articles

Catalog

Article Metrics

Related

Short-time acoustic scene recognition method using multi-scale feature fusion

Abstract

References

Related Articles

Catalog

Article Metrics

Related

Export File

Citation

Format

Content