A dynamic template adaptation approach for noise-robust sound classification and distance determination in single-channel audio

Rezaul Tutul, André Jakob & Ilona Buchem

Volume 5 Issue 1, March 2025

Abstract

In this study, we propose a dynamic template adaptation approach for noise-robust sound classification and distance estimation in single-channel audio environments. Traditional cross-correlation methods rely on fixed sound templates that limit their performance under dynamic and noisy conditions. Our method integrates a low-pass filter for noise reduction and uses an online support vector machine (SVM) to dynamically update the sound templates based on real-time audio inputs. This hybrid approach enables continuous refinement of the templates and improves both the accuracy of sound classification and the ability to determine the relative distance between sound sources by estimating time delays. The robustness and adaptability of the algorithm make it suitable for real-world applications such as environmental monitoring, speaker recognition, and sound event localization. We demonstrate the effectiveness of the proposed method in various noisy and overlapping sound scenarios and compare it with traditional approaches such as ICA, NMF, and TFM. The results show that dynamic template adaptation and incremental learning significantly improve the classification accuracy and distance detection in changing environments. These findings demonstrate that the proposed method not only enhances real-time sound classification and distance determination but also holds potential for applications in autonomous vehicles, urban noise monitoring, and smart home systems, where robust audio processing in dynamic environments is critical.

Keywords

dynamic template adaptation, noise-robust sound classification, distance determination, cross-correlation, single-channel audio processing, online support vector machine

Author information & Contribution

Rezaul Tutul. Corresponding author. Doctor of Philosophy Candidate, Humboldt University of Berlin, Berlin, Germany. Email: rezaul.tutul@yahoo.com

André Jakob. Berliner Hochschule für Technik (BHT), Berlin, Germany

Ilona Buchem. Berliner Hochschule für Technik (BHT), Berlin, Germany

"All authors equally contributed to the conception, design, preparation, data gathering and analysis, and writing of the manuscript. All authors read and approved the final manuscript."

Disclosure statement

No potential conﬂict of interest was reported by the author(s).

Funding

This work was not supported by any funding.

Institutional Review Board Statement

Not applicable.

Declaration

The author declares the use of Artificial Intelligence (AI) in writing this paper. In particular, the author used the author used Paperpal in searching appropriate literature, summarizing key points and paraphrasing ideas. The author takes full responsibility in ensuring proper review and editing of contents generated using AI.

Notes

Acknowledgement

References

Abdoli, S., Cardinal, P., & Lameiras Koerich, A. (2019). End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136, 252–263. https://doi.org/10.1016/j.eswa.2019.06.040

Adavanne, S., Politis, A., Nikunen, J., & Virtanen, T. (2019). Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1), 34–48. https://doi.org/10.1109/JSTSP.2018.2885636

Bahmei, B., Birmingham, E., & Arzanpour, S. (2022). CNN-RNN and data augmentation using deep convolutional generative adversarial network for environmental sound classification. IEEE Signal Processing Letters, 29, 682–686. https://doi.org/10.1109/LSP.2022.3150258

Benesty, J., Chen, J., & Huang, Y. (2004). Time-delay estimation via linear interpolation and cross correlation. IEEE Transactions on Speech and Audio Processing, 12(5), 509–519. https://doi.org/10.1109/TSA.2004.833008

Carter, G. C. (1987). Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. https://doi.org/10.1109/PROC.1987.13723

Chu, H.-C., Zhang, Y.-L., & Chiang, H.-C. (2023). A CNN sound classification mechanism using data augmentation. Sensors, 23(15), 6972. https://doi.org/10.3390/s23156972

Cordourier, H., Lopez Meyer, P., Huang, J., Del Hoyo Ontiveros, J., & Lu, H. (2019). GCC-PHAT cross-correlation audio features for simultaneous sound event localization and detection (SELD) on multiple rooms. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 55–58. https://doi.org/10.33682/3re4-nd65

Dai, W., Dai, C., Qu, S., Li, J., & Das, S. (2017). Very deep convolutional neural networks for raw waveforms. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 421–425. https://doi.org/10.1109/ICASSP.2017.7952190

Fang, Z., Yin, B., Du, Z., & Huang, X. (2022). Fast environmental sound classification based on resource adaptive convolutional neural network. Scientific Reports, 12(1), 6599. https://doi.org/10.1038/s41598-022-10382-x

Gupta, C., Kamath, P., & Wyse, L. (2021). Signal representations for synthesizing audio textures with generative adversarial networks. Proceedings of the 18th Sound and Music Computing Conference. https://doi.org/10.5281/zenodo.5054145

Knapp, C., & Carter, G. (1976). The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4), 320–327. https://doi.org/10.1109/TASSP.1976.1162830

Li, S., Yao, Y., Hu, J., Liu, G., Yao, X., & Hu, J. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 8(7), 1152. https://doi.org/10.3390/app8071152

Liu, M., Zeng, Q., Jian, Z., Peng, Y., & Nie, L. (2023). A sound source localization method based on improved second correlation time delay estimation. Measurement Science and Technology, 34(4), 045102. https://doi.org/10.1088/1361-6501/aca5a6

Nadia Maghfira, T., Basaruddin, T., & Krisnadhi, A. (2020). Infant cry classification using CNN – RNN. Journal of Physics: Conference Series, 1528(1), 012019. https://doi.org/10.1088/1742-6596/1528/1/012019

Narayana Murthy, B. H. V. S., Yegnanarayana, B., & Kadiri, S. R. (2020). Time delay estimation from mixed multispeaker speech signals using single frequency filtering. Circuits, Systems, and Signal Processing, 39(4), 1988–2005. https://doi.org/10.1007/s00034-019-01239-2

Raina, A., & Arora, V. (2023). SyncNet: Correlating objective for time delay estimation in audio signals. ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096874

Ren, Z., Kong, Q., Han, J., Plumbley, M. D., & Schuller, B. W. (2021). CAA-Net: Conditional atrous CNNs with attention for explainable device-robust acoustic scene classification. IEEE Transactions on Multimedia, 23, 4131–4142. https://doi.org/10.1109/TMM.2020.3037534

Salih, A. O. M. (2017). Audio noise reduction using low pass filters. OALib, 04(11), 1–7. https://doi.org/10.4236/oalib.1103709

Shimada, K., Koyama, Y., Takahashi, N., Takahashi, S., & Mitsufuji, Y. (2021). Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 915–919. https://doi.org/10.1109/ICASSP39728.2021.9413609

Sun, Z., Bao, C., Jia, M., & Bu, B. (2014). Relative distance estimation in multi-channel spatial audio signal. 2014 International Conference on Audio, Language and Image Processing, 35–38. https://doi.org/10.1109/ICALIP.2014.7009752

Tho Nguyen, T. N., Jones, D. L., Watcharasupat, K. N., Phan, H., & Gan, W.-S. (2022). SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 716–720. https://doi.org/10.1109/ICASSP43922.2022.9746132

Tutul, R., Buchem, I., Jakob, A., & Pinkwart, N. (2024). Enhancing learner motivation, engagement, and enjoyment through sound-recognizing humanoid robots in quiz-based educational games. In: Biele, C., et al. Digital Interaction and Machine Intelligence. MIDI 2023. Lecture Notes in Networks and Systems, vol 1076. Springer, Cham. https://doi.org/10.1007/978-3-031-66594-3_13

Venkatesan, R., & Ganesh, A. B. (2020). Analysis of monaural and binaural statistical properties for the estimation of distance of a target speaker. Circuits, Systems, and Signal Processing, 39(7), 3626–3651. https://doi.org/10.1007/s00034-019-01333-5

Virtanen, T. (2007). Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 1066–1074. https://doi.org/10.1109/TASL.2006.885253

Wang, C., Jia, M., & Zhang, X. (2023). Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments. EURASIP Journal on Audio, Speech, and Music Processing, 2023(1), 41. https://doi.org/10.1186/s13636-023-00307-5

Cite this article:

Tutul, R., Jakob, A. & Buchem, I. (2025). A dynamic template adaptation approach for noise-robust sound classification and distance determination in single-channel audio. International Journal of Science, Technology, Engineering and Mathematics, 5(1), 22-41. https://doi.org/10.53378/ijstem.353154