In this study, we propose a dynamic template adaptation approach for noise-robust sound classification and distance estimation in single-channel audio environments. Traditional cross-correlation methods rely on fixed sound templates that limit their performance under dynamic and noisy conditions. Our method integrates a low-pass filter for noise reduction and uses an online support vector machine (SVM) to dynamically update the sound templates based on real-time audio inputs. This hybrid approach enables continuous refinement of the templates and improves both the accuracy of sound classification and the ability to determine the relative distance between sound sources by estimating time delays. The robustness and adaptability of the algorithm make it suitable for real-world applications such as environmental monitoring, speaker recognition, and sound event localization. We demonstrate the effectiveness of the proposed method in various noisy and overlapping sound scenarios and compare it with traditional approaches such as ICA, NMF, and TFM. The results show that dynamic template adaptation and incremental learning significantly improve the classification accuracy and distance detection in changing environments. These findings demonstrate that the proposed method not only enhances real-time sound classification and distance determination but also holds potential for applications in autonomous vehicles, urban noise monitoring, and smart home systems, where robust audio processing in dynamic environments is critical.
dynamic template adaptation, noise-robust sound classification, distance determination, cross-correlation, single-channel audio processing, online support vector machine
Rezaul Tutul. Corresponding author. Doctor of Philosophy Candidate, Humboldt University of Berlin, Berlin, Germany. Email: rezaul.tutul@yahoo.com
André Jakob. Berliner Hochschule für Technik (BHT), Berlin, Germany
Ilona Buchem. Berliner Hochschule für Technik (BHT), Berlin, Germany
"All authors equally contributed to the conception, design, preparation, data gathering and analysis, and writing of the manuscript. All authors read and approved the final manuscript."
No potential conflict of interest was reported by the author(s).
This work was not supported by any funding.
Not applicable.
The author declares the use of Artificial Intelligence (AI) in writing this paper. In particular, the author used the author used Paperpal in searching appropriate literature, summarizing key points and paraphrasing ideas. The author takes full responsibility in ensuring proper review and editing of contents generated using AI.
Abdoli, S., Cardinal, P., & Lameiras Koerich, A. (2019). End-to-end environmental sound classification using a 1D convolutional neural network. Expert Systems with Applications, 136, 252–263. https://doi.org/10.1016/j.eswa.2019.06.040
Adavanne, S., Politis, A., Nikunen, J., & Virtanen, T. (2019). Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE Journal of Selected Topics in Signal Processing, 13(1), 34–48. https://doi.org/10.1109/JSTSP.2018.2885636
Bahmei, B., Birmingham, E., & Arzanpour, S. (2022). CNN-RNN and data augmentation using deep convolutional generative adversarial network for environmental sound classification. IEEE Signal Processing Letters, 29, 682–686. https://doi.org/10.1109/LSP.2022.3150258
Benesty, J., Chen, J., & Huang, Y. (2004). Time-delay estimation via linear interpolation and cross correlation. IEEE Transactions on Speech and Audio Processing, 12(5), 509–519. https://doi.org/10.1109/TSA.2004.833008
Carter, G. C. (1987). Coherence and time delay estimation. Proceedings of the IEEE, 75(2), 236–255. https://doi.org/10.1109/PROC.1987.13723
Chu, H.-C., Zhang, Y.-L., & Chiang, H.-C. (2023). A CNN sound classification mechanism using data augmentation. Sensors, 23(15), 6972. https://doi.org/10.3390/s23156972
Cordourier, H., Lopez Meyer, P., Huang, J., Del Hoyo Ontiveros, J., & Lu, H. (2019). GCC-PHAT cross-correlation audio features for simultaneous sound event localization and detection (SELD) on multiple rooms. Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), 55–58. https://doi.org/10.33682/3re4-nd65
Dai, W., Dai, C., Qu, S., Li, J., & Das, S. (2017). Very deep convolutional neural networks for raw waveforms. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 421–425. https://doi.org/10.1109/ICASSP.2017.7952190
Fang, Z., Yin, B., Du, Z., & Huang, X. (2022). Fast environmental sound classification based on resource adaptive convolutional neural network. Scientific Reports, 12(1), 6599. https://doi.org/10.1038/s41598-022-10382-x
Gupta, C., Kamath, P., & Wyse, L. (2021). Signal representations for synthesizing audio textures with generative adversarial networks. Proceedings of the 18th Sound and Music Computing Conference. https://doi.org/10.5281/zenodo.5054145
Knapp, C., & Carter, G. (1976). The generalized correlation method for estimation of time delay. IEEE Transactions on Acoustics, Speech, and Signal Processing, 24(4), 320–327. https://doi.org/10.1109/TASSP.1976.1162830
Li, S., Yao, Y., Hu, J., Liu, G., Yao, X., & Hu, J. (2018). An ensemble stacked convolutional neural network model for environmental event sound recognition. Applied Sciences, 8(7), 1152. https://doi.org/10.3390/app8071152
Liu, M., Zeng, Q., Jian, Z., Peng, Y., & Nie, L. (2023). A sound source localization method based on improved second correlation time delay estimation. Measurement Science and Technology, 34(4), 045102. https://doi.org/10.1088/1361-6501/aca5a6
Nadia Maghfira, T., Basaruddin, T., & Krisnadhi, A. (2020). Infant cry classification using CNN – RNN. Journal of Physics: Conference Series, 1528(1), 012019. https://doi.org/10.1088/1742-6596/1528/1/012019
Narayana Murthy, B. H. V. S., Yegnanarayana, B., & Kadiri, S. R. (2020). Time delay estimation from mixed multispeaker speech signals using single frequency filtering. Circuits, Systems, and Signal Processing, 39(4), 1988–2005. https://doi.org/10.1007/s00034-019-01239-2
Raina, A., & Arora, V. (2023). SyncNet: Correlating objective for time delay estimation in audio signals. ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096874
Ren, Z., Kong, Q., Han, J., Plumbley, M. D., & Schuller, B. W. (2021). CAA-Net: Conditional atrous CNNs with attention for explainable device-robust acoustic scene classification. IEEE Transactions on Multimedia, 23, 4131–4142. https://doi.org/10.1109/TMM.2020.3037534
Salih, A. O. M. (2017). Audio noise reduction using low pass filters. OALib, 04(11), 1–7. https://doi.org/10.4236/oalib.1103709
Shimada, K., Koyama, Y., Takahashi, N., Takahashi, S., & Mitsufuji, Y. (2021). Accdoa: Activity-coupled cartesian direction of arrival representation for sound event localization and detection. ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 915–919. https://doi.org/10.1109/ICASSP39728.2021.9413609
Sun, Z., Bao, C., Jia, M., & Bu, B. (2014). Relative distance estimation in multi-channel spatial audio signal. 2014 International Conference on Audio, Language and Image Processing, 35–38. https://doi.org/10.1109/ICALIP.2014.7009752
Tho Nguyen, T. N., Jones, D. L., Watcharasupat, K. N., Phan, H., & Gan, W.-S. (2022). SALSA-Lite: A fast and effective feature for polyphonic sound event localization and detection with microphone arrays. ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 716–720. https://doi.org/10.1109/ICASSP43922.2022.9746132
Tutul, R., Buchem, I., Jakob, A., & Pinkwart, N. (2024). Enhancing learner motivation, engagement, and enjoyment through sound-recognizing humanoid robots in quiz-based educational games. In: Biele, C., et al. Digital Interaction and Machine Intelligence. MIDI 2023. Lecture Notes in Networks and Systems, vol 1076. Springer, Cham. https://doi.org/10.1007/978-3-031-66594-3_13
Venkatesan, R., & Ganesh, A. B. (2020). Analysis of monaural and binaural statistical properties for the estimation of distance of a target speaker. Circuits, Systems, and Signal Processing, 39(7), 3626–3651. https://doi.org/10.1007/s00034-019-01333-5
Virtanen, T. (2007). Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech and Language Processing, 15(3), 1066–1074. https://doi.org/10.1109/TASL.2006.885253
Wang, C., Jia, M., & Zhang, X. (2023). Deep encoder/decoder dual-path neural network for speech separation in noisy reverberation environments. EURASIP Journal on Audio, Speech, and Music Processing, 2023(1), 41. https://doi.org/10.1186/s13636-023-00307-5
Cite this article:
Tutul, R., Jakob, A. & Buchem, I. (2025). A dynamic template adaptation approach for noise-robust sound classification and distance determination in single-channel audio. International Journal of Science, Technology, Engineering and Mathematics, 5(1), 22-41. https://doi.org/10.53378/ijstem.353154
License:
This work is licensed under a Creative Commons Attribution (CC BY 4.0) International License.