CNN-Based U-Net Deep Learning Model for Speech Signal Enhancement

International Journal of Emerging Research in Science, Engineering, and Management
Vol. 2, Issue 1, pp. 226-233, January 2026.

https://doi.org/10.58482/ijersem.v2i1.31

CNN-Based U-Net Deep Learning Model for Speech Signal Enhancement

S. Roja

T. K. Dileep

M. Bharath

G. Ananda Reddy

S.K. Jagadeesh

G. Hemasundar Reddy

Department of ECE, Siddartha Institute of Science and Technology, Puttur, India.

Abstract: Speech signals are often corrupted by environmental noise during acquisition and transmission, which degrades speech quality and intelligibility and adversely affects downstream applications such as speech recognition, communication systems, and assistive hearing devices. Traditional speech enhancement techniques, including spectral subtraction and Wiener filtering, rely on statistical assumptions about noise and frequently fail under non-stationary and real-world noise conditions. To overcome these limitations, this paper proposes a CNN–U-Net–based deep learning framework for speech enhancement that effectively suppresses noise while preserving essential speech characteristics. In the proposed approach, noisy speech signals are first transformed into the time–frequency domain using the Short-Time Fourier Transform (STFT). A U-Net architecture enhanced with convolutional neural network (CNN) layers is then employed to learn a nonlinear mapping between noisy and clean speech representations. The encoder–decoder structure captures both local spectral patterns and long-range contextual information through skip connections, enabling accurate reconstruction of clean speech components. The enhanced speech signal is finally obtained using the inverse STFT. The performance of the proposed framework is evaluated using real-world noise conditions, including traffic, fan, and household noises, at different signal-to-noise ratio (SNR) levels. Quantitative evaluation using metrics such as SNR improvement and Mean Squared Error (MSE) demonstrates that the CNN–U-Net model significantly outperforms conventional speech enhancement methods. The experimental results confirm the effectiveness and robustness of the proposed approach for speech enhancement in challenging noisy environments.

Keywords: Speech Enhancement, Convolutional Neural Network, U-Net Architecture, Noise Reduction, STFT.

References: 

  1. V. Gupta and S. KR, “Enhancing Speech Quality with Wave-U-Net and GANs,” Procedia Computer Science, vol. 258, pp. 1651–1658, Jan. 2025, doi: 10.1016/j.procs.2025.04.396.
  2. Y. Zhang, Z. Zhang, W. Guo, W. Chen, Z. Liu, and H. Liu, “LRetUNet: A U-Net-based retentive network for single-channel speech enhancement,” Computer Speech & Language, vol. 93, p. 101798, Mar. 2025, doi: 10.1016/j.csl.2025.101798.
  3. R. Rekha, P. Shruti, M. Deekshitha, and J. Akash, “DCSwin-UNet: Dual Encoder U-Net based on CNN and Swin Transformer with Trainable Multiplication Layer for brain tumor segmentation from MRI images,” Biomedical Signal Processing and Control, vol. 110, p. 108325, Jul. 2025, doi: 10.1016/j.bspc.2025.108325.
  4. N. Saleem and S. Bourouis, “MFFR-net: Multi-scale feature fusion and attentive recalibration network for deep neural speech enhancement,” Digital Signal Processing, vol. 156, p. 104870, Nov. 2024, doi: 10.1016/j.dsp.2024.104870.
  5. C. Tao, “An energy-efficient deep learning model evaluation for robust image recognition in automated decision-making systems,” Sustainable Computing Informatics and Systems, vol. 48, p. 101254, Nov. 2025, doi: 10.1016/j.suscom.2025.101254.
  6. N. Saleem, S. Bourouis, H. Elmannai, and A. D. Algarni, “CTSE-Net: Resource-efficient convolutional and TF-transformer network for speech enhancement,” Knowledge-Based Systems, vol. 317, p. 113452, Apr. 2025, doi: 10.1016/j.knosys.2025.113452.
  7. O. Cetin, B. Canel, G. Dogali, and U. Sakoglu, “Enhancing precision in multiple sclerosis lesion segmentation: A U-net based machine learning approach with data augmentation,” Neuroimage Reports, vol. 5, no. 1, p. 100235, Feb. 2025, doi: 10.1016/j.ynirp.2025.100235.
  8. S. Natarajan et al., “Deep neural networks for speech enhancement and speech recognition: A systematic review,” Ain Shams Engineering Journal, vol. 16, no. 7, p. 103405, May 2025, doi: 10.1016/j.asej.2025.103405.
  9. Y. Xie and Z.-H. Tan, “A survey of deep learning for complex speech spectrograms,” Speech Communication, vol. 175, p. 103319, Oct. 2025, doi: 10.1016/j.specom.2025.103319.
  10. H. Chen, Z. Zhou, L. Wu, Y. Fu, and D. Xue, “Enhancing air traffic complexity assessment through deep metric learning: A CNN-Based approach,” Aerospace Science and Technology, vol. 160, p. 110090, Feb. 2025, doi: 10.1016/j.ast.2025.110090.
  11. A. Li and J. Cai, “Heart rate estimation for U-Net and LSTM models combining multiple attention mechanisms,” Medical Engineering & Physics, vol. 145, p. 104406, Aug. 2025, doi: 10.1016/j.medengphy.2025.104406.
  12. N. Sharma and P. G. Shambharkar, “Transforming security in internet of medical things with advanced deep learning-based intrusion detection frameworks,” Applied Soft Computing, vol. 180, p. 113420, Jun. 2025, doi: 10.1016/j.asoc.2025.113420.
  13. H. Ahlawat, N. Aggarwal, and D. Gupta, “Automatic Speech Recognition: A survey of deep learning techniques and approaches,” International Journal of Cognitive Computing in Engineering, vol. 6, pp. 201–237, Jan. 2025, doi: 10.1016/j.ijcce.2024.12.007.
  •