Deep Neural Networks for Speech Enhancement in Complex-Noisy Environments.

Authors

DOI:

https://doi.org/10.9781/ijimai.2019.06.001

Keywords:

Speech Enhancement, Deep Learning, Intelligibility, Time- Frequency Masking, Ideal Binary Mask

Abstract

In this paper, we considered the problem of the speech enhancement similar to the real-world environments where several complex noise sources simultaneously degrade the quality and intelligibility of a target speech. The existing literature on the speech enhancement principally focuses on the presence of one noise source in mixture signals. However, in real-world situations, we generally face and attempt to improve the quality and intelligibility of speech where various complex stationary and nonstationary noise sources are simultaneously mixed with the target speech. Here, we have used deep learning for speech enhancement in complex-noisy environments and used ideal binary mask (IBM) as a binary classification function by using deep neural networks (DNNs). IBM is used as a target function during training and the trained DNNs are used to estimate IBM during enhancement stage. The estimated target function is then applied to the complex-noisy mixtures to obtain the target speech. The mean square error (MSE) is used as an objective cost function at various epochs. The experimental results at different input signal-to-noise ratio (SNR) showed that DNN-based complex-noisy speech enhancement outperformed the competing methods in terms of speech quality by using perceptual evaluation of speech quality (PESQ), segmental signal-to-noise ratio (SNRSeg), log-likelihood ratio (LLR), weighted spectral slope (WSS). Moreover, short-time objective intelligibility (STOI) reinforced the better speech intelligibility.

Downloads

Download data is not yet available.

References

[1] Rehr, R., & Gerkmann, T. (2018). On the importance of super-Gaussian speech priors for machine-learning based speech enhancement. IEEE/ ACM Transactions on Audio, Speech and Language Processing (TASLP), 26(2), 357-366.

[2] Saleem, N., Khattak, M. I., & Shafi, M. (2018). Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization. Applied Acoustics, 141, 333-347.

[3] Saleem, N., Irfan, M., Chen, X., & Ali, M. (2018). Deep Neural Network based Supervised Speech Enhancement in Speech-Babble Noise. In 2018 IEEE/ACIS 17th International Conference on Computer and Information Science (ICIS) (pp. 871-874). IEEE.

[4] Saleem, N., Shafi, M., Mustafa, E., & Nawaz, A. (2015). A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality. University of Engineering and Technology Taxila. Technical Journal, 20(4), 36.

[5] Saleem, N., & Irfan, M. (2018). Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain. Circuits, Systems, and Signal Processing, 37(6), 2591-2612.

[6] Saleem, N., & Khattak, M. I. (2018). Regularized sparse decomposition model for speech enhancement via convex distortion measure. Modern Physics Letters B, 32(22), 1850262.

[7] Saleem, N., & Tareen, T. G. (2018). Spectral Restoration Based Speech Enhancement for Robust Speaker Identification. International Journal of Interactive Multimedia and Artificial Intelligence, 5(1), 34-39.

[8] Boll, S. (1979, April). A spectral subtraction algorithm for suppression of acoustic noise in speech. In Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79. (Vol. 4, pp. 200-203). IEEE.

[9] Scalart, P. (1996, May). Speech enhancement based on a priori signal to noise estimation. In Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings, 1996 IEEE International Conference on (Vol. 2, pp. 629-632). IEEE.

[10] Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimummean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing, 32(6), 1109- 1121.

[11] Ephraim, Y., & Malah, D. (1985). Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE transactions on acoustics, speech, and signal processing, 33(2), 443-445.

[12] Ephraim, Y., & Van Trees, H. L. (1995). A signal subspace approach for speech enhancement. IEEE Transactions on speech and audio processing, 3(4), 251-266.

[13] Sigg, C. D., Dikk, T., & Buhmann, J. M. (2010, March). Speech enhancement with sparse coding in learned dictionaries. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (pp. 4758-4761). IEEE.

[14] Zao, L., Coelho, R., & Flandrin, P. (2014). Speech enhancement with emd and hurst-based mode selection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5), 899-911.

[15] Deng, L., Hinton, G., & Kingsbury, B. (2013, May). New types of deep neural network learning for speech recognition and related applications: An overview. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 8599-8603). IEEE.

[16] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., ... & Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6), 82-97.

[17] Xu, Y., Du, J., Dai, L. R., & Lee, C. H. (2014). An experimental study on speech enhancement based on deep neural networks. IEEE Signal processing letters, 21(1), 65-68.

[18] Xu, Y., Du, J., Dai, L. R., & Lee, C. H. (2015). A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(1), 7-19.

[19] Kolbk, M., Tan, Z. H., Jensen, J., Kolbk, M., Tan, Z. H., & Jensen, J. (2017). Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1), 153-167.

[20] Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013, August). Speech enhancement based on deep denoising autoencoder. In Interspeech (pp. 436-440).

[21] Wang, Y., Narayanan, A., & Wang, D. (2014). On training targets for supervised speech separation. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22(12), 1849-1858.

[22] Chen, J., Wang, Y., & Wang, D. (2014). A feature study for classificationbased speech separation at low signal-to-noise ratios. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12), 1993- 2002.

[23] Brown, A. M., Gaskill, S. A., Carlyon, R. P., & Williams, D. M. (1993). Acoustic distortion as a measure of frequency selectivity: relation to psychophysical equivalent rectangular bandwidth. The Journal of the Acoustical Society of America, 93(6), 3291-3297.

[24] Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10) (pp. 807-814).

[25] Klein, S., Pluim, J. P., Staring, M., & Viergever, M. A. (2009). Adaptive stochastic gradient descent optimisation for image registration.International journal of computer vision, 81(3), 227.

[26] Wager, S., Wang, S., & Liang, P. S. (2013). Dropout training as adaptive regularization. In Advances in neural information processing systems (pp. 351-359).

[27] Rothauser, E. H. (1969). IEEE recommended practice for speech quality measurements. IEEE Trans. on Audio and Electroacoustics, 17, 225-246.

[28] Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP’01). 2001 IEEE International Conference on (Vol. 2, pp. 749-752). IEEE.

[29] Loizou, P. C. (2007). Speech enhancement: theory and practice. CRC press.

[30] Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2010, March). A short-time objective intelligibility measure for time-frequency weighted noisy speech. In Acoustics Speech and Signal Processing (ICASSP), 2010 IEEE International Conference on (pp. 4214-4217). IEEE.

[31] Kim, G., Lu, Y., Hu, Y., & Loizou, P. C. (2009). An algorithm that improves speech intelligibility in noise for normal-hearing listeners. The Journal of the Acoustical Society of America, 126(3), 1486-1494.

Downloads

Published

2020-03-01
Metrics
Views/Downloads
  • Abstract
    306
  • PDF
    92

How to Cite

Saleem, N. and Irfan Khattak, M. (2020). Deep Neural Networks for Speech Enhancement in Complex-Noisy Environments. International Journal of Interactive Multimedia and Artificial Intelligence, 6(1), 84–90. https://doi.org/10.9781/ijimai.2019.06.001

Most read articles by the same author(s)