Spectral Restoration based Speech Enhancement for Robust Speaker Identification

Spectral restoration based speech enhancement algorithms are used to enhance quality of noise masked speech for robust speaker identification (SID). In presence of background noise, the performance of speaker identification systems can be severely deteriorated. The present study employed and evaluated the Minimum Mean-Square-Error Short-Time Spectral Amplitude Estimators (MMSE-STSA) with modified a priori SNR estimate prior to speaker identification to improve performance of the speaker identification systems in presence of background noise. For speaker identification, Mel Frequency Cepstral coefficient (MFCC) and Vector Quantization (VQ) is used to extract the speech features and to model the extracted features respectively. The experimental results showed significant improvement in speaker identification rates when spectral restoration based speech enhancement algorithms are used as a pre-processing step. The identification rates are found to be higher after employing the speech enhancement algorithms.


Introduction
Speech enhancement aspires to improve quality by employing a variety of speech processing algorithms.The intention of the enhancement is to improve the speech intelligibility and/or overall perceptual quality of speech noise masked speech.Enhancement of speech degraded by background noise, called noise reduction is significant area of speech enhancement and is considered for diverse applications e.g., mobile phones, speech/speaker recognition/identification and hearing aids.The speech signals are frequently contaminated by the background noise, which affects the performance of speaker identification (SID) systems.The SID systems are used in online banking, voice mail, remote computer access etc.Therefore, for effective use of such systems, a speech enhancement system must be positioned in front-end to improve identification accuracy.Figure 1 shows the procedural block diagram of speech enhancement and speaker identification system.The algorithms for speech enhancement are categorized into three fundamental classes, (i) filtering techniques including spectral subtraction [1][2][3][4], Wiener filtering [5][6][7] and signal subspace techniques [8][9], (ii) Spectral restoration algorithms including Mean-Square-Error Short-Time Spectral Amplitude Estimators [10][11][12] and (iii) speech-model based algorithms.The systems in [5][6][7][10][11][12] principally depend on accurate estimates of signal-to-noise ratio (SNR) in all frequency bands, because gain is computed as function of spectral SNR.A conventional and recognized technique for SNR estimate is decision-directed (DD) method suggested in [10].The DD technique tails the shape of instantaneous SNR for a priori SNR estimate brings in oneframe delay.Therefore; to avoid one-frame delay, momentum terms are incorporated to get better tracking speed of system and avoid the frame delay problem.All the mentioned systems in [10][11][12] can significantly improve speech quality.Binary masking [13][14][15][16][17][18] is another class that increases speech quality and intelligibility simultaneously.This paper presents Mean-Square-Error Short-Time Spectral Amplitude Estimators with modified a priori SNR estimate to reduce background noise and to improve identification rates of speaker identification systems in presence of background noises.The paper is prepared as follows.Section 2 presents the overview of speech enhancement system; section 3 gives speaker identification system; section 4 presents the experimental setup, results

Robotics & Automation Engineering Journal
and discussions, and section 5 presents the summary and concluding remarks.The Matlab R2015b is used to construct the algorithms and simulations.In classical spectral restoration system based speech enhancement system, the noisy speech is given as;

Spectral Restoration based Speech Enhancement System
( ) ( ) ( ) , where ( ) Where E{.} shows expectation operator, ( ) Where, β is a smoothing factor and ( ) is estimate in previous frame.The SNRs can be calculated as: Where α is smoothing factor and has a constant value 0.98, ( ) ξ ω is a priori noise estimate via decision-direct (DD) method whereas {} .F is half-wave rectification.The DD is efficient method and achieve well in speech enhancement applications however; the a priori SNR follows the shape of instantaneous SNR and brings single-frame delay.To overcome the single-frame delay, a modified form of DD approach is used to estimate a priori SNR.The modified a priori SNR can be written as: The gain function Where, ς ς is used to avoid large gain values at low a posteriori SNR and choose 10 ς = here.

Speaker Identification System
The intention of a Speaker identification system is to identity information regarding any speaker and categorized into two sub-categories called as Speaker identification (SID) and speaker Verification (SVR).For SID, the Mel Frequency Cepstral coefficient (MFCC) and Vector Quantization (VQ) is used to extract the speech features and to model the extracted features respectively.The speaker identification system drives in two stages, the Training and testing stages.In training mode the system is allowed to create the database of speech signals

Robotics & Automation Engineering Journal
and formulate a feature model of speech utterances.In testing mode, the system used information provided in database and attempt to segregate and identify the speakers.Here, the Mel frequency Cepstral Coefficients (MFCCs) features are used for constructing a SID system.The extracted features of speakers are quantized to a number of centroids employing vector quantization (VQ) K-means algorithm.MFCCs are computed in training as well as in testing stage.The Euclidean distance among MFCCs of all speakers in training stage to centroids of isolated speaker in testing stage is calculated and a particular speaker is identified according to minimum Euclidean distance.

Feature Extraction
The MFCCs are acquired by pre-emphasis [ref] of speech initially to emphasize high frequencies and eliminate glottal and lip radiations.The resulting speech is fragmented, windowed, and FFT is computed to attain spectra.To estimate human auditory system, triangular band-pass filters bank is utilized.For center frequencies lower than 1kHz, a linear relation while beyond 1kHz, a logarithmic relation is assumed.The filter bank response is given in Fig. 2. The Mel-spaced filter bank response is given as: The DFT is computed on the log of Mel spectrum to compute Cepstrum as: Where Mg shows MFCCs, S  is nth Mel filter output, K is number of MFCCs chosen between 5 to 26 and f N is the number of Mel filters.Initial few coefficients are considered since most of the specific information about speakers is present in them [ref].

Vector Quantization
Vector quantization (VQ) is a lossy compression method based on the block coding theory [19].The purpose of VQ in speaker recognition systems is to create a classification system for every speaker and a large set of acoustic vectors are converted to lesser set that signifies centroids of distribution shown in Figure 2. The VQ is employed since all MFCC generated feature vector cannot be stored and extracted acoustic vectors are clustered into a set of codewords (referred to as codebook) and this clustering is achieved by using the K-Means Algorithm which separates the M feature vectors into K centroids.Initially K cluster-centroids are chosen randomly within M feature vectors and then all feature vectors are allocated to nearby centroid, and the formation of c entroids new clusters follows this pattern.The process keeps on until a certain condition for stopping is reached, i.e., the mean square error (MSE) among acoustic vector and cluster centroid is lower than a certain predefined threshold or no additional variations in cluster-center task [20][21].

Speaker Identification
The speaker recognition phase is characterized by a set of acoustic feature vectors ,{ } 1, 2, ., M M Mt … and is judged against codebooks in list.For all codebooks a distortion is calculated, and a speaker with the lowest distortion is selected, and this distortion is sum of squared Euclidean distances among vectors and their centroids.As a result, all feature vectors in M sequence are compared with codebooks, and the codebooks with the minimum average distance are selected.The Euclidean distance between two points,   ( ) Six different speakers, three male and three female, were selected from Noizeus [22] and TIMIT database respectively.To evaluate the performance of system, four signal-to-noise ratio levels, including 0dB, 5dB, 10dB and 15dB are used.Also three noisy situations including car, street and white noise are used to degrade the Figure 3: 2D acoustic Vector analysis clean speech.The Perceptual evaluation of speech quality (PESQ) [23] and Segmental SNR (SNRSeg) is used to predict the speech quality after speech enhancement.Three sets of experiments are conducted to measure the speaker identification rates including, clean speech with no background noise, speech degraded by background noise and speech processed by the spectral restoration enhancing algorithms.Figure 4 shows PESQ scores obtained after applying Minimum Mean-Square-Error Short-Time Spectral Amplitude Estimators with modified a priori SNR estimate (MMSE-MDD).The modified version offers the best results consistently in all SNR levels and noisy conditions when compared to noisy and speech processed by traditional MMSE-STSA speech enhancement algorithm.Similarly, Figure 5 shows speech quality in terms of segmental SNR (SNRSeg) where highest SNRSeg scores are obtained with MMSE-MDD.The enhanced speech associated with six speakers is tested for speaker identification.Figure 6 shows identification rates, the lowest identification rates are observed in the presence of background noise (Babble, car and street) however, employment of the speech enhancement

Robotics & Automation Engineering Journal
before speaker identification has tremendously increased the identification rates which are evident in Figure 5.The identification rates for MMSE-MDD are higher in all SNR conditions and levels.

Figure 1 :
Figure 1: Proposed Speech enhancement and speaker identification system

ω
and time frame k.The quasi-stationary nature of speech is considered in frame analysis since noise and speech signals both reveal non-stationary behavior (Figrue 2).A speech enhancement algorithm involves in multiplication of a of spectral gain follows two key parameters, a posteriori SNR and the a priori SNR estimate: presents a posteriori and a priori SNR estimate.In practical implementations of a speech enhancement system, squared power spectrum density of clean speech only noisy speech is available.Therefore; both instantaneous and a priori SNR need to be estimated.The noise power spectral density is estimated during speech gaps exploiting standard recursive relation, given as: − − The Eq.6 shows the modified DD (MDD) version used in the speech enhancement system, α is smoothing parameter (α=0.98),ζ is momentum parameter ( ) SNR estimate after modification.The estimated power spectrum of the clean speech magnitude ( ) , SEST k k ω is attained by multiplying the gain function with noisy speech ( ) , Y k k ω as: