Universal back-end parameters for multi-lingual speech recognition performance predictions with an automatic speech recognizer
* Presenting author
The framework for auditory discrimination experiments (FADE) from Schädler et al. (2015) predicts human speech recognition performance using a Gaussian Mixture Model (GMM) and Hidden Markov Model (HMM) based automatic speech recognition (ASR) system. The influence of the GMM/HMM parameters such as the number of states per word, training samples, and training iterations was optimized for several front-ends with the German matrix sentence test and basic psychoacoustic experiments. Here, this work is extended for the matrix sentence test in various languages (German, Polish, Russian, and Spanish) considering traditional Mel frequency cepstral coefficient (MFCC) and robust separable Gabor filter bank (SGBFB) features as front-ends. Two types of maskers were considered; the test specific stationary noise and modulated noise. Different settings of back-end parameters showed a similar influence on the predicted speech recognition performance across the four languages. With the robust SGBFB features generally better speech recognition performance was predicted than with the traditional MFCC features, particularly in the modulated noise. These results show that the originally proposed parameter set of the ASR back-end can be applied for different languages and noise conditions. The spectro-temporal processing by the SGBFB turns out to be crucial for accurate speech recognition predictions in all languages.