Learning Acoustic Features from the Raw Waveform for Automatic Speech Recognition
* Presenting author
Automatic speech recognition (ASR) usually is performed by using handcrafted preprocessing, which extracts relevant information from the speechwaveform, while reducing the redundancy of the resulting featurevectors. Prominent examples are the Mel Frequency CepstralCoefficients (MFCCs) or the Gammatone (GT) filter bank, originallydesigned for the use in Gaussian mixture hidden Markovmodels. However, the successful introduction of neural network (NN)acoustic models has raised the following question: can preprocessingbecome part of the acoustic modeling and training, taking unprocessedwaveforms as direct input? Recent work shows that indeed a fullyconnected feed-forward NN, is able to learn the feature extraction aspart of the acoustic model to a large extent. Introducingconvolutional layers in the first stages of the NN further closed theperformance gap to hand crafted preprocessing. Improvements, even formultichannel speech input, are reported on top of manually designedpreprocessing, using large amounts of training data for a proprietorytask. In this work, waveform based ASR modeling and training isinvestigated and analyzed for a publicly available medium sized dataset, namely the CHiME-4 data set, which suppliesreal multichannel noisy data for training and evaluation.