Deep Recurrent Neural Networks for Emotion Recognition in Speech
* Presenting author
Emotion recognition in speech (ERS) is a hot research topic in the field of affective computing. Giving computers the ability to know the emotions from a subject is an important aspect in naturalistic human-computer interaction or user profiling. Recent methods to tackle the complex task of ERS employ deep neural networks. As human emotion is an affective state, which changes over time, the neural network has the task of sequence-to-sequence modelling. This is usually implemented as a recurrent neural network (RNN), such as a long short-term memory RNN. Even though in recent work, the audio signal is directly fed into the network, the most common approach is to feed the RNN with acoustic low-level features, describing prosodic, spectral, or cepstral characteristics of the speech. Instead of using the raw low-level features, they can also be encoded in terms of the bag-of-audio-words approach, where the feature vectors are quantised using a previously learnt codebook of templates and the occurrence of each template is encoded in a sparse histogram vector. In this contribution, we propose a deep learning framework for ERS and compare different feature representations. Results are presented using a state-of-the-art benchmark database from the domain of affective computing.