DOI
https://doi.org/10.25772/H2HS-8726
Author ORCID Identifier
https://orcid.org/0009-0000-9171-8955
Defense Date
2024
Document Type
Thesis
Degree Name
Master of Science
Department
Computer Science
First Advisor
Tamer Nadeem
Abstract
Speech Emotion Recognition (SER) is pivotal in advancing human-computer interaction by enabling machines to understand and respond to human emotions. Despite significant progress with self-supervised learning models, SER systems often struggle with generalization across diverse languages and unseen data distributions, limiting their real-world applicability. This thesis addresses these challenges by first introducing a large-scale benchmark to evaluate the robustness and adaptability of state-of-the-art SER models in both in-domain and out-of-domain settings. The benchmark includes a diverse set of multilingual datasets, emphasizing cross-lingual and out-of-domain evaluations to assess model generalization. Surprisingly, we find that the Whisper model, originally designed for automatic speech recognition, outperforms dedicated self-supervised learning models in cross-lingual SER tasks.
Building upon these findings, we propose a novel approach that combines soft labeling and aggressive data augmentation techniques to model temporal emotion shifts in large-scale multilingual speech data. We introduce {\textbf{Eisper}}, a novel training paradigm that leverages the Whisper encoder augmented with Matryoshka Representation Learning to capture hierarchical emotional representations. By aggregating 17 diverse datasets and employing stochastic sample packing, we create a comprehensive dataset encompassing 571 hours of multilingual emotional speech. Extensive experiments demonstrate significant improvements in zero-shot generalization, achieving state-of-the-art performance on out-of-domain datasets and surpassing existing models on the SER Evals benchmark.
This thesis contributes to advancing the field of SER by providing valuable insights into model generalization, offering a robust benchmark for future research, and presenting novel methodologies to enhance the generalizability of SER models across diverse languages and data distributions.
Rights
© The Author
Is Part Of
VCU University Archives
Is Part Of
VCU Theses and Dissertations
Date of Submission
11-25-2024