EmotionRankCLAP | Interspeech 2025

About the Paper

This work introduces EmotionRankCLAP, a supervised contrastive learning framework that aligns emotional speech with natural language speaking style descriptions, leveraging the ordinal nature of emotions via a novel Rank-N-Contrast objective.

Abstract

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

Key Contributions

Leveraging the ordinal nature of emotions to learn fine-grained emotion embeddings using Rank-N-Contrast.
Demonstrating improved cross-modal alignment by reducing the modality gap between audio and text embeddings.
Designing a cross-modal retrieval task to evaluate ordinal consistency, achieving state-of-the-art results on valence and arousal retrieval.
Releasing LLM-generated natural-language speaking style descriptions from the MSP-Podcast corpus to bridge modalities.

Method Overview

Encoding: Extract audio embeddings via a frozen WavLM-based SER model and text embeddings via DistilRoBERTa.
Projection: Map both embeddings to a 512-dimensional shared space.
Rank-N-Contrast: Apply a cross-modal loss that ranks positive and negative pairs based on valence-arousal distance.
Caption Generation: Use an LLM prompt to produce natural language style descriptions for training.

Experiments & Results

Cross-modal Alignment

EmotionRankCLAP achieves the lowest MMD (0.087) and Wasserstein distance (0.065), outperforming SCE and SupCon baselines.

Ordinal Consistency & Retrieval

On cross-modal retrieval tasks, EmotionRankCLAP reaches Kendall’s Tau of 0.616 for valence and 0.552 for arousal, marking significant gains over prior methods.

Cross-modal Retrieval Illustration — Figure: Illustration of cross-modal retrieval using ordered valence rankings and LLM-generated descriptions.

Resources