TEMOTTS

TEMOTTS: Text-aware Emotional Text-to-Speech with no labels

Authors : Shreeram Suresh Chandra, Zongyang Du, Berrak Sisman

Speech and Machine learning lab - The University of Texas at Dallas

Submitted to Speaker Odyssey 2024. Codes will released after acceptance.

Abstract

Many frameworks for emotional Text-to-Speech (E-TTS) rely on human-annotated emotion labels that are often inaccurate and difficult to obtain. Learning emotional prosody implicitly presents challenges due to the subjective nature of emotions and the hierarchical structure of speech. In this study, we propose a novel approach that leverages text awareness to acquire emotional styles without the need for explicit emotion labels. We present TEMOTTS, a two-stage framework for E-TTS that is trained without emotion labels and is capable of inference without auxiliary inputs. Our proposed method performs knowledge transfer between the linguistic space learned by BERT and the emotional style space constructed by global style tokens. Our experimental results demonstrate the effectiveness of our proposed framework in comparison to baselines, showcasing improvements in emotional accuracy, naturalness, and intelligibility. This is one of the first studies to leverage the emotional correlation between spoken content and expressive delivery for emotional TTS.

Model Architecture

Fig.1 Proposed TEMOTTS framework with stage I and stage II. Red modules contain trainable weights and purple modules contain fixed weights; CE loss represents Categorical Cross Entropy loss.

Let's test the voice quality!

	FastSpeech2 [1]	VITS [2]	TEMOTTS
1.
2.
3.
4.
5.
6.
7.

Emotional Text-awareness - Notice how closely the speech is able to capture the emotions in the text

	Text	FastSpeech2 [1]	VITS [2]	TEMOTTS
1.	Blowing out birthday candles makes me feel special!
2.	Her heart felt heavy with sorrow.
3.	I am feeling sad.
4.	I feel joy when I see colourful balloons.
5.	I feel like a broken toy discarded and forgotten.
6.	I'm about to explode with anger!
7.	I'm so angry I can't even breathe.
8.	I'm so angry I could spit fire.
9.	Playing with toys brings me so much happiness!
10.	She felt like a part of her was missing.
11.	Singing and dancing make me feel so good.
12.	Smiling at others fills me with happiness.
13.	Tears welled up in her eyes.
14.	This is driving me crazy.
15.	Watching a funny movie makes me laugh out loud.

References

[1] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in International Conference on Learning Representations, 2021.

[2] Jaehyeon Kim, Jungil Kong, and Juhee Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in International Conference on Machine Learning. PMLR, 2021, pp. 5530–5540.