SelectTTS Demo

SelectTTS: Synthesizing Anyone’s Voice via Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen*, Shreeram Suresh Chandra*, Junchen Lu, Berrak Sisman

*Equal contribution

Speech and Machine Learning Lab - The University of Texas at Dallas

Proposed Method

Proposed SelectTTS framework with the frame-selection method. In the frame selection, frames z₁,z₂,z₃,z₄ are chosen through subsequence matching and frames z₇, z₉,z₆ and z₁₀ are chosen via inverse k-means sampling

Comparison with Baselines

Ground Truth	X-TTS	VALL-E	YourTTS	SelectTTS (no sub-sequence)	SelectTTS (with sub-sequence)

How much reference speech do we need?

Ground Truth	30 seconds	1 minute	3 minutes	5 Minutes

Effect of different frame selection strategies

Ground Truth	SelectTTS (only inv k-means (rand))	SelectTTS (only inv k-means (avg))	SelectTTS (inv k-means (rand) + sub-match)	SelectTTS (inv k-means (avg) + sub-match)

Effect of vocoder fine-tuning

Ground Truth	Vocoder (no prematched fine-tuning)	Vocoder (prematched fine-tuning)

Inspiration credit

[1] Baas, M., van Niekerk, B., & Kamper, H. (2023). Voice conversion with just nearest neighbors. arXiv preprint arXiv:2305.18975.