SelectTTS: Synthesizing Anyone’s Voice via Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen*, Shreeram Suresh Chandra*, Junchen Lu, Berrak Sisman

*Equal contribution

Speech and Machine Learning Lab - The University of Texas at Dallas

Proposed Method

TokenizersSelectTTS

Proposed SelectTTS framework with the frame-selection method. In the frame selection, frames z1,z2,z3,z4 are chosen through subsequence matching and frames z7, z9,z6 and z10 are chosen via inverse k-means sampling

Comparison with Baselines

Ground TruthX-TTSVALL-EYourTTSSelectTTS (no sub-sequence)SelectTTS (with sub-sequence)

How much reference speech do we need?

Ground Truth30 seconds1 minute3 minutes5 Minutes

Effect of different frame selection strategies

Ground TruthSelectTTS (only inv k-means (rand))SelectTTS (only inv k-means (avg))SelectTTS (inv k-means (rand) + sub-match)SelectTTS (inv k-means (avg) + sub-match)

Effect of vocoder fine-tuning

Ground TruthVocoder (no prematched fine-tuning)Vocoder (prematched fine-tuning)

Inspiration credit

Baselines