GSA-TTS: Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor

 

 

Anonymous submission to InterSpeech 2025

ABSTRACT

We present the gradual style adaptor TTS (GSA-TTS) with a novel style encoder that gradually encodes speaking styles from an acoustic reference for zero-shot speech synthesis. GSA first captures the local style of each semantic sound unit. Then the local styles are combined by self-attention to obtain a global style condition. This semantic and hierarchical encoding strategy provides a robust and rich style representation for an acoustic model. We test GSA-TTS on unseen speakers and obtain promising results regarding naturalness, speaker similarity, and intelligibility. Additionally, we explore the potential of GSA in terms of interpretability and controllability which stems from its hierarchical structure.

 

 

Compare GSA-TTS with other TTS models as:

1. Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [Official Demo page][Official Code]

2. YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone[Official Demo page][Official Code]

 

ZERO-SHOT TTS (VCTK)

Sentence 1 (p294):

Domestic abuse is related to power and control.

Reference Audio

Text: Some have accepted it as a miracle without physical explanation.

GT

ASR: Domestic abuse is related to power and control.

GT(voc.)

ASR: Domestic abuse is related to power and control.

GSA-TTS(Ours)

ASR: Domestic abuse is related to power and control.

MetaStyleSpeech

ASR: The Mestika view is related to power and control.

YourTTS

ASR: Genastic abuse is related to power and control.

Sentence 2 (p248):

He is a young man with a new deal.

Reference Audio

Text: According to the criteria, he is qualified for Scotland.

GT

ASR: He's a young man with a new deal.

GT(voc.)

ASR: He's a young man with a new deal.

GSA-TTS(Ours)

ASR: He is a young man with a new deal.

MetaStyleSpeech

ASR: Yes, a young man with a new deal.

YourTTS

ASR: You see young men with a new deal.

Sentence 3 (p302):

Lord Levy said he'd started his life with nothing.

Reference Audio

Text: Glasgow deserved their win, but we made them look good.

GT

ASR: Lord Levy said he started his life with nothing.

GT(voc.)

ASR: Lord Levy said he started his life with nothing.

GSA-TTS(Ours)

ASR: Lord Levy said he'd started his life with nothing.

MetaStyleSpeech

ASR: Lord Letty said he'd started his life with nothing.

YourTTS

ASR: What Levy said he'd started his life with nothing.

Sentence 4 (p238):

He was dead on arrival at hospital.

Reference Audio

Text: The most important thing in theatre is to listen.

GT

ASR: He was dead on arrival at hospital.

GT(voc.)

ASR: He was dead on arrival at hospital.

GSA-TTS(Ours)

ASR: He was dead on arrival at hospital.

MetaStyleSpeech

ASR: He was dead on her idol at hospital.

YourTTS

ASR: He was dead on a rival at hospital.

Sentence 5 (p347):

The Old Firm are going nowhere.

Reference Audio

Text: It was a lot of hard work, but it wasn't difficult.

GT

ASR: The old firm are going nowhere.

GT(voc.)

ASR: The old firm are going nowhere.

GSA-TTS(Ours)

ASR: The old firm are going nowhere.

MetaStyleSpeech

ASR: The old St. Octanequih flower.

YourTTS

ASR: A old farmer going nowhere.

ABLATION STUDY

Sentence 1 (p347):

I have just got to get back.

Reference Audio

Text: To which the only answer is, don't hold your breath.

GT

ASR: I have just got to go back.

GT(voc.)

ASR: I have just got to go back.

GSA-TTS

ASR: I have just got to get back.

w/o LSE

ASR: I have just got to get back.

w/o GSE

ASR: I have just got to get back.

w/o Style Segments

ASR: I have just got to get back.

FastPitch+MSE

ASR: I have just got to get back.

Sentence 2 (p248):

You were very proud that day to be among their friends.

Reference Audio

Text: That's the principal difference between an artist and a dog.

GT

ASR: You were very proud that day to be among the way friends.

GT(voc.)

ASR: You were very proud that day to be among the different.

GSA-TTS

ASR: You were very proud that day to be among their friends.

w/o LSE

ASR: You were very proud that day to be among their friends.

w/o GSE

ASR: You are very proud that they to be among their friends

w/o Style Segments

ASR: You are very proud that day to be among their friends.

FastPitch+MSE

ASR: You were very proud that day to be among difference.

Sentence 3 (p335):

Mummy believed in one thing above all.

Reference Audio

Text: However, we had no failures in our side.

GT

ASR: Mummy believes in one thing above all.

GT(voc.)

ASR: Mommy believed in one thing above all.

GSA-TTS

ASR: Mummy believed in one thing above all.

w/o LSE

ASR: Mommy believed in one thing above all.

w/o GSE

ASR: Mummy believed in one thing above all.

w/o Style Segments

ASR: Mummy believed in one thing above all.

FastPitch+MSE

ASR: By me bill eves and one it bye bye.

Sentence 4 (p326):

It had until tomorrow to comply.

Reference Audio

Text: We should savour this, because it's unlikely ever to happen again.

GT

ASR: It had until tomorrow to comply.

GT(voc.)

ASR: It had until tomorrow to comply.

GSA-TTS

ASR: It had until tomorrow to comply.

w/o LSE

ASR: It had until tomorrow to come play.

w/o GSE

ASR: It head until tomorrow to comply.

w/o Style Segments

ASR: It had until tomorrow to comply.

FastPitch+MSE

ASR: It had in sealed tomorrow to comply.

Sentence 5 (p347):

Altman has every right to be bitter.

Reference Audio

Text: However, there are signs of dissent among his colleagues.

GT

ASR: Ultima has every right to be better.

GT(voc.)

ASR: Ultima has every right to be better.

GSA-TTS

ASR: Altman has every right to be bitter.

w/o LSE

ASR: Altman has every right to be bitter.

w/o GSE

ASR: Altman has every right to be bitter.

w/o Style Segments

ASR: Altman has every right to be bitter.

FastPitch+MSE

ASR: Woman has every right to be possessed.

CONTROLLABILITY

(Section 4.3)

Sentence 1 (p294):

Our traditions and cultures remain the same.

Reference Audio

Text: Ask her to bring these things with her from the store.

GT

ASR: Our traditions and cultures remain the same.

GT(voc.)

ASR: Our traditions and cultures remain the same.

GSA-TTS

ASR: Out traditions and cultures remain the same.

GSA-TTS(Control)

ASR: Our traditions and cultures remain the same.

Sentence 2 (p326):

Brown is an interesting man, but he's not desperate.

Reference Audio

Text: It's a record label, not a form of music.

GT

ASR: Brown is an interesting man but he is not desperate.

GT(voc.)

ASR: Brown is an interesting man but he is not desperate.

GSA-TTS

ASR: Round is an interesting man but he is not desperate.

GSA-TTS(Control)

ASR: Brown is an interesting man but he is not desperate.