PromptTTS 2: Describing and Generating Voices with Text Prompt

Abstract

Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to utilize the large language models (LLM) to compose high quality text prompts. Specifically, the variation network predicts the representation extracted from the reference speech (which contains full information about voice variability) based on the text prompt representation. For the prompt generation pipeline, it generates text prompts for speech with a speech language understanding model to recognize voice attributes (e.g., gender, speed) from speech and a large language model to formulate text prompts based on the recognition results. Experiments on a large-scale (44K hours) speech dataset demonstrate that compared to the previous works, PromptTTS 2 generates voices more consistent with text prompts and supports the sampling of diverse voice variability, thereby offering users more choices on voice generation. Additionally, the prompt generation pipeline produces high-quality text prompts, eliminating the large labeling cost. The demo page of PromptTTS 2 is online.

Content

1. Audio Samples
1.1 Attribute Control
1.2 Timbre Variation
1.3 Extension on Face2Voice

1. Audio Samples

We demonstrate the advantages of PromptTTS 2 through the following three aspects:

Attribute Control: we aim to show that we can control a specific attribute with the text prompts of different meanings.

Timbre Variation: we aim to show that we can synthesize speech in the different timbre with different sampling results of variation network, while in the same timbre when changing the text prompt with the same intention, text content or sampling results of TTS backbone. In this case, the voice variability is mainly controlled by the variation network.

Extension on Face2Voice: we aim to show that we can synthesize speech that matches the facial image.

1.1 Attribute Control

Note that audio samples in this part are from PromptTTS 2, PromptTTS, and InstructTTS, which show that PromptTTS 2 can control a specific attribute with the text prompts of different meanings.

1.1.1 Gender

Gender-1: Please ask a man with a normal voice to say: Fire a whole platoon, Major.

Gender-2: Please ask a woman with a normal voice to say: Fire a whole platoon, Major.

Model	Gender-1	Gender-2
PromptTTS 2
PromptTTS
InstructTTS

1.1.2 Speed

Speed-1: Please speak at a slow speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.

Speed-2: Please speak at a normal speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.

Speed-3: Please speak at a fast speed, gentleman: Him sorely and yet, it was but a woman's fancy, a passing fancy. She would become reconciled to the inevitable, as women do, and when her children came, she would grow accustomed to her sorrow. And her trouble would be forgotten in their laughter.

Model	Speed-1	Speed-2	Speed-3
PromptTTS 2
PromptTTS
InstructTTS

1.1.3 Pitch

Pitch-1: She said in a low pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.

Pitch-2: She said in a normal pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.

Pitch-3: She said in a high pitch: But it is not with a view to distinction that you should cultivate this talent, if you consult your own happiness.

Model	Pitch-1	Pitch-2	Pitch-3
PromptTTS 2
PromptTTS
InstructTTS

1.1.4 Volume

Volume-1: Generate a boy's voice with a low volume for me: But their health and strength, child; they can never stand the severe application.

Volume-2: Generate a boy's voice with a normal volume for me: But their health and strength, child; they can never stand the severe application.

Volume-3: Generate a boy's voice with a high volume for me: But their health and strength, child; they can never stand the severe application.

Model	Volume-1	Volume-2	Volume-3
PromptTTS 2
PromptTTS
InstructTTS

1.2 Timbre Variation

Note that audio samples in this part are from PromptTTS 2, which show that PromptTTS 2 can synthesize speech in the different timbre with different sampling results of variation network, while in the same timbre when changing the text prompt with the same intention, text content or sampling results of TTS backbone. In this case, the voice variability is mainly controlled by the variation network.

1.2.1 Variation Network

We can change the timbre by altering the sampling results of the variation network while maintaining the speech consistent with the intention of the text prompt.

Variation Network-[1, 2, 3]: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.

Variation Network-1	Variation Network-2	Variation Network-3

1.2.2 Text Prompt

Even if we change the text prompt with the same intention, the timbre will not be not be altered, which is exactly what we aim to achieve.

Text Prompt-1: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.

Text Prompt-2: This madam talks to me with a deep voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.

Text Prompt-3: Decrease the pitch of her voice for me: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.

Text Prompt-1	Text Prompt-2	Text Prompt-3

1.2.3 Text Content

Even if we change the text content, the timbre will not be not be altered, which is exactly what we aim to achieve.

Text-1: I want a low pitched female voice: Delaney had read one or two works on psychic phenomena and understood from them that spirit projection was not only quite feasible but far from uncommon.

Text-2: I want a low pitched female voice: And in this additional chapter to amplify and fortify, here and there, the result must necessarily be disconnected but a glance at the index will point the way to what is new.

Text-3: I want a low pitched female voice: If you wore the pink bonnet, I'll give it to you, and I'll back you up again, Mrs. Danvey. I think you might have done something with our member, as my father calls him, when you had him for so long in the house, but altogether.

Text Content-1	Text Content-2	Text Content-3

1.2.4 TTS Backbone

We cannot alter the timbre by changing the sampling results of TTS backbone.

TTS Backbone-[1, 2, 3]: I want a low pitched female voice: Jason went back sadly and told the heroes what he had heard, and they leapt onshore and searched till gone. At dawn, they found the body all rolled in dust and blood among the corpses of those monstrous beasts.

TTS Backbone-1	TTS Backbone-2	TTS Backbone-3

1.3 Extension on Face2Voice

Note that audio samples in this part are from PromptTTS 2, SP-FaceVC, and ground-truth voice (Ground-Truth) which show that PromptTTS 2 can synthesize speech that matches the facial image.

PromptTTS 2: Face-1: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.

SP-FaceVC: Face-1: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.

Ground-Truth: Face-1: That includes requiring background checks for students and other school employees, rights for student expulsion.

Face-1	PromptTTS 2	SP-FaceVC	Ground-Truth

PromptTTS 2: Face-2: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.

SP-FaceVC: Face-2: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.

Ground-Truth: Face-2: We currently know it. This is important that we listen to the people will be negatively impacted and everyone who cares deeply about the direction this budget will take.

Face-2	PromptTTS 2	SP-FaceVC	Ground-Truth

PromptTTS 2: Face-3: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.

SP-FaceVC: Face-3: That summer is immigration, however, being mainly from the free states, greatly changed the relative strength of the two parties.

Ground-Truth: Face-3: Non-fiscal provision for me that would have devastating effects for our citizens. This budget also includes significant change.

Face-3	PromptTTS 2	SP-FaceVC	Ground-Truth

PromptTTS 2: Face-4: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.

SP-FaceVC: Face-4: There was an average cost per lane for meter operation of 22 cents a year and each meter took care of an average of 17 lamps.

Ground-Truth: Face-4: I am taking those opportunities away from our kids and giving them to private schools that all interests.

Face-4	PromptTTS 2	SP-FaceVC	Ground-Truth