PromptTTS: controllable text-to-speech with text descriptions


Authors

* Equal contribution.

^ Corresponding author.

Abstract

Using a text description as prompt to guide the generation of text or images (e.g., GPT-3 or DALLE-2) has drawn wide attention recently. Beyond text and image generation, in this work, we explore the possibility of utilizing text descriptions to guide speech synthesis. Thus, we develop a text-to-speech (TTS) system (dubbed as PromptTTS) that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech. Specifically, PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt, and a speech decoder to synthesize speech according to the extracted style and content representations. Compared with previous works in controllable TTS that require users to have acoustic knowledge to understand style factors such as prosody and pitch, PromptTTS is more user-friendly since text descriptions are a more natural way to express speech style (e.g., ''A lady whispers to her friend slowly''). Given that there is no TTS dataset with prompts, to benchmark the task of PromptTTS, we construct and release a dataset containing prompts with style and content information and the corresponding speech. Experiments show that PromptTTS can generate speech with precise style control and high speech quality. Audio samples and our dataset are publicly available.

Content

1. Dataset
1.1 PromptSpeech
2. Audio Samples
2.1 Comparison of Different Speech Styles
2.2 Comparison of Different Style Prompts
2.3 Comparison of Different Systems

1. Dataset

Given that there are no TTS datasets with prompts, we construct and release a dataset called PromptSpeech which consists of speech and the corresponding prompts. We synthesize speech with 5 different style factors (gender, pitch, speaking speed, volume, and emotion) from a commercial TTS API. The emotion factor has 5 categories and the gender factor has 2 categories. For the rest of style factors including pitch, speaking speed, and volume, we extract the value of style factors from speech with signal processing tools and divide speech into 3 categories (high/normal/low) according to the proportion. Considering that the style of speech can be described in natural language, we ask experts to write style prompts for each category. To further augment the style prompts, we utilize SimBERT to generate more style prompts with similar semantics. The speech and the corresponding prompt describing the same style and content are used as paired data in our dataset.

Since there could be a gap between the synthesized speech and the real speech, besides the dataset mentioned above (denoted as synthesized version), we also construct a dataset (denoted as real version) with real speech data from LibriTTS by similar construction process. Due to the lack of emotion in LibriTTS, the real version of PromptSpeech only has 4 style factors. As shown in the table below, we split both versions of PromptSpeech into training sets and test sets, which are released in public. The experiments are conducted on PromptSpeech to verify the effectiveness of PromptTTS.

1.1 PromptSpeech

You can download the both versions of PromptSpeech from the table below and the number in the table corresponds to the amount of downloaded data.

Version Training Prompt Test Prompt Speech
Synthesized 155092 5032 160124
Real 26588 1305 27893

2. Audio Samples

The following parts are the audio samples on both versions of PromptSpeech from PromptTTS and the two-stage baseline system. Note that we show PromptTTS from 3 parts as follows.

Comparison of Different Speech Styles: we aim to show that we can control a specific style factor with the prompts of different meaning.

Comparison of Different Style Prompts: we aim to show that promptTTS can synthesize speech in the same style with the different prompts of the same meaning.

Comparison of Different Systems: we aim to show PromptTTS can synthesize speech in a more consistent style with the intention of prompts than baseline.

2.1 Comparison of Different Speech Styles

Note that audio samples in this part are from PromptTTS with the prompts of different meaning to control a specific style factor.

2.1.1 Synthesized Version

2.1.1.1 Gender

Gender-1: Please ask a man with a normal voice to say: So Tom took his father's pipes and walked over the hill to farmer Bowser's house; for you must know that.

Gender-2: Please ask a woman with a normal voice to say: So Tom took his father's pipes and walked over the hill to farmer Bowser's house; for you must know that.

Gender-1 Gender-2
2.1.1.2 Pitch

Pitch-1: His voice has a low pitch: What shall we do, he asked his father.

Pitch-2: His voice has a normal pitch: What shall we do, he asked his father.

Pitch-3: His voice has a high pitch: What shall we do, he asked his father.

Pitch-1 Pitch-2 Pitch-3
2.1.1.3 Speaking Speed

Speaking Speed-1: Please speak at a slow speed, gentleman: This man Tom, Tom, the piper's son, was hungry when the day begun; he wanted a bun and asked for one, but soon found out that there were none.

Speaking Speed-2: Please speak at a normal speed, gentleman: This man Tom, Tom, the piper's son, was hungry when the day begun; he wanted a bun and asked for one, but soon found out that there were none.

Speaking Speed-3: Please speak at a fast speed, gentleman: This man Tom, Tom, the piper's son, was hungry when the day begun; he wanted a bun and asked for one, but soon found out that there were none.

Speaking Speed-1 Speaking Speed-2 Speaking Speed-3
2.1.1.4 Volume

Volume-1: Generate a boy's voice with a low volume for me: Tom did not like to steal, but he had no one to teach him to be honest, and so, under his father's guidance, he fell into bad ways.

Volume-2: Generate a boy's voice with a normal volume for me: Tom did not like to steal, but he had no one to teach him to be honest, and so, under his father's guidance, he fell into bad ways.

Volume-3: Generate a boy's voice with a high volume for me: Tom did not like to steal, but he had no one to teach him to be honest, and so, under his father's guidance, he fell into bad ways.

Volume-1 Volume-2 Volume-3
2.1.1.5 Emotion

Emotion-1: Look, that man says to his friend: Tom, Tom, the piper's son, stole a pig and away he run; the pig was eat and Tom was beat and Tom ran crying down the street.

Emotion-2: Look, that man shouts to his friend: Tom, Tom, the piper's son, stole a pig and away he run; the pig was eat and Tom was beat and Tom ran crying down the street.

Emotion-3: Look, that man whispers to his friend: Tom, Tom, the piper's son, stole a pig and away he run; the pig was eat and Tom was beat and Tom ran crying down the street.

Emotion-4: Look, that man says cheerfully to his friend: Tom, Tom, the piper's son, stole a pig and away he run; the pig was eat and Tom was beat and Tom ran crying down the street.

Emotion-5: Look, that man says sadly to his friend: Tom, Tom, the piper's son, stole a pig and away he run; the pig was eat and Tom was beat and Tom ran crying down the street.

Emotion-1 Emotion-2 Emotion-3 Emotion-4 Emotion-5

2.1.2 Real Version

2.1.2.1 Gender

Gender-1: Provide me with a male voice: Go hungry, replied Barney, unless you want to take my pipes and play in the village.

Gender-2: Provide me with a female voice: Go hungry, replied Barney, unless you want to take my pipes and play in the village.

Gender-1 Gender-2
2.1.2.2 Pitch

Pitch-1: She said in a deep voice: There was not a worse vagabond in Shrewsbury than old Barney the piper.

Pitch-2: She said in a sharp voice: There was not a worse vagabond in Shrewsbury than old Barney the piper.

Pitch-1 Pitch-2
2.1.2.3 Speaking Speed

Speaking Speed-1: I want to hear that girl talk slowly: Tom, the piper's son

Speaking Speed-2: I want to hear that girl talk quickly: Tom, the piper's son

Speaking Speed-1 Speaking Speed-2
2.1.2.4 Volume

Volume-1: Let's speak in a low voice, ladies!: Tom did not like to steal, but he had no one to teach him to be honest, and so, under his father's guidance, he fell into bad ways.

Volume-2: Let's speak in a high voice, ladies!: Tom did not like to steal, but he had no one to teach him to be honest, and so, under his father's guidance, he fell into bad ways.

Volume-1 Volume-1

2.2 Comparison of Different Style Prompts

Note that audio samples in this part are from PromptTTS with the different prompts of the same meaning to synthesize the speech in the same style.

2.2.1 Synthesized Version

Gender-1: Ask a boy to speak with normal speed and volume: Tom, the piper's son.

Gender-2: I need a man with a normal voice to talk: Tom, the piper's son.

Gender-1 Gender-2

Pitch-1: The boy's voice is very sharp: Tom, the piper's son.

Pitch-2: Synthesize a very squeaky voice and ask him to say: Tom, the piper's son.

Pitch-1 Pitch-2

Speaking Speed-1: Generate a male voice with slow speaking speed, please: Tom, the piper's son.

Speaking Speed-2: He spoke even more slowly than usual: Tom, the piper's son.

Speaking Speed-1 Speaking Speed-2

Volume-1: Gentlemen, please talk softly: Tom, the piper's son.

Volume-2: Could you please help me generate a manlike voice to talk with me in a low voice: Tom, the piper's son.

Volume-1 Volume-2

2.2.2 Real Version

Pitch-1: Please let a lady speak for me and she has a very deep voice: He never did any work except to play the pipes, and he played so badly that few pennies ever found their way into his pouch.

Pitch-2: The woman has a low-pitched voice: He never did any work except to play the pipes, and he played so badly that few pennies ever found their way into his pouch.

Pitch-1 Pitch-2

Volume-1: I want a loud female voice: He never did any work except to play the pipes, and he played so badly that few pennies ever found their way into his pouch.

Volume-2: This madam talks to me with high volume: He never did any work except to play the pipes, and he played so badly that few pennies ever found their way into his pouch.

Volume-1 Volume-2

2.3 Comparison of Different Systems

Note that audio samples in this part are from PromptTTS and the two-stage baseline system (denoted as Two-stage) with the same prompts, which shows that PromptTTS can synthesize speech in a more consistent style with the intention of prompts than baseline.

2.3.1 Synthesized Version

PromptTTS: Listen, she increased her speaking speed: There was not a worse vagabond in Shrewsbury than old Barney the piper.

Two-stage: Listen, she increased her speaking speed: There was not a worse vagabond in Shrewsbury than old Barney the piper.

PromptTTS Two-stage

PromptTTS: Reduce the volume of her voice for me: There was not a worse vagabond in Shrewsbury than old Barney the piper.

Two-stage: Reduce the volume of her voice for me: There was not a worse vagabond in Shrewsbury than old Barney the piper.

PromptTTS Two-stage

2.3.2 Real Version

PromptTTS: Please find a girl to talk at a higher pitch than usual: Barney had one son, named Tom; and they lived all alone in a little hut away at the end of the village street, for Tom's mother had died when he was a baby.

Two-stage: Please find a girl to talk at a higher pitch than usual: Barney had one son, named Tom; and they lived all alone in a little hut away at the end of the village street, for Tom's mother had died when he was a baby.

PromptTTS Two-stage

Some speech research conducted at Microsoft Research Asia
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis
FastSpeech: Fast, Robust and Controllable Text to Speech
FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data
AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior