UWSpeech: Speech to Speech Translation for Unwritten Languages

ArXiv: arXiv:2006.07926 (Accepted by AAAI2021)



Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages, which have no written text or phoneme available. In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language speech into target discrete tokens with a translator, and finally synthesizes target speech from target discrete tokens with an inverter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition, to train the converter and inverter of UWSpeech jointly. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate the advantages and potentials of UWSpeech.

Audio Samples

All of the audio samples use Griffin-Lim as vocoder. Translate Spanish to English.

Case 1

Source(Spanish): Yo no entendí lo que ella dijo.

Target(English): I didn’t understand what she said.

Direct Translation: I don't know. VQ-VAE: I didn't understand. UWSpeech: I didn't understand what she say.

Case 2

Source(Spanish): Qué tal, ¿de dónde eres?

Target(English): How’s it going, where are you from?

Direct Translation: Had a price. VQ-VAE: Like are you there are you from? UWSpeech: How are you, where are you from?

Case 3

Source(Spanish): Yo soy puertoriqueña.

Target(English): I am Puerto Rican.

Direct Translation: Ah. VQ-VAE: I'm from. UWSpeech: I'm from Puerto Rico.

Case 4

Source(Spanish): Halo, buenas noches.

Target(English): Hello good evening.

Direct Translation: And a. VQ-VAE: Hello video. UWSpeech: Hello good evening.

Almost Unsupervised Text to Speech and Automatic Speech Recognition
FastSpeech: Fast, Robust and Controllable Text to Speech
Semi-Supervised Neural Architecture Search
MultiSpeech: Multi-Speaker Text to Speech with Transformer
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition
DeepSinger: Singing Voice Synthesis with Data Mined From the Web
Denoising Text to Speech with Frame-Level Noise Modeling