ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Abstract

Text-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio is not solely dependent on its content, pitch, rhythm, and energy, but also on the physical environment. To address this, we propose the Visual TTS task, which involves converting given written text and target environmental images into audio that matches the target environment. Despite benefits, this task presents challenges: 1) the high cost of annotating high-quality target environmental images and audio leads to a scarcity of data for model training, and 2) modeling room acoustic information and integrating it with the audio properties to produce realistic target environmental audio is challenging. In this work, we introduce self-supervised pre-training techniques for both the encoder and decoder, along with a transformer-based DDPM. Specifically, we use a large volume of unlabeled data to pre-train the encoder and decoder to mitigate the scarcity of data via masked prediction loss. Additionally, we show that the wavenet inductive bias is not crucial to the performance of DDPMs and they can readily replaced with transformers. With the scaling prosperity of transformers, model capacity can be effortlessly expanded to better model room acoustic data and generate high-quality audio that matches the target environment. Experimental results demonstrate that our method outperforms wavenet backbone and cascaded model(composed of TTS and visual acoustic matching) in the Visual TTS task, achieving state-of-the-art performance and significant results in low-resource scenarios (15min/1h/2h). https://ViT-TTS.github.io/

A. Preliminary Analyses

Test-Seen

Recording WaveNet Transformer-S Transformer-M Transformer-L Transformer-XL
Sample 1:
Target RGB/Depth Image:
Reference Text: when you argue about the nature of god apart from the question of justification you may be as profound as you like
Sample 2:
Target RGB/Depth Image:
Reference Text: the wharves of brooklyn and every part of new york bordering the east river were crowded with curiosity seekers
Sample 3:
Target RGB/Depth Image:
Reference Text: the top floor belongs to miles mc laren

Test-Unseen

Recording WaveNet Transformer-S Transformer-M Transformer-L Transformer-XL
Sample 1:
Target RGB/Depth Image:
Reference Text: the task will not be difficult returned david hesitating though i greatly fear your presence would rather increase than mitigate his unhappy fortunes
Sample 2:
Target RGB/Depth Image:
Reference Text: by this time the two gentlemen had reached the palings and had got down from their horses it was plain they meant to come in
Sample 3:
Target RGB/Depth Image:
Reference Text: im going to see mister marshall said kenneth and discover what i can do to assist you thank you sir

B. Model Performances

Test-Seen

Recording DiffSpeech ProDiff Viusal-DiffSpeech Cascaded ViT-TTS
Sample 1:
Target RGB/Depth Image:
Reference Text: when you argue about the nature of god apart from the question of justification you may be as profound as you like
Sample 2:
Target RGB/Depth Image:
Reference Text: the wharves of brooklyn and every part of new york bordering the east river were crowded with curiosity seekers
Sample 3:
Target RGB/Depth Image:
Reference Text: the top floor belongs to miles mc laren

Test-Unseen

Recording DiffSpeech ProDiff Viusal-DiffSpeech Cascaded ViT-TTS
Sample 1:
Target RGB/Depth Image:
Reference Text: the task will not be difficult returned david hesitating though i greatly fear your presence would rather increase than mitigate his unhappy fortunes
Sample 2:
Target RGB/Depth Image:
Reference Text: by this time the two gentlemen had reached the palings and had got down from their horses it was plain they meant to come in
Sample 3:
Target RGB/Depth Image:
Reference Text: im going to see mister marshall said kenneth and discover what i can do to assist you thank you sir

C. Low Resource Evaluation

Test-Seen

Recording Finetune-1h Finetune-2h Finetune-5h
Sample 1:
Target RGB/Depth Image:
Reference Text: when you argue about the nature of god apart from the question of justification you may be as profound as you like
Sample 2:
Target RGB/Depth Image:
Reference Text: the wharves of brooklyn and every part of new york bordering the east river were crowded with curiosity seekers
Sample 3:
Target RGB/Depth Image:
Reference Text: the top floor belongs to miles mc laren

Test-Unseen

Recording Finetune-1h Finetune-2h Finetune-5h
Sample 1:
Target RGB/Depth Image:
Reference Text: the task will not be difficult returned david hesitating though i greatly fear your presence would rather increase than mitigate his unhappy fortunes
Sample 2:
Target RGB/Depth Image:
Reference Text: by this time the two gentlemen had reached the palings and had got down from their horses it was plain they meant to come in
Sample 3:
Target RGB/Depth Image:
Reference Text: im going to see mister marshall said kenneth and discover what i can do to assist you thank you sir