Tell Your Story: Text-Driven Face Video Synthesis With High Diversity via Adversarial Learning
Xia Hou, Meng Sun, Wenfeng Song
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Face synthesis is a rapidly growing area of research in computer vision. Text-driven face synthesis is particularly flexible, but challenges still exist in fusing the semantics of text and images, as well as generating diverse faces. To address these challenges, we propose a cross-modality adversarial learning framework to generate highly diverse face videos that correspond to given text descriptions. We encode text and images into a common latent space and align text and image features to control the synthesis of face attributes. We have designed a novel auto-encoder with a face identity discriminator that enlarges the margin between different individuals, increasing the variety of created faces while maintaining the semantic coherence of text and images. Our proposed method has been successfully tested on the recently released Multimodal VoxCeleb dataset. Our code is public available at https://github.com/sunmeng7/TYS.git.