Say Hello to EMO: Transforming Photos into Talking Masterpieces

Researchers at Alibaba have developed an AI system called EMO that can animate a single portrait photo and generate videos of the person talking or singing in a lifelike fashion.
EMO uses a technique called diffusion models to directly convert audio to video frames, allowing it to capture subtle facial motions and individual styles.
It was trained on over 250 hours of talking head videos and can generate fluid and expressive movements matching an input audio track.
EMO outperforms existing methods in metrics like video quality, identity preservation, and expressiveness according to experiments.
Beyond speech, EMO can also animate singing portraits with synchronized mouth shapes and facial expressions.
Potential applications include personalized video content generated from a photo and audio, but ethical concerns around impersonation and misinformation remain.