• Researchers at Alibaba have developed an AI system called EMO that can animate a single portrait photo and generate videos of the person talking or singing in a lifelike fashion.
  • EMO uses a technique called diffusion models to directly convert audio to video frames, allowing it to capture subtle facial motions and individual styles.
  • It was trained on over 250 hours of talking head videos and can generate fluid and expressive movements matching an input audio track.
  • EMO outperforms existing methods in metrics like video quality, identity preservation, and expressiveness according to experiments.
  • Beyond speech, EMO can also animate singing portraits with synchronized mouth shapes and facial expressions.
  • Potential applications include personalized video content generated from a photo and audio, but ethical concerns around impersonation and misinformation remain.