Alibaba’s EMO: A New AI Model That Generates Videos from Still Images by Adding Lip Sync

Mar 1, 2024 | 0 comments

Alibaba's EMO Converting Still Image to Video
Image Credits:

Alibaba has unveiled EMO, or Emote Portrait Alive. Alibaba’s EMO is an AI model that helps users generate videos by adding lip sync to still images and portraits.  The AI is able to add lip sync to still images and make them sing, rap, and talk from audio files. So users can take an image of a song and make the character in the image look like it is singing the song by adding lip sync.

The generated output looks very real, as along with the lips of characters in the image, their head position and facial expressions also change accordingly. In some cases, it is hard to identify what’s real and what’s not. Alibaba’s EMO can also generate videos of any duration, depending on the duration of the audio file.

Alibaba shared demo videos on GitHub showcasing the work done by Alibaba’s EMO. One example showed the famous Asian lady generated by OpenAI’s Sora model singing Dua Lipa’s song with just a reference image as an input and an audio file of the song.

Technical Details of Alibaba’s EMO

The model is trained on a vast and diverse audio-video dataset consisting of more than 250 hours of footage and 150 million images from various sources, including films, speeches, music performances, and more, in different languages, including Chinese and English.

It takes the reference image and adds it to video frames, which are generated during the diffusion process. The diffusion process involves converting audio waves into video frames. This captures subtle facial motions and identity-specific nuances associated with natural speech, making the resultant output video look more realistic.

Image Showing Technical Working of Alibaba's Sora
Image Credits: EMO’s Research Paper

Alibaba’s researchers, mentioning the problem with traditional audio-video models, wrote in their research paper, “We identify the limitations of traditional techniques that often fail to capture the full spectrum of human expressions and the uniqueness of individual facial styles. To address these issues, we propose: EMO, a novel framework that utilizes direct audio-to-video synthesis approach, bypassing the need for intermediate 3D models or facial landmarks.”

For more technical details, you can read EMO’s research paper here.

Limitations of Alibaba’s EMO

Alibaba, in their research paper, mentioned some limitations of EMO. The researchers mentioned that it is more time-consuming than other methods that don’t rely on diffusion models. Moreover, as they don’t use explicit control signals to control characters’ emotions, it can lead to the inadvertent generation of body parts such as hands, resulting in artifacts in the video.


Alibaba’s EMO AI model is going to make AI-generated videos more realistic with its diffusion model approach. This can lead to an increase in the number of deep-fake videos that can be generated using this model. So the company needs to take important steps before fully releasing the model to the public. Although the model still has some limitations, it will be interesting to see how Alibaba refines it to get accurate results.

Read More

OpenAI to Now add AI to Humanoid Robots

Adobe is Working on a New AI Music Generation and Editing Tool


Elon Musk’s xAI Announces Grok 1.5 with Great Capabilities

Elon Musk’s xAI Announces Grok 1.5 with Great Capabilities

Image Credits: xAI Elon Musk's xAI launched Grok last year in November to compete with chatbots from big tech giants like Google, Microsoft, and OpenAI. Elon Musk's xAI is soon launching the next version of their chatbot, Grok 1.5, which performs really well as...

Meta’s Ray-Ban Smart Glasses Are Getting New AI Features

Meta’s Ray-Ban Smart Glasses Are Getting New AI Features

Image Credits: Meta Meta’s $300 smart glasses, made in collaboration with Ray-Ban, allow users to take pictures, record videos, make calls, hear music, and do much more. Now, new AI features are being added to Meta's Ray-Ban smart glasses.  New AI Features in Meta’s...

Claude 3 beats GPT-4 for the First Time on LMSYS Leaderboard

Claude 3 beats GPT-4 for the First Time on LMSYS Leaderboard

Anthropic released the Claude 3 model family earlier this month, and they have become highly popular since their release. Now Anthropic's Claude 3 Opus Model beats OpenAI's GPT-4 model for the first time on the LMSYS Chatbot Arena Leaderboard. LMSYS Chatbot Arena is a...


Submit a Comment

Your email address will not be published. Required fields are marked *