What is talking head?

Talking head is a technology to move a single image or face photo to look like a "speaking person". It moves the mouth and facial expressions of the input image using the movement of a separately prepared reference video or audio as a clue.

It is very similar to lip sync, but lip sync focuses on "matching only the mouth of an existing video to audio". Talking head is basically about moving a single image, and many of them focus on moving based on the movement of a reference video rather than audio.

As the name talking head suggests, it started from moving the face, but it is evolving in the direction of moving the upper body and even the whole body.


Deformation-based talking head

Thin-Plate Spline Motion Model for Image Animation

When you input a single image and a video of a moving person, the image side deforms to mimic that movement.

What it is doing is closer to an image of twisting "squishy" in 2D rather than a 3D model. It's like Photoshop's Puppet Warp.

LivePortrait

AdvancedLivePortrait_image2video.json

This also takes a single image and a reference video as input, but it is devised to stably reproduce the movement of each part of the face, line of sight, and nuances of emotion.

Since it is not a diffusion model, it is relatively light and suitable for real-time applications. Also, since it allows editing such as "face direction slightly down" or "open eyes slightly", it is still often used today.


Diffusion model-based talking head

In the next generation, talking heads in the direction of "redrawing the picture itself" using diffusion models appeared. This includes lineages such as X-Portrait and HelloMeme.

HelloMeme_video.json

These extract signals corresponding to "head orientation" and "facial expression changes" from the reference video and pass them to the diffusion model as conditions. What they are doing is close to generating an image while fixing the pose and composition with ControlNet, specifying "I want you to redraw this character's face with this movement".


Video generation model-based talking head

In a newer generation, talking head / avatar models based on video generation models themselves have appeared. OmniAvatar and Wan-Animate fall into this line.

Wan-Animate

Wan-Animate is a type of model that takes a character image and a "reference video with movement" as input and moves the character to trace that movement.


To Human Motion Transfer

When talking head technology becomes able to handle the face area stably, it is a natural flow to want to "move the upper body and whole body as well".

Old things like Thin-Plate Spline could originally be applied to the whole body as well as the face, and Wan-Animate can handle the whole body perfectly, so I feel there is no need to distinguish it from talking head, but Human Motion Transfer has evolved independently here, so let's take a look.

Human Motion Transfer