talking headとは?
talking headは、1枚の画像や顔写真を「しゃべっている人」に見えるように動かす技術です。入力した画像を、別途用意した参照動画の動きや音声を手がかりにして、口や表情を動かします。
リップシンクとよく似ていますが、リップシンクは「元からある動画の口だけを音声に合わせる」ものが中心です。talking headは、1枚絵を動かすことが基本で、音声ではなく参照動画の動きを元に動かすことをメインに据えているものが多いです。
talking headの名の通り、顔を動かすことからスタートしましたが、上半身や全身まで動かす方向へと進化しています。
変形ベースのtalking head
Thin-Plate Spline Motion Model for Image Animation

1枚の画像と、動いている人の動画を入力すると、画像側がその動きを真似するように変形します。
やっていることは3Dモデルというより、2Dのまま「グニャっと」ねじっているイメージに近いです。Photoshopのパペットワープのようなものですね。
LivePortrait
{
"last_node_id": 65,
"last_link_id": 100,
"nodes": [
{
"id": 45,
"type": "LoadImage",
"pos": [
110,
110
],
"size": [
296.8722126163094,
418.8513744298035
],
"flags": {},
"order": 0,
"mode": 0,
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
88,
95,
99
],
"slot_index": 0,
"shape": 3
},
{
"name": "MASK",
"type": "MASK",
"links": null,
"shape": 3
}
],
"properties": {
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"Emma.png",
"image"
]
},
{
"id": 59,
"type": "ExpressionEditor",
"pos": [
436,
183
],
"size": [
315,
690
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "src_image",
"type": "IMAGE",
"link": 88
},
{
"name": "motion_link",
"type": "EDITOR_LINK",
"link": null
},
{
"name": "sample_image",
"type": "IMAGE",
"link": null
},
{
"name": "add_exp",
"type": "EXP_DATA",
"link": null
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [],
"shape": 3,
"slot_index": 0
},
{
"name": "motion_link",
"type": "EDITOR_LINK",
"links": [
98
],
"shape": 3,
"slot_index": 1
},
{
"name": "save_exp",
"type": "EXP_DATA",
"links": null,
"shape": 3
}
],
"properties": {
"Node name for S&R": "ExpressionEditor"
},
"widgets_values": [
0,
20,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
"All",
1.7
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 60,
"type": "AdvancedLivePortrait",
"pos": [
1460,
110
],
"size": [
283.1057506328382,
431.99999618530273
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "src_images",
"type": "IMAGE",
"link": 95
},
{
"name": "motion_link",
"type": "EDITOR_LINK",
"link": 100
},
{
"name": "driving_images",
"type": "IMAGE",
"link": 97
}
],
"outputs": [
{
"name": "images",
"type": "IMAGE",
"links": [
94
],
"shape": 3,
"slot_index": 0
}
],
"properties": {
"Node name for S&R": "AdvancedLivePortrait"
},
"widgets_values": [
0,
0,
1.7,
true,
"1 = 1:10\n2 = 11:32"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 62,
"type": "VHS_LoadVideo",
"pos": [
1121,
304
],
"size": [
298.8722126163094,
432.86561959667404
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "meta_batch",
"type": "VHS_BatchManager",
"link": null
},
{
"name": "vae",
"type": "VAE",
"link": null
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
97
],
"shape": 3,
"slot_index": 0
},
{
"name": "frame_count",
"type": "INT",
"links": null,
"shape": 3
},
{
"name": "audio",
"type": "AUDIO",
"links": null,
"shape": 3
},
{
"name": "video_info",
"type": "VHS_VIDEOINFO",
"links": null,
"shape": 3
}
],
"properties": {
"Node name for S&R": "VHS_LoadVideo"
},
"widgets_values": {
"video": "7327398-uhd_3840_2160_25fps.mp4",
"force_rate": 8,
"force_size": "?x512",
"custom_width": 512,
"custom_height": 512,
"frame_load_cap": 32,
"skip_first_frames": 0,
"select_every_nth": 1,
"choose video to upload": "image",
"videopreview": {
"hidden": false,
"paused": false,
"params": {
"frame_load_cap": 32,
"skip_first_frames": 0,
"force_rate": 8,
"filename": "7327398-uhd_3840_2160_25fps.mp4",
"type": "input",
"format": "video/mp4",
"select_every_nth": 1,
"force_size": "?x512"
},
"muted": false
}
}
},
{
"id": 64,
"type": "VHS_VideoCombine",
"pos": [
1783,
110
],
"size": [
315,
682.9166666666667
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 94
},
{
"name": "audio",
"type": "AUDIO",
"link": null
},
{
"name": "meta_batch",
"type": "VHS_BatchManager",
"link": null
},
{
"name": "vae",
"type": "VAE",
"link": null
}
],
"outputs": [
{
"name": "Filenames",
"type": "VHS_FILENAMES",
"links": null,
"shape": 3
}
],
"properties": {
"Node name for S&R": "VHS_VideoCombine"
},
"widgets_values": {
"frame_rate": 8,
"loop_count": 0,
"filename_prefix": "AnimateDiff",
"format": "video/h264-mp4",
"pix_fmt": "yuv420p",
"crf": 19,
"save_metadata": true,
"pingpong": false,
"save_output": false,
"videopreview": {
"hidden": false,
"paused": false,
"params": {
"filename": "AnimateDiff_00035.mp4",
"subfolder": "",
"type": "temp",
"format": "video/h264-mp4",
"frame_rate": 8
},
"muted": false
}
}
},
{
"id": 65,
"type": "ExpressionEditor",
"pos": [
773,
183
],
"size": {
"0": 315,
"1": 690
},
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "src_image",
"type": "IMAGE",
"link": 99
},
{
"name": "motion_link",
"type": "EDITOR_LINK",
"link": 98
},
{
"name": "sample_image",
"type": "IMAGE",
"link": null
},
{
"name": "add_exp",
"type": "EXP_DATA",
"link": null
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [],
"shape": 3,
"slot_index": 0
},
{
"name": "motion_link",
"type": "EDITOR_LINK",
"links": [
100
],
"shape": 3,
"slot_index": 1
},
{
"name": "save_exp",
"type": "EXP_DATA",
"links": null,
"shape": 3
}
],
"properties": {
"Node name for S&R": "ExpressionEditor"
},
"widgets_values": [
10,
0,
10,
0,
0,
0,
0,
0,
0,
0,
0,
0,
1,
1,
"All",
1.7
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
88,
45,
0,
59,
0,
"IMAGE"
],
[
94,
60,
0,
64,
0,
"IMAGE"
],
[
95,
45,
0,
60,
0,
"IMAGE"
],
[
97,
62,
0,
60,
2,
"IMAGE"
],
[
98,
59,
1,
65,
1,
"EDITOR_LINK"
],
[
99,
45,
0,
65,
0,
"IMAGE"
],
[
100,
65,
1,
60,
1,
"EDITOR_LINK"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.6830134553650706,
"offset": [
26.84286476099541,
551.0218305182884
]
}
},
"version": 0.4
}
こちらも1枚絵と参照動画を入力にしますが、顔のパーツごとの動きや視線、感情のニュアンスなどを安定して再現できるよう工夫されています。
拡散モデルではないため比較的軽く、リアルタイム寄りにも向いています。また、「顔の向きを少し下に」や「目を少し開く」といった編集ができるため、現在でもよく使われます。
拡散モデルベースのtalking head
次の世代では、拡散モデルを使って「絵そのものを描き直す」方向のtalking headが出てきました。X-PortraitやHelloMemeといった系統です。
{
"last_node_id": 26,
"last_link_id": 40,
"nodes": [
{
"id": 21,
"type": "GetReferenceImageRT",
"pos": [
740,
410
],
"size": [
241.79998779296875,
46
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "face_toolkits",
"type": "FACE_TOOLKITS",
"link": 32
},
{
"name": "image",
"type": "IMAGE",
"link": 35
}
],
"outputs": [
{
"name": "REFRT",
"type": "REFRT",
"links": [
31
]
}
],
"properties": {
"Node name for S&R": "GetReferenceImageRT"
},
"widgets_values": [],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 24,
"type": "PreviewImage",
"pos": [
744,
551
],
"size": [
210,
246
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 37
}
],
"outputs": [],
"properties": {
"Node name for S&R": "PreviewImage"
}
},
{
"id": 19,
"type": "GetVideoDriveParams",
"pos": [
1064,
330
],
"size": [
270.3999938964844,
98
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "face_toolkits",
"type": "FACE_TOOLKITS",
"link": 33
},
{
"name": "images",
"type": "IMAGE",
"link": 40
},
{
"name": "ref_rt",
"type": "REFRT",
"link": 31
}
],
"outputs": [
{
"name": "drive_video_params",
"type": "DRIVE_VIDEO_PARAMS",
"links": [
29
]
}
],
"properties": {
"Node name for S&R": "GetVideoDriveParams"
},
"widgets_values": [
0
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 22,
"type": "HMFaceToolkitsLoader",
"pos": [
458,
330
],
"size": [
230.03500366210938,
58
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "FACE_TOOLKITS",
"type": "FACE_TOOLKITS",
"links": [
32,
33
],
"slot_index": 0
}
],
"properties": {
"Node name for S&R": "HMFaceToolkitsLoader"
},
"widgets_values": [
0
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 14,
"type": "ImageResize",
"pos": [
380,
470
],
"size": [
315,
246
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "pixels",
"type": "IMAGE",
"link": 19
},
{
"name": "mask_optional",
"type": "MASK",
"link": null,
"shape": 7
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
35,
36,
37
],
"slot_index": 0
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"Node name for S&R": "ImageResize"
},
"widgets_values": [
"crop to ratio",
0,
0,
0,
"reduce size only",
"1:1",
0,
20
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 26,
"type": "VHS_LoadVideo",
"pos": [
386,
803
],
"size": [
305.44378662109375,
436.5621337890625
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "meta_batch",
"type": "VHS_BatchManager",
"link": null,
"shape": 7
},
{
"name": "vae",
"type": "VAE",
"link": null,
"shape": 7
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
40
],
"slot_index": 0
},
{
"name": "frame_count",
"type": "INT",
"links": null
},
{
"name": "audio",
"type": "AUDIO",
"links": null
},
{
"name": "video_info",
"type": "VHS_VIDEOINFO",
"links": null
}
],
"properties": {
"Node name for S&R": "VHS_LoadVideo"
},
"widgets_values": {
"video": "3762907-uhd_3840_2160_25fps.mp4",
"force_rate": 16,
"force_size": "512x?",
"custom_width": 512,
"custom_height": 512,
"frame_load_cap": 48,
"skip_first_frames": 0,
"select_every_nth": 1,
"choose video to upload": "image",
"videopreview": {
"hidden": false,
"paused": false,
"params": {
"force_rate": 16,
"frame_load_cap": 48,
"skip_first_frames": 0,
"select_every_nth": 1,
"filename": "3762907-uhd_3840_2160_25fps.mp4",
"type": "input",
"format": "video/mp4"
},
"muted": false
}
}
},
{
"id": 18,
"type": "HMVideoPipelineLoader",
"pos": [
981,
161
],
"size": [
352.79998779296875,
106
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "HMVIDEOPIPELINE",
"type": "HMVIDEOPIPELINE",
"links": [
28
]
}
],
"properties": {
"Node name for S&R": "HMVideoPipelineLoader"
},
"widgets_values": [
"None",
"None",
0
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 25,
"type": "VHS_VideoCombine",
"pos": [
1762,
445
],
"size": [
341.20306396484375,
645.2030639648438
],
"flags": {},
"order": 9,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 39
},
{
"name": "audio",
"type": "AUDIO",
"link": null,
"shape": 7
},
{
"name": "meta_batch",
"type": "VHS_BatchManager",
"link": null,
"shape": 7
},
{
"name": "vae",
"type": "VAE",
"link": null,
"shape": 7
}
],
"outputs": [
{
"name": "Filenames",
"type": "VHS_FILENAMES",
"links": null
}
],
"properties": {
"Node name for S&R": "VHS_VideoCombine"
},
"widgets_values": {
"frame_rate": 16,
"loop_count": 0,
"filename_prefix": "AnimateDiff",
"format": "video/h264-mp4",
"pix_fmt": "yuv420p",
"crf": 19,
"save_metadata": true,
"pingpong": false,
"save_output": true,
"videopreview": {
"hidden": false,
"paused": false,
"params": {
"filename": "AnimateDiff_00033.mp4",
"subfolder": "",
"type": "output",
"format": "video/h264-mp4",
"frame_rate": 16
},
"muted": false
}
}
},
{
"id": 17,
"type": "HMPipelineVideo",
"pos": [
1409,
448
],
"size": [
315,
218
],
"flags": {},
"order": 8,
"mode": 0,
"inputs": [
{
"name": "pipeline",
"type": "HMVIDEOPIPELINE",
"link": 28
},
{
"name": "image",
"type": "IMAGE",
"link": 36
},
{
"name": "drive_video_params",
"type": "DRIVE_VIDEO_PARAMS",
"link": 29
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
39
],
"slot_index": 0
}
],
"properties": {
"Node name for S&R": "HMPipelineVideo"
},
"widgets_values": [
"(best quality), highly detailed, ultra-detailed, headshot, person, well-placed five sense organs, looking at the viewer, centered composition, sharp focus, realistic skin texture",
"",
25,
1234,
"fixed",
2
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 4,
"type": "LoadImage",
"pos": [
32,
471
],
"size": [
315,
314
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
19
],
"slot_index": 0
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pexels-photo-28252721.jpg",
"image"
]
}
],
"links": [
[
19,
4,
0,
14,
0,
"IMAGE"
],
[
28,
18,
0,
17,
0,
"HMVIDEOPIPELINE"
],
[
29,
19,
0,
17,
2,
"DRIVE_VIDEO_PARAMS"
],
[
31,
21,
0,
19,
2,
"REFRT"
],
[
32,
22,
0,
21,
0,
"FACE_TOOLKITS"
],
[
33,
22,
0,
19,
0,
"FACE_TOOLKITS"
],
[
35,
14,
0,
21,
1,
"IMAGE"
],
[
36,
14,
0,
17,
1,
"IMAGE"
],
[
37,
14,
0,
24,
0,
"IMAGE"
],
[
39,
17,
0,
25,
0,
"IMAGE"
],
[
40,
26,
0,
19,
1,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.7513148009015778,
"offset": [
93.6229236924027,
107.24457527189101
]
}
},
"version": 0.4
}
これらは、参照動画から「頭の向き」や「表情の変化」に相当する信号を取り出し、それを条件として拡散モデルに渡します。やっていることは、ControlNetでポーズや構図を固定しながら画像生成するのに近く、「このキャラの顔を、この動きで描き直してほしい」と指定しているようなものです。
動画生成モデルベースのtalking head
さらに新しい世代では、動画生成モデル自体をベースにしたtalking head / avatarモデルが登場しています。OmniAvatarやWan-Animateがこのラインにあたります。

Wan-Animate
Wan-Animateは、キャラクター画像と「動きを持った参照動画」を入力にして、その動きをなぞるようにキャラクターを動かすタイプのモデルです。
Human Motion Transferへ
talking headの技術が顔まわりを安定して扱えるようになってくると、「上半身や全身も動かしたい」となるのは自然な流れです。
Thin-Plate Splineのような古いものも、もともと顔だけでなく全身に適用できましたし、Wan-Animateでは完璧に全身を扱うことができるので、わざわざtalking headと区別する必要も無い気がしますが、Human Motion Transferはこちらはこちらで独自に進化してきたので、少し見てみましょう。