ControlNet

ControlNetとは？

生成AIの本質は、「2つのものの対応関係」を学習することです。 text2imageでは「ノイズ ↔ 画像」の関係を覚えさせますが、ノイズ以外のものでも同じことができます

線画 ↔ 画像 のペアを学習 → 線画から自動着色が
棒人間 ↔ 画像 のペアを学習 → ポーズ指定で画像生成が
深度マップ ↔ 画像 のペアを学習 → 奥行き情報から画像生成が

ControlNet はこれを実現する技術のひとつです

SD1.5 × ControlNet Scribble

ControlNet は、無数の種類がありますが、まずは「scribble」を試してみましょう。
scribble モデルは、「ラフな落書き」をもとに画像を生成する ControlNet です。

ControlNetモデルのダウンロード

control_v11p_sd15_scribble_fp16.safetensors

📂ComfyUI/
  └── 📂models/
      └── 📂controlnet/
          └── control_v11p_sd15_scribble_fp16.safetensors

workflow

SD1.5_ControlNet_scribble.json

{
  "id": "ff9a9120-9e06-4d07-93fd-048b505d0534",
  "revision": 0,
  "last_node_id": 23,
  "last_link_id": 35,
  "nodes": [
    {
      "id": 7,
      "type": "CLIPTextEncode",
      "pos": [
        481.0621643066406,
        427.9620666503906
      ],
      "size": [
        419.9831237792969,
        100.87960815429688
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 5
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            22
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "text, watermark, low quality"
      ]
    },
    {
      "id": 17,
      "type": "VAELoader",
      "pos": [
        1301.9801940917969,
        183.17780701188025
      ],
      "size": [
        280.8620910644531,
        58
      ],
      "flags": {
        "collapsed": false
      },
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            16
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "vae-ft-mse-840000-ema-pruned.safetensors"
      ]
    },
    {
      "id": 13,
      "type": "ControlNetLoader",
      "pos": [
        586.0452880859375,
        606.119140625
      ],
      "size": [
        315,
        58
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "CONTROL_NET",
          "type": "CONTROL_NET",
          "links": [
            25
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "ControlNetLoader"
      },
      "widgets_values": [
        "control_v11p_sd15_scribble_fp16.safetensors"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 14,
      "type": "LoadImage",
      "pos": [
        506.46689675070996,
        739.2381924715907
      ],
      "size": [
        312.5415954589844,
        402.5836486816406
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            33,
            34
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "fd112e311d4e0503fbb4df2044fc9325.png",
        "image"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 6,
      "type": "CLIPTextEncode",
      "pos": [
        481.0621643066406,
        227.43450927734375
      ],
      "size": [
        419.9831237792969,
        140.84524536132812
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            21
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "high quality,high detailed,RAW Photograph of a cat"
      ]
    },
    {
      "id": 3,
      "type": "KSampler",
      "pos": [
        1267.84228515625,
        299.4739990234375
      ],
      "size": [
        315,
        262
      ],
      "flags": {},
      "order": 9,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 1
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 23
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 24
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 2
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "slot_index": 0,
          "links": [
            7
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        11111,
        "fixed",
        20,
        8,
        "euler",
        "normal",
        1
      ]
    },
    {
      "id": 20,
      "type": "GetImageSize",
      "pos": [
        843.8379185975353,
        740.3050594888713
      ],
      "size": [
        140,
        124
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 33
        }
      ],
      "outputs": [
        {
          "name": "width",
          "type": "INT",
          "links": [
            28
          ]
        },
        {
          "name": "height",
          "type": "INT",
          "links": [
            29
          ]
        },
        {
          "name": "batch_size",
          "type": "INT",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "GetImageSize"
      },
      "widgets_values": [
        "width: 512, height: 512\n batch size: 1"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 5,
      "type": "EmptyLatentImage",
      "pos": [
        1005.7074254334752,
        716.0095069019711
      ],
      "size": [
        210,
        106
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "width",
          "type": "INT",
          "widget": {
            "name": "width"
          },
          "link": 28
        },
        {
          "name": "height",
          "type": "INT",
          "widget": {
            "name": "height"
          },
          "link": 29
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "slot_index": 0,
          "links": [
            2
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "EmptyLatentImage"
      },
      "widgets_values": [
        512,
        512,
        1
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 4,
      "type": "CheckpointLoaderSimple",
      "pos": [
        105.53061575140833,
        331.77475253018486
      ],
      "size": [
        315,
        98
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "slot_index": 0,
          "links": [
            1
          ]
        },
        {
          "name": "CLIP",
          "type": "CLIP",
          "slot_index": 1,
          "links": [
            3,
            5
          ]
        },
        {
          "name": "VAE",
          "type": "VAE",
          "slot_index": 2,
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "CheckpointLoaderSimple"
      },
      "widgets_values": [
        "v1-5-pruned-emaonly-fp16.safetensors"
      ]
    },
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1615.2298583984375,
        299.4739990234375
      ],
      "size": [
        172.8817596435547,
        46
      ],
      "flags": {},
      "order": 10,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 7
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 16
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "slot_index": 0,
          "links": [
            35
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "VAEDecode"
      },
      "widgets_values": []
    },
    {
      "id": 23,
      "type": "SaveImage",
      "pos": [
        1820.499267578125,
        299.4739990234375
      ],
      "size": [
        460.22799999999984,
        432.01099999999985
      ],
      "flags": {},
      "order": 11,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 35
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76"
      },
      "widgets_values": [
        "ComfyUI"
      ]
    },
    {
      "id": 21,
      "type": "ControlNetApplyAdvanced",
      "pos": [
        948.958740234375,
        320.1973571777344
      ],
      "size": [
        270,
        186
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 21
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 22
        },
        {
          "name": "control_net",
          "type": "CONTROL_NET",
          "link": 25
        },
        {
          "name": "image",
          "type": "IMAGE",
          "link": 34
        },
        {
          "name": "vae",
          "shape": 7,
          "type": "VAE",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "positive",
          "type": "CONDITIONING",
          "links": [
            23
          ]
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "links": [
            24
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.49",
        "Node name for S&R": "ControlNetApplyAdvanced"
      },
      "widgets_values": [
        0.8,
        0,
        0.4
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      4,
      0,
      3,
      0,
      "MODEL"
    ],
    [
      2,
      5,
      0,
      3,
      3,
      "LATENT"
    ],
    [
      3,
      4,
      1,
      6,
      0,
      "CLIP"
    ],
    [
      5,
      4,
      1,
      7,
      0,
      "CLIP"
    ],
    [
      7,
      3,
      0,
      8,
      0,
      "LATENT"
    ],
    [
      16,
      17,
      0,
      8,
      1,
      "VAE"
    ],
    [
      21,
      6,
      0,
      21,
      0,
      "CONDITIONING"
    ],
    [
      22,
      7,
      0,
      21,
      1,
      "CONDITIONING"
    ],
    [
      23,
      21,
      0,
      3,
      1,
      "CONDITIONING"
    ],
    [
      24,
      21,
      1,
      3,
      2,
      "CONDITIONING"
    ],
    [
      25,
      13,
      0,
      21,
      2,
      "CONTROL_NET"
    ],
    [
      28,
      20,
      0,
      5,
      0,
      "INT"
    ],
    [
      29,
      20,
      1,
      5,
      1,
      "INT"
    ],
    [
      33,
      14,
      0,
      20,
      0,
      "IMAGE"
    ],
    [
      34,
      14,
      0,
      21,
      3,
      "IMAGE"
    ],
    [
      35,
      8,
      0,
      23,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.6830134553650705,
      "offset": [
        -5.530615751408334,
        -83.17780701188025
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

🟩 Apply ControlNet ノードにControlNet モデルとscribble 画像を入力。
🟨 ControlNet画像と生成する画像のサイズは同じでなくてもエラーは出ませんが、同じサイズにしておきましょう。

scribble モデルは「黒背景に白で描いた線」に最適化されています。
白背景に黒で描いた線だと、うまく反応しないことが多いので注意してください。

サンプル画像

ControlNetの制御のバランス

拡散モデルは、本来 縛られずに生成するときが最もクオリティが高く なります。
しかし、完全に自由だと役に立たないので、テキストや ControlNet などの Conditioning で制御します。
制御が強すぎるとクオリティが落ちる —— これはテキストプロンプトでも LoRA でも同じです。

では、制御とクオリティのバランスはどう取ればよいでしょうか？

start_percent / end_percent

サンプリングは序盤で大まかな形が決まり、後半で細部が描き込まれます。

ControlNet の多く（pose / depth / scribble など）は 形を決めるタイプ の制御です。
ということは、序盤だけ ControlNet を効かせればよい と考えることもできるわけです。

Apply ControlNet では、ControlNet が どの区間で効くか を指定できます。

start_percent: 効き始めるタイミング
end_percent: 効き終わるタイミング

end_percent を下げるほど、後半はモデルの自由度が戻り、形を保ちながらクオリティも向上させられます。

strength（強さ）と start_percent / end_percent を組み合わせて、
「縛りすぎず、崩しすぎない」バランスを見つけていきましょう。

主なControlNetの種類

画像と対応させられる「概念」は、星の数ほどあります。
ここでは代表的なものだけ紹介しましょう。

モデルのダウンロード

一覧

Canny

写真や画像の輪郭を保ったまま別のスタイルで描き直します。

Lineart

Cannyと似ていますが、よりイラスト向けです。
線画着色などに使われます。

Depth

深度マップ（手前・奥の情報）を使って、元画像の奥行きや構図を保ちながら生成します。
建物や風景など、立体感を崩したくない場合に向いています。

Normal

法線マップを使って、光の当たり方や立体感をコントロールします。

Pose

OpenPose などで抽出した「棒人間のポーズ情報」から、同じポーズの人物・キャラクター画像を生成します。

Inpaint

画像の一部だけを描き直したいときに使うモデルです。
マスクで指定した範囲だけ、自然に描き換えることができます（不要物の消去・小物の差し替えなど）。

QR Code Monster

QRコードとして読み取れる画像を作り出します。
QRコードに限らず、「白黒のパターン画像」をベースに、好きな絵柄に変形させる使い方もできます。

Tile

ぼかしが強い画像や低解像度の画像から、綺麗な画像を作り出します。
単体でも使えますが、実際には Ultimate SD Upscale のような「超解像アップスケール」と組み合わせて使われることが多いです。

ControlNet Union

Flux 以降の話になりますが、Scribble や Pose、Depth といった基本的な ControlNet を
ひとつのモデルとして内蔵させたものが「ControlNet Union」です。

入力された画像の特徴（ポーズ・線・深度など）を自動で認識し、
それに近い ControlNet の挙動をまとめて再現しようとするモデルだと考えておけば十分です。

ControlNet

ControlNetとは？