What is ControlNet?

Roughly speaking, diffusion models learn the relationship between "noise" and "images" so that they can restore images from noise.

What if we added "another piece of information corresponding to the image" along with the noise?

  • If we learn the relationship between line art and colored illustrations
    • → Just by passing line art, it automatically colors it
  • If we learn the relationship between stick figures (pose images) and photos of people
    • → Just by passing a stick figure, you can create an image of a person doing that pose

We can create such AI.

In this way, ControlNet is one of the mechanisms for controlling generation results using "additional image conditions (pose, line art, depth, etc.)" as clues.


Typical Types of ControlNet

The "additional information" that ControlNet can handle can be increased as much as ideas allow, but there are certain patterns in commonly used ones. I will list only the representative ones.

openpose (Pose / Stick Figure)

Specifies the pose of a person or character with a stick figure or skeleton.

depth (Depth Map)

Fixes the composition and depth using a depth map.

scribble (Doodle)

Passes only a rough doodle and generates an image based on it.

lineart / anime (Line Art)

Passes line art and generates coloring.

inpaint (For Inpainting)

Naturally fills in masked areas.

Besides these, there are various variations such as edge extraction (Canny), segmentation, QR codes, etc., but any ControlNet can be created as long as "images" and "corresponding representations" can be prepared.


Instruction-Based Image Editing

In recent image editing models, cases where things traditionally done with ControlNet are treated as "Instruction-Based Image Editing" are increasing.

Instruction-based image editing allows image editing by giving instructions such as "zoom out" or "make it night" to the given image.

This means that ControlNet-like operations can also be treated as "image editing."

  • Pose image + "Draw a character in black clothes with this pose"
  • Depth map + "Make it a night view photo with the same composition"
  • Rough image + "Make this rough sketch into a beautiful illustration"