image2image

What is image2image?

image2image is a method of using a reference image as a draft and having a picture drawn over it.

Even if you use it as a draft, if you trace it perfectly, it's just a copy. It has no originality.

So, by adding noise to the extent that the original image is still recognizable, and then removing the noise, let's have it draw a different version of the picture that follows the prompt while inheriting the composition and atmosphere of the original image moderately.

Mechanism of image2image

Here is a review of diffusion models and Sampling again. In ComfyUI, KSampler first fills an "empty latent" with noise, and generates an image by gradually removing noise from there.

In image2image, this "empty latent" is replaced with a latent encoded from the reference image. And you adjust from which point to start adding noise with start_at_step.

Now, let's see what happens when we change start_at_step with a KSampler (Advanced) of steps: 20.

start_at_step: 0

It is filled with noise from the beginning.
The draft image is not visible at all. It is almost the same as normal text2image.
*Behavior is slightly different only in Stable Diffusion 1.5. → image2image and text2image when denoise is 1.0

start_at_step: 1

Starts from 1 step forward.
Therefore, the amount of noise added to the draft (= amount of noise to be removed from now on) decreases slightly.
However, the draft image is still barely visible.

start_at_step: 9

The amount of noise added to the draft (= amount of noise to be removed from now on) decreases significantly.
The outline and composition of the draft remain to the extent that they can be understood as they are.

start_at_step: 20

Since it starts from the last step out of 20 steps, it is effectively the same as "doing nothing".
In other words, practically no sampling is done, and no noise is added.
Therefore, the input image is output as it is.

In this way, if start_at_step is set somewhere between 1 ~ (steps - 1), it will be in a state of sampling while leaving the original picture.

This is called image2image.

Workflow with KSampler (Advanced)

SD1.5_image2image_KSampler_(Advanced).json

{
  "id": "8b9f7796-0873-4025-be3c-0f997f67f866",
  "revision": 0,
  "last_node_id": 15,
  "last_link_id": 32,
  "nodes": [
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1209,
        186
      ],
      "size": [
        210,
        46
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 28
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 10
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "slot_index": 0,
          "links": [
            9
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "VAEDecode"
      },
      "widgets_values": []
    },
    {
      "id": 7,
      "type": "CLIPTextEncode",
      "pos": [
        416.1970166015625,
        392.37848510742185
      ],
      "size": [
        410.75801513671877,
        158.82607910156253
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 5
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            12
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "text, watermark"
      ]
    },
    {
      "id": 10,
      "type": "VAELoader",
      "pos": [
        464.1892561983473,
        736.7997591425777
      ],
      "size": [
        210,
        58
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            10,
            30
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "vae-ft-mse-840000-ema-pruned.safetensors"
      ]
    },
    {
      "id": 13,
      "type": "LoadImage",
      "pos": [
        145.97903082644623,
        611.5931484814206
      ],
      "size": [
        272.2618963068182,
        377.6363636363636
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            18
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "vivi (1).png",
        "image"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 9,
      "type": "SaveImage",
      "pos": [
        1451,
        186
      ],
      "size": [
        354.2876035004722,
        433.23967321788405
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 9
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33"
      },
      "widgets_values": [
        "ComfyUI"
      ]
    },
    {
      "id": 6,
      "type": "CLIPTextEncode",
      "pos": [
        415,
        186
      ],
      "size": [
        411.95503173828126,
        151.0030493164063
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            11
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "high quality, cute clay figure of a small humanoid character with long pink hair, yellow curved horns, purple boots, simple flat colors, minimal facial features, soft studio lighting, clean background"
      ]
    },
    {
      "id": 12,
      "type": "VAEEncode",
      "pos": [
        685.9517580991734,
        611.5931484814206
      ],
      "size": [
        140,
        46
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "pixels",
          "type": "IMAGE",
          "link": 18
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 30
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            32
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "VAEEncode"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 11,
      "type": "KSamplerAdvanced",
      "pos": [
        867.0434936363629,
        186
      ],
      "size": [
        306.34804687500014,
        334
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 14
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 11
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 12
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 32
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            28
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "KSamplerAdvanced"
      },
      "widgets_values": [
        "enable",
        123,
        "fixed",
        20,
        8,
        "euler",
        "normal",
        6,
        20,
        "enable"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 4,
      "type": "CheckpointLoaderSimple",
      "pos": [
        38.43636363636362,
        363.0864500000007
      ],
      "size": [
        315,
        98
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "slot_index": 0,
          "links": [
            14
          ]
        },
        {
          "name": "CLIP",
          "type": "CLIP",
          "slot_index": 1,
          "links": [
            3,
            5
          ]
        },
        {
          "name": "VAE",
          "type": "VAE",
          "slot_index": 2,
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CheckpointLoaderSimple"
      },
      "widgets_values": [
        "v1-5-pruned-emaonly-fp16.safetensors"
      ]
    }
  ],
  "links": [
    [
      3,
      4,
      1,
      6,
      0,
      "CLIP"
    ],
    [
      5,
      4,
      1,
      7,
      0,
      "CLIP"
    ],
    [
      9,
      8,
      0,
      9,
      0,
      "IMAGE"
    ],
    [
      10,
      10,
      0,
      8,
      1,
      "VAE"
    ],
    [
      11,
      6,
      0,
      11,
      1,
      "CONDITIONING"
    ],
    [
      12,
      7,
      0,
      11,
      2,
      "CONDITIONING"
    ],
    [
      14,
      4,
      0,
      11,
      0,
      "MODEL"
    ],
    [
      18,
      13,
      0,
      12,
      0,
      "IMAGE"
    ],
    [
      28,
      11,
      0,
      8,
      0,
      "LATENT"
    ],
    [
      30,
      10,
      0,
      12,
      1,
      "VAE"
    ],
    [
      32,
      12,
      0,
      11,
      3,
      "LATENT"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.7513148009015777,
      "offset": [
        61.56363636363638,
        -86
      ]
    },
    "frontendVersion": "1.34.5",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

🟩 Convert the image to latent with the VAE Encode node.
🟨 Try changing the value of start_at_step to see how much of the original image remains.

Workflow with KSampler

Of course, you can do image2image with the standard KSampler as well. However, "which knob determines how much of the original image remains" is quite different from KSampler (Advanced).

SD1.5_image2image_KSampler.json

{
  "id": "8b9f7796-0873-4025-be3c-0f997f67f866",
  "revision": 0,
  "last_node_id": 16,
  "last_link_id": 39,
  "nodes": [
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1209,
        186
      ],
      "size": [
        210,
        46
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 39
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 10
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "slot_index": 0,
          "links": [
            9
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "VAEDecode"
      },
      "widgets_values": []
    },
    {
      "id": 10,
      "type": "VAELoader",
      "pos": [
        464.1892561983473,
        736.7997591425777
      ],
      "size": [
        210,
        58
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            10,
            30
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "vae-ft-mse-840000-ema-pruned.safetensors"
      ]
    },
    {
      "id": 13,
      "type": "LoadImage",
      "pos": [
        145.97903082644623,
        611.5931484814206
      ],
      "size": [
        272.2618963068182,
        377.6363636363636
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            18
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "vivi (1).png",
        "image"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 9,
      "type": "SaveImage",
      "pos": [
        1451,
        186
      ],
      "size": [
        354.2876035004722,
        433.23967321788405
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 9
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33"
      },
      "widgets_values": [
        "ComfyUI"
      ]
    },
    {
      "id": 6,
      "type": "CLIPTextEncode",
      "pos": [
        415,
        186
      ],
      "size": [
        411.95503173828126,
        151.0030493164063
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            35
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "high quality, cute clay figure of a small humanoid character with long pink hair, yellow curved horns, purple boots, simple flat colors, minimal facial features, soft studio lighting, clean background"
      ]
    },
    {
      "id": 7,
      "type": "CLIPTextEncode",
      "pos": [
        416.1970166015625,
        392.37848510742185
      ],
      "size": [
        410.75801513671877,
        158.82607910156253
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 5
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            36
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "text, watermark"
      ]
    },
    {
      "id": 12,
      "type": "VAEEncode",
      "pos": [
        685.9517580991734,
        611.5931484814206
      ],
      "size": [
        140,
        46
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "pixels",
          "type": "IMAGE",
          "link": 18
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 30
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            37
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "VAEEncode"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 4,
      "type": "CheckpointLoaderSimple",
      "pos": [
        38.43636363636362,
        363.0864500000007
      ],
      "size": [
        315,
        98
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "slot_index": 0,
          "links": [
            38
          ]
        },
        {
          "name": "CLIP",
          "type": "CLIP",
          "slot_index": 1,
          "links": [
            3,
            5
          ]
        },
        {
          "name": "VAE",
          "type": "VAE",
          "slot_index": 2,
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.33",
        "Node name for S&R": "CheckpointLoaderSimple"
      },
      "widgets_values": [
        "v1-5-pruned-emaonly-fp16.safetensors"
      ]
    },
    {
      "id": 16,
      "type": "KSampler",
      "pos": [
        871.9451695085444,
        186
      ],
      "size": [
        301.7355371900828,
        262
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 38
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 35
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 36
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 37
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            39
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        123,
        "fixed",
        20,
        8,
        "euler",
        "normal",
        0.7
      ],
      "color": "#323",
      "bgcolor": "#535"
    }
  ],
  "links": [
    [
      3,
      4,
      1,
      6,
      0,
      "CLIP"
    ],
    [
      5,
      4,
      1,
      7,
      0,
      "CLIP"
    ],
    [
      9,
      8,
      0,
      9,
      0,
      "IMAGE"
    ],
    [
      10,
      10,
      0,
      8,
      1,
      "VAE"
    ],
    [
      18,
      13,
      0,
      12,
      0,
      "IMAGE"
    ],
    [
      30,
      10,
      0,
      12,
      1,
      "VAE"
    ],
    [
      35,
      6,
      0,
      16,
      1,
      "CONDITIONING"
    ],
    [
      36,
      7,
      0,
      16,
      2,
      "CONDITIONING"
    ],
    [
      37,
      12,
      0,
      16,
      3,
      "LATENT"
    ],
    [
      38,
      4,
      0,
      16,
      0,
      "MODEL"
    ],
    [
      39,
      16,
      0,
      8,
      0,
      "LATENT"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.9090909090909091,
      "offset": [
        61.56363636363638,
        -86
      ]
    },
    "frontendVersion": "1.34.5",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

🟪 set how much of the original image to leave by changing the value of denoise.
- At 1.0, it fills completely with noise. In other words, it is the same as text2image.
- At 0.0, no noise is added at all, so the original image is output as it is.

Difference between Standard and Advanced

Here, let's compare it with KSampler (Advanced).

What we want to do is the same, and both adjust "how much noise is added to the original image and then how much is removed".

However, since the assignment of knobs is different, it is a bit confusing. Let's look at the behavior of each with settings that seem to produce the same result.

KSampler (Advanced)

For example, if you set steps: 20, start_at_step: 4, It executes only "from the 4th step to the 20th step of the total 20 steps".
The actual number of times sampled is 20 - 4 = 16 times.

Standard KSampler

Similarly, if you set steps: 20 and denoise: 0.8, the appearance of "how noise is applied" will be close, but the sampling count remains 20 times.
Even if you change the value of denoise to 0.5 or 0.1, it still samples 20 times.

Advanced
- steps is "total number of steps", execute only after start_at_step → execution count changes
Standard
- steps is "actual execution count", denoise changes only the strength of noise → execution count does not change

If you want to achieve "noise application close to Advanced" with Standard KSampler, the following formula gives a rough estimate. (It does not match perfectly)

Steps to set ≒ Total steps * denoise

You don't really need to worry about it

After explaining it so thoroughly, both determine "how much noise to add to the original image".

Care must be taken when mixing standard KSampler and Advanced, but since no one builds such a workflow, there is no need to worry.

It is OK if you know which parameter to change to leave how much of the original image.

image2image and text2image when denoise is 1.0

When denoise: 1.0, the original image is completely filled with noise, so mechanically image2image and text2image using the Empty Latent Image node should be the same.

But they are not the same in Stable Diffusion 1.5. (I think it's a difference in implementation, but I don't understand it so I don't know.) On the other hand, in recent models (Flux etc.), they become exactly the same image.

Stable Diffusion 1.5 is a special case, and on this site, we treat "image2image with denoise 1.0 and text2image as the same thing" as originally designed.

image2image

What is image2image?

Mechanism of image2image

Workflow with KSampler (Advanced)

Workflow with KSampler

Difference between Standard and Advanced

You don't really need to worry about it

image2image and text2image when denoise is 1.0

Sample Images

What is the JSON copy button?

This page has an issue!

Please explain more!

Feedback / Other

Thank you