What is Qwen-Image-Layered?
It is a diffusion model that decomposes an input image into an arbitrary number of layers.
In recent trending image editing, parts unrelated to the instructions sometimes change. So, the motivation was "why not separate layers like designers have done so far, and edit only the target layer?", right?
It is also noteworthy that it is the first general-purpose method to handle transparent images (RGBA). Previous methods required post-processing or special processing only during decoding, but this one takes a more straightforward approach of "handling as RGBA images".
Model Download
-
diffusion_models
- qwen_image_layered_fp8mixed.safetensors (20.5 GB)
-
text_encoders
- qwen_2.5_vl_7b_fp8_scaled.safetensors (9.38 GB)
-
vae
- qwen_image_layered_vae.safetensors (254 MB)
-
gguf (Optional)
📂ComfyUI/
└── 📂models/
├── 📂diffusion_models/
│ └── qwen_image_layered_fp8mixed.safetensors
├── 📂text_encoders/
│ └── qwen_2.5_vl_7b_fp8_scaled.safetensors
├── 📂unet/
│ └── Qwen_Image_Layered-XXXX.gguf ← Only when using gguf
└── 📂vae/
└── qwen_image_layered_vae.safetensors
workflow

{
"id": "d8034549-7e0a-40f1-8c2e-de3ffc6f1cae",
"revision": 0,
"last_node_id": 87,
"last_link_id": 148,
"nodes": [
{
"id": 38,
"type": "CLIPLoader",
"pos": [
56.288665771484375,
312.74468994140625
],
"size": [
301.3524169921875,
106
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "CLIP",
"type": "CLIP",
"slot_index": 0,
"links": [
74,
75
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPLoader"
},
"widgets_values": [
"qwen_2.5_vl_7b_fp8_scaled.safetensors",
"qwen_image",
"default"
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 57,
"type": "ReferenceLatent",
"pos": [
864.2781462760086,
186
],
"size": [
204.134765625,
46
],
"flags": {},
"order": 12,
"mode": 0,
"inputs": [
{
"name": "conditioning",
"type": "CONDITIONING",
"link": 103
},
{
"name": "latent",
"shape": 7,
"type": "LATENT",
"link": 110
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"links": [
104
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ReferenceLatent"
},
"widgets_values": []
},
{
"id": 58,
"type": "ReferenceLatent",
"pos": [
864.2781462760086,
405.392333984375
],
"size": [
204.134765625,
46
],
"flags": {},
"order": 11,
"mode": 0,
"inputs": [
{
"name": "conditioning",
"type": "CONDITIONING",
"link": 102
},
{
"name": "latent",
"shape": 7,
"type": "LATENT",
"link": 109
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"links": [
105
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ReferenceLatent"
},
"widgets_values": []
},
{
"id": 54,
"type": "ModelSamplingAuraFlow",
"pos": [
838.0823302359695,
42.94671378647985
],
"size": [
230.33058166503906,
58
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 99
}
],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"links": [
100
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.49",
"Node name for S&R": "ModelSamplingAuraFlow"
},
"widgets_values": [
1
]
},
{
"id": 7,
"type": "CLIPTextEncode",
"pos": [
415.9506530761719,
405.392333984375
],
"size": [
418.3189392089844,
107.08506774902344
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "clip",
"type": "CLIP",
"link": 75
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"slot_index": 0,
"links": [
102
]
}
],
"title": "CLIP Text Encode (Negative Prompt)",
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPTextEncode"
},
"widgets_values": [
"text, worst quality, blurry, ugly"
]
},
{
"id": 64,
"type": "ImageScaleToTotalPixels",
"pos": [
249.72535062227473,
718.9234534762987
],
"size": [
229.5555480957031,
106
],
"flags": {},
"order": 8,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 115
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
113,
114
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ImageScaleToTotalPixels"
},
"widgets_values": [
"nearest-exact",
0.5,
1
]
},
{
"id": 6,
"type": "CLIPTextEncode",
"pos": [
415,
186
],
"size": [
419.26959228515625,
156.00363159179688
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "clip",
"type": "CLIP",
"link": 74
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"slot_index": 0,
"links": [
103
]
}
],
"title": "CLIP Text Encode (Positive Prompt)",
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPTextEncode"
},
"widgets_values": [
"Intimate macro of a 33-year-old Brazilian dancer's feet en pointe, focus on toes and ballet shoe, studio lighting from above, shot on Sony FE 90mm f/2.8 macro, realistic worn shoe fabric texture, individual toe details visible through shoe, strained tendons, slight blood spot on shoe tip, dusty studio floor texture, ankle ribbons tied tight uphill"
]
},
{
"id": 37,
"type": "UNETLoader",
"pos": [
497.22367921939565,
42.94671378647985
],
"size": [
305.3782043457031,
82
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"slot_index": 0,
"links": [
99
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "UNETLoader"
},
"widgets_values": [
"Qwen-Image\\qwen_image_layered_fp8mixed.safetensors",
"fp8_e4m3fn"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 39,
"type": "VAELoader",
"pos": [
223.02005587937379,
578.5647381339587
],
"size": [
256.26084283860405,
58
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "VAE",
"type": "VAE",
"slot_index": 0,
"links": [
116,
122
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "VAELoader"
},
"widgets_values": [
"qwen_image_layered_vae.safetensors"
],
"color": "#322",
"bgcolor": "#533"
},
{
"id": 61,
"type": "LoadImage",
"pos": [
-134.3561028852609,
718.9234534762987
],
"size": [
353.5766357421875,
459.44451904296864
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
115
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (113).png",
"image"
]
},
{
"id": 60,
"type": "VAEEncode",
"pos": [
512.2301683876235,
581.0691055180919
],
"size": [
171.72218557769065,
46
],
"flags": {},
"order": 9,
"mode": 0,
"inputs": [
{
"name": "pixels",
"type": "IMAGE",
"link": 113
},
{
"name": "vae",
"type": "VAE",
"link": 116
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"links": [
109,
110
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "VAEEncode"
},
"widgets_values": []
},
{
"id": 63,
"type": "GetImageSize",
"pos": [
512.2301683876235,
718.9234534762987
],
"size": [
210,
136
],
"flags": {},
"order": 10,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 114
}
],
"outputs": [
{
"name": "width",
"type": "INT",
"links": [
117
]
},
{
"name": "height",
"type": "INT",
"links": [
118
]
},
{
"name": "batch_size",
"type": "INT",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "GetImageSize"
},
"widgets_values": []
},
{
"id": 66,
"type": "VAEDecode",
"pos": [
1696.6426615505557,
173.13380452764375
],
"size": [
166.0271370269786,
46
],
"flags": {},
"order": 16,
"mode": 0,
"inputs": [
{
"name": "samples",
"type": "LATENT",
"link": 121
},
{
"name": "vae",
"type": "VAE",
"link": 122
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"slot_index": 0,
"links": [
120,
128,
129
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "VAEDecode"
},
"widgets_values": []
},
{
"id": 77,
"type": "ImageFromBatch",
"pos": [
1562.2305676328322,
947.806885766381
],
"size": [
210,
82
],
"flags": {},
"order": 19,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 129
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
131
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ImageFromBatch"
},
"widgets_values": [
2,
1
]
},
{
"id": 3,
"type": "KSampler",
"pos": [
1104.4448189452391,
173.13380452764375
],
"size": [
315,
262
],
"flags": {},
"order": 14,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 100
},
{
"name": "positive",
"type": "CONDITIONING",
"link": 104
},
{
"name": "negative",
"type": "CONDITIONING",
"link": 105
},
{
"name": "latent_image",
"type": "LATENT",
"link": 108
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"slot_index": 0,
"links": [
119
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "KSampler"
},
"widgets_values": [
1234,
"fixed",
20,
2.5,
"euler",
"simple",
1
]
},
{
"id": 76,
"type": "ImageFromBatch",
"pos": [
1562.2305676328322,
797.301847591665
],
"size": [
210,
82
],
"flags": {},
"order": 18,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 128
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
136
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ImageFromBatch"
},
"widgets_values": [
1,
1
]
},
{
"id": 67,
"type": "SaveImage",
"pos": [
1939.850523648282,
171.13269321533235
],
"size": [
428.5909735732416,
468.94454416638166
],
"flags": {
"collapsed": false
},
"order": 17,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 120
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76"
},
"widgets_values": [
"ComfyUI"
]
},
{
"id": 65,
"type": "LatentCutToBatch",
"pos": [
1453.0437402478974,
173.13380452764375
],
"size": [
210,
82
],
"flags": {},
"order": 15,
"mode": 0,
"inputs": [
{
"name": "samples",
"type": "LATENT",
"link": 119
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"links": [
121
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "LatentCutToBatch"
},
"widgets_values": [
"t",
1
],
"color": "#332922",
"bgcolor": "#593930"
},
{
"id": 55,
"type": "MarkdownNote",
"pos": [
12.546970997699502,
-11.88447421897053
],
"size": [
345.70001220703125,
225.77000427246094
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [],
"outputs": [],
"properties": {},
"widgets_values": [
"## models\n\n- [qwen_image_layered_fp8mixed.safetensors](https://huggingface.co/Comfy-Org/Qwen-Image-Layered_ComfyUI/blob/main/split_files/diffusion_models/qwen_image_layered_fp8mixed.safetensors)\n- [qwen_2.5_vl_7b_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/blob/main/split_files/text_encoders/qwen_2.5_vl_7b_fp8_scaled.safetensors)\n- [qwen_image_layered_vae.safetensors](https://huggingface.co/Comfy-Org/Qwen-Image-Layered_ComfyUI/blob/main/split_files/vae/qwen_image_layered_vae.safetensors)\n\n\n```\n📂ComfyUI/\n└── 📂models/\n ├── 📂diffusion_models/\n │ └── qwen_image_layered_fp8mixed.safetensors\n ├── 📂text_encoders/\n │ └── qwen_2.5_vl_7b_fp8_scaled.safetensors\n └── 📂vae/\n └── qwen_image_layered_vae.safetensors\n```"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 80,
"type": "PreviewImage",
"pos": [
2460.612938515339,
795.0211535441636
],
"size": [
353.88890380859357,
371.8889038085938
],
"flags": {},
"order": 24,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 135
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 59,
"type": "EmptyQwenImageLayeredLatentImage",
"pos": [
755.2170791040447,
693.0025348793122
],
"size": [
305.1563720703124,
130
],
"flags": {},
"order": 13,
"mode": 0,
"inputs": [
{
"name": "width",
"type": "INT",
"widget": {
"name": "width"
},
"link": 117
},
{
"name": "height",
"type": "INT",
"widget": {
"name": "height"
},
"link": 118
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"links": [
108
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "EmptyQwenImageLayeredLatentImage"
},
"widgets_values": [
640,
640,
2,
1
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 81,
"type": "SplitImageWithAlpha",
"pos": [
1792.8667692821584,
797.301847591665
],
"size": [
213.68285814424544,
46
],
"flags": {},
"order": 20,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 136
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
144
]
},
{
"name": "MASK",
"type": "MASK",
"links": []
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "SplitImageWithAlpha"
},
"widgets_values": []
},
{
"id": 87,
"type": "InvertMask",
"pos": [
2032.7321075290527,
965.5261917188795
],
"size": [
140,
26
],
"flags": {},
"order": 22,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 147
}
],
"outputs": [
{
"name": "MASK",
"type": "MASK",
"links": [
148
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "InvertMask"
}
},
{
"id": 79,
"type": "SplitImageWithAlpha",
"pos": [
1792.8667692821584,
947.806885766381
],
"size": [
213.68285814424544,
46
],
"flags": {},
"order": 21,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 131
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
143
]
},
{
"name": "MASK",
"type": "MASK",
"links": [
147
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "SplitImageWithAlpha"
},
"widgets_values": []
},
{
"id": 74,
"type": "ImageCompositeMasked",
"pos": [
2200.3106540392373,
795.0211535441636
],
"size": [
228.33342285156277,
146
],
"flags": {},
"order": 23,
"mode": 0,
"inputs": [
{
"name": "destination",
"type": "IMAGE",
"link": 144
},
{
"name": "source",
"type": "IMAGE",
"link": 143
},
{
"name": "mask",
"shape": 7,
"type": "MASK",
"link": 148
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
135
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.6.0",
"Node name for S&R": "ImageCompositeMasked"
},
"widgets_values": [
0,
0,
false
]
}
],
"links": [
[
74,
38,
0,
6,
0,
"CLIP"
],
[
75,
38,
0,
7,
0,
"CLIP"
],
[
99,
37,
0,
54,
0,
"MODEL"
],
[
100,
54,
0,
3,
0,
"MODEL"
],
[
102,
7,
0,
58,
0,
"CONDITIONING"
],
[
103,
6,
0,
57,
0,
"CONDITIONING"
],
[
104,
57,
0,
3,
1,
"CONDITIONING"
],
[
105,
58,
0,
3,
2,
"CONDITIONING"
],
[
108,
59,
0,
3,
3,
"LATENT"
],
[
109,
60,
0,
58,
1,
"LATENT"
],
[
110,
60,
0,
57,
1,
"LATENT"
],
[
113,
64,
0,
60,
0,
"IMAGE"
],
[
114,
64,
0,
63,
0,
"IMAGE"
],
[
115,
61,
0,
64,
0,
"IMAGE"
],
[
116,
39,
0,
60,
1,
"VAE"
],
[
117,
63,
0,
59,
0,
"INT"
],
[
118,
63,
1,
59,
1,
"INT"
],
[
119,
3,
0,
65,
0,
"LATENT"
],
[
120,
66,
0,
67,
0,
"IMAGE"
],
[
121,
65,
0,
66,
0,
"LATENT"
],
[
122,
39,
0,
66,
1,
"VAE"
],
[
128,
66,
0,
76,
0,
"IMAGE"
],
[
129,
66,
0,
77,
0,
"IMAGE"
],
[
131,
77,
0,
79,
0,
"IMAGE"
],
[
135,
74,
0,
80,
0,
"IMAGE"
],
[
136,
76,
0,
81,
0,
"IMAGE"
],
[
143,
79,
0,
74,
1,
"IMAGE"
],
[
144,
81,
0,
74,
0,
"IMAGE"
],
[
147,
79,
1,
87,
0,
"MASK"
],
[
148,
87,
0,
74,
2,
"MASK"
]
],
"groups": [
{
"id": 1,
"title": "Image Composite",
"bounding": [
1552.2305676328322,
723.701847591665,
1277.625191808733,
456.85072813132865
],
"color": "#3f789e",
"font_size": 24,
"flags": {}
}
],
"config": {},
"extra": {
"ds": {
"scale": 0.6209213230591553,
"offset": [
306.76748492398815,
311.2185045299079
]
},
"frontendVersion": "1.36.12",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
-
Resize input image
- It can handle up to 1024px, but since it tends to get heavier as the number of layers increases, it is set to 0.5M pixels here.
-
🟩
Empty Qwen Image Layered Latentlayers: Number of layers you want to split- Increasing this also increases memory and time costs.
-
🟫
LatentCutToBatch- It might be hard to understand what it is doing, but please think of it as "formatting" for implementation convenience.
- As the name suggests, this model outputs multiple images as "layers", but the current
VAE Decodecannot understand the concept of layers well, so it converts them into a simple batch of N images.
-
🟦 Synthesize images again (Optional)
- If split into 2 layers, a total of 3 RGBA images (original image + decomposition results) are output.
-
You can return to the original single image by continuing to overlay the 2nd and subsequent images with
ImageCompositeMasked.- However, since this node can only handle RGB images, it is necessary to convert it to the form of RGB image + mask.
- cf. Mask and Alpha Channel
-
I think it's troublesome, but node-based UIs and layer systems are not very compatible, not limited to ComfyUI 😥