Ideogram 4.0 是什么?
Ideogram 4.0 是一个 9.3B 的 DiT 系模型。
它最大的特点是使用 JSON 形式的 caption,可以比较细地指定图像里的各个元素。
以前也有类似思路的 FIBO,但 Ideogram 4.0 更擅长用 BBOX 指定坐标,也更擅长颜色指定。海报、Logo、UI、包装这类偏 DTP 的设计任务,会比较适合它。
作为交换,它可能不是那种随手就很好用的模型。提示词必须按既定格式写,才能发挥出本来的效果。
模型下载
- diffusion_models
- ideogram4_fp8_scaled.safetensors (9.28 GB)
- ideogram4_nvfp4_mixed.safetensors (5.49 GB)
- ideogram4_unconditional_fp8_scaled.safetensors (9.28 GB)
- ideogram4_unconditional_nvfp4_mixed.safetensors (5.49 GB)
- text_encoders
- qwen3vl_8b_fp8_scaled.safetensors (10.6 GB)
- vae
- flux2-vae.safetensors (336 MB)
📂ComfyUI/
└── 📂models/
├── 📂diffusion_models/
│ ├── ideogram4_fp8_scaled.safetensors
│ ├── ideogram4_nvfp4_mixed.safetensors
│ ├── ideogram4_unconditional_fp8_scaled.safetensors
│ └── ideogram4_unconditional_nvfp4_mixed.safetensors
├── 📂text_encoders/
│ └── qwen3vl_8b_fp8_scaled.safetensors
└── 📂vae/
└── flux2-vae.safetensors
细节后面再说,因为它会读取两个 Diffusion model,所以相当重。
ComfyUI 内部会处理显存,所以 VRAM 不够也不一定完全不能生成,但可能会非常慢。
想减轻负担,可以使用 nvfp4,不过画质会下降。
unconditional 侧对质量的影响相对小一些,所以普通侧用 fp8,unconditional 侧用 nvfp4,可能是比较合适的平衡。
提示词
只用普通自然语言也能生成,但如果不按既定的 JSON schema 来写,质量很难出来。
基本形式如下。
{
"high_level_description": "图像整体的 1~2 句说明。",
"style_description": {
"aesthetics": "氛围、审美方向。",
"lighting": "光照。",
"medium": "illustration / photograph / graphic_design 等。",
"art_style": "非照片时的画风。",
"color_palette": ["#FFFFFF", "#000000"]
},
"compositional_deconstruction": {
"background": "背景、环境说明。",
"elements": [
{
"type": "obj",
"bbox": [100, 200, 800, 700],
"desc": "物体、人物、元素说明。",
"color_palette": ["#FFFFFF", "#000000"]
},
{
"type": "text",
"bbox": [820, 200, 920, 800],
"text": "HELLO",
"desc": "文字外观说明。",
"color_palette": ["#000000"]
}
]
}
}
整体说明、风格、背景、各元素说明,结构本身并不复杂。但每次都手写这种东西,还是不太现实。
尤其是坐标指定很麻烦。每个元素要放在图像的哪里,都要用 BBOX 指定,光靠脑子想象基本不可能。
所以这里介绍几种生成提示词的方法。
交给 LLM
最轻松的方法,是把官方的 Prompting Guide 和想做的图像说明交给 LLM,让它转换成 JSON caption。
也可以把参考图,或者自己画的草图一起丢给它。
能在 ComfyUI 里本地跑的模型,一般能力不太够,所以还是老老实实交给 ChatGPT、Gemini 之类会比较好。

使用专用提示词构建器
也可以使用专用的提示词构建器,视觉化地制作提示词。
例如 ComfyUI-KJNodes 里的 Ideogram 4 Prompt Builder KJ 节点,就是比较常用的一个。

- 设置生成图像的尺寸,然后填写背景、风格等内容。
- 在 region 栏拖动即可创建 BBOX,并为该区域设置想画的内容提示词和颜色代码。
text2image

{
"id": "d8034549-7e0a-40f1-8c2e-de3ffc6f1cae",
"revision": 0,
"last_node_id": 109,
"last_link_id": 172,
"nodes": [
{
"id": 76,
"type": "KSamplerSelect",
"pos": [
560.1252292712549,
324.6528974302894
],
"size": [
270,
68.88020833333334
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "SAMPLER",
"type": "SAMPLER",
"links": [
132
]
}
],
"properties": {
"Node name for S&R": "KSamplerSelect",
"cnr_id": "comfy-core",
"ver": "0.3.56",
"enableTabs": false,
"tabWidth": 65,
"tabXOffset": 10,
"hasSecondTab": false,
"secondTabText": "Send Back",
"secondTabOffset": 80,
"secondTabWidth": 65
},
"widgets_values": [
"euler"
]
},
{
"id": 8,
"type": "VAEDecode",
"pos": [
1180.6146240234375,
195.84114925861235
],
"size": [
157.56002807617188,
46
],
"flags": {},
"order": 15,
"mode": 0,
"inputs": [
{
"name": "samples",
"type": "LATENT",
"link": 164
},
{
"name": "vae",
"type": "VAE",
"link": 76
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"slot_index": 0,
"links": [
101
]
}
],
"properties": {
"Node name for S&R": "VAEDecode",
"cnr_id": "comfy-core",
"ver": "0.3.33"
},
"widgets_values": []
},
{
"id": 92,
"type": "EmptyFlux2LatentImage",
"pos": [
560.1252292712549,
721.3543590454891
],
"size": [
270,
106
],
"flags": {},
"order": 9,
"mode": 0,
"inputs": [
{
"name": "width",
"type": "INT",
"widget": {
"name": "width"
},
"link": 153
},
{
"name": "height",
"type": "INT",
"widget": {
"name": "height"
},
"link": 154
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"links": [
152
]
}
],
"properties": {
"Node name for S&R": "EmptyFlux2LatentImage"
},
"widgets_values": [
1024,
1024,
1
]
},
{
"id": 88,
"type": "DualModelGuider",
"pos": [
560.1252292712549,
119.74227078935617
],
"size": [
270,
118
],
"flags": {},
"order": 13,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 172
},
{
"name": "positive",
"type": "CONDITIONING",
"link": 146
},
{
"name": "model_negative",
"shape": 7,
"type": "MODEL",
"link": 149
},
{
"name": "negative",
"shape": 7,
"type": "CONDITIONING",
"link": 145
}
],
"outputs": [
{
"name": "GUIDER",
"type": "GUIDER",
"links": [
144
]
}
],
"properties": {
"Node name for S&R": "DualModelGuider"
},
"widgets_values": [
7
]
},
{
"id": 91,
"type": "CLIPLoader",
"pos": [
-510.80318076960083,
311.42496280403503
],
"size": [
289.8073985431536,
106
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "CLIP",
"type": "CLIP",
"links": [
150
]
}
],
"properties": {
"Node name for S&R": "CLIPLoader"
},
"widgets_values": [
"qwen3vl_8b_fp8_scaled.safetensors",
"ideogram4",
"default"
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 39,
"type": "VAELoader",
"pos": [
904.0583918587276,
72.88171068869869
],
"size": [
242.12760404770165,
58
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "VAE",
"type": "VAE",
"slot_index": 0,
"links": [
76
]
}
],
"properties": {
"Node name for S&R": "VAELoader",
"cnr_id": "comfy-core",
"ver": "0.3.33"
},
"widgets_values": [
"flux2-vae.safetensors"
],
"color": "#322",
"bgcolor": "#533"
},
{
"id": 90,
"type": "UNETLoader",
"pos": [
-96.592885685151,
131.5898766922886
],
"size": [
305.3782043457031,
82
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"slot_index": 0,
"links": [
149
]
}
],
"properties": {
"Node name for S&R": "UNETLoader",
"cnr_id": "comfy-core",
"ver": "0.3.33"
},
"widgets_values": [
"ideogram4_unconditional_nvfp4_mixed.safetensors",
"default"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 87,
"type": "ConditioningZeroOut",
"pos": [
278.5119793477361,
427.0196777116257
],
"size": [
211.88658923633488,
26
],
"flags": {},
"order": 12,
"mode": 0,
"inputs": [
{
"name": "conditioning",
"type": "CONDITIONING",
"link": 142
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"links": [
145
]
}
],
"properties": {
"Node name for S&R": "ConditioningZeroOut"
},
"widgets_values": []
},
{
"id": 79,
"type": "SamplerCustomAdvanced",
"pos": [
904.0583918587276,
195.84114925861235
],
"size": [
242.12760404770165,
106
],
"flags": {},
"order": 14,
"mode": 0,
"inputs": [
{
"name": "noise",
"type": "NOISE",
"link": 130
},
{
"name": "guider",
"type": "GUIDER",
"link": 144
},
{
"name": "sampler",
"type": "SAMPLER",
"link": 132
},
{
"name": "sigmas",
"type": "SIGMAS",
"link": 167
},
{
"name": "latent_image",
"type": "LATENT",
"link": 152
}
],
"outputs": [
{
"name": "output",
"type": "LATENT",
"links": [
164
]
},
{
"name": "denoised_output",
"type": "LATENT",
"links": []
}
],
"properties": {
"Node name for S&R": "SamplerCustomAdvanced",
"cnr_id": "comfy-core",
"ver": "0.3.60",
"enableTabs": false,
"tabWidth": 65,
"tabXOffset": 10,
"hasSecondTab": false,
"secondTabText": "Send Back",
"secondTabOffset": 80,
"secondTabWidth": 65
},
"widgets_values": []
},
{
"id": 95,
"type": "Ideogram4Scheduler",
"pos": [
560.1252292712549,
480.44373240455593
],
"size": [
270,
154
],
"flags": {},
"order": 10,
"mode": 0,
"inputs": [
{
"name": "width",
"type": "INT",
"widget": {
"name": "width"
},
"link": 156
},
{
"name": "height",
"type": "INT",
"widget": {
"name": "height"
},
"link": 157
}
],
"outputs": [
{
"name": "SIGMAS",
"type": "SIGMAS",
"links": [
167
]
}
],
"properties": {
"Node name for S&R": "Ideogram4Scheduler"
},
"widgets_values": [
20,
1024,
1024,
0,
1.75
]
},
{
"id": 56,
"type": "SaveImage",
"pos": [
1371.5615738427737,
195.84114925861235
],
"size": [
436.7195313170437,
711.2421298391242
],
"flags": {},
"order": 16,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 101
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.75"
},
"widgets_values": [
"ComfyUI"
]
},
{
"id": 94,
"type": "ResolutionSelector",
"pos": [
249.45527396590353,
701.3543590454891
],
"size": [
270,
126
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "width",
"type": "INT",
"links": [
153,
156
]
},
{
"name": "height",
"type": "INT",
"links": [
154,
157
]
}
],
"properties": {
"Node name for S&R": "ResolutionSelector"
},
"widgets_values": [
"2:3 (Portrait Photo)",
1,
16
]
},
{
"id": 83,
"type": "CLIPTextEncode",
"pos": [
-194.71785512781273,
311.42496280403503
],
"size": [
408.34315901785703,
324.50164397511764
],
"flags": {},
"order": 8,
"mode": 0,
"inputs": [
{
"name": "clip",
"type": "CLIP",
"link": 150
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"links": [
142,
146
]
}
],
"properties": {
"Node name for S&R": "CLIPTextEncode",
"cnr_id": "comfy-core",
"ver": "0.3.56",
"enableTabs": false,
"tabWidth": 65,
"tabXOffset": 10,
"hasSecondTab": false,
"secondTabText": "Send Back",
"secondTabOffset": 80,
"secondTabWidth": 65
},
"widgets_values": [
"{\n \"high_level_description\": \"A cinematic Leica-style twilight photograph shows a tall modern office tower from a dramatic low angle, rising against a deep evening sky. Warm illuminated rooms on the front facade spell \\\"Comfy,\\\" while a small handwritten \\\"Ideogram 4.0\\\" signature appears at the lower right.\",\n \"style_description\": {\n \"aesthetics\": \"cinematic, atmospheric, elegant, slightly dreamy, high-end urban photography with strong vertical composition and restrained visual clutter\",\n \"lighting\": \"blue-hour ambient light with soft haze, gentle street glow near the lower frame, and warm yellow interior window lights standing out against the dark facade\",\n \"photo\": \"shot like a Leica photograph with a low-angle perspective, subtle filmic contrast, crisp architectural lines, natural depth, and a refined editorial cityscape look\",\n \"medium\": \"photograph\",\n \"color_palette\": [\"#1E2148\", \"#4A3F7E\", \"#F3E34B\", \"#C8CEDF\", \"#6E314D\"]\n },\n \"compositional_deconstruction\": {\n \"background\": \"A dusky urban evening sky fills most of the frame with deep navy and violet tones, fading slightly brighter near the horizon. The atmosphere is lightly hazy, with a soft bloom of city light near the lower left. Minimal surrounding street-level structures appear as subdued silhouettes near the bottom edges, keeping the tower dominant in the portrait-oriented composition.\",\n \"elements\": [\n {\n \"type\": \"obj\",\n \"bbox\": [120, 330, 945, 845],\n \"desc\": \"A tall dark-glass office tower viewed from below, centered slightly right of frame. The building has sharp modern edges, horizontal floor bands, a subtly reflective facade, and a tapering sense of height emphasized by the perspective. The front-facing plane is the main visual surface, while the right side recedes into shadow.\",\n \"color_palette\": [\"#161A33\", \"#252C54\", \"#BEC6DC\"]\n },\n {\n \"type\": \"text\",\n \"bbox\": [170, 455, 785, 615],\n \"text\": \"Comfy\",\n \"desc\": \"The word is formed by warm glowing room windows arranged vertically on the front facade of the tower. Each letter is clearly legible through clusters of illuminated office rooms, appearing as bright yellow typographic shapes embedded within the architecture.\",\n \"color_palette\": [\"#F3E34B\"]\n },\n {\n \"type\": \"obj\",\n \"bbox\": [40, 85, 90, 130],\n \"desc\": \"A small crescent moon in the upper left portion of the sky, softly glowing and isolated against the dark twilight background.\",\n \"color_palette\": [\"#F3E34B\", \"#F8F0A8\"]\n },\n {\n \"type\": \"text\",\n \"bbox\": [955, 790, 995, 985],\n \"text\": \"Ideogram 4.0\",\n \"desc\": \"A small handwritten signature placed at the lower right corner, rendered in a light ink-like white script with a casual, unobtrusive appearance.\",\n \"color_palette\": [\"#F3F3F0\"]\n }\n ]\n }\n}"
]
},
{
"id": 78,
"type": "RandomNoise",
"pos": [
560.1252292712549,
-49.16835585157703
],
"size": [
270,
82
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "NOISE",
"type": "NOISE",
"links": [
130
]
}
],
"properties": {
"Node name for S&R": "RandomNoise",
"cnr_id": "comfy-core",
"ver": "0.3.56",
"enableTabs": false,
"tabWidth": 65,
"tabXOffset": 10,
"hasSecondTab": false,
"secondTabText": "Send Back",
"secondTabOffset": 80,
"secondTabWidth": 65
},
"widgets_values": [
9999,
"fixed"
]
},
{
"id": 37,
"type": "UNETLoader",
"pos": [
-96.592885685151,
-49.16835585157703
],
"size": [
305.3782043457031,
82
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"slot_index": 0,
"links": [
148
]
}
],
"properties": {
"Node name for S&R": "UNETLoader",
"cnr_id": "comfy-core",
"ver": "0.3.33"
},
"widgets_values": [
"ideogram4_fp8_scaled.safetensors",
"default"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 89,
"type": "CFGOverride",
"pos": [
249.45527396590353,
-49.16835585157703
],
"size": [
270,
106
],
"flags": {},
"order": 11,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 148
}
],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"links": [
172
]
}
],
"properties": {
"Node name for S&R": "CFGOverride"
},
"widgets_values": [
3,
0.7,
1
]
},
{
"id": 71,
"type": "MarkdownNote",
"pos": [
-560.0606701629572,
-141.99890306621
],
"size": [
402.9868769880169,
355.5887797584986
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [],
"outputs": [],
"properties": {},
"widgets_values": [
"## models\n\n- diffusion_models\n - [ideogram4_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/diffusion_models/ideogram4_fp8_scaled.safetensors) (9.28 GB)\n - [ideogram4_nvfp4_mixed.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/diffusion_models/ideogram4_nvfp4_mixed.safetensors) (5.49 GB)\n - [ideogram4_unconditional_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/diffusion_models/ideogram4_unconditional_fp8_scaled.safetensors) (9.28 GB)\n - [ideogram4_unconditional_nvfp4_mixed.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/diffusion_models/ideogram4_unconditional_nvfp4_mixed.safetensors) (5.49 GB)\n- text_encoders\n - [qwen3vl_8b_fp8_scaled.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/text_encoders/qwen3vl_8b_fp8_scaled.safetensors) (10.6 GB)\n- vae\n - [flux2-vae.safetensors](https://huggingface.co/Comfy-Org/Ideogram-4/blob/main/vae/flux2-vae.safetensors) (336 MB)\n\n```text\n📂ComfyUI/\n└── 📂models/\n ├── 📂diffusion_models/\n │ ├── ideogram4_fp8_scaled.safetensors\n │ ├── ideogram4_nvfp4_mixed.safetensors\n │ ├── ideogram4_unconditional_fp8_scaled.safetensors\n │ └── ideogram4_unconditional_nvfp4_mixed.safetensors\n ├── 📂text_encoders/\n │ └── qwen3vl_8b_fp8_scaled.safetensors\n └── 📂vae/\n └── flux2-vae.safetensors\n```"
],
"color": "#323",
"bgcolor": "#535"
}
],
"links": [
[
76,
39,
0,
8,
1,
"VAE"
],
[
101,
8,
0,
56,
0,
"IMAGE"
],
[
130,
78,
0,
79,
0,
"NOISE"
],
[
132,
76,
0,
79,
2,
"SAMPLER"
],
[
142,
83,
0,
87,
0,
"CONDITIONING"
],
[
144,
88,
0,
79,
1,
"GUIDER"
],
[
145,
87,
0,
88,
3,
"CONDITIONING"
],
[
146,
83,
0,
88,
1,
"CONDITIONING"
],
[
148,
37,
0,
89,
0,
"MODEL"
],
[
149,
90,
0,
88,
2,
"MODEL"
],
[
150,
91,
0,
83,
0,
"CLIP"
],
[
152,
92,
0,
79,
4,
"LATENT"
],
[
153,
94,
0,
92,
0,
"INT"
],
[
154,
94,
1,
92,
1,
"INT"
],
[
156,
94,
0,
95,
0,
"INT"
],
[
157,
94,
1,
95,
1,
"INT"
],
[
164,
79,
0,
8,
0,
"LATENT"
],
[
167,
95,
0,
79,
3,
"SIGMAS"
],
[
172,
89,
0,
88,
0,
"MODEL"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.6209213230591553,
"offset": [
821.830919056688,
383.46442973242
]
},
"frontendVersion": "1.45.15",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
除了提示词以外,它和普通 workflow 相比还有几个稍微特殊的地方,所以这里只看这些部分。
Load Diffusion Model
Ideogram 4.0 为了稍微特殊的 CFG,会读取两个 diffusion model。
- 普通的 CFG 是比较有提示词的结果和没有提示词的结果,把生成方向往提示词靠。
- Ideogram 4.0 的 unconditional 侧不是传入空提示词,而是把不使用文本 token 的 image-only 输入送入 unconditional 用模型。
- 乍看可能会觉得这有什么区别,但可以把它理解成一种更细致处理 positive prompt 的办法。
CFG
这是很早就有的小技巧:采样前半段和后半段使用不同的 CFG 值。
- 这个 workflow 里,前半段是 CFG 7,后半段是 CFG 3。
- 与其从头到尾一直套很高的 CFG,中途降下来通常更稳定。
- 这里使用的是
CFG Override节点。 - 它只会在指定的 step 范围内覆盖 CFG 值。
- 这个 workflow 里,总步数 70% 之后,
cfg会变成 3。