什么是提示词生成・编辑?
在除了提示词以外几乎没有参数可以正常触碰的时期,提示词工程学(Prompt Engineering)和“咒语”这个词非常流行(真怀念啊)。
与现在的自然语言提示词相比,面向 Stable Diffusion 1.5 的提示词就像是罗列了标签的咒语一样。模型侧的理解能力也很低,需要一边观察实际的输出一边反复试验提示词。
但是,每次都由人类来写这个很麻烦,而且难免会变成工匠技艺。试图让 LLM 来分担这一部分的,就是本页面所说的“提示词生成・编辑”。
Stable Diffusion 时代的提示词生成
Stable Diffusion / SDXL 世代的模型无法很好地理解自然语言,基本写法是将逗号分隔的标签连起来。
masterpiece, (best quality:1.05), 1girl, blue hair, …
虽然会做一些排列意义相近的单词、贴近模型学习所用文本的习惯……这样的功夫,但每次都要手动构建“偏向 AI 的写法”很麻烦。
因此登场的就是,“将粗略编写的提示词转换为 Stable Diffusion 风格的标签序列”的专用模型。
代表例
-
- 生成 Danbooru 标签序列的轻量模型。如果在这个模型中输入粗略的标签或说明,它会将其转换为适合 Stable Diffusion 的浓厚标签序列。
-
Qwen 1.8B Stable Diffusion Prompt
- 专注于 SD 用提示词生成(日语→英语标签序列等)的小型 Qwen 系模型。
两者都不是为了“人类是否容易阅读”,而是专注于 吐出 SD1.5 / SDXL 容易处理形式的提示词 的工具。
最近的模型和提示词
像 FLUX 这样的 DiT 系模型,以及最近的图像编辑模型,其文本编码器变成了 T5 或 Qwen 等 LLM 基础的文本编码器。
多亏了这一点,与 Stable Diffusion 时代相比,自然语言的解释能力大幅提升,所谓的“咒语提示词”那样的技巧几乎不再需要了。
另一方面,也并不是“随便写写就能稳定地得出好结果”。
面对人类也是一样的。可以说,简洁地说明以下要素是优秀导演的工作。
- 距离、视角、焦距、时间段、张数等定量信息
- 背景、构图、布光、风格、表情等各要素的指定
虽说如此,每次手写这些很麻烦,所以使用 ChatGPT 等 LLM。即使是“把这个日语提示词详细化给 FLUX.2 用”、“加上构图、布光、相机信息进行整形”、“把这个提示词整形为 Qwen-Image 用”这样粗略的委托,也足以提升提示词的密度。
根据图像生成模型的不同,有时也会准备专用的 LLM,但并不会有那么大的改善。归根结底,图像生成模型的性能才是最重要的。
在 ComfyUI 中的运用
虽然也有一些可以在 ComfyUI 本地运行的 LLM,但也请考虑 通过 API 节点调用 Gemini 或 ChatGPT。

{
"id": "d8034549-7e0a-40f1-8c2e-de3ffc6f1cae",
"revision": 0,
"last_node_id": 60,
"last_link_id": 105,
"nodes": [
{
"id": 8,
"type": "VAEDecode",
"pos": [
1252.432861328125,
188.1918182373047
],
"size": [
157.56002807617188,
46
],
"flags": {},
"order": 12,
"mode": 0,
"inputs": [
{
"name": "samples",
"type": "LATENT",
"link": 35
},
{
"name": "vae",
"type": "VAE",
"link": 76
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"slot_index": 0,
"links": [
101
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "VAEDecode"
},
"widgets_values": []
},
{
"id": 7,
"type": "CLIPTextEncode",
"pos": [
492,
394.392333984375
],
"size": [
418.3189392089844,
107.08506774902344
],
"flags": {
"collapsed": true
},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "clip",
"type": "CLIP",
"link": 75
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"slot_index": 0,
"links": [
52
]
}
],
"title": "CLIP Text Encode (Negative Prompt)",
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPTextEncode"
},
"widgets_values": [
""
]
},
{
"id": 37,
"type": "UNETLoader",
"pos": [
250.6552734375,
-167.9522705078125
],
"size": [
305.3782043457031,
82
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"slot_index": 0,
"links": [
99
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "UNETLoader"
},
"widgets_values": [
"Z-Image\\z_image_turbo_bf16.safetensors",
"fp8_e4m3fn"
]
},
{
"id": 54,
"type": "ModelSamplingAuraFlow",
"pos": [
586.9390258789062,
-167.9522705078125
],
"size": [
230.33058166503906,
58
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 99
}
],
"outputs": [
{
"name": "MODEL",
"type": "MODEL",
"links": [
100
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.49",
"Node name for S&R": "ModelSamplingAuraFlow"
},
"widgets_values": [
3.1
]
},
{
"id": 6,
"type": "CLIPTextEncode",
"pos": [
492,
175
],
"size": [
330.26959228515625,
142.00363159179688
],
"flags": {},
"order": 9,
"mode": 0,
"inputs": [
{
"name": "clip",
"type": "CLIP",
"link": 74
},
{
"name": "text",
"type": "STRING",
"widget": {
"name": "text"
},
"link": 102
}
],
"outputs": [
{
"name": "CONDITIONING",
"type": "CONDITIONING",
"slot_index": 0,
"links": [
46
]
}
],
"title": "CLIP Text Encode (Positive Prompt)",
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPTextEncode"
},
"widgets_values": [
""
]
},
{
"id": 38,
"type": "CLIPLoader",
"pos": [
120.78603616968121,
342.5854112036154
],
"size": [
301.3524169921875,
106
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "CLIP",
"type": "CLIP",
"slot_index": 0,
"links": [
74,
75
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "CLIPLoader"
},
"widgets_values": [
"qwen_3_4b.safetensors",
"lumina2",
"default"
]
},
{
"id": 53,
"type": "EmptySD3LatentImage",
"pos": [
597.2695922851562,
482.05751390379885
],
"size": [
237,
106
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"links": [
98
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.49",
"Node name for S&R": "EmptySD3LatentImage"
},
"widgets_values": [
1024,
1024,
1
]
},
{
"id": 56,
"type": "SaveImage",
"pos": [
1442.0747874475098,
188.22962825237536
],
"size": [
510.21224258223606,
595.4940064248622
],
"flags": {},
"order": 13,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 101
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.75"
},
"widgets_values": [
"ComfyUI"
]
},
{
"id": 59,
"type": "PreviewAny",
"pos": [
492,
1.5167060232018699
],
"size": [
330,
111
],
"flags": {},
"order": 10,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 104
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.75",
"Node name for S&R": "PreviewAny"
},
"widgets_values": []
},
{
"id": 57,
"type": "GeminiNode",
"pos": [
131.26602226763393,
0.08407710682253366
],
"size": [
273,
266
],
"flags": {},
"order": 8,
"mode": 0,
"inputs": [
{
"name": "images",
"shape": 7,
"type": "IMAGE",
"link": null
},
{
"name": "audio",
"shape": 7,
"type": "AUDIO",
"link": null
},
{
"name": "video",
"shape": 7,
"type": "VIDEO",
"link": null
},
{
"name": "files",
"shape": 7,
"type": "GEMINI_INPUT_FILES",
"link": null
},
{
"name": "prompt",
"type": "STRING",
"widget": {
"name": "prompt"
},
"link": 105
}
],
"outputs": [
{
"name": "STRING",
"type": "STRING",
"links": [
102,
104
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.75",
"Node name for S&R": "GeminiNode"
},
"widgets_values": [
"",
"gemini-3-pro-preview",
12345,
"fixed",
"Status: Completed\nPrice: $0.0113\nTime elapsed: 10s"
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 55,
"type": "MarkdownNote",
"pos": [
-136.07276600955444,
-300.4671673650518
],
"size": [
349.13103718118725,
214.5148968572393
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [],
"outputs": [],
"properties": {},
"widgets_values": [
"## models\n- [z_image_turbo_bf16.safetensors](https://huggingface.co/Comfy-Org/z_image_turbo/blob/main/split_files/diffusion_models/z_image_turbo_bf16.safetensors)\n- [qwen_3_4b.safetensors](https://huggingface.co/Comfy-Org/z_image_turbo/blob/main/split_files/text_encoders/qwen_3_4b.safetensors)\n- [ae.safetensors](https://huggingface.co/Comfy-Org/z_image_turbo/blob/main/split_files/vae/ae.safetensors)\n\n```\n📂ComfyUI/\n└── 📂models/\n ├── 📂diffusion_models/\n │ └── z_image_turbo_bf16.safetensors\n ├── 📂text_encoders/\n │ └── qwen_3_4b.safetensors\n └── 📂vae/\n └── ae.safetensors\n```"
]
},
{
"id": 60,
"type": "StringConcatenate",
"pos": [
-181.55781163713942,
-8.244166137499546
],
"size": [
283.8399999999999,
276.23
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "STRING",
"type": "STRING",
"links": [
105
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.75",
"Node name for S&R": "StringConcatenate"
},
"widgets_values": [
"You are a prompt refiner for image generation models (e.g. Stable Diffusion, FLUX, Qwen-Image, etc.).\n\nThe user will give you a short, rough prompt describing an image. Your job is to rewrite it into a single, detailed prompt that is easy for an image generation model to follow.\n\nInteraction rules:\n- This is strictly single-turn. For each input, you read the rough prompt once and respond once.\n- Do NOT ask the user questions.\n- Do NOT rely on or refer to any previous conversation history.\n\nGoals:\n- Keep the same core subject, theme, and intent as the original prompt.\n- Do NOT change the meaning or add new story elements; only clarify and enrich what is already implied.\n- Make implicit visual details explicit: subject appearance, pose, composition, environment, lighting, mood, and style.\n- Focus only on what should be visible in a single still image.\n\nWhen expanding the prompt:\n- Prefer concrete, visual, testable details over emotional or metaphorical language.\n- Describe:\n - Who or what is in the image (age, gender expression, clothing, notable features, materials, etc.).\n - Pose and action of the main subject.\n - Camera and composition (shot type, angle, distance, framing, depth of field).\n - Environment and background (indoor/outdoor, location type, props, weather, time of day).\n - Lighting (soft/hard, key direction, contrast, highlights, reflections, etc.).\n - Color palette and overall mood, if implied.\n - Rendering style (photograph, watercolor, anime illustration, 3D render, flat graphic, etc.), based on the user’s words.\n- If the prompt clearly suggests a photograph, add subtle camera details (for example: lens focal length, aperture, high-resolution, realistic textures) but keep them plausible and not overly technical.\n- If the prompt clearly suggests illustration or anime style, describe line quality, shading style, and level of detail instead of camera specs.\n- Do not invent extra characters, locations, or objects that are not suggested in the original prompt.\n\nLanguage rules:\n- Always respond in English, regardless of the input language.\n- Use one concise paragraph or 1–3 sentences, not a long list.\n- Avoid overly poetic or flowery language; keep it functional and descriptive.\n- Do NOT mention “prompt”, “model”, “negative prompt”, “system prompt”, or give any meta commentary.\n\nOutput format:\n- Output ONLY the refined image-generation prompt as plain text.\n- Do NOT add explanations, headings, bullet points, quotes, or any extra filler.\n",
"万華鏡の中で撮影したかのようなファッションショー",
"---"
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 39,
"type": "VAELoader",
"pos": [
999.1927782010846,
509.5303495842456
],
"size": [
210,
58
],
"flags": {
"collapsed": false
},
"order": 5,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "VAE",
"type": "VAE",
"slot_index": 0,
"links": [
76
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "VAELoader"
},
"widgets_values": [
"ae.safetensors"
]
},
{
"id": 3,
"type": "KSampler",
"pos": [
898.7548217773438,
188.1918182373047
],
"size": [
315,
262
],
"flags": {},
"order": 11,
"mode": 0,
"inputs": [
{
"name": "model",
"type": "MODEL",
"link": 100
},
{
"name": "positive",
"type": "CONDITIONING",
"link": 46
},
{
"name": "negative",
"type": "CONDITIONING",
"link": 52
},
{
"name": "latent_image",
"type": "LATENT",
"link": 98
}
],
"outputs": [
{
"name": "LATENT",
"type": "LATENT",
"slot_index": 0,
"links": [
35
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.33",
"Node name for S&R": "KSampler"
},
"widgets_values": [
55555,
"fixed",
8,
1,
"euler",
"simple",
1
]
}
],
"links": [
[
35,
3,
0,
8,
0,
"LATENT"
],
[
46,
6,
0,
3,
1,
"CONDITIONING"
],
[
52,
7,
0,
3,
2,
"CONDITIONING"
],
[
74,
38,
0,
6,
0,
"CLIP"
],
[
75,
38,
0,
7,
0,
"CLIP"
],
[
76,
39,
0,
8,
1,
"VAE"
],
[
98,
53,
0,
3,
3,
"LATENT"
],
[
99,
37,
0,
54,
0,
"MODEL"
],
[
100,
54,
0,
3,
0,
"MODEL"
],
[
101,
8,
0,
56,
0,
"IMAGE"
],
[
102,
57,
0,
6,
1,
"STRING"
],
[
104,
57,
0,
59,
0,
"*"
],
[
105,
60,
0,
57,
4,
"STRING"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.3310000000000004,
"offset": [
66.47718451467517,
14.701606057764138
]
},
"frontendVersion": "1.34.2",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
我自己也是一个想要坚持使用本地模型的人,但老实说,相比于运行图像生成模型,在本地常用具有一定质量的 LLM,对 PC 规格的要求往往更严苛。
值得庆幸的是,LLM 的 API 使用费用相当便宜。很久以前购买的 5 美元额度到现在还没用完 (´・ω・`)