LLM / MLLM

什么是 LLM / MLLM？

LLM（大语言模型），简单来说就是像 ChatGPT 一样，读取文本并以文本作答的 AI。

MLLM 则是在此基础上，还能接受图像等多种输入的 LLM。顾名思义，就是"多模态"LLM。

在 ComfyUI 中用来做什么？

在 ComfyUI 中，LLM 与其说是用来对话，不如说更多是作为"幕后助手"——为图像生成模型准备好输入素材。

提示词扩展与翻译
- 将人类粗略的指令，扩展为 AI 更容易理解的详细英文提示词
标签生成与图像描述
- 输入图像，让它输出描述该图像的标签或文字
- 可用于训练用描述文本，也可将其作为提示词重新生成图像
物体检测与分割
- 部分 MLLM 能执行更专业的任务
- 基于 MLLM 的物体检测尤为实用，因为可以用自然语言指定目标

在 ComfyUI 中使用 LLM 的四种方式

ComfyUI 是专为图像生成设计的引擎，因此它对 LLM 的支持能力是比较有限的——两者的底层机制完全不同。

因此，通常需要通过核心节点、自定义节点或外部集成来使用。

TextGenerate 节点

这是最近加入核心的节点，目标是让图像生成里使用的文本编码器，也能当作 LLM / MLLM 来使用。

由于它是强行在 ComfyUI 的代码环境中运行的，所以和 llama-cpp 这类专用引擎相比，无论速度还是功能都差了不少。

当然，能在核心里直接跑起来这一点本身已经很厉害了，不过就现状而言，还不太算是推荐方案。

TextGenerate_gemma3.json

{
  "id": "cc2eec2f-681d-45b7-a301-a8f315a9bce8",
  "revision": 0,
  "last_node_id": 7,
  "last_link_id": 6,
  "nodes": [
    {
      "id": 2,
      "type": "CLIPLoader",
      "pos": [
        729.6315004012546,
        764.4307765124615
      ],
      "size": [
        270,
        106
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "CLIP",
          "type": "CLIP",
          "links": [
            1
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.17.0",
        "Node name for S&R": "CLIPLoader"
      },
      "widgets_values": [
        "gemma_3_12B_it_fp8_scaled.safetensors",
        "stable_diffusion",
        "default"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 3,
      "type": "LoadImage",
      "pos": [
        467.40009544257674,
        931.5659151585422
      ],
      "size": [
        270,
        326
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            4
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.18.0",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "000032_00007_.png",
        "image"
      ]
    },
    {
      "id": 4,
      "type": "PreviewAny",
      "pos": [
        1496.5240313386585,
        764.4307765124615
      ],
      "size": [
        286,
        154
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 3
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.18.0",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        null
      ]
    },
    {
      "id": 5,
      "type": "ResizeImageMaskNode",
      "pos": [
        762.4053115790771,
        932.9639654966344
      ],
      "size": [
        236.556640625,
        106
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "input",
          "type": "IMAGE,MASK",
          "link": 4
        }
      ],
      "outputs": [
        {
          "name": "resized",
          "type": "IMAGE",
          "links": [
            5
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.18.0",
        "Node name for S&R": "ResizeImageMaskNode"
      },
      "widgets_values": [
        "scale total pixels",
        0.25,
        "nearest-exact"
      ]
    },
    {
      "id": 1,
      "type": "TextGenerate",
      "pos": [
        1056.12688558396,
        764.4307765124615
      ],
      "size": [
        400,
        300
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 1
        },
        {
          "name": "image",
          "shape": 7,
          "type": "IMAGE",
          "link": 5
        }
      ],
      "outputs": [
        {
          "name": "generated_text",
          "type": "STRING",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.17.0",
        "Node name for S&R": "TextGenerate"
      },
      "widgets_values": [
        "Please describe this image in detail in 200 characters",
        256,
        "on",
        0.7,
        64,
        0.95,
        0.05,
        1.05,
        0
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "CLIP"
    ],
    [
      3,
      1,
      0,
      4,
      0,
      "STRING"
    ],
    [
      4,
      3,
      0,
      5,
      0,
      "IMAGE"
    ],
    [
      5,
      5,
      0,
      1,
      1,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.1,
      "offset": [
        -258.6635250536924,
        -464.0999220954758
      ]
    },
    "frontendVersion": "1.41.21",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

支持的模型

Gemma 3
Qwen3
Qwen-3.5

ComfyUI 自定义节点

与图像生成模型一样，下载模型文件并在本机运行。

主要使用针对特定任务优化的轻量模型，如描述生成或物体检测。

代表性支持模型

外部 LLM 服务器集成

将 LLM 推理交给 Ollama 或 LM Studio 等专用引擎，从 ComfyUI 通过 API 调用。

即使在同一台 PC 上运行，VRAM 的竞争依然存在，但最大的优势在于将推理环境与 ComfyUI 完全隔离。

不污染 ComfyUI 的依赖环境，维护性更高
在另一台 PC 上运行并通过网络连接，还能彻底解决 VRAM 争抢问题

→ 外部 LLM 服务器集成

官方付费 API 节点

ComfyUI 官方提供的节点，用于通过 API 调用 ChatGPT 或 Gemini 等闭源服务。

说句实话，这些服务比本地模型聪明得多，也快得多。

PC 负载完全为零。在跑图的同时让它在后台润色提示词，完全不影响生成速度
不过，当然需要按量付费，且 NSFW（成人向）内容会被安全机制拦截，需注意

→ API 节点

题外话：其实你已经在用了？

最近的图像生成模型（如 Qwen-Image、Z-Image 等）内置了 Qwen 或 Gemma 等 MLLM 作为文本编码器——也就是理解提示词的那个组件。

它们用 MLLM 来理解文本提示词和参考图像，进而完成生成与编辑任务。但反过来说，也只是用于这一目的，算是一种"奢侈的用法"。
如果有朝一日能直接将其作为通用 MLLM 使用，那就有趣了……

LLM / MLLM

什么是 LLM / MLLM？

在 ComfyUI 中用来做什么？

在 ComfyUI 中使用 LLM 的四种方式

TextGenerate 节点

ComfyUI 自定义节点

外部 LLM 服务器集成

官方付费 API 节点

题外话：其实你已经在用了？

什么是 JSON 复制按钮？

这个页面有问题！

请补充讲解！

感想 / 其他

感谢！