物体检测

什么是物体检测？

物体检测（Object Detection）是找出图像中“拍到了什么（类）”“在哪里（位置）”的任务。

一般会为每个物体输出边界框（长方形）和标签。

在 ComfyUI 中，主要作为生成蒙版的入口使用。从图像中检测出狗并消除，或者只检测出脸并进行优化……总之是出场率很高的技术。

代表性手法

在原本的物体检测世界中有各种各样的系统，但从 ComfyUI 的视角来看，以下是代表性的。

YOLO 系

用于检测特定物体（车、人、狗等）的，传统且强大的模型群。

yolo8.json

{
  "id": "ffcc6c64-e535-4685-ab04-be903b4cdf3c",
  "revision": 0,
  "last_node_id": 7,
  "last_link_id": 5,
  "nodes": [
    {
      "id": 3,
      "type": "UltralyticsDetectorProvider",
      "pos": [
        -131.74129771892854,
        275.10463657117793
      ],
      "size": [
        225.47324988344883,
        100.20074983277442
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "BBOX_DETECTOR",
          "type": "BBOX_DETECTOR",
          "links": [
            2
          ]
        },
        {
          "name": "SEGM_DETECTOR",
          "type": "SEGM_DETECTOR",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfyui-impact-subpack",
        "ver": "1.3.5",
        "Node name for S&R": "UltralyticsDetectorProvider"
      },
      "widgets_values": [
        "segm/person_yolov8m-seg.pt"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        -192.01296976493634,
        433.54398787774375
      ],
      "size": [
        288.15658006702404,
        326
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.71",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "1f421a11eb7f46ffcf970787036c5cc1.jpg",
        "image"
      ]
    },
    {
      "id": 1,
      "type": "ImpactSimpleDetectorSEGS",
      "pos": [
        137.03559995799336,
        275.10463657117793
      ],
      "size": [
        244.07421875,
        310
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "bbox_detector",
          "type": "BBOX_DETECTOR",
          "link": 2
        },
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "sam_model_opt",
          "shape": 7,
          "type": "SAM_MODEL",
          "link": null
        },
        {
          "name": "segm_detector_opt",
          "shape": 7,
          "type": "SEGM_DETECTOR",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "SEGS",
          "type": "SEGS",
          "links": [
            5
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-impact-pack",
        "ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
        "Node name for S&R": "ImpactSimpleDetectorSEGS"
      },
      "widgets_values": [
        0.5,
        0,
        3,
        10,
        0.5,
        0,
        0,
        0.7,
        0
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 7,
      "type": "SEGSPreview",
      "pos": [
        416.62826858269676,
        275.10463657117793
      ],
      "size": [
        332.13668518001396,
        314
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "segs",
          "type": "SEGS",
          "link": 5
        },
        {
          "name": "fallback_image_opt",
          "shape": 7,
          "type": "IMAGE",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "shape": 6,
          "type": "IMAGE",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfyui-impact-pack",
        "ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
        "Node name for S&R": "SEGSPreview"
      },
      "widgets_values": [
        true,
        0.1
      ]
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      1,
      "IMAGE"
    ],
    [
      2,
      3,
      0,
      1,
      0,
      "BBOX_DETECTOR"
    ],
    [
      5,
      1,
      0,
      7,
      0,
      "SEGS"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.01525597994771,
      "offset": [
        522.496714378834,
        -22.433780096160543
      ]
    },
    "frontendVersion": "1.34.3",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

压倒性地高速，轻量到可以用于实时处理。
针对预先决定的类集合（如“人”、“车”等）进行学习，并从中进行检测。

如果没有模型，需要自己进行训练。

DETR 系

不是使用 CNN 而是使用 Transformer 的检测模型。在 ComfyUI 中直接处理的机会几乎没有，但在物体检测的语境下应该会看到名字。

文本物体检测

上面的检测器只能检测预先决定的类，因此如果试图检测人和车等代表性物体以外的东西，一下子就会变得很难用。

对 ComfyUI 来说重要的，是 可以用文本指定物体的类型 的检测。

Grounding DINO

图像编码器＋文本编码器，将图像和文本的特征对应起来的模型。
“red car”、“traffic light”等，可以检测任何用提示词（文本）指示的东西。

Florence-2

Florence-2.json

{
  "id": "b3c4cb62-a4e3-43d1-8cab-97b76da640ea",
  "revision": 0,
  "last_node_id": 5,
  "last_link_id": 4,
  "nodes": [
    {
      "id": 2,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        -172.8312043876651,
        730.6295594867262
      ],
      "size": [
        258.6021484375,
        139.84973267580756
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            1
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        162.05970658979237,
        378.9941029603949
      ],
      "size": [
        400,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 3
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 1
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": [
            4
          ]
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": null
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": null
        },
        {
          "name": "data",
          "type": "JSON",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "coffee",
        "caption_to_phrase_grounding",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 4,
      "type": "LoadImage",
      "pos": [
        -199.4499034371617,
        176.5861666100186
      ],
      "size": [
        283.34567757826187,
        480.9894372866636
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            3
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "download (1).jpg",
        "image"
      ]
    },
    {
      "id": 5,
      "type": "PreviewImage",
      "pos": [
        620.7629211596435,
        281.30273069624826
      ],
      "size": [
        397.0780228385779,
        544.4469000769693
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 4
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewImage"
      },
      "widgets_values": []
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      3,
      4,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      4,
      1,
      0,
      5,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.015255979947711,
      "offset": [
        299.4499034371617,
        -76.58616661001861
      ]
    },
    "frontendVersion": "1.34.3",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

观察图像进行描述生成・物体检测・分割等，一个模型能扮演多个角色的通用 VLM。
因为拥有接近 LLM 的结构，所以比起 Grounding DINO，可以用更复杂的文章进行指示是它的强项。

在 ComfyUI 中的用处（作为蒙版生成）

在 ComfyUI 中，物体检测几乎都是作为 蒙版生成的入口 来使用的。

话虽如此，从物体检测模型输出的只有 BBOX（长方形）。

虽然光是这个对于通过 inpainting 去除对象等也很有用，但例如检测到人时，其中大部分区域是背景，作为蒙版使用稍微有点浪费。

因此，这些检测结果很多时候不单独使用，而是与后续的抠图或分割并用。接下来让我们看看那些。

什么是物体检测？

代表性手法

YOLO 系

DETR 系

文本物体检测

Grounding DINO

Florence-2

在 ComfyUI 中的用处（作为蒙版生成）

相关

什么是 JSON 复制按钮？

这个页面有问题！

请补充讲解！

感想 / 其他

感谢！

物体检测

什么是物体检测？

代表性手法

YOLO 系

DETR 系

文本物体检测

Grounding DINO

Florence-2

在 ComfyUI 中的用处（作为蒙版生成）

相关

相关工作流