什么是物体检测?
物体检测(Object Detection)是找出图像中“拍到了什么(类)”“在哪里(位置)”的任务。
一般会为每个物体输出边界框(长方形)和标签。
在 ComfyUI 中,主要作为生成蒙版的入口使用。 从图像中检测出狗并消除,或者只检测出脸并进行优化……总之是出场率很高的技术。
代表性手法
在原本的物体检测世界中有各种各样的系统,但从 ComfyUI 的视角来看,以下是代表性的。
YOLO 系
用于检测特定物体(车、人、狗等)的,传统且强大的模型群。

{
"id": "ffcc6c64-e535-4685-ab04-be903b4cdf3c",
"revision": 0,
"last_node_id": 7,
"last_link_id": 5,
"nodes": [
{
"id": 3,
"type": "UltralyticsDetectorProvider",
"pos": [
-131.74129771892854,
275.10463657117793
],
"size": [
225.47324988344883,
100.20074983277442
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "BBOX_DETECTOR",
"type": "BBOX_DETECTOR",
"links": [
2
]
},
{
"name": "SEGM_DETECTOR",
"type": "SEGM_DETECTOR",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-subpack",
"ver": "1.3.5",
"Node name for S&R": "UltralyticsDetectorProvider"
},
"widgets_values": [
"segm/person_yolov8m-seg.pt"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
-192.01296976493634,
433.54398787774375
],
"size": [
288.15658006702404,
326
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"1f421a11eb7f46ffcf970787036c5cc1.jpg",
"image"
]
},
{
"id": 1,
"type": "ImpactSimpleDetectorSEGS",
"pos": [
137.03559995799336,
275.10463657117793
],
"size": [
244.07421875,
310
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "bbox_detector",
"type": "BBOX_DETECTOR",
"link": 2
},
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "sam_model_opt",
"shape": 7,
"type": "SAM_MODEL",
"link": null
},
{
"name": "segm_detector_opt",
"shape": 7,
"type": "SEGM_DETECTOR",
"link": null
}
],
"outputs": [
{
"name": "SEGS",
"type": "SEGS",
"links": [
5
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "ImpactSimpleDetectorSEGS"
},
"widgets_values": [
0.5,
0,
3,
10,
0.5,
0,
0,
0.7,
0
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 7,
"type": "SEGSPreview",
"pos": [
416.62826858269676,
275.10463657117793
],
"size": [
332.13668518001396,
314
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "segs",
"type": "SEGS",
"link": 5
},
{
"name": "fallback_image_opt",
"shape": 7,
"type": "IMAGE",
"link": null
}
],
"outputs": [
{
"name": "IMAGE",
"shape": 6,
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SEGSPreview"
},
"widgets_values": [
true,
0.1
]
}
],
"links": [
[
1,
2,
0,
1,
1,
"IMAGE"
],
[
2,
3,
0,
1,
0,
"BBOX_DETECTOR"
],
[
5,
1,
0,
7,
0,
"SEGS"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.01525597994771,
"offset": [
522.496714378834,
-22.433780096160543
]
},
"frontendVersion": "1.34.3",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 压倒性地高速,轻量到可以用于实时处理。
- 针对预先决定的类集合(如“人”、“车”等)进行学习,并从中进行检测。
- 如果没有模型,需要自己进行训练。
DETR 系
不是使用 CNN 而是使用 Transformer 的检测模型。 在 ComfyUI 中直接处理的机会几乎没有,但在物体检测的语境下应该会看到名字。
文本物体检测
上面的检测器只能检测预先决定的类,因此如果试图检测人和车等代表性物体以外的东西,一下子就会变得很难用。
对 ComfyUI 来说重要的,是 可以用文本指定物体的类型 的检测。
Grounding DINO
- 图像编码器+文本编码器,将图像和文本的特征对应起来的模型。
- “red car”、“traffic light”等,可以检测任何用提示词(文本)指示的东西。
Florence-2

{
"id": "b3c4cb62-a4e3-43d1-8cab-97b76da640ea",
"revision": 0,
"last_node_id": 5,
"last_link_id": 4,
"nodes": [
{
"id": 2,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
-172.8312043876651,
730.6295594867262
],
"size": [
258.6021484375,
139.84973267580756
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
1
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
162.05970658979237,
378.9941029603949
],
"size": [
400,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 3
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 1
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [
4
]
},
{
"name": "mask",
"type": "MASK",
"links": null
},
{
"name": "caption",
"type": "STRING",
"links": null
},
{
"name": "data",
"type": "JSON",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"coffee",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 4,
"type": "LoadImage",
"pos": [
-199.4499034371617,
176.5861666100186
],
"size": [
283.34567757826187,
480.9894372866636
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
3
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"download (1).jpg",
"image"
]
},
{
"id": 5,
"type": "PreviewImage",
"pos": [
620.7629211596435,
281.30273069624826
],
"size": [
397.0780228385779,
544.4469000769693
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 4
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
}
],
"links": [
[
1,
2,
0,
1,
1,
"FL2MODEL"
],
[
3,
4,
0,
1,
0,
"IMAGE"
],
[
4,
1,
0,
5,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.015255979947711,
"offset": [
299.4499034371617,
-76.58616661001861
]
},
"frontendVersion": "1.34.3",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 观察图像进行描述生成・物体检测・分割等,一个模型能扮演多个角色的通用 VLM。
- 因为拥有接近 LLM 的结构,所以比起 Grounding DINO,可以用更复杂的文章进行指示是它的强项。
在 ComfyUI 中的用处(作为蒙版生成)
在 ComfyUI 中,物体检测几乎都是作为 蒙版生成的入口 来使用的。
话虽如此,从物体检测模型输出的只有 BBOX(长方形)。
虽然光是这个对于通过 inpainting 去除对象等也很有用,但例如检测到人时,其中大部分区域是背景,作为蒙版使用稍微有点浪费。
因此,这些检测结果很多时候不单独使用,而是与后续的抠图或分割并用。接下来让我们看看那些。