AI Mask Generation
There are many situations where you create masks in inpainting, etc., but it is hard to draw them by hand or prepare mask images every time. Above all, it cannot be automated.
So let's use various AIs to automatically generate masks.
- Object Detection
- Detects objects in an image with a Bounding Box according to instructions such as text.
- Matting
- Separates the foreground and background with a mask with gradation (Alpha Matte) (often becomes a binary mask in ComfyUI).
- Segmentation
- Extracts the "shape of the object" with a black and white mask (binary mask).
Required Custom Nodes
There are many types of technologies to do these, and accordingly, there are various custom nodes, but for now, the following should suffice.
- 1038lab/ComfyUI-RMBG
- Supports many technologies from matting to segmentation, and is easy to use.
- ltdrdata/ComfyUI-Impact-Pack
- ltdrdata/ComfyUI-Impact-Subpack
- It is for doing work called Detailer, and it has some quirks to use simply as mask generation.
- kijai/ComfyUI-Florence2
- Runs an MLLM called Florence2.
- kijai/ComfyUI-segment-anything-2
- Runs a segmentation model called SAM 2, used in a set with Florence2.
Object Detection

As the name suggests, it can identify the position of a specific object in an image and outputs a square range called BBOX.
There are various technologies with characteristics in accuracy, versatility, and speed.
YOLO Family
An ultra-fast detection technology aimed at detecting objects in real time.
Basically, since one model (face only, hand only, etc.) is created for the type of object you want to detect, you need to make it yourself if there is no model, and it is unsuitable when you want to detect multiple types.

{
"id": "ffcc6c64-e535-4685-ab04-be903b4cdf3c",
"revision": 0,
"last_node_id": 7,
"last_link_id": 5,
"nodes": [
{
"id": 3,
"type": "UltralyticsDetectorProvider",
"pos": [
-131.74129771892854,
275.10463657117793
],
"size": [
225.47324988344883,
100.20074983277442
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "BBOX_DETECTOR",
"type": "BBOX_DETECTOR",
"links": [
2
]
},
{
"name": "SEGM_DETECTOR",
"type": "SEGM_DETECTOR",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-subpack",
"ver": "1.3.5",
"Node name for S&R": "UltralyticsDetectorProvider"
},
"widgets_values": [
"segm/person_yolov8m-seg.pt"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
-192.01296976493634,
433.54398787774375
],
"size": [
288.15658006702404,
326
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"1f421a11eb7f46ffcf970787036c5cc1.jpg",
"image"
]
},
{
"id": 5,
"type": "SegsToCombinedMask",
"pos": [
424.4134665014664,
275.10463657117793
],
"size": [
211.851171875,
26
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "segs",
"type": "SEGS",
"link": 3
}
],
"outputs": [
{
"name": "MASK",
"type": "MASK",
"links": [
4
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SegsToCombinedMask"
},
"color": "#232",
"bgcolor": "#353"
},
{
"id": 6,
"type": "MaskPreview",
"pos": [
679.5682861699395,
275.10463657117793
],
"size": [
294.93629499045346,
258
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 4
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 7,
"type": "SEGSPreview",
"pos": [
424.5080547233428,
380.8224702427784
],
"size": [
210,
314
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "segs",
"type": "SEGS",
"link": 5
},
{
"name": "fallback_image_opt",
"shape": 7,
"type": "IMAGE",
"link": null
}
],
"outputs": [
{
"name": "IMAGE",
"shape": 6,
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SEGSPreview"
},
"widgets_values": [
true,
0.2
]
},
{
"id": 1,
"type": "ImpactSimpleDetectorSEGS",
"pos": [
137.03559995799336,
275.10463657117793
],
"size": [
244.07421875,
310
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "bbox_detector",
"type": "BBOX_DETECTOR",
"link": 2
},
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "sam_model_opt",
"shape": 7,
"type": "SAM_MODEL",
"link": null
},
{
"name": "segm_detector_opt",
"shape": 7,
"type": "SEGM_DETECTOR",
"link": null
}
],
"outputs": [
{
"name": "SEGS",
"type": "SEGS",
"links": [
3,
5
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "ImpactSimpleDetectorSEGS"
},
"widgets_values": [
0.5,
0,
3,
10,
0.5,
0,
0,
0.7,
0
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
1,
"IMAGE"
],
[
2,
3,
0,
1,
0,
"BBOX_DETECTOR"
],
[
3,
1,
0,
5,
0,
"SEGS"
],
[
4,
5,
0,
6,
0,
"MASK"
],
[
5,
1,
0,
7,
0,
"SEGS"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.0152559799477097,
"offset": [
292.0129697649363,
-175.10463657117793
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
Suitable when high-speed processing is required (when a specific target is decided, such as face detection).
- How to get models:
ComfyUI Manager->Install Models-> Search for YOLO to find various YOLO models besides faces. - I won't paste the link, but if you search for Adetailer on Civitai, you can also find models specialized for NSFW.
Grounding DINO
Detects objects specified by text and outputs a BBOX.
Unlike YOLO, it is easy to use because you can specify objects with arbitrary text such as "white dog" or "red car", and you can also detect multiple objects at the same time.
Since there is no node that runs Grounding DINO alone, I will introduce a workflow combined with segmentation below.
Florence-2
Florence-2 is a Vision Language Model that can understand images as text.
It can do various things such as caption generation, and one of them is object detection.

{
"id": "57b8cf9b-11ed-420b-be41-187510d36325",
"revision": 0,
"last_node_id": 9,
"last_link_id": 9,
"nodes": [
{
"id": 4,
"type": "PreviewImage",
"pos": [
500.84779414328955,
53.49562866388473
],
"size": [
357.987809336234,
366.9149013951313
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 6
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.68",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 7,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
-199.95852064582468,
506.0635940169577
],
"size": [
258.6021484375,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
7
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 9,
"type": "MaskPreview",
"pos": [
504.15530090191146,
487.1967803209515
],
"size": [
356.4644286534351,
363.80642544479423
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 9
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 6,
"type": "Florence2Run",
"pos": [
95.85142311428962,
53.49562866388473
],
"size": [
366.62910569436383,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 4
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 7
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [
6
]
},
{
"name": "mask",
"type": "MASK",
"links": [
9
]
},
{
"name": "caption",
"type": "STRING",
"links": null
},
{
"name": "data",
"type": "JSON",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"Potted plant",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 5,
"type": "LoadImage",
"pos": [
-232.51584222034649,
53.49562866388473
],
"size": [
290,
390
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
4
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.68",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"ComfyUI_05189_.png",
"image"
]
}
],
"links": [
[
4,
5,
0,
6,
0,
"IMAGE"
],
[
6,
6,
0,
4,
0,
"IMAGE"
],
[
7,
7,
0,
6,
1,
"FL2MODEL"
],
[
9,
6,
1,
9,
0,
"MASK"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.1167815779424781,
"offset": [
332.5158422203465,
46.50437133611527
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- Model: I don't feel much difference, but please try various ones. The model is downloaded automatically.
- Prompt: Describe the object you want to detect.
- task: caption_to_phrase_grounding
- output_mask_select: If there are several detected items, select which output to use (if blank, all are output).
Suitable when you want to specify the target with complex sentence expressions or utilize the understanding power of LLM (although the speed is slow).
Matting
The contents of services and functions provided under the name "Background Removal" are basically the same as this.
You cannot specify objects, and "what exactly refers to the background?" is left to AI, so it is good to use when you simply want to remove the background or when the boundary between the foreground and background is clear.
BiRefNet
Probably the most used model. Speed and performance are perfect, so you should use this for now.

{
"id": "57b8cf9b-11ed-420b-be41-187510d36325",
"revision": 0,
"last_node_id": 5,
"last_link_id": 3,
"nodes": [
{
"id": 5,
"type": "LoadImage",
"pos": [
-232.51584222034649,
53.49562866388473
],
"size": [
283.4437144886363,
493.72727272727275
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
3
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.68",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"viewfilename=ComfyUI_temp_gzdac_00001_.png",
"image"
]
},
{
"id": 4,
"type": "PreviewImage",
"pos": [
500.8477941432896,
53.49562866388473
],
"size": [
352.3299825744998,
503.21998838299993
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 2
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.68",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 3,
"type": "BiRefNetRMBG",
"pos": [
105.88783320578972,
53.49562866388473
],
"size": [
340,
254
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 3
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.3",
"Node name for S&R": "BiRefNetRMBG"
},
"widgets_values": [
"BiRefNet-general",
0,
0,
false,
false,
"Color",
"#00ff00"
],
"color": "#222e40",
"bgcolor": "#364254"
}
],
"links": [
[
2,
3,
0,
4,
0,
"IMAGE"
],
[
3,
5,
0,
3,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.8390545288824014,
"offset": [
492.21940782589115,
157.34341313697843
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- If you set
BackgroundtoAlpha, a transparent image with an alpha channel added is output. - Note: Since the output at this time is RGBA, an error may occur if used in image2image etc. (see Mask & Alpha Channel).
There are several derivative models depending on the application, such as ToonOut which is good at anime images. Please try various ones.
Segmentation
SAM (Segment Anything Model)
Currently the most famous segmentation model.
It knows "shapes of things" well, and if you specify a car etc. in a photo with a point or box, it accurately finds the outline and makes it a mask.

This is a function to segment the specified object by pressing a point, but basically it is often combined with object detection.
-
- Right-click on image node ->
Open in SAM Detector
- Right-click on image node ->
-
- Specify the object you want to extract by clicking with the left mouse button (right click for the range you want to exclude)
-
- Press
Detectto generate a mask
- Press
SAM is currently being developed, and there are Initial / SAM 2 / SAM 2.1 / SAM 3.
The latest version, SAM 3, supports not only point and BBOX instructions but also text instructions. I will introduce it again below, but honestly, SAM 3 alone is enough for AI mask generation of still images.
Clothing / Body Parts Segmentation
Performs segmentation of specific parts such as "upper body", "skirt", "face", "hair".

{
"id": "207761f3-951e-495d-82e6-ba18f812bf62",
"revision": 0,
"last_node_id": 6,
"last_link_id": 4,
"nodes": [
{
"id": 4,
"type": "LoadImage",
"pos": [
-196.19169533724752,
147.27211328602687
],
"size": [
300.2159903749374,
523.434865885697
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"ComfyUI_temp_jgbjo_00009_.png",
"image"
]
},
{
"id": 5,
"type": "PreviewImage",
"pos": [
554.1983967152759,
147.27211328602687
],
"size": [
279.6810290221624,
519.4029697754617
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 3
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 1,
"type": "ClothesSegment",
"pos": [
159.1113458764829,
147.27211328602687
],
"size": [
340,
662
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 1
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
3
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.4",
"Node name for S&R": "ClothesSegment"
},
"widgets_values": [
false,
false,
false,
false,
true,
false,
false,
false,
false,
false,
false,
false,
false,
true,
false,
false,
false,
false,
512,
0,
0,
false,
"Color",
"#00ff00"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
4,
0,
1,
0,
"IMAGE"
],
[
3,
1,
0,
5,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.6934334949441355,
"offset": [
552.8853816068156,
29.159152850417545
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- Select the category you want to segment.
I used to use it often for tasks such as changing clothes, but now object detection + segmentation might be more versatile and have better performance.
Combining
By combining object detection, segmentation, and matting, more precise mask generation becomes possible.
YOLO x SAM

{
"id": "ffcc6c64-e535-4685-ab04-be903b4cdf3c",
"revision": 0,
"last_node_id": 8,
"last_link_id": 6,
"nodes": [
{
"id": 3,
"type": "UltralyticsDetectorProvider",
"pos": [
-131.74129771892854,
275.10463657117793
],
"size": [
225.47324988344883,
100.20074983277442
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "BBOX_DETECTOR",
"type": "BBOX_DETECTOR",
"links": [
2
]
},
{
"name": "SEGM_DETECTOR",
"type": "SEGM_DETECTOR",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-subpack",
"ver": "1.3.5",
"Node name for S&R": "UltralyticsDetectorProvider"
},
"widgets_values": [
"segm/person_yolov8m-seg.pt"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 5,
"type": "SegsToCombinedMask",
"pos": [
424.4134665014664,
275.10463657117793
],
"size": [
211.851171875,
26
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "segs",
"type": "SEGS",
"link": 3
}
],
"outputs": [
{
"name": "MASK",
"type": "MASK",
"links": [
4
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SegsToCombinedMask"
},
"color": "#232",
"bgcolor": "#353"
},
{
"id": 6,
"type": "MaskPreview",
"pos": [
679.5682861699395,
275.10463657117793
],
"size": [
294.93629499045346,
258
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 4
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 7,
"type": "SEGSPreview",
"pos": [
424.5080547233428,
380.8224702427784
],
"size": [
210,
314
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "segs",
"type": "SEGS",
"link": 5
},
{
"name": "fallback_image_opt",
"shape": 7,
"type": "IMAGE",
"link": null
}
],
"outputs": [
{
"name": "IMAGE",
"shape": 6,
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SEGSPreview"
},
"widgets_values": [
true,
0.2
]
},
{
"id": 8,
"type": "SAMLoader",
"pos": [
-116.2680478354797,
435.37734731069196
],
"size": [
210,
82
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "SAM_MODEL",
"type": "SAM_MODEL",
"links": [
6
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "SAMLoader"
},
"widgets_values": [
"sam_vit_b_01ec64.pth",
"AUTO"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
-199.16827143603965,
581.4934848883244
],
"size": [
288.15658006702404,
326
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"1f421a11eb7f46ffcf970787036c5cc1.jpg",
"image"
]
},
{
"id": 1,
"type": "ImpactSimpleDetectorSEGS",
"pos": [
137.03559995799336,
275.10463657117793
],
"size": [
244.07421875,
310
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "bbox_detector",
"type": "BBOX_DETECTOR",
"link": 2
},
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "sam_model_opt",
"shape": 7,
"type": "SAM_MODEL",
"link": 6
},
{
"name": "segm_detector_opt",
"shape": 7,
"type": "SEGM_DETECTOR",
"link": null
}
],
"outputs": [
{
"name": "SEGS",
"type": "SEGS",
"links": [
3,
5
]
}
],
"properties": {
"cnr_id": "comfyui-impact-pack",
"ver": "61bd8397a18e7e7668e6a24e95168967768c2bed",
"Node name for S&R": "ImpactSimpleDetectorSEGS"
},
"widgets_values": [
0.5,
0,
3,
10,
0.5,
0,
0,
0.7,
0
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
1,
"IMAGE"
],
[
2,
3,
0,
1,
0,
"BBOX_DETECTOR"
],
[
3,
1,
0,
5,
0,
"SEGS"
],
[
4,
5,
0,
6,
0,
"MASK"
],
[
5,
1,
0,
7,
0,
"SEGS"
],
[
6,
8,
0,
1,
2,
"SAM_MODEL"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.839054528882405,
"offset": [
431.4600310048111,
-114.3219362287694
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
A combination of high-speed face detection (YOLO) and SAM (initial).
Grounding DINO x SAM

{
"id": "45213769-31e7-40a4-9027-26c67d437c51",
"revision": 0,
"last_node_id": 6,
"last_link_id": 4,
"nodes": [
{
"id": 4,
"type": "LoadImage",
"pos": [
-84.57715485740746,
436.65995789100543
],
"size": [
306.56906795083313,
543.6425774433825
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pexels-photo-14705585.jpg",
"image"
]
},
{
"id": 2,
"type": "SegmentV2",
"pos": [
270.53229781565096,
436.65995789100543
],
"size": [
340,
332
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
3
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.4",
"Node name for S&R": "SegmentV2"
},
"widgets_values": [
"horse",
"sam_hq_vit_h (2.57GB)",
"GroundingDINO_SwinT_OGC (694MB)",
0.35,
0,
0,
false,
"Color",
"#00ff00"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 5,
"type": "PreviewImage",
"pos": [
659.0726825378763,
436.65995789100543
],
"size": [
332.83609638042526,
541.6899599010097
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 3
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
}
],
"links": [
[
1,
4,
0,
2,
0,
"IMAGE"
],
[
3,
2,
0,
5,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.7627768444385543,
"offset": [
184.57715485740746,
-336.65995789100543
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
A combination of Grounding DINO and HQ-SAM, an improved version of SAM.
It is one of the most used combinations because it can generate high-precision masks while specifying the target by text.
Florence2 x SAM2

{
"id": "b13968f1-cfe5-4646-9f22-ac07831aae2b",
"revision": 0,
"last_node_id": 33,
"last_link_id": 41,
"nodes": [
{
"id": 27,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
797.5498046875,
435.3081359863281
],
"size": [
270,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
28
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "de485b65b3e1b9b887ab494afa236dff4bef9a7e",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 30,
"type": "Florence2toCoordinates",
"pos": [
1548.1920166015625,
275.46484375
],
"size": [
270,
102
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "data",
"type": "JSON",
"link": 36
}
],
"outputs": [
{
"name": "center_coordinates",
"type": "STRING",
"links": null
},
{
"name": "bboxes",
"type": "BBOX",
"links": [
37
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "c59676b008a76237002926f684d0ca3a9b29ac54",
"Node name for S&R": "Florence2toCoordinates"
},
"widgets_values": [
"0",
false
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 16,
"type": "LoadImage",
"pos": [
797.5498046875,
-13.30300235748291
],
"size": [
270,
392.65997314453125
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
26,
34,
41
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.39",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"Clipboard - 2025-05-13 21.27.11.png",
"image"
]
},
{
"id": 29,
"type": "InvertMask",
"pos": [
2183.08349609375,
215.1739044189453
],
"size": [
140,
26
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 38
}
],
"outputs": [
{
"name": "MASK",
"type": "MASK",
"links": [
35
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.39",
"Node name for S&R": "InvertMask"
},
"widgets_values": []
},
{
"id": 23,
"type": "PreviewImage",
"pos": [
2585.65771484375,
-6.269532203674316
],
"size": [
374.6875305175781,
390.1878356933594
],
"flags": {},
"order": 8,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 32
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.39",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 32,
"type": "Sam2Segmentation",
"pos": [
1870.6756591796875,
216.38262939453125
],
"size": [
272.087890625,
182
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"link": 40
},
{
"name": "image",
"type": "IMAGE",
"link": 41
},
{
"name": "coordinates_positive",
"shape": 7,
"type": "STRING",
"link": null
},
{
"name": "coordinates_negative",
"shape": 7,
"type": "STRING",
"link": null
},
{
"name": "bboxes",
"shape": 7,
"type": "BBOX",
"link": 37
},
{
"name": "mask",
"shape": 7,
"type": "MASK",
"link": null
}
],
"outputs": [
{
"name": "mask",
"type": "MASK",
"links": [
38
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "c59676b008a76237002926f684d0ca3a9b29ac54",
"Node name for S&R": "Sam2Segmentation"
},
"widgets_values": [
true,
false
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 28,
"type": "JoinImageWithAlpha",
"pos": [
2368.4716796875,
-6.269532203674316
],
"size": [
176.86484375,
46
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 34
},
{
"name": "alpha",
"type": "MASK",
"link": 35
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
32
]
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.39",
"Node name for S&R": "JoinImageWithAlpha"
},
"widgets_values": []
},
{
"id": 33,
"type": "DownloadAndLoadSAM2Model",
"pos": [
1548.1920166015625,
82.7560043334961
],
"size": [
270,
130
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"links": [
40
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "c59676b008a76237002926f684d0ca3a9b29ac54",
"Node name for S&R": "DownloadAndLoadSAM2Model"
},
"widgets_values": [
"sam2.1_hiera_base_plus.safetensors",
"single_image",
"cuda",
"fp16"
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 25,
"type": "Florence2Run",
"pos": [
1107.8709716796875,
74.4581298828125
],
"size": [
400,
364
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 26
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 28
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": null
},
{
"name": "data",
"type": "JSON",
"links": [
36
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "de485b65b3e1b9b887ab494afa236dff4bef9a7e",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"goldfish",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
26,
16,
0,
25,
0,
"IMAGE"
],
[
28,
27,
0,
25,
1,
"FL2MODEL"
],
[
32,
28,
0,
23,
0,
"IMAGE"
],
[
34,
16,
0,
28,
0,
"IMAGE"
],
[
35,
29,
0,
28,
1,
"MASK"
],
[
36,
25,
3,
30,
0,
"JSON"
],
[
37,
30,
1,
32,
4,
"BBOX"
],
[
38,
32,
0,
29,
0,
"MASK"
],
[
40,
33,
0,
32,
0,
"SAM2MODEL"
],
[
41,
16,
0,
32,
1,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.620921323059155,
"offset": [
-697.5498046875,
113.30300235748291
]
},
"reroutes": [
{
"id": 1,
"pos": [
1829.7442626953125,
3.2779242992401123
],
"linkIds": [
34,
41
]
}
],
"linkExtensions": [
{
"id": 34,
"parentId": 1
},
{
"id": 41,
"parentId": 1
}
],
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
A combination of Florence2 and SAM2.1.
Anything is fine if it is an easy-to-understand target such as a person or an animal, but when you want to specify with complex conditions such as "a man wearing sunglasses" or "a cat lying under a tree", such LLM-based models demonstrate their power.
SAM 3

{
"id": "45213769-31e7-40a4-9027-26c67d437c51",
"revision": 0,
"last_node_id": 11,
"last_link_id": 11,
"nodes": [
{
"id": 6,
"type": "PreviewImage",
"pos": [
410.5883107288138,
420.92796486120585
],
"size": [
597.0143975156826,
437.7992150216443
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 4
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 4,
"type": "LoadImage",
"pos": [
-513.4050648613645,
420.92796486120585
],
"size": [
507.5333607299855,
441.38462274968504
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
11
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (34).png",
"image"
]
},
{
"id": 3,
"type": "SAM3Segment",
"pos": [
32.358303298717374,
420.92796486120585
],
"size": [
340,
332
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 11
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
4
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.4",
"Node name for S&R": "SAM3Segment"
},
"widgets_values": [
"a woman wearing an apron",
"sam3",
"Auto",
0.5,
0,
0,
false,
"Color",
"#00ff00"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
4,
3,
0,
6,
0,
"IMAGE"
],
[
11,
4,
0,
3,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.0152559799477263,
"offset": [
613.4050648613645,
-320.92796486120585
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
The latest version of SAM, which also supports text instructions, allowing you to execute object detection and segmentation at once.
Accuracy, performance, and speed are all excellent, so let's use this for now (´ε` )
If you want to do something more complex, try custom nodes like Ltamann/ComfyUI-TBG-SAM3.
SAM 3 x BiRefNet

{
"id": "5231bbde-3d9e-483d-9963-63165fedc646",
"revision": 0,
"last_node_id": 12,
"last_link_id": 18,
"nodes": [
{
"id": 2,
"type": "PreviewImage",
"pos": [
1836.5379900055684,
293.7408968602474
],
"size": [
554.9600255276209,
422.8923553539689
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 17
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 1,
"type": "LoadImage",
"pos": [
477.2842309638515,
293.7408968602474
],
"size": [
526.1926943110356,
491.5335516952887
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
18
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (35).png",
"image"
]
},
{
"id": 11,
"type": "BiRefNetRMBG",
"pos": [
1445.5176350953413,
293.7408968602474
],
"size": [
340,
254
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 16
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
17
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.4",
"Node name for S&R": "BiRefNetRMBG"
},
"widgets_values": [
"BiRefNet-general",
0,
0,
false,
false,
"Alpha",
"#222222"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 5,
"type": "PreviewImage",
"pos": [
1448.15746204173,
611.2211523676546
],
"size": [
332.392016078781,
258
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 4
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.71",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 4,
"type": "SAM3Segment",
"pos": [
1054.497280185114,
293.7408968602474
],
"size": [
340,
332
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 18
}
],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
4,
16
]
},
{
"name": "MASK",
"type": "MASK",
"links": []
},
{
"name": "MASK_IMAGE",
"type": "IMAGE",
"links": null
}
],
"properties": {
"cnr_id": "comfyui-rmbg",
"ver": "2.9.4",
"Node name for S&R": "SAM3Segment"
},
"widgets_values": [
"the woman on the right",
"sam3",
"Auto",
0.5,
0,
7,
false,
"Color",
"#00ff00"
],
"color": "#432",
"bgcolor": "#653"
}
],
"links": [
[
4,
4,
0,
5,
0,
"IMAGE"
],
[
16,
4,
0,
11,
0,
"IMAGE"
],
[
17,
11,
0,
2,
0,
"IMAGE"
],
[
18,
1,
0,
4,
0,
"IMAGE"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.8390545288824087,
"offset": [
-377.2842309638515,
-193.7408968602474
]
},
"frontendVersion": "1.33.8",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
Segmentation is originally for distinguishing objects and is not used for fine cutouts.
On the other hand, matting can handle fine things like hair and semi-transparent things like glass.
By combining these, you can multiply each other's abilities.