AI Mask Generation

Masks are often needed for inpainting and similar workflows, but drawing them by hand or preparing mask images every time is a lot of work. Above all, it cannot be automated.

However, there are not many techniques that can simply take "mask this part" and always produce a clean mask.

You need to think in terms of combining several AI techniques.

  • Object Detection - Finds where the target is in the image.
  • Segmentation - Cuts out the target shape as a mask.
  • Matting - Handles the boundary between foreground and background in more detail.

For example, you might use object detection to find the target, then pass that result to segmentation to turn it into a mask.

Let's look at the main techniques.


Object Detection

As the name suggests, object detection identifies the position of a specific object in an image and outputs a rectangular area called a BBOX.

YOLO Family

YOLO is an extremely fast detection technique designed for real-time object detection.

Basically, one model is made for each type of object you want to detect, such as faces or hands. If there is no model for your target, you need to make one yourself, and it is not suitable when you want to detect many different categories at once.

In exchange, it is very light, so it is suitable when high-speed processing is needed.

Grounding DINO and Others

Grounding DINO detects objects specified by text and outputs BBOXes.

Unlike YOLO, you can specify objects with text such as "white dog" or "red car", so it is easy to use and can detect multiple objects at the same time.

VLM / MLLM

VLM / MLLM are LLMs with the ability to see images.

They can do many things, such as caption generation, and some of them can also perform object detection.

A representative older example is Florence-2.

It is slow, but because it has strong understanding ability, you can specify targets with complex text such as "the woman on the right side of the screen wearing a blue hat."


Matting

Many processes called "background removal" are matting.

Matting separates the foreground from the background, and can handle fine boundaries such as hair and semi-transparent areas.

However, it is not for specifying and cutting out one particular object the way segmentation does.

BiRefNet

The detailed usage is covered on the BiRefNet page.


Segmentation

SAM (Segment Anything Model)

SAM is currently the most famous segmentation model.

It understands the shape of objects, so if you specify a car in a photo with text, points, or boxes, it can find the outline and turn it into a mask.

The current latest model is covered on the SAM 3 / 3.1 page.


Practical Examples

Let's combine the techniques above to generate masks for arbitrary text prompts or categories.

The workflows below were commonly used before SAM 3. If your goal is target-specified segmentation, start with SAM 3 / 3.1 now.

They remain here as references for understanding older workflows or reproducing the same setup in an existing environment.

Required Custom Nodes

These custom nodes may be needed to run the practical examples on this page.

YOLO x SAM

YOLO_face-SAM.json

This combines fast face detection with YOLO and the original SAM.

Grounding DINO x SAM

Grounding_DINO_HQ-SAM.json

This combines Grounding DINO with HQ-SAM, an improved version of SAM.

It can specify targets by text and generate high-precision masks, so it was one of the most commonly used combinations.

Florence2 x SAM2

Florence2_SAM2.1.json

This combines Florence2 and SAM2.1.

For easy targets such as people or animals, many methods work fine. But when you want to specify a complex condition like "a man wearing sunglasses" or "a cat lying under a tree", this kind of LLM-based model is useful.

SAM 3 x BiRefNet

SAM3_BiRefNet.json

Segmentation is for distinguishing objects, not for fine cutouts.

By contrast, matting can handle fine details like hair and semi-transparent objects like glass.

Combining them lets you take advantage of both.