What is Florence-2?
It is a general-purpose VLM (Visual Language Model) that can handle multiple tasks such as caption generation, object detection, segmentation, and OCR with a single model by looking at an image.
In this page, we will focus on four tasks often used in ComfyUI: "Caption Generation", "Object Detection (Coordinate Extraction)", "OCR", and "Q&A about Images".
Custom Node
- kijai/ComfyUI-Florence2
- The model is automatically downloaded at the first run.
Florence2Run Node
Florence2Run is the main node for having Florence-2 execute tasks on the input image. By switching task, you can use functions such as caption generation, object detection, and OCR properly.
caption, detailed caption
Generates a natural language caption from the image.

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"detailed_caption",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (74).png",
"image"
]
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
false
]
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1,
"offset": [
222.45068512396688,
-43.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
caption- Explains the outline of the image simply.
detailed caption- Explains the composition and appearance in a little more detail.
However, if the purpose is only "caption for prompts", using a caption-specific model such as JoyCaption will produce much more flexible and high-quality results.
caption_to_phrase_grounding
Outputs the position of the object in the form of a rectangle (bounding box) for each phrase of the specified caption.

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 11,
"last_link_id": 12,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": []
},
{
"name": "data",
"type": "JSON",
"links": [
7
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"fox",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 3,
"type": "PreviewImage",
"pos": [
1023.5038603305788,
423.43518886551453
],
"size": [
419.6727272727271,
391.9818181818181
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 2
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 10,
"type": "DownloadAndLoadSAM2Model",
"pos": [
1031.2774982383762,
876.8182919589856
],
"size": [
210,
130
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"links": [
10
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "DownloadAndLoadSAM2Model"
},
"widgets_values": [
"sam2.1_hiera_base_plus.safetensors",
"single_image",
"cuda",
"fp16"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1,
11
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (73).png",
"image"
]
},
{
"id": 11,
"type": "MaskPreview",
"pos": [
1535.0502255111053,
980.9273828680758
],
"size": [
374.29999999999995,
323
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 12
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 8,
"type": "Florence2toCoordinates",
"pos": [
1030.8481877951024,
1066.5042611550825
],
"size": [
210,
102
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "data",
"type": "JSON",
"link": 7
}
],
"outputs": [
{
"name": "center_coordinates",
"type": "STRING",
"links": [
8
]
},
{
"name": "bboxes",
"type": "BBOX",
"links": [
9
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Florence2toCoordinates"
},
"widgets_values": [
"0",
false
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 9,
"type": "Sam2Segmentation",
"pos": [
1281.994151431467,
982.5618884278075
],
"size": [
212.087890625,
182
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"link": 10
},
{
"name": "image",
"type": "IMAGE",
"link": 11
},
{
"name": "coordinates_positive",
"shape": 7,
"type": "STRING",
"link": 8
},
{
"name": "coordinates_negative",
"shape": 7,
"type": "STRING",
"link": null
},
{
"name": "bboxes",
"shape": 7,
"type": "BBOX",
"link": 9
},
{
"name": "mask",
"shape": 7,
"type": "MASK",
"link": null
}
],
"outputs": [
{
"name": "mask",
"type": "MASK",
"links": [
12
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Sam2Segmentation"
},
"widgets_values": [
false,
false
],
"color": "#323",
"bgcolor": "#535"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
2,
1,
0,
3,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
7,
1,
3,
8,
0,
"JSON"
],
[
8,
8,
0,
9,
2,
"STRING"
],
[
9,
8,
1,
9,
4,
"BBOX"
],
[
10,
10,
0,
9,
0,
"SAM2MODEL"
],
[
11,
2,
0,
9,
1,
"IMAGE"
],
[
12,
9,
0,
11,
0,
"MASK"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.8264462809917358,
"offset": [
-56.58931487603314,
-89.94996065705918
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- It is characterized by being able to take the position even with slightly complex instructions such as "left tree" or "red car".
- By extracting coordinates with the 🟨
Florence2 Coordinatesnode and combining it with a segmentation model such as SAM2, you can use it to mask only specific objects.
ocr
Reads characters in the image and outputs them as text.

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (75).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"ocr",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
docvqa
A task to answer questions about the image.

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (76).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"How many eggs are on the ramen?",
"docvqa",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- You can ask questions like "Where is XX in this image?" or "What is the value of this table?" and receive the answer in text.
- Imagine usage similar to throwing an image to ChatGPT and asking questions.