什么是 Florence-2?
是看图像进行说明文生成・物体检出・分割・OCR 等,1 个模型能完成几个任务的通用 VLM(Visual Language Model)。
在这一页,聚焦于在 ComfyUI 经常使用的“说明文生成”“物体检出(坐标抽出)”“OCR”“关于图像的 Q&A”这 4 个进行处理。
自定义节点
- kijai/ComfyUI-Florence2
- 模型在最初执行时被自动下载。
Florence2Run 节点
Florence2Run 是,为了对输入图像让 Florence-2 执行任务的主节点。通过切换 task,可以区分使用说明文生成或物体检出、OCR 等功能。
caption, detailed caption
从图像生成自然文的说明文。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"detailed_caption",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (74).png",
"image"
]
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
false
]
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1,
"offset": [
222.45068512396688,
-43.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
caption- 简单地说明图像的概要。
detailed caption- 稍微详细地说明构图或外观。
但是,如果目的只是“提示词用的说明文”,使用 JoyCaption 等,说明文专用模型的一方会出现遥远地更柔软且高质量的东西。
caption_to_phrase_grounding
每指定说明文的短语,以矩形(边界框)的形式输出物体的位置。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 11,
"last_link_id": 12,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": []
},
{
"name": "data",
"type": "JSON",
"links": [
7
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"fox",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 3,
"type": "PreviewImage",
"pos": [
1023.5038603305788,
423.43518886551453
],
"size": [
419.6727272727271,
391.9818181818181
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 2
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 10,
"type": "DownloadAndLoadSAM2Model",
"pos": [
1031.2774982383762,
876.8182919589856
],
"size": [
210,
130
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"links": [
10
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "DownloadAndLoadSAM2Model"
},
"widgets_values": [
"sam2.1_hiera_base_plus.safetensors",
"single_image",
"cuda",
"fp16"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1,
11
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (73).png",
"image"
]
},
{
"id": 11,
"type": "MaskPreview",
"pos": [
1535.0502255111053,
980.9273828680758
],
"size": [
374.29999999999995,
323
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 12
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 8,
"type": "Florence2toCoordinates",
"pos": [
1030.8481877951024,
1066.5042611550825
],
"size": [
210,
102
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "data",
"type": "JSON",
"link": 7
}
],
"outputs": [
{
"name": "center_coordinates",
"type": "STRING",
"links": [
8
]
},
{
"name": "bboxes",
"type": "BBOX",
"links": [
9
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Florence2toCoordinates"
},
"widgets_values": [
"0",
false
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 9,
"type": "Sam2Segmentation",
"pos": [
1281.994151431467,
982.5618884278075
],
"size": [
212.087890625,
182
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"link": 10
},
{
"name": "image",
"type": "IMAGE",
"link": 11
},
{
"name": "coordinates_positive",
"shape": 7,
"type": "STRING",
"link": 8
},
{
"name": "coordinates_negative",
"shape": 7,
"type": "STRING",
"link": null
},
{
"name": "bboxes",
"shape": 7,
"type": "BBOX",
"link": 9
},
{
"name": "mask",
"shape": 7,
"type": "MASK",
"link": null
}
],
"outputs": [
{
"name": "mask",
"type": "MASK",
"links": [
12
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Sam2Segmentation"
},
"widgets_values": [
false,
false
],
"color": "#323",
"bgcolor": "#535"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
2,
1,
0,
3,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
7,
1,
3,
8,
0,
"JSON"
],
[
8,
8,
0,
9,
2,
"STRING"
],
[
9,
8,
1,
9,
4,
"BBOX"
],
[
10,
10,
0,
9,
0,
"SAM2MODEL"
],
[
11,
2,
0,
9,
1,
"IMAGE"
],
[
12,
9,
0,
11,
0,
"MASK"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.8264462809917358,
"offset": [
-56.58931487603314,
-89.94996065705918
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 能取到“left tree”“red car”等,稍微复杂的指示的位置是特征。
- 🟨 用
Florence2 Coordinates节点取出坐标,通过与 SAM2 等的分割模型组合,可以做只将特定的物体掩膜化这样的使用方法。
ocr
读取图像内的文字,作为文本输出。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (75).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"ocr",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
docvqa
回答关于图像的问题的任务。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (76).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"How many eggs are on the ramen?",
"docvqa",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 投出“这幅图像中〇〇在哪里?”“这个表的值是?”之类的问题,可以以文本接收回答。
- 是向 ChatGPT 投图像提问相似的使用方法的印象。