Florence-2とは?
画像を見てキャプション生成・物体検出・セグメンテーション・OCR など、1つのモデルでいくつものタスクをこなせる汎用 VLM(Visual Language Model)です。
このページでは、ComfyUI でよく使う「キャプション生成」「物体検出(座標抽出)」「OCR」「画像に関するQ&A」の4つに絞って扱います。
カスタムノード
- kijai/ComfyUI-Florence2
- モデルは最初の実行時に自動でダウンロードされます。
Florence2Run ノード
Florence2Run は、入力画像に対して Florence-2 にタスクを実行させるためのメインノードです。task を切り替えることで、キャプション生成や物体検出、OCR などの機能を使い分けることができます。
caption, detailed caption
画像から自然文のキャプションを生成します。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"detailed_caption",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (74).png",
"image"
]
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
false
]
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1,
"offset": [
222.45068512396688,
-43.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
caption- 画像の概要をシンプルに説明します。
detailed caption- 構図や見た目をもう少し細かく説明します。
ただし、「プロンプト用のキャプション」だけが目的であれば、JoyCaption など、キャプション専用モデルを使ったほうが遥かに柔軟でクオリティの高いものが出てきます。
caption_to_phrase_grounding
指定したキャプションのフレーズごとに、物体の位置を矩形(バウンディングボックス)の形で出力します。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 11,
"last_link_id": 12,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": [
2
]
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": []
},
{
"name": "data",
"type": "JSON",
"links": [
7
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"fox",
"caption_to_phrase_grounding",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 3,
"type": "PreviewImage",
"pos": [
1023.5038603305788,
423.43518886551453
],
"size": [
419.6727272727271,
391.9818181818181
],
"flags": {},
"order": 4,
"mode": 0,
"inputs": [
{
"name": "images",
"type": "IMAGE",
"link": 2
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewImage"
},
"widgets_values": []
},
{
"id": 10,
"type": "DownloadAndLoadSAM2Model",
"pos": [
1031.2774982383762,
876.8182919589856
],
"size": [
210,
130
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"links": [
10
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "DownloadAndLoadSAM2Model"
},
"widgets_values": [
"sam2.1_hiera_base_plus.safetensors",
"single_image",
"cuda",
"fp16"
],
"color": "#323",
"bgcolor": "#535"
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1,
11
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (73).png",
"image"
]
},
{
"id": 11,
"type": "MaskPreview",
"pos": [
1535.0502255111053,
980.9273828680758
],
"size": [
374.29999999999995,
323
],
"flags": {},
"order": 7,
"mode": 0,
"inputs": [
{
"name": "mask",
"type": "MASK",
"link": 12
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "MaskPreview"
},
"widgets_values": []
},
{
"id": 8,
"type": "Florence2toCoordinates",
"pos": [
1030.8481877951024,
1066.5042611550825
],
"size": [
210,
102
],
"flags": {},
"order": 5,
"mode": 0,
"inputs": [
{
"name": "data",
"type": "JSON",
"link": 7
}
],
"outputs": [
{
"name": "center_coordinates",
"type": "STRING",
"links": [
8
]
},
{
"name": "bboxes",
"type": "BBOX",
"links": [
9
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Florence2toCoordinates"
},
"widgets_values": [
"0",
false
],
"color": "#432",
"bgcolor": "#653"
},
{
"id": 9,
"type": "Sam2Segmentation",
"pos": [
1281.994151431467,
982.5618884278075
],
"size": [
212.087890625,
182
],
"flags": {},
"order": 6,
"mode": 0,
"inputs": [
{
"name": "sam2_model",
"type": "SAM2MODEL",
"link": 10
},
{
"name": "image",
"type": "IMAGE",
"link": 11
},
{
"name": "coordinates_positive",
"shape": 7,
"type": "STRING",
"link": 8
},
{
"name": "coordinates_negative",
"shape": 7,
"type": "STRING",
"link": null
},
{
"name": "bboxes",
"shape": 7,
"type": "BBOX",
"link": 9
},
{
"name": "mask",
"shape": 7,
"type": "MASK",
"link": null
}
],
"outputs": [
{
"name": "mask",
"type": "MASK",
"links": [
12
]
}
],
"properties": {
"cnr_id": "ComfyUI-segment-anything-2",
"ver": "0c35fff5f382803e2310103357b5e985f5437f32",
"Node name for S&R": "Sam2Segmentation"
},
"widgets_values": [
false,
false
],
"color": "#323",
"bgcolor": "#535"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
2,
1,
0,
3,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
7,
1,
3,
8,
0,
"JSON"
],
[
8,
8,
0,
9,
2,
"STRING"
],
[
9,
8,
1,
9,
4,
"BBOX"
],
[
10,
10,
0,
9,
0,
"SAM2MODEL"
],
[
11,
2,
0,
9,
1,
"IMAGE"
],
[
12,
9,
0,
11,
0,
"MASK"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 0.8264462809917358,
"offset": [
-56.58931487603314,
-89.94996065705918
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 「left tree」「red car」など、少し複雑な指示でも位置を取れるのが特徴です。
- 🟨
Florence2 Coordinatesノードで座標を取り出し、SAM2 などのセグメンテーションモデルと組み合わせることで、特定の物体だけをマスク化するといった使い方ができます。
ocr
画像内の文字を読み取り、テキストとして出力します。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (75).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"",
"ocr",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
docvqa
画像についての質問に答えるタスクです。

{
"id": "063054af-873b-492c-a642-b59c68b22c0b",
"revision": 0,
"last_node_id": 12,
"last_link_id": 13,
"nodes": [
{
"id": 4,
"type": "DownloadAndLoadFlorence2Model",
"pos": [
349.41423462195155,
229.87996065705917
],
"size": [
286.86661124741727,
130
],
"flags": {},
"order": 0,
"mode": 0,
"inputs": [
{
"name": "lora",
"shape": 7,
"type": "PEFTLORA",
"link": null
}
],
"outputs": [
{
"name": "florence2_model",
"type": "FL2MODEL",
"links": [
3
]
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "DownloadAndLoadFlorence2Model"
},
"widgets_values": [
"microsoft/Florence-2-base-ft",
"fp16",
"sdpa",
true
],
"color": "#232",
"bgcolor": "#353"
},
{
"id": 12,
"type": "PreviewAny",
"pos": [
1025.4266881474668,
427.6300114135301
],
"size": [
297.27272727272725,
182.36363636363637
],
"flags": {},
"order": 3,
"mode": 0,
"inputs": [
{
"name": "source",
"type": "*",
"link": 13
}
],
"outputs": [],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "PreviewAny"
},
"widgets_values": [
null,
null,
null
]
},
{
"id": 2,
"type": "LoadImage",
"pos": [
248.54931487603312,
423.43518886551453
],
"size": [
390.44371448863615,
395.81818181818187
],
"flags": {},
"order": 1,
"mode": 0,
"inputs": [],
"outputs": [
{
"name": "IMAGE",
"type": "IMAGE",
"links": [
1
]
},
{
"name": "MASK",
"type": "MASK",
"links": null
}
],
"properties": {
"cnr_id": "comfy-core",
"ver": "0.3.76",
"Node name for S&R": "LoadImage"
},
"widgets_values": [
"pasted/image (76).png",
"image"
]
},
{
"id": 1,
"type": "Florence2Run",
"pos": [
674.4302630294422,
423.43518886551453
],
"size": [
313.6363636363636,
364
],
"flags": {},
"order": 2,
"mode": 0,
"inputs": [
{
"name": "image",
"type": "IMAGE",
"link": 1
},
{
"name": "florence2_model",
"type": "FL2MODEL",
"link": 3
}
],
"outputs": [
{
"name": "image",
"type": "IMAGE",
"links": []
},
{
"name": "mask",
"type": "MASK",
"links": []
},
{
"name": "caption",
"type": "STRING",
"links": [
13
]
},
{
"name": "data",
"type": "JSON",
"links": []
}
],
"properties": {
"cnr_id": "comfyui-florence2",
"ver": "00b63382966a444a9fefacb65b8deb188d12a458",
"Node name for S&R": "Florence2Run"
},
"widgets_values": [
"How many eggs are on the ramen?",
"docvqa",
true,
false,
1024,
3,
true,
"",
1234,
"fixed"
],
"color": "#232",
"bgcolor": "#353"
}
],
"links": [
[
1,
2,
0,
1,
0,
"IMAGE"
],
[
3,
4,
0,
1,
1,
"FL2MODEL"
],
[
13,
1,
2,
12,
0,
"*"
]
],
"groups": [],
"config": {},
"extra": {
"ds": {
"scale": 1.2100000000000006,
"offset": [
-148.54931487603312,
-129.87996065705917
]
},
"frontendVersion": "1.34.6",
"VHS_latentpreview": false,
"VHS_latentpreviewrate": 0,
"VHS_MetadataImage": true,
"VHS_KeepIntermediate": true
},
"version": 0.4
}
- 「この画像の中で○○はどこにあるか?」「この表の値は?」といった質問を投げて、回答をテキストで受け取ることができます。
- ChatGPT に画像を投げて質問するのと似た使い方のイメージです。