Florence-2

Florence-2とは？

画像を見てキャプション生成・物体検出・セグメンテーション・OCR など、1つのモデルでいくつものタスクをこなせる汎用 VLM（Visual Language Model）です。

このページでは、ComfyUI でよく使う「キャプション生成」「物体検出（座標抽出）」「OCR」「画像に関するQ&A」の4つに絞って扱います。

カスタムノード

kijai/ComfyUI-Florence2
- モデルは最初の実行時に自動でダウンロードされます。

Florence2Run ノード

Florence2Run は、入力画像に対して Florence-2 にタスクを実行させるためのメインノードです。task を切り替えることで、キャプション生成や物体検出、OCR などの機能を使い分けることができます。

caption, detailed caption

画像から自然文のキャプションを生成します。

Florence2-detailed_caption.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "",
        "detailed_caption",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (74).png",
        "image"
      ]
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        false
      ]
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1,
      "offset": [
        222.45068512396688,
        -43.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

caption
- 画像の概要をシンプルに説明します。
detailed caption
- 構図や見た目をもう少し細かく説明します。

ただし、「プロンプト用のキャプション」だけが目的であれば、JoyCaption など、キャプション専用モデルを使ったほうが遥かに柔軟でクオリティの高いものが出てきます。

caption_to_phrase_grounding

指定したキャプションのフレーズごとに、物体の位置を矩形（バウンディングボックス）の形で出力します。

Florence2-caption_to_phrase_grounding.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 11,
  "last_link_id": 12,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": [
            2
          ]
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": []
        },
        {
          "name": "data",
          "type": "JSON",
          "links": [
            7
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "fox",
        "caption_to_phrase_grounding",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 3,
      "type": "PreviewImage",
      "pos": [
        1023.5038603305788,
        423.43518886551453
      ],
      "size": [
        419.6727272727271,
        391.9818181818181
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 2
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewImage"
      },
      "widgets_values": []
    },
    {
      "id": 10,
      "type": "DownloadAndLoadSAM2Model",
      "pos": [
        1031.2774982383762,
        876.8182919589856
      ],
      "size": [
        210,
        130
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "sam2_model",
          "type": "SAM2MODEL",
          "links": [
            10
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "DownloadAndLoadSAM2Model"
      },
      "widgets_values": [
        "sam2.1_hiera_base_plus.safetensors",
        "single_image",
        "cuda",
        "fp16"
      ],
      "color": "#323",
      "bgcolor": "#535"
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1,
            11
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (73).png",
        "image"
      ]
    },
    {
      "id": 11,
      "type": "MaskPreview",
      "pos": [
        1535.0502255111053,
        980.9273828680758
      ],
      "size": [
        374.29999999999995,
        323
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "mask",
          "type": "MASK",
          "link": 12
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "MaskPreview"
      },
      "widgets_values": []
    },
    {
      "id": 8,
      "type": "Florence2toCoordinates",
      "pos": [
        1030.8481877951024,
        1066.5042611550825
      ],
      "size": [
        210,
        102
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "data",
          "type": "JSON",
          "link": 7
        }
      ],
      "outputs": [
        {
          "name": "center_coordinates",
          "type": "STRING",
          "links": [
            8
          ]
        },
        {
          "name": "bboxes",
          "type": "BBOX",
          "links": [
            9
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "Florence2toCoordinates"
      },
      "widgets_values": [
        "0",
        false
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 9,
      "type": "Sam2Segmentation",
      "pos": [
        1281.994151431467,
        982.5618884278075
      ],
      "size": [
        212.087890625,
        182
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "sam2_model",
          "type": "SAM2MODEL",
          "link": 10
        },
        {
          "name": "image",
          "type": "IMAGE",
          "link": 11
        },
        {
          "name": "coordinates_positive",
          "shape": 7,
          "type": "STRING",
          "link": 8
        },
        {
          "name": "coordinates_negative",
          "shape": 7,
          "type": "STRING",
          "link": null
        },
        {
          "name": "bboxes",
          "shape": 7,
          "type": "BBOX",
          "link": 9
        },
        {
          "name": "mask",
          "shape": 7,
          "type": "MASK",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "mask",
          "type": "MASK",
          "links": [
            12
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "Sam2Segmentation"
      },
      "widgets_values": [
        false,
        false
      ],
      "color": "#323",
      "bgcolor": "#535"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      2,
      1,
      0,
      3,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      7,
      1,
      3,
      8,
      0,
      "JSON"
    ],
    [
      8,
      8,
      0,
      9,
      2,
      "STRING"
    ],
    [
      9,
      8,
      1,
      9,
      4,
      "BBOX"
    ],
    [
      10,
      10,
      0,
      9,
      0,
      "SAM2MODEL"
    ],
    [
      11,
      2,
      0,
      9,
      1,
      "IMAGE"
    ],
    [
      12,
      9,
      0,
      11,
      0,
      "MASK"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.8264462809917358,
      "offset": [
        -56.58931487603314,
        -89.94996065705918
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

「left tree」「red car」など、少し複雑な指示でも位置を取れるのが特徴です。
🟨 Florence2 Coordinates ノードで座標を取り出し、SAM2 などのセグメンテーションモデルと組み合わせることで、特定の物体だけをマスク化するといった使い方ができます。

ocr

画像内の文字を読み取り、テキストとして出力します。

Florence2-ocr.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        null
      ]
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (75).png",
        "image"
      ]
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "",
        "ocr",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.2100000000000006,
      "offset": [
        -148.54931487603312,
        -129.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

docvqa

画像についての質問に答えるタスクです。

Florence2-docvqa.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        null
      ]
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (76).png",
        "image"
      ]
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "How many eggs are on the ramen?",
        "docvqa",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.2100000000000006,
      "offset": [
        -148.54931487603312,
        -129.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

「この画像の中で○○はどこにあるか？」「この表の値は？」といった質問を投げて、回答をテキストで受け取ることができます。
ChatGPT に画像を投げて質問するのと似た使い方のイメージです。

Florence-2

Florence-2とは？

カスタムノード

Florence2Run ノード

caption, detailed caption

caption_to_phrase_grounding

ocr

docvqa

jsonコピーボタンとは？

修正・誤字報告

記事リクエスト

感想・その他

ありがとうございます

Florence-2

Florence-2とは？

カスタムノード

Florence2Run ノード

caption, detailed caption

caption_to_phrase_grounding

ocr

docvqa

関連Workflow