Florence-2

What is Florence-2?

It is a general-purpose VLM (Visual Language Model) that can handle multiple tasks such as caption generation, object detection, segmentation, and OCR with a single model by looking at an image.

In this page, we will focus on four tasks often used in ComfyUI: "Caption Generation", "Object Detection (Coordinate Extraction)", "OCR", and "Q&A about Images".

Custom Node

kijai/ComfyUI-Florence2
- The model is automatically downloaded at the first run.

Florence2Run Node

Florence2Run is the main node for having Florence-2 execute tasks on the input image. By switching task, you can use functions such as caption generation, object detection, and OCR properly.

caption, detailed caption

Generates a natural language caption from the image.

Florence2-detailed_caption.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "",
        "detailed_caption",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (74).png",
        "image"
      ]
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        false
      ]
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1,
      "offset": [
        222.45068512396688,
        -43.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

caption
- Explains the outline of the image simply.
detailed caption
- Explains the composition and appearance in a little more detail.

However, if the purpose is only "caption for prompts", using a caption-specific model such as JoyCaption will produce much more flexible and high-quality results.

caption_to_phrase_grounding

Outputs the position of the object in the form of a rectangle (bounding box) for each phrase of the specified caption.

Florence2-caption_to_phrase_grounding.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 11,
  "last_link_id": 12,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": [
            2
          ]
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": []
        },
        {
          "name": "data",
          "type": "JSON",
          "links": [
            7
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "fox",
        "caption_to_phrase_grounding",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 3,
      "type": "PreviewImage",
      "pos": [
        1023.5038603305788,
        423.43518886551453
      ],
      "size": [
        419.6727272727271,
        391.9818181818181
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 2
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewImage"
      },
      "widgets_values": []
    },
    {
      "id": 10,
      "type": "DownloadAndLoadSAM2Model",
      "pos": [
        1031.2774982383762,
        876.8182919589856
      ],
      "size": [
        210,
        130
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "sam2_model",
          "type": "SAM2MODEL",
          "links": [
            10
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "DownloadAndLoadSAM2Model"
      },
      "widgets_values": [
        "sam2.1_hiera_base_plus.safetensors",
        "single_image",
        "cuda",
        "fp16"
      ],
      "color": "#323",
      "bgcolor": "#535"
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1,
            11
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (73).png",
        "image"
      ]
    },
    {
      "id": 11,
      "type": "MaskPreview",
      "pos": [
        1535.0502255111053,
        980.9273828680758
      ],
      "size": [
        374.29999999999995,
        323
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "mask",
          "type": "MASK",
          "link": 12
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "MaskPreview"
      },
      "widgets_values": []
    },
    {
      "id": 8,
      "type": "Florence2toCoordinates",
      "pos": [
        1030.8481877951024,
        1066.5042611550825
      ],
      "size": [
        210,
        102
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "data",
          "type": "JSON",
          "link": 7
        }
      ],
      "outputs": [
        {
          "name": "center_coordinates",
          "type": "STRING",
          "links": [
            8
          ]
        },
        {
          "name": "bboxes",
          "type": "BBOX",
          "links": [
            9
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "Florence2toCoordinates"
      },
      "widgets_values": [
        "0",
        false
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 9,
      "type": "Sam2Segmentation",
      "pos": [
        1281.994151431467,
        982.5618884278075
      ],
      "size": [
        212.087890625,
        182
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "sam2_model",
          "type": "SAM2MODEL",
          "link": 10
        },
        {
          "name": "image",
          "type": "IMAGE",
          "link": 11
        },
        {
          "name": "coordinates_positive",
          "shape": 7,
          "type": "STRING",
          "link": 8
        },
        {
          "name": "coordinates_negative",
          "shape": 7,
          "type": "STRING",
          "link": null
        },
        {
          "name": "bboxes",
          "shape": 7,
          "type": "BBOX",
          "link": 9
        },
        {
          "name": "mask",
          "shape": 7,
          "type": "MASK",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "mask",
          "type": "MASK",
          "links": [
            12
          ]
        }
      ],
      "properties": {
        "cnr_id": "ComfyUI-segment-anything-2",
        "ver": "0c35fff5f382803e2310103357b5e985f5437f32",
        "Node name for S&R": "Sam2Segmentation"
      },
      "widgets_values": [
        false,
        false
      ],
      "color": "#323",
      "bgcolor": "#535"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      2,
      1,
      0,
      3,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      7,
      1,
      3,
      8,
      0,
      "JSON"
    ],
    [
      8,
      8,
      0,
      9,
      2,
      "STRING"
    ],
    [
      9,
      8,
      1,
      9,
      4,
      "BBOX"
    ],
    [
      10,
      10,
      0,
      9,
      0,
      "SAM2MODEL"
    ],
    [
      11,
      2,
      0,
      9,
      1,
      "IMAGE"
    ],
    [
      12,
      9,
      0,
      11,
      0,
      "MASK"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.8264462809917358,
      "offset": [
        -56.58931487603314,
        -89.94996065705918
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

It is characterized by being able to take the position even with slightly complex instructions such as "left tree" or "red car".
By extracting coordinates with the 🟨 Florence2 Coordinates node and combining it with a segmentation model such as SAM2, you can use it to mask only specific objects.

ocr

Reads characters in the image and outputs them as text.

Florence2-ocr.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        null
      ]
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (75).png",
        "image"
      ]
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "",
        "ocr",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.2100000000000006,
      "offset": [
        -148.54931487603312,
        -129.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

docvqa

A task to answer questions about the image.

Florence2-docvqa.json

{
  "id": "063054af-873b-492c-a642-b59c68b22c0b",
  "revision": 0,
  "last_node_id": 12,
  "last_link_id": 13,
  "nodes": [
    {
      "id": 4,
      "type": "DownloadAndLoadFlorence2Model",
      "pos": [
        349.41423462195155,
        229.87996065705917
      ],
      "size": [
        286.86661124741727,
        130
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [
        {
          "name": "lora",
          "shape": 7,
          "type": "PEFTLORA",
          "link": null
        }
      ],
      "outputs": [
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "links": [
            3
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "DownloadAndLoadFlorence2Model"
      },
      "widgets_values": [
        "microsoft/Florence-2-base-ft",
        "fp16",
        "sdpa",
        true
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 12,
      "type": "PreviewAny",
      "pos": [
        1025.4266881474668,
        427.6300114135301
      ],
      "size": [
        297.27272727272725,
        182.36363636363637
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "source",
          "type": "*",
          "link": 13
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "PreviewAny"
      },
      "widgets_values": [
        null,
        null,
        null
      ]
    },
    {
      "id": 2,
      "type": "LoadImage",
      "pos": [
        248.54931487603312,
        423.43518886551453
      ],
      "size": [
        390.44371448863615,
        395.81818181818187
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            1
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.76",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "pasted/image (76).png",
        "image"
      ]
    },
    {
      "id": 1,
      "type": "Florence2Run",
      "pos": [
        674.4302630294422,
        423.43518886551453
      ],
      "size": [
        313.6363636363636,
        364
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 1
        },
        {
          "name": "florence2_model",
          "type": "FL2MODEL",
          "link": 3
        }
      ],
      "outputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "links": []
        },
        {
          "name": "mask",
          "type": "MASK",
          "links": []
        },
        {
          "name": "caption",
          "type": "STRING",
          "links": [
            13
          ]
        },
        {
          "name": "data",
          "type": "JSON",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfyui-florence2",
        "ver": "00b63382966a444a9fefacb65b8deb188d12a458",
        "Node name for S&R": "Florence2Run"
      },
      "widgets_values": [
        "How many eggs are on the ramen?",
        "docvqa",
        true,
        false,
        1024,
        3,
        true,
        "",
        1234,
        "fixed"
      ],
      "color": "#232",
      "bgcolor": "#353"
    }
  ],
  "links": [
    [
      1,
      2,
      0,
      1,
      0,
      "IMAGE"
    ],
    [
      3,
      4,
      0,
      1,
      1,
      "FL2MODEL"
    ],
    [
      13,
      1,
      2,
      12,
      0,
      "*"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 1.2100000000000006,
      "offset": [
        -148.54931487603312,
        -129.87996065705917
      ]
    },
    "frontendVersion": "1.34.6",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

You can ask questions like "Where is XX in this image?" or "What is the value of this table?" and receive the answer in text.
Imagine usage similar to throwing an image to ChatGPT and asking questions.

Florence-2

What is Florence-2?

Custom Node

Florence2Run Node

caption, detailed caption

caption_to_phrase_grounding

ocr

docvqa

What is the JSON copy button?

This page has an issue!

Please explain more!

Feedback / Other

Thank you

Florence-2

What is Florence-2?

Custom Node

Florence2Run Node

caption, detailed caption

caption_to_phrase_grounding

ocr

docvqa

Related workflows