Tag & Caption Generation

What is Tag & Caption Generation?

It is a task to automatically add tags and descriptions (captions) from images.

It is used for creating datasets for LoRA and fine-tuning, and for generating prompts to create images similar to reference images.

Tag Generation (Tagger)

This automatically assigns Danbooru-style tags and genre labels.

WD Family Taggers

WD14-tagger / WD-tagger-v3 family

WD-tagger-v3.json

{
  "last_node_id": 11,
  "last_link_id": 10,
  "nodes": [
    {
      "id": 11,
      "type": "LoadImage",
      "pos": [
        135,
        145
      ],
      "size": [
        289.42388916015625,
        379.69337463378906
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            10
          ],
          "shape": 3
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": null,
          "shape": 3
        }
      ],
      "properties": {
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "yellow-dress (1).png",
        "image"
      ]
    },
    {
      "id": 10,
      "type": "WD14Tagger|pysssss",
      "pos": [
        461,
        145
      ],
      "size": [
        292.44875507447193,
        277.8525060634534
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [
        {
          "name": "image",
          "type": "IMAGE",
          "link": 10,
          "slot_index": 0
        }
      ],
      "outputs": [
        {
          "name": "STRING",
          "type": "STRING",
          "links": null,
          "shape": 6
        }
      ],
      "properties": {
        "Node name for S&R": "WD14Tagger|pysssss"
      },
      "widgets_values": [
        "wd-swinv2-tagger-v3",
        0.35,
        0.85,
        false,
        false,
        "",
        "1girl, solo, long_hair, looking_at_viewer, gloves, dress, sitting, full_body, yellow_eyes, monochrome, sleeveless, striped_clothes, from_side, sleeveless_dress, yellow_background, vertical-striped_clothes, limited_palette, striped_dress, yellow_theme, vertical-striped_dress"
      ]
    }
  ],
  "links": [
    [
      10,
      11,
      0,
      10,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "0246.VERSION": [
      0,
      0,
      4
    ]
  },
  "version": 0.4
}

Tagging models for illustration and anime images.
They provide very detailed tags such as characters, hair color, clothing, facial expressions, and composition.

JoyTagger

While the WD family is specialized for anime, this tagger supports more general images.
It is by the same author as JoyCaption mentioned later, which is convenient if you want to align tagging and caption generation with the same family of tools.

Local Caption Generation Models

When LLMs (VLMs) that could handle images were rare, many caption generation models running locally were proposed. Moondream, LLaVA family, InternLM-XComposer2-VL, etc., the list goes on.

Looking at them from current standards, many are tough in terms of the balance between accuracy, stability, and cost, and those worth introducing anew are becoming limited.

Here, I will list only those that are still relatively easy to use.

JoyCaption

JoyCaption
A lightweight model specialized for caption generation created by the same author as JoyTagger.
Unlike VLMs aiming for general-purpose use, it specializes in "Image → Description," so you don't need to be particular about prompts and can use it casually.
Being lightweight is the best part.

Qwen-2.5 / Qwen3-VL Family

As a lightweight local MLLM, this series can be said to be SoTA class at the moment.
It supports not only general caption generation but also slightly more in-depth instructions like "Make it a writing style suitable for training captions."
If you want to run an LVLM like ChatGPT locally, try using this for now.

APIs like ChatGPT / Gemini

Just like with prompt generation, using closed models via API is also a good choice.

Setup is very easy.
Japanese captions can be handled as is.
You can ask for post-processing together, like "Make it a slightly more technical description for LoRA training."

Setting up an MLLM is difficult, and computational costs tend to be high... Being able to use it casually is the happiest thing above all.

Reasons to Use Local Models

Not limited to LLMs, a big reason to use local models is whether they can handle NSFW data.

Public APIs often blur or reject NSFW content.
On the other hand, training datasets sometimes require "captions as they are" regardless of content.
Even local models are often censored.

I think this is the sole reason why WD family taggers and JoyCaption still maintain a certain demand.

If you need completely local operation or are creating a dataset including NSFW, please use these local models in combination.

Tag & Caption Generation

What is Tag & Caption Generation?

Tag Generation (Tagger)

WD Family Taggers

JoyTagger

Local Caption Generation Models

JoyCaption

Qwen-2.5 / Qwen3-VL Family

APIs like ChatGPT / Gemini

Reasons to Use Local Models

What is the JSON copy button?

This page has an issue!

Please explain more!

Thank you

Tag & Caption Generation

What is Tag & Caption Generation?

Tag Generation (Tagger)

WD Family Taggers

JoyTagger

Local Caption Generation Models

JoyCaption

Qwen-2.5 / Qwen3-VL Family

APIs like ChatGPT / Gemini

Reasons to Use Local Models

Related workflows