FLUX.2 [klein] Schematic LoRA

概要

画像生成モデルの事前知識を活用して、CV タスクに応用する研究はいくつもあります。代表的なものでいえば、Marigold、Lotus-2、SDPose などです。

これらは事前学習済みの画像生成モデルを利用しているとはいえ、最終的にはそれぞれのタスク専用に設計されています。

しかし、指示ベース画像編集モデルが一般的になった現在、画像から深度推定やセグメンテーションを行うタスクを、大きな意味で 画像編集 として扱ってしまえば良いのではないか？という研究が発表されました。それが Google DeepMind の Vision Banana です。

これにインスピレーションを受け、同じ方向のことが FLUX.2 [klein] でもできないか？と考えたのが今回の実験の動機です。

結果として SOTA 性能を出せたとは言えませんが、簡単な LoRA 学習だけでも、ローカルモデルで Vision Banana に近い方向の挙動を試せることを示せればと思います。

配布先

LoRA: nomadoor/flux-2-klein-9B-schematic-lora
Dataset: nomadoor/flux-2-klein-9B-schematic-dataset

タスク設定

Vision Banana では、主に depth / normal / segmentation が扱われています。

今回はそれとまったく同じタスク構成にはせず、ComfyUI / 画像生成コミュニティでなじみのある、いわゆる ControlNet Preprocessor 的な出力を中心に選出しました。

個人的にこれらのタスクを image2schematic と呼び、今回作成した LoRA にはすべて schematic という文字を含めています。

task	output
relative depth	near = white / far = black
normal map	RGB normal map
pose body	OpenPose 風 body skeleton
pose full	body + hands + face
binary segmentation	visible region mask
amodal segmentation	occluded parts を含む mask

amodal segmentation

この中で、amodal segmentation というタスクには聞き覚えのない方がいるかもしれません。

通常の segmentation は、対象の 見えている領域 だけをマスクします。
amodal segmentation は、遮蔽物で隠れている部分も含め、対象全体の形を推定してマスクします。

たとえば鹿の手前を枝が遮っている場合、通常の segmentation では枝の後ろに隠れた部分は出力されません。
一方で amodal segmentation では、枝で隠れた部分も含め、鹿全体をマスクとして出力します。

見えない部分を推定する必要があるため、これは単なる分類というより、生成に近いタスクです。
逆にいえば、画像生成モデルが力を発揮しやすいタスクでもあるため、今回チャレンジしてみました。

タスク別 LoRA 学習

当初はすべてのタスクをひとつの LoRA として学習させる予定でしたが、内部でタスクが混ざってしまい、プロンプトだけではうまく切り替えられませんでした。

そのため、今回は 1タスク 1LoRA の構成にしています。

データセット

task	positive	negative	total
depth	300	0	300
normal	300	0	300
pose body	300	30	330
pose full	300	30	330
binary segmentation	300	30	330
amodal segmentation	300	30	330

Depth / Normal

Open Images から画像を取得し、Lotus-2 で teacher を作成。

depth
- relative depth
- near = white
- far = black
normal
- RGB normal map

depth と normal は同じ入力画像セットを使っています。

Pose

人物画像を Open Images から取得し、DWPose で teacher を作成。

pose body
pose full

候補画像は目視レビューし、群衆や出力が明らかに崩れているものは除外しています。

Amodal Segmentation

amodal segmentation は、対象物の見えている部分だけでなく、隠れている部分も含めて mask を作るタスクです。

既存データセット、およびこれを直接生成する teacher が手元になかったため、画像生成と画像編集を組み合わせて作成しました。

作成手順:

明確な subject と、それを自然に隠す occluder が含まれる occlusion scene のプロンプトを GPT-5.5 で作成
Z-Image-Turbo で source image を生成
GPT-5.5 が source image を確認
FLUX.2 [klein] 9B image edit で occluder を除去
除去後画像から target object を SAM 3.1 で segmentation
source image と complete-object mask をペア化
目視によるレビュー
BiRefNet、および手動でのリファイン

mask は SAM 3.1 だけでは不安定だったため、ほとんどすべてを BiRefNet と手作業で修正しています。

本筋ではないですが、LLM に画像生成プロンプトを大量生成させると、

同じような対象物
同じような occluder
同じような構図

に寄ります。そのため、Open Images のランダム画像を inspiration として見せ、scene variation を増やしています。

Binary Segmentation

amodal segmentation 用の source image を流用。

入力画像上で実際に見えている target object の領域を SAM 3.1 で segmentation し、手動でリファインしています。

Negative Samples

pose / segmentation ともに、入力画像に対象が存在しない場合のハルシネーションが問題になります。

例えば、入力画像に猫がいないにもかかわらず generate mask of the cat のような指示を与えると、モデルが適当に猫の形のマスクを作ってしまうことがあります。

これに対処するため、all-black target の negative pair を一部追加しました。

segmentation
- 画像に存在しない target を指定した場合、all-black mask を返す
- 例: cond 画像にキリンしか写っていない状態で、cat の amodal mask を要求する
pose
- 人物が写っていない入力画像では、all-black pose image を返す

ただし、今回の規模では明確な改善は確認できませんでした。特に pose では、むしろ学習を不安定にした可能性があります。

学習

AI-Toolkit で学習。

item	value
base model	`black-forest-labs/FLUX.2-klein-base-9B`
architecture	`flux2_klein_9b`
LoRA rank	linear 32 / conv 16
optimizer	`adamw8bit`
lr	`5e-5`
dtype	`bf16`
quantization	transformer / text encoder: `qfloat8`
batch size	4
text encoder	frozen
caption dropout	0.05
EMA	enabled

解像度は、計算量を抑えるため基本的に 768 bucket を使用。

pose full のみ、顔や手の細部が重要になるため、768 / 1024 bucket を含めて学習しています。

100 step ごとに checkpoint を保存。
ComfyUI で実際に動かして、良さそうな step を選びます。すべての LoRA で 2000〜2500 step ほどで収束しています。

workflow

以下は、ComfyUI で使用するための workflow です。

注意として、FLUX.2 [klein] Base で学習した LoRA は、FLUX.2 [klein] Distilled モデルでは上手く動きません。Base モデルを使うか、Distilled と Base の差分 LoRA（Klein 4B/9B Base to Turbo Lora）を使用してください。

モデルのダウンロード

ベースは FLUX.2 [klein] です。

LoRA

📂ComfyUI/
└── 📂models/
    └── 📂loras/
        ├── flux2-klein-schematic-relative-depth-lora.safetensors
        ├── flux2-klein-schematic-surface-normal-lora.safetensors
        ├── flux2-klein-schematic-body-pose-lora.safetensors
        ├── flux2-klein-schematic-full-pose-lora.safetensors
        ├── flux2-klein-schematic-binary-segmentation-lora.safetensors
        └── flux2-klein-schematic-amodal-segmentation-lora.safetensors

image edit Base

Flux.2-klein-base-9b_image-edit.json

{
  "id": "37b279c2-46a8-4e38-ae9d-efce5a7f30a1",
  "revision": 0,
  "last_node_id": 90,
  "last_link_id": 204,
  "nodes": [
    {
      "id": 49,
      "type": "ReferenceLatent",
      "pos": [
        1021.3615651604734,
        160.40068341159756
      ],
      "size": [
        204.134765625,
        46
      ],
      "flags": {},
      "order": 12,
      "mode": 0,
      "inputs": [
        {
          "name": "conditioning",
          "type": "CONDITIONING",
          "link": 70
        },
        {
          "name": "latent",
          "shape": 7,
          "type": "LATENT",
          "link": 75
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            71
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ReferenceLatent"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 52,
      "type": "VAEEncode",
      "pos": [
        821.9650317382822,
        579.5488468801722
      ],
      "size": [
        162,
        46
      ],
      "flags": {},
      "order": 11,
      "mode": 0,
      "inputs": [
        {
          "name": "pixels",
          "type": "IMAGE",
          "link": 149
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 76
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            74,
            75,
            84
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "VAEEncode"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 55,
      "type": "ReferenceLatent",
      "pos": [
        1021.3615651604734,
        387.24938745307793
      ],
      "size": [
        204.134765625,
        46
      ],
      "flags": {},
      "order": 13,
      "mode": 0,
      "inputs": [
        {
          "name": "conditioning",
          "type": "CONDITIONING",
          "link": 83
        },
        {
          "name": "latent",
          "shape": 7,
          "type": "LATENT",
          "link": 84
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            85
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ReferenceLatent"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 44,
      "type": "CLIPLoader",
      "pos": [
        224.49879211425772,
        292.66704483032197
      ],
      "size": [
        283.80000000000007,
        106
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "CLIP",
          "type": "CLIP",
          "links": [
            63,
            64
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.73",
        "Node name for S&R": "CLIPLoader"
      },
      "widgets_values": [
        "qwen_3_8b.safetensors",
        "flux2",
        "default"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 43,
      "type": "VAELoader",
      "pos": [
        520.2735954205992,
        751.5950256347664
      ],
      "size": [
        269.8313103058076,
        58
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            62,
            76
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "flux2-vae.safetensors"
      ],
      "color": "#322",
      "bgcolor": "#533"
    },
    {
      "id": 51,
      "type": "ResizeImageMaskNode",
      "pos": [
        213.01819139432666,
        579.5488468801722
      ],
      "size": [
        270,
        106
      ],
      "flags": {},
      "order": 9,
      "mode": 0,
      "inputs": [
        {
          "name": "input",
          "type": "IMAGE,MASK",
          "link": 196
        }
      ],
      "outputs": [
        {
          "name": "resized",
          "type": "IMAGE",
          "links": [
            82
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ResizeImageMaskNode"
      },
      "widgets_values": [
        "scale total pixels",
        1,
        "nearest-exact"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 33,
      "type": "CLIPTextEncode",
      "pos": [
        558.700000000001,
        387.24938745307793
      ],
      "size": [
        425.2650317382812,
        122.99611236572264
      ],
      "flags": {
        "collapsed": false
      },
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 64
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            83
          ]
        }
      ],
      "title": "CLIP Text Encode (Negative Prompt)",
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "text, worst quality, blurry, ugly"
      ]
    },
    {
      "id": 48,
      "type": "UNETLoader",
      "pos": [
        500.7576387664047,
        0.22617585350672265
      ],
      "size": [
        308.1592787377913,
        82
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            140
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "UNETLoader"
      },
      "widgets_values": [
        "Flux.2/flux-2-klein-base-9b-fp8.safetensors",
        "default"
      ],
      "color": "#323",
      "bgcolor": "#535"
    },
    {
      "id": 45,
      "type": "SaveImage",
      "pos": [
        1827.035886825295,
        160.40068341159756
      ],
      "size": [
        672.8774941199636,
        835.2894627396925
      ],
      "flags": {},
      "order": 16,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 65
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.73"
      },
      "widgets_values": [
        "ComfyUI"
      ]
    },
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1615.039087801503,
        160.40068341159756
      ],
      "size": [
        161.111083984375,
        46
      ],
      "flags": {},
      "order": 15,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 52
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 62
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "slot_index": 0,
          "links": [
            65
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "VAEDecode"
      },
      "widgets_values": []
    },
    {
      "id": 54,
      "type": "ResizeImageMaskNode",
      "pos": [
        517.4916115663044,
        579.5488468801722
      ],
      "size": [
        270,
        106
      ],
      "flags": {},
      "order": 10,
      "mode": 0,
      "inputs": [
        {
          "name": "input",
          "type": "IMAGE,MASK",
          "link": 82
        }
      ],
      "outputs": [
        {
          "name": "resized",
          "type": "IMAGE",
          "links": [
            149
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ResizeImageMaskNode"
      },
      "widgets_values": [
        "scale to multiple",
        16,
        "nearest-exact"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 50,
      "type": "LoadImage",
      "pos": [
        -139.92346878942647,
        579.5488468801722
      ],
      "size": [
        324.77206892429183,
        437.2170640540477
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            196
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "000144_00001_.png",
        "image"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 84,
      "type": "MarkdownNote",
      "pos": [
        12.952843572078905,
        -258.08006758864195
      ],
      "size": [
        425.4143199001238,
        450.1487253162147
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [],
      "outputs": [],
      "properties": {},
      "widgets_values": [
        "# Workflow Prompts\n\n## Relative Depth\n\n```text\nGenerate a relative depth map of the input image.\n```\n\n## Surface Normal\n\n```text\nGenerate a surface normal map of the input image.\n```\n\n## Body Pose\n\n```text\nGenerate a body pose map of all visible people in the input image.\n```\n\n## Full Pose\n\n```text\nGenerate a full pose map of all visible people in the input image.\n```\n\n## Binary Segmentation\n\n```text\nGenerate a binary segmentation mask of [target] in the input image.\n```\n\n## Amodal Segmentation\n\n```text\nGenerate an amodal segmentation mask of [target] in the input image.\n```"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 6,
      "type": "CLIPTextEncode",
      "pos": [
        558.700000000001,
        160.40068341159756
      ],
      "size": [
        425.2650317382812,
        167.9430462646484
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 63
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            70
          ]
        }
      ],
      "title": "CLIP Text Encode (Positive Prompt)",
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "Generate a relative depth map of the input image.\n"
      ]
    },
    {
      "id": 31,
      "type": "KSampler",
      "pos": [
        1262.7259909887628,
        160.40068341159756
      ],
      "size": [
        315,
        262
      ],
      "flags": {},
      "order": 14,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 193
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 71
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 85
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 74
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "slot_index": 0,
          "links": [
            52
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        1234,
        "fixed",
        20,
        5,
        "euler",
        "simple",
        1
      ]
    },
    {
      "id": 81,
      "type": "LoraLoaderModelOnly",
      "pos": [
        834.1673366934359,
        0.22617585350672265
      ],
      "size": [
        393.62824686843055,
        82
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 140
        }
      ],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            193
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.20.1",
        "Node name for S&R": "LoraLoaderModelOnly"
      },
      "widgets_values": [
        "flux2-klein-schematic-relative-depth-lora.safetensors",
        0.8
      ]
    },
    {
      "id": 85,
      "type": "MarkdownNote",
      "pos": [
        -458.2999499066289,
        -259.54092129190855
      ],
      "size": [
        423.753902071114,
        543.7932327033926
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [],
      "outputs": [],
      "properties": {},
      "widgets_values": [
        "## models\n\n* diffusion_models\n\n  * [flux-2-klein-base-9b-fp8.safetensors](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9b-fp8/blob/main/flux-2-klein-base-9b-fp8.safetensors)\n* text_encoders\n\n  * [qwen_3_8b.safetensors](https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-9b/blob/main/split_files/text_encoders/qwen_3_8b.safetensors)\n* vae\n\n  * [flux2-vae.safetensors](https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-9b/blob/main/split_files/vae/flux2-vae.safetensors)\n\n\n```text\n📂ComfyUI/\n└── 📂models/\n    ├── 📂diffusion_models/\n    │   ├── flux-2-klein-9b-fp8.safetensors\n    │   └── flux-2-klein-base-9b-fp8.safetensors\n    ├── 📂text_encoders/\n    │   └── qwen_3_8b.safetensors\n    └── 📂vae/\n         └── flux2-vae.safetensors\n```\n\n### loras\n\n- loras\n  - [flux2-klein-schematic-relative-depth-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-relative-depth-lora.safetensors)\n  - [flux2-klein-schematic-surface-normal-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-surface-normal-lora.safetensors)\n  - [flux2-klein-schematic-body-pose-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-body-pose-lora.safetensors)\n  - [flux2-klein-schematic-full-pose-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-full-pose-lora.safetensors)\n  - [flux2-klein-schematic-binary-segmentation-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-binary-segmentation-lora.safetensors)\n  - [flux2-klein-schematic-amodal-segmentation-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-amodal-segmentation-lora.safetensors)\n\n```\n📂ComfyUI/\n└── 📂models/\n    └── 📂loras/\n        ├── flux2-klein-schematic-relative-depth-lora.safetensors\n        ├── flux2-klein-schematic-surface-normal-lora.safetensors\n        ├── flux2-klein-schematic-body-pose-lora.safetensors\n        ├── flux2-klein-schematic-full-pose-lora.safetensors\n        ├── flux2-klein-schematic-binary-segmentation-lora.safetensors\n        └── flux2-klein-schematic-amodal-segmentation-lora.safetensors\n```"
      ],
      "color": "#323",
      "bgcolor": "#535"
    }
  ],
  "links": [
    [
      52,
      31,
      0,
      8,
      0,
      "LATENT"
    ],
    [
      62,
      43,
      0,
      8,
      1,
      "VAE"
    ],
    [
      63,
      44,
      0,
      6,
      0,
      "CLIP"
    ],
    [
      64,
      44,
      0,
      33,
      0,
      "CLIP"
    ],
    [
      65,
      8,
      0,
      45,
      0,
      "IMAGE"
    ],
    [
      70,
      6,
      0,
      49,
      0,
      "CONDITIONING"
    ],
    [
      71,
      49,
      0,
      31,
      1,
      "CONDITIONING"
    ],
    [
      74,
      52,
      0,
      31,
      3,
      "LATENT"
    ],
    [
      75,
      52,
      0,
      49,
      1,
      "LATENT"
    ],
    [
      76,
      43,
      0,
      52,
      1,
      "VAE"
    ],
    [
      82,
      51,
      0,
      54,
      0,
      "IMAGE"
    ],
    [
      83,
      33,
      0,
      55,
      0,
      "CONDITIONING"
    ],
    [
      84,
      52,
      0,
      55,
      1,
      "LATENT"
    ],
    [
      85,
      55,
      0,
      31,
      2,
      "CONDITIONING"
    ],
    [
      140,
      48,
      0,
      81,
      0,
      "MODEL"
    ],
    [
      149,
      54,
      0,
      52,
      0,
      "IMAGE"
    ],
    [
      193,
      81,
      0,
      31,
      0,
      "MODEL"
    ],
    [
      196,
      50,
      0,
      51,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.7627768444385989,
      "offset": [
        1370.6596805987829,
        510.4137858114659
      ]
    },
    "frontendVersion": "1.42.15",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

実践テスト

relative depth

Generate a relative depth map of the input image.

input	Depth Anything V2	FLUX.2 [klein] LoRA

normal map

Generate a surface normal map of the input image.

input	Lotus-2	FLUX.2 [klein] LoRA

pose body

Generate a body pose map of all visible people in the input image.

input	DWPose	FLUX.2 [klein] LoRA

pose full

Generate a full pose map of all visible people in the input image.

input	DWPose	FLUX.2 [klein] LoRA

binary segmentation

Generate a binary segmentation mask of the stretcher in the input image.
Generate a binary segmentation mask of the tuna sushi in the input image.
Generate a binary segmentation mask of all jars in the input image.

input	SAM 3.1	FLUX.2 [klein] LoRA

amodal segmentation

Generate an amodal segmentation mask of the woman in the input image.
Generate an amodal segmentation mask of the bench in the input image.
Generate an amodal segmentation mask of the steam locomotive in the input image.

input	SAM 3.1 visible mask	FLUX.2 [klein] LoRA

限界と課題

Depth/Normal

teacher には Lotus-2 を使いましたが、Lotus-2 由来のノイズもそのまま学習してしまいます。

この手のタスクではよく行われるように、3D モデルを使った合成データも検討するべきでした。

余談ですが、Lotus-2 の前に DSINE で作った target 画像でも学習しました。DSINE は Lotus-2 に比べるとかなりのっぺりした Normal map を作成します。その結果、LoRA の出力も同じようにのっぺりしたものになりました。

teacher の質がそのまま LoRA の出力に出るため、データセット品質の重要性を改めて感じます。

pose

そもそもの問題として、pose は今回のタスクの中ではもっとも RGB 画像での表現に向いていません。

OpenPose 風の画像として出力できても、そこから keypoint へ戻すのは簡単ではなく、実用上の扱いも難しくなります。また、色やボーン数が厳密に決まっているため、少しのブレでもかなり目立ちます。

それでも簡単に学習できるタスクだと思っていましたが、想像以上に崩れました。動物画像や非人物画像でのハルシネーションも防げていません。

segmentation

テキストエンコーダである Qwen3 8B のプロンプト理解力に期待していましたが、期待ほどのコントロール性能は得られませんでした。

「〇〇の人物を削除して」のような指示には従える一方で、LoRA を適用して「〇〇の人物をセグメンテーションして」と指示すると、失敗したり、別の人物をセグメンテーションしたりします。

そのため、単純なプロンプト理解力だけの問題というより、データセットから segmentation タスクそのものをうまく理解できていない可能性があります。

細部の切り抜き精度についても、本当は matting に近い滑らかな境界を期待していましたが、現状は SAM 3.1 程度の粗さに留まっています。

全体

全体として、データセットの枚数が足りていません。

今回、amodal segmentation のデータセット作成が非常に重く、それに合わせて全タスクをおおむね 300 枚で揃えました。
ただ、原因をきちんと切り分けるには、各タスク 2000〜3000 枚程度は必要だったように思います。

改善点が多く残っていますが、予算と時間を使いすぎてしまったため、ここで一度断念します。
チャンスがあれば、より大きいデータセットで再度試したいですね。

おわりに

品質はさておき、小規模な LoRA 学習だけでも、FLUX.2 [klein] に CV タスク風の RGB 出力をある程度学習させることはできました。

ただし、本質的に重要なのは「CV タスクができたかどうか」ではなく、何を画像編集として扱うか によって、画像編集モデルの使い道がまだかなり広がるという点です。

画像編集と聞くと、絵柄変換やオブジェクト除去がすぐに思い浮かびます。
しかし、今回のような CV タスク風の出力や、独自の中間表現を作らせることも、広い意味では画像編集として扱えます。

絵を描くだけだった画像生成モデルが、少しずつ汎用的な視覚モデルに近づいていくように見えるのは、楽しいですね。

FLUX.2 [klein] Schematic LoRA

概要

配布先

タスク設定

amodal segmentation

タスク別 LoRA 学習

データセット

Depth / Normal

Pose

Amodal Segmentation

Binary Segmentation

Negative Samples

学習

workflow

モデルのダウンロード

image edit Base

実践テスト

relative depth

normal map

pose body

pose full

binary segmentation

amodal segmentation

限界と課題

Depth/Normal

pose

segmentation

全体

おわりに

参考

jsonコピーボタンとは？

修正・誤字報告

記事リクエスト

感想・その他

ありがとうございます

FLUX.2 [klein] Schematic LoRA

概要

配布先

タスク設定

amodal segmentation

タスク別 LoRA 学習

データセット

Depth / Normal

Pose

Amodal Segmentation

Binary Segmentation

Negative Samples

学習

workflow

モデルのダウンロード

image edit Base

実践テスト

relative depth

normal map

pose body

pose full

binary segmentation

amodal segmentation

限界と課題

Depth/Normal

pose

segmentation

全体

おわりに

参考

関連ページ