FLUX.2 [klein] Schematic LoRA

Overview

There are many studies that reuse the prior knowledge of image generation models for CV tasks. Representative examples include Marigold, Lotus-2, and SDPose.

Although these methods use pretrained image generation models, they are ultimately designed specifically for each task.

Now that instruction-based image editing models have become common, another idea has appeared: maybe tasks such as depth estimation and segmentation can be treated, in a broad sense, as image editing. That is the idea behind Google DeepMind's Vision Banana.

Inspired by that direction, I wanted to see whether something similar could be done with FLUX.2 [klein].

The result is not SOTA-level performance. Still, I hope this experiment shows that even a simple LoRA can make a local model behave in a direction somewhat close to Vision Banana.

Downloads

LoRA: nomadoor/flux-2-klein-9B-schematic-lora
Dataset: nomadoor/flux-2-klein-9B-schematic-dataset

Task Setup

Vision Banana mainly covers depth / normal / segmentation.

For this experiment, I did not use exactly the same task set. Instead, I chose outputs that are familiar in ComfyUI and the image generation community, roughly like ControlNet Preprocessor outputs.

Personally, I call these tasks image2schematic, and every LoRA created for this experiment includes the word schematic.

task	output
relative depth	near = white / far = black
normal map	RGB normal map
pose body	OpenPose-style body skeleton
pose full	body + hands + face
binary segmentation	visible region mask
amodal segmentation	mask including occluded parts

amodal segmentation

Some readers may not be familiar with amodal segmentation.

Regular segmentation masks only the visible region of the target.
Amodal segmentation estimates and masks the full shape of the target, including parts hidden behind occluders.

For example, if branches are blocking a deer, regular segmentation will not output the parts hidden behind the branches.
With amodal segmentation, the mask includes the full deer, including the hidden parts.

Because it has to infer invisible regions, this is closer to generation than simple classification.
In that sense, it is also a task where an image generation model may be able to show its strengths, so I decided to try it.

One LoRA Per Task

At first, I planned to train all tasks into a single LoRA, but the tasks mixed internally and could not be switched well with prompts alone.

So this experiment uses one LoRA per task.

Dataset

task	positive	negative	total
depth	300	0	300
normal	300	0	300
pose body	300	30	330
pose full	300	30	330
binary segmentation	300	30	330
amodal segmentation	300	30	330

Depth / Normal

Images were taken from Open Images, and the teacher outputs were created with Lotus-2.

depth
- relative depth
- near = white
- far = black
normal
- RGB normal map

Depth and normal use the same input image set.

Pose

Person images were taken from Open Images, and the teacher outputs were created with DWPose.

pose body
pose full

Candidate images were reviewed manually, and crowded images or obviously broken outputs were removed.

Amodal Segmentation

Amodal segmentation is a task that creates a mask for the whole object, including not only the visible area but also parts hidden by occluders.

I did not have an existing dataset or a teacher that could directly generate this, so I created it by combining image generation and image editing.

Creation flow:

GPT-5.5 created prompts for occlusion scenes with a clear subject and a natural occluder
Z-Image-Turbo generated the source image
GPT-5.5 reviewed the source image
FLUX.2 [klein] 9B image edit removed the occluder
SAM 3.1 segmented the target object from the edited image
The source image and complete-object mask were paired
Manual review
Refinement with BiRefNet and manual editing

SAM 3.1 alone was unstable for these masks, so almost all of them were fixed with BiRefNet and manual edits.

This is not the main point, but when an LLM generates large numbers of image prompts, the results tend to converge toward:

similar subjects
similar occluders
similar compositions

To avoid that, I showed random Open Images examples as inspiration and increased the scene variation.

Binary Segmentation

The source images created for amodal segmentation were reused.

The actually visible target object region in the input image was segmented with SAM 3.1 and refined manually.

Negative Samples

For both pose and segmentation, hallucination becomes a problem when the requested target is not present in the input image.

For example, if there is no cat in the input image but the prompt says generate mask of the cat, the model may invent a cat-shaped mask.

To address this, I added some negative pairs with all-black targets.

segmentation
- If the specified target does not exist in the image, return an all-black mask
- Example: asking for a cat amodal mask when the conditioning image only contains a giraffe
pose
- If no person appears in the input image, return an all-black pose image

However, at this scale, I could not confirm a clear improvement. For pose in particular, it may have made training less stable.

Training

Training was done with AI Toolkit.

item	value
base model	`black-forest-labs/FLUX.2-klein-base-9B`
architecture	`flux2_klein_9b`
LoRA rank	linear 32 / conv 16
optimizer	`adamw8bit`
lr	`5e-5`
dtype	`bf16`
quantization	transformer / text encoder: `qfloat8`
batch size	4
text encoder	frozen
caption dropout	0.05
EMA	enabled

To keep the compute cost down, I basically used a 768 bucket.

Only pose full was trained with both 768 / 1024 buckets, because details in the face and hands matter more.

Checkpoints were saved every 100 steps.
I tested them in ComfyUI and picked the step that looked best. All LoRAs converged at around 2000-2500 steps.

workflow

This is the workflow for using the LoRAs in ComfyUI.

Note that LoRAs trained on FLUX.2 [klein] Base do not work well with the FLUX.2 [klein] Distilled model. Use the Base model, or use a Base-to-Distilled difference LoRA such as Klein 4B/9B Base to Turbo Lora.

Model Download

The base is FLUX.2 [klein].

LoRA

📂ComfyUI/
└── 📂models/
    └── 📂loras/
        ├── flux2-klein-schematic-relative-depth-lora.safetensors
        ├── flux2-klein-schematic-surface-normal-lora.safetensors
        ├── flux2-klein-schematic-body-pose-lora.safetensors
        ├── flux2-klein-schematic-full-pose-lora.safetensors
        ├── flux2-klein-schematic-binary-segmentation-lora.safetensors
        └── flux2-klein-schematic-amodal-segmentation-lora.safetensors

image edit Base

Flux.2-klein-base-9b_image-edit.json

{
  "id": "37b279c2-46a8-4e38-ae9d-efce5a7f30a1",
  "revision": 0,
  "last_node_id": 90,
  "last_link_id": 204,
  "nodes": [
    {
      "id": 49,
      "type": "ReferenceLatent",
      "pos": [
        1021.3615651604734,
        160.40068341159756
      ],
      "size": [
        204.134765625,
        46
      ],
      "flags": {},
      "order": 12,
      "mode": 0,
      "inputs": [
        {
          "name": "conditioning",
          "type": "CONDITIONING",
          "link": 70
        },
        {
          "name": "latent",
          "shape": 7,
          "type": "LATENT",
          "link": 75
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            71
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ReferenceLatent"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 52,
      "type": "VAEEncode",
      "pos": [
        821.9650317382822,
        579.5488468801722
      ],
      "size": [
        162,
        46
      ],
      "flags": {},
      "order": 11,
      "mode": 0,
      "inputs": [
        {
          "name": "pixels",
          "type": "IMAGE",
          "link": 149
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 76
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            74,
            75,
            84
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "VAEEncode"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 55,
      "type": "ReferenceLatent",
      "pos": [
        1021.3615651604734,
        387.24938745307793
      ],
      "size": [
        204.134765625,
        46
      ],
      "flags": {},
      "order": 13,
      "mode": 0,
      "inputs": [
        {
          "name": "conditioning",
          "type": "CONDITIONING",
          "link": 83
        },
        {
          "name": "latent",
          "shape": 7,
          "type": "LATENT",
          "link": 84
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            85
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ReferenceLatent"
      },
      "widgets_values": [],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 44,
      "type": "CLIPLoader",
      "pos": [
        224.49879211425772,
        292.66704483032197
      ],
      "size": [
        283.80000000000007,
        106
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "CLIP",
          "type": "CLIP",
          "links": [
            63,
            64
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.73",
        "Node name for S&R": "CLIPLoader"
      },
      "widgets_values": [
        "qwen_3_8b.safetensors",
        "flux2",
        "default"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 43,
      "type": "VAELoader",
      "pos": [
        520.2735954205992,
        751.5950256347664
      ],
      "size": [
        269.8313103058076,
        58
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            62,
            76
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "VAELoader"
      },
      "widgets_values": [
        "flux2-vae.safetensors"
      ],
      "color": "#322",
      "bgcolor": "#533"
    },
    {
      "id": 51,
      "type": "ResizeImageMaskNode",
      "pos": [
        213.01819139432666,
        579.5488468801722
      ],
      "size": [
        270,
        106
      ],
      "flags": {},
      "order": 9,
      "mode": 0,
      "inputs": [
        {
          "name": "input",
          "type": "IMAGE,MASK",
          "link": 196
        }
      ],
      "outputs": [
        {
          "name": "resized",
          "type": "IMAGE",
          "links": [
            82
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ResizeImageMaskNode"
      },
      "widgets_values": [
        "scale total pixels",
        1,
        "nearest-exact"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 33,
      "type": "CLIPTextEncode",
      "pos": [
        558.700000000001,
        387.24938745307793
      ],
      "size": [
        425.2650317382812,
        122.99611236572264
      ],
      "flags": {
        "collapsed": false
      },
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 64
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            83
          ]
        }
      ],
      "title": "CLIP Text Encode (Negative Prompt)",
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "text, worst quality, blurry, ugly"
      ]
    },
    {
      "id": 48,
      "type": "UNETLoader",
      "pos": [
        500.7576387664047,
        0.22617585350672265
      ],
      "size": [
        308.1592787377913,
        82
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            140
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "UNETLoader"
      },
      "widgets_values": [
        "Flux.2/flux-2-klein-base-9b-fp8.safetensors",
        "default"
      ],
      "color": "#323",
      "bgcolor": "#535"
    },
    {
      "id": 45,
      "type": "SaveImage",
      "pos": [
        1827.035886825295,
        160.40068341159756
      ],
      "size": [
        672.8774941199636,
        835.2894627396925
      ],
      "flags": {},
      "order": 16,
      "mode": 0,
      "inputs": [
        {
          "name": "images",
          "type": "IMAGE",
          "link": 65
        }
      ],
      "outputs": [],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.73"
      },
      "widgets_values": [
        "ComfyUI"
      ]
    },
    {
      "id": 8,
      "type": "VAEDecode",
      "pos": [
        1615.039087801503,
        160.40068341159756
      ],
      "size": [
        161.111083984375,
        46
      ],
      "flags": {},
      "order": 15,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 52
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 62
        }
      ],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "slot_index": 0,
          "links": [
            65
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "VAEDecode"
      },
      "widgets_values": []
    },
    {
      "id": 54,
      "type": "ResizeImageMaskNode",
      "pos": [
        517.4916115663044,
        579.5488468801722
      ],
      "size": [
        270,
        106
      ],
      "flags": {},
      "order": 10,
      "mode": 0,
      "inputs": [
        {
          "name": "input",
          "type": "IMAGE,MASK",
          "link": 82
        }
      ],
      "outputs": [
        {
          "name": "resized",
          "type": "IMAGE",
          "links": [
            149
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "ResizeImageMaskNode"
      },
      "widgets_values": [
        "scale to multiple",
        16,
        "nearest-exact"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 50,
      "type": "LoadImage",
      "pos": [
        -139.92346878942647,
        579.5488468801722
      ],
      "size": [
        324.77206892429183,
        437.2170640540477
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "IMAGE",
          "type": "IMAGE",
          "links": [
            196
          ]
        },
        {
          "name": "MASK",
          "type": "MASK",
          "links": []
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.9.2",
        "Node name for S&R": "LoadImage"
      },
      "widgets_values": [
        "000144_00001_.png",
        "image"
      ],
      "color": "#232",
      "bgcolor": "#353"
    },
    {
      "id": 84,
      "type": "MarkdownNote",
      "pos": [
        12.952843572078905,
        -258.08006758864195
      ],
      "size": [
        425.4143199001238,
        450.1487253162147
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [],
      "outputs": [],
      "properties": {},
      "widgets_values": [
        "# Workflow Prompts\n\n## Relative Depth\n\n```text\nGenerate a relative depth map of the input image.\n```\n\n## Surface Normal\n\n```text\nGenerate a surface normal map of the input image.\n```\n\n## Body Pose\n\n```text\nGenerate a body pose map of all visible people in the input image.\n```\n\n## Full Pose\n\n```text\nGenerate a full pose map of all visible people in the input image.\n```\n\n## Binary Segmentation\n\n```text\nGenerate a binary segmentation mask of [target] in the input image.\n```\n\n## Amodal Segmentation\n\n```text\nGenerate an amodal segmentation mask of [target] in the input image.\n```"
      ],
      "color": "#432",
      "bgcolor": "#653"
    },
    {
      "id": 6,
      "type": "CLIPTextEncode",
      "pos": [
        558.700000000001,
        160.40068341159756
      ],
      "size": [
        425.2650317382812,
        167.9430462646484
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 63
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "slot_index": 0,
          "links": [
            70
          ]
        }
      ],
      "title": "CLIP Text Encode (Positive Prompt)",
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "CLIPTextEncode"
      },
      "widgets_values": [
        "Generate a relative depth map of the input image.\n"
      ]
    },
    {
      "id": 31,
      "type": "KSampler",
      "pos": [
        1262.7259909887628,
        160.40068341159756
      ],
      "size": [
        315,
        262
      ],
      "flags": {},
      "order": 14,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 193
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 71
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 85
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 74
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "slot_index": 0,
          "links": [
            52
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.3.39",
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        1234,
        "fixed",
        20,
        5,
        "euler",
        "simple",
        1
      ]
    },
    {
      "id": 81,
      "type": "LoraLoaderModelOnly",
      "pos": [
        834.1673366934359,
        0.22617585350672265
      ],
      "size": [
        393.62824686843055,
        82
      ],
      "flags": {},
      "order": 8,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 140
        }
      ],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            193
          ]
        }
      ],
      "properties": {
        "cnr_id": "comfy-core",
        "ver": "0.20.1",
        "Node name for S&R": "LoraLoaderModelOnly"
      },
      "widgets_values": [
        "flux2-klein-schematic-relative-depth-lora.safetensors",
        0.8
      ]
    },
    {
      "id": 85,
      "type": "MarkdownNote",
      "pos": [
        -458.2999499066289,
        -259.54092129190855
      ],
      "size": [
        423.753902071114,
        543.7932327033926
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [],
      "outputs": [],
      "properties": {},
      "widgets_values": [
        "## models\n\n* diffusion_models\n\n  * [flux-2-klein-base-9b-fp8.safetensors](https://huggingface.co/black-forest-labs/FLUX.2-klein-base-9b-fp8/blob/main/flux-2-klein-base-9b-fp8.safetensors)\n* text_encoders\n\n  * [qwen_3_8b.safetensors](https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-9b/blob/main/split_files/text_encoders/qwen_3_8b.safetensors)\n* vae\n\n  * [flux2-vae.safetensors](https://huggingface.co/Comfy-Org/vae-text-encorder-for-flux-klein-9b/blob/main/split_files/vae/flux2-vae.safetensors)\n\n\n```text\n📂ComfyUI/\n└── 📂models/\n    ├── 📂diffusion_models/\n    │   ├── flux-2-klein-9b-fp8.safetensors\n    │   └── flux-2-klein-base-9b-fp8.safetensors\n    ├── 📂text_encoders/\n    │   └── qwen_3_8b.safetensors\n    └── 📂vae/\n         └── flux2-vae.safetensors\n```\n\n### loras\n\n- loras\n  - [flux2-klein-schematic-relative-depth-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-relative-depth-lora.safetensors)\n  - [flux2-klein-schematic-surface-normal-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-surface-normal-lora.safetensors)\n  - [flux2-klein-schematic-body-pose-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-body-pose-lora.safetensors)\n  - [flux2-klein-schematic-full-pose-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-full-pose-lora.safetensors)\n  - [flux2-klein-schematic-binary-segmentation-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-binary-segmentation-lora.safetensors)\n  - [flux2-klein-schematic-amodal-segmentation-lora.safetensors](https://huggingface.co/nomadoor/flux-2-klein-9B-schematic-lora/blob/main/loras/flux2-klein-schematic-amodal-segmentation-lora.safetensors)\n\n```\n📂ComfyUI/\n└── 📂models/\n    └── 📂loras/\n        ├── flux2-klein-schematic-relative-depth-lora.safetensors\n        ├── flux2-klein-schematic-surface-normal-lora.safetensors\n        ├── flux2-klein-schematic-body-pose-lora.safetensors\n        ├── flux2-klein-schematic-full-pose-lora.safetensors\n        ├── flux2-klein-schematic-binary-segmentation-lora.safetensors\n        └── flux2-klein-schematic-amodal-segmentation-lora.safetensors\n```"
      ],
      "color": "#323",
      "bgcolor": "#535"
    }
  ],
  "links": [
    [
      52,
      31,
      0,
      8,
      0,
      "LATENT"
    ],
    [
      62,
      43,
      0,
      8,
      1,
      "VAE"
    ],
    [
      63,
      44,
      0,
      6,
      0,
      "CLIP"
    ],
    [
      64,
      44,
      0,
      33,
      0,
      "CLIP"
    ],
    [
      65,
      8,
      0,
      45,
      0,
      "IMAGE"
    ],
    [
      70,
      6,
      0,
      49,
      0,
      "CONDITIONING"
    ],
    [
      71,
      49,
      0,
      31,
      1,
      "CONDITIONING"
    ],
    [
      74,
      52,
      0,
      31,
      3,
      "LATENT"
    ],
    [
      75,
      52,
      0,
      49,
      1,
      "LATENT"
    ],
    [
      76,
      43,
      0,
      52,
      1,
      "VAE"
    ],
    [
      82,
      51,
      0,
      54,
      0,
      "IMAGE"
    ],
    [
      83,
      33,
      0,
      55,
      0,
      "CONDITIONING"
    ],
    [
      84,
      52,
      0,
      55,
      1,
      "LATENT"
    ],
    [
      85,
      55,
      0,
      31,
      2,
      "CONDITIONING"
    ],
    [
      140,
      48,
      0,
      81,
      0,
      "MODEL"
    ],
    [
      149,
      54,
      0,
      52,
      0,
      "IMAGE"
    ],
    [
      193,
      81,
      0,
      31,
      0,
      "MODEL"
    ],
    [
      196,
      50,
      0,
      51,
      0,
      "IMAGE"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "ds": {
      "scale": 0.7627768444385989,
      "offset": [
        1370.6596805987829,
        510.4137858114659
      ]
    },
    "frontendVersion": "1.42.15",
    "VHS_latentpreview": false,
    "VHS_latentpreviewrate": 0,
    "VHS_MetadataImage": true,
    "VHS_KeepIntermediate": true
  },
  "version": 0.4
}

Practical Tests

relative depth

Generate a relative depth map of the input image.

input	Depth Anything V2	FLUX.2 [klein] LoRA

normal map

Generate a surface normal map of the input image.

input	Lotus-2	FLUX.2 [klein] LoRA

pose body

Generate a body pose map of all visible people in the input image.

input	DWPose	FLUX.2 [klein] LoRA

pose full

Generate a full pose map of all visible people in the input image.

input	DWPose	FLUX.2 [klein] LoRA

binary segmentation

Generate a binary segmentation mask of the stretcher in the input image.
Generate a binary segmentation mask of the tuna sushi in the input image.
Generate a binary segmentation mask of all jars in the input image.

input	SAM 3.1	FLUX.2 [klein] LoRA

amodal segmentation

Generate an amodal segmentation mask of the woman in the input image.
Generate an amodal segmentation mask of the bench in the input image.
Generate an amodal segmentation mask of the steam locomotive in the input image.

input	SAM 3.1 visible mask	FLUX.2 [klein] LoRA

Limitations and Issues

Depth / Normal

I used Lotus-2 as the teacher, but the LoRA also learned noise that came from Lotus-2.

For this kind of task, synthetic data from 3D models should probably have been considered as well.

As a side note, before Lotus-2, I also trained with target images created by DSINE. DSINE produces much flatter normal maps than Lotus-2, and the LoRA outputs became similarly flat.

The quality of the teacher appears directly in the LoRA output, so this made me feel again how important dataset quality is.

pose

The first problem is that pose is the least suited to RGB-image representation among the tasks tested here.

Even if the model outputs an OpenPose-style image, converting that back into keypoints is not easy, which makes it difficult to use in practice. The colors and number of bones are also strict, so even small deviations stand out.

I thought this would be an easy task to train, but it broke down more than expected. Hallucinations on animal images and non-person images are not prevented either.

segmentation

I expected the prompt understanding ability of the Qwen3 8B text encoder to help, but the control was not as strong as I hoped.

The model can follow instructions like "remove the person in X", but when applying the LoRA and asking it to "segment the person in X", it may fail or segment a different person.

So this may not be just a problem of prompt understanding. The model may not have learned the segmentation task itself well enough from the dataset.

For boundary precision, I was hoping for smoother edges closer to matting, but at the moment it remains around the roughness of SAM 3.1.

Overall

Overall, the dataset size was not enough.

Creating the amodal segmentation dataset was very heavy, so I roughly aligned all tasks to around 300 images.
To properly isolate the causes, I think each task would have needed around 2000-3000 images.

There are many things left to improve, but I spent too much budget and time on this, so I am stopping here for now.
If I get the chance, I would like to try again with a larger dataset.

Closing

Regardless of quality, this small-scale LoRA training was enough to teach FLUX.2 [klein] some CV-task-like RGB outputs.

The important point is not really whether it "can do CV tasks." It is that the possible uses of image editing models can expand quite a lot depending on what we decide to treat as image editing.

When we hear "image editing," style transfer and object removal come to mind first.
But outputs like these CV-task-like images, or custom intermediate representations, can also be treated as image editing in a broad sense.

It is fun to watch image generation models, which used to be mostly about drawing pictures, gradually look more like general-purpose vision models.

FLUX.2 [klein] Schematic LoRA

Overview

Downloads

Task Setup

amodal segmentation

One LoRA Per Task

Dataset

Depth / Normal

Pose

Amodal Segmentation

Binary Segmentation

Negative Samples

Training

workflow

Model Download

image edit Base

Practical Tests

relative depth

normal map

pose body

pose full

binary segmentation

amodal segmentation

Limitations and Issues

Depth / Normal

pose

segmentation

Overall

Closing

References

What is the JSON copy button?

This page has an issue!

Please explain more!

Feedback / Other

Thank you

FLUX.2 [klein] Schematic LoRA

Overview

Downloads

Task Setup

amodal segmentation

One LoRA Per Task

Dataset

Depth / Normal

Pose

Amodal Segmentation

Binary Segmentation

Negative Samples

Training

workflow

Model Download

image edit Base

Practical Tests

relative depth

normal map

pose body

pose full

binary segmentation

amodal segmentation

Limitations and Issues

Depth / Normal

pose

segmentation

Overall

Closing

References

Related pages