What is SCAIL-2?

SCAIL-2 is a Wan2.1-based model specialized for motion transfer to people and characters.

The major difference from Wan-Animate and the previous SCAIL-1 is that it does not convert the input into an intermediate representation such as a stick figure.

The usual idea has been to make a stick figure with ViTPose or OpenPose, then use that as the condition for moving the person. But once you convert the video into a stick figure, a lot of information is lost.

Depth, contact, intertwined multi-person motion, non-human character motion, and so on...

So SCAIL-2 passes the reference image and motion video almost directly to the DiT.

Rather than humans building a complicated processing pipeline by hand, it is often more flexible to prepare the right dataset and let the AI understand the task. That way of thinking will probably become more common from here.


Model Download

📂ComfyUI/
└── 📂models/
    ├── 📂checkpoints/
    │   └── sam3.1_multiplex_fp16.safetensors
    ├── 📂clip_vision/
    │   └── clip_vision_h.safetensors
    ├── 📂diffusion_models/
    │   └── wan2.1_14B_SCAIL_2_fp8_scaled.safetensors
    ├── 📂loras/
    │   └── Wan21_I2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors
    ├── 📂text_encoders/
    │   └── umt5_xxl_fp8_e4m3fn_scaled.safetensors
    └── 📂vae/
        └── wan_2.1_vae.safetensors

Animation Mode

Move a reference image using a motion video.

SCAIL-2_Animation.json

The base workflow is similar to Wan-Animate, but this one is much simpler, so let's look through it.

Reference Image / Motion Video

The reference image and motion video are resized internally, so they do not need to be the same size.

  • Similar aspect ratios are easier to handle.
  • The pose in the image and the pose in the video do not need to match perfectly.
  • However, if they are too different, generation will fail.
  • It is usually safer to choose a reference image close to the first frame of the motion video.

Prompt

Since this is just motion transfer, you do not need a detailed prompt.

  • However, if the prompt is too short, generation can fail more easily, especially in Replacement Mode.
  • For this example, write enough to describe the intended video, such as a man in a shirt is standing with one hand on his waist and touching his hair.

Resolution / Frame Count

Set the generation size and frame count in WanSCAILToVideo.

  • Recommended resolution is 480p (864×480) to roughly 720p (1280×704), and a multiple of 32
  • Maximum frame count is 81
  • In this workflow, the reference image is resized and that size is used as the generation resolution.

Mask Generation with SAM3.1

Mask the people in the reference image and motion video with SAM 3 / 3.1.

  • This is not a strict inpainting mask. It is just a helper that tells SCAIL-2 which people correspond to each other, so a little misalignment is fine.

Create SCAIL-2 Colored Mask

The generated masks are colored appropriately.

  • This becomes a little more important when there are multiple people. More on that later.

6-Step Generation

SCAIL-2 can also use the Lightx2v LoRA for fast Wan2.1 generation.

  • cfg is 1.0
  • steps is 6

Output Example

reference image
reference image
motion video
output

Replacement Mode

Replace the person in the video with the person in the reference image.

SCAIL-2_Replacement.json

Basically, just set replacement_mode to true in Create SCAIL-2 Colored Mask and WanSCAILToVideo.

Resolution

Replacement uses the video size as the base.

  • In this workflow, it resizes the first frame of the video, reads that size, and sets it as the output size.

Create SCAIL-2 Colored Mask and WanSCAILToVideo

Set replacement_mode to true.

  • By the way, the output of Create SCAIL-2 Colored Mask only makes the pose_video background white.

Output Example

motion video
reference image
reference image
output

Animation Mode (Multiple People)

SCAIL-2 also supports videos and images with multiple people.

No special operation is required. As before, just input the video and reference image.

SCAIL-2_Animation_multi-char.json

Create SCAIL-2 Colored Mask

When there are multiple people, it becomes important to control which person should follow which motion. SCAIL-2 uses colored masks for this.

  • When SAM3.1 segments multiple targets, Create SCAIL-2 Colored Mask paints them in different colors in order.
  • Basically, matching colors are linked together, so use options such as sort_by to align the colors.

However, as in the output example below, the color correspondence and the motion may not always match. This is only a light condition, and the model may simply choose the closer composition.

Output Example

reference image
reference image
motion video
output

Animation Mode (Over 81 Frames)

SCAIL-2 basically generates up to 81 frames, but with WAN Context Windows (Manual), you can generate longer videos by splitting along the time direction.

SCAIL-2_Animation_WAN-Context-Windows.json

WAN Context Windows (Manual)

It is like tiling along the time axis, or context sliding.

  • Set context_length to 81, and it generates internally in chunks of 81 frames.
  • If you leave it as-is, the seams will be obvious, so set an appropriate number of frames in context_overlap as overlap.

Output Example

reference image
reference image
motion video
output