What is LTX 2.3?

LTX 2.3 is an improved version of Lightricks' video generation model LTX-2.

The basic ideas and node structure are the same as LTX-2.
So on this page, we only look at what changed from LTX-2.


Recommended Settings

  • Resolution
    • Final output around 1.5M pixels
    • Must be a multiple of 32
  • FPS
    • 24 / 25 / 48 / 50
  • Frames
    • 65 / 97 / 121 / 161 / 257
    • Must be 8n + 1

Model Download

📂ComfyUI/
└── 📂models/
    ├── 📂checkpoints/
    │   └── ltx-2.3-22b-dev-fp8.safetensors
    ├── 📂latent_upscale_models/
    │   └── ltx-2.3-spatial-upscaler-x2-1.1.safetensors
    ├── 📂loras/
    │   └── ltx-2.3-22b-distilled-lora-384.safetensors
    └── 📂text_encoders/
        └── gemma_3_12B_it_fp8_scaled.safetensors

Basic Process Flow

The architecture is the same as LTX-2, so the workflow itself can be reused.
However, the results are not very good if you use it as-is.

So on this page, we use the community-discovered 3-stage workflow.

Originally, LTX-2 used a 2-stage process: generate once at low resolution, then Hires.fix it to 1.5MP.
In 2.3, you add one more stage: generate at a very small resolution, do 2x Hires.fix, then do another 2x Hires.fix.

This is not the officially recommended method, but the results are clearly better, so this is what we use here.

Everything here uses distilled-lora with 8-step generation.


About prompts

Just like LTX-2, prompt quality directly affects video quality.
It is a good idea to use the official prompt guide as a reference and write prompts that are both specific and information-rich.

It can also help to let an LLM assist with prompt writing. Give it the reference link and a rough description of what you want, and have it clean the prompt up for you.

ComfyUI has a core TextGenerate node that can run an LLM directly.
Many LTX-2 workflows use it to refine prompts, but it is still just a node for editing prompts, so the workflows on this page do not use it.
Personally, I think it is easier to make prompts separately with ChatGPT or Gemini.


text2video

LTX-2.3_text2video_distilled_3stage.json

Set video resolution, length, and FPS

This is where you decide the parameters for the video and audio you want to generate.

  • Enter resolution, frame count, and FPS in EmptyLTXVLatentVideo / LTXV Empty Latent Audio
  • 🚨This is the part that differs from LTX-2
    • Since it upscales by 2x twice, meaning 4x in width and height overall, set a value around 0.1MP with that in mind

Output example


image2video

LTX-2.3_image2video_distilled_3stage.json

Output example

Input
Input
Output

audio2video

LTX-2.3_audio2video_distilled_3stage.json

Output example


audio-image2video

LTX-2.3_audio-image2video_distilled_3stage.json

Output example


IC-LoRA

LTX-2.3 can also use IC-LoRA-based extensions, just like LTX-2.

Model Download

📂ComfyUI/
└── 📂models/
    └── 📂loras/
        └── ltx-2.3-22b-ic-lora-union-control-ref0.5.safetensors

IC-LoRA Union (Pose)

LTX-2.3_IC-LoRA(Pose)_distilled_2stage.json
  • 🚨For IC-LoRA, use a 2-stage workflow instead of 3-stage
  • IC-LoRA Union uses a special method where the control video is generated at half the resolution of the final video
    • So if you use 3 stages, the control image resolution drops to "half of half of half of half", roughly around 100px
    • At that point, it no longer keeps enough information to work as a control image
    • That is why IC-LoRA stays at 2 stages

Output example

Input
Output

ID-LoRA

Generate a talking-head video of a person speaking in a scene, using one reference image, a short reference audio clip, and a text prompt.

Unlike feeding cloned audio into audio-image2video afterward, ID-LoRA generates the audio and video at the same time.
Because of that, the mouth movement and overall voice feel tend to come out more naturally as one piece.

Model Download

Both distributed files are named lora_weights.safetensors.
To keep them easy to tell apart, it is helpful to rename them to LTX-2.3-ID-LoRA-CelebVHQ-3K.safetensors and LTX-2.3-ID-LoRA-TalkVid-3K.safetensors.

📂ComfyUI/
└── 📂models/
    └── 📂loras/
        ├── LTX-2.3-ID-LoRA-CelebVHQ-3K.safetensors
        └── LTX-2.3-ID-LoRA-TalkVid-3K.safetensors

workflow

LTX-2.3_ID-LoRA_distilled_3stage.json

The overall base is image2video.
On top of that, you add the ID-LoRA LoRA and the reference-audio condition.

ID-LoRA model

Load the ID-LoRA model.

  • LTX-2.3-ID-LoRA-CelebVHQ-3K
  • LTX-2.3-ID-LoRA-TalkVid-3K

There are two versions, but the method is the same and only the dataset differs.
There is not a huge difference between them, but it is worth trying both to see which one works better for you.

LTXV Reference Audio (ID-LoRA)

Connect ID-LoRA and the reference audio.

  • Use a reference audio clip trimmed to around 5 seconds
  • It is only used as a reference, so it does not determine the final video length

Prompt

The prompt format is fixed, so write it in this structure.

[VISUAL]: Scene description and the character's appearance
[SPEECH]: The line the character speaks
[SOUNDS]: Speaking style + ambient / surrounding sounds
  • To avoid ending up with audio that feels like narration laid over the video, it helps to state in [VISUAL] that the character is actually speaking

Output example

input
input
ref_audio
output