What is SDXL?

SDXL (specifically SDXL 1.0) is the legitimate successor model by Stability AI, who developed Stable Diffusion 1.5. (There was a lineage called Stable Diffusion 2.1, but well... the performance was...)

There are roughly two main differences from Stable Diffusion:

Two-stage configuration of base and refiner

  • Basic text2image is completed only with the base model.
  • After that, it is designed to perform "finishing" to adjust details and texture by performing image2image with the refiner model.

Change in learning resolution

  • Stable Diffusion 1.5
    • Trained mainly on 512 x 512px square images
  • SDXL
    • Trained mainly on 1024 x 1024px with various aspect ratios
    • It handles high-resolution image generation and portrait/landscape compositions more easily from the beginning.

Model Download

📂ComfyUI/
  └── 📂models/
      └── 📂checkpoints/
          ├── sd_xl_base_1.0_0.9vae.safetensors
          └── sd_xl_refiner_1.0_0.9vae.safetensors

text2image with only base model

First, let's simply do text2image with only base.

Basic generation is possible just by replacing the Checkpoint with SDXL base in the text2image workflow of SD1.5.

SDXL_text2image_base.json
  • Set the resolution to approximately 1M pixels (around 1024 x 1024px).

    • Examples: 1024 x 1024 / 896 x 1152 / 1152 x 896, etc.

CLIPTextEncodeSDXL

SDXL base is configured by combining two types of CLIP (OpenCLIP-ViT/G, CLIP-ViT/L) as text encoders.

ComfyUI has nodes that can input separate texts into each CLIP, but let me say first, you don't need to use them.

SDXL_text2image_base_CLIPTextEncodeSDXL.json
  • If you input the same prompt to both CLIPs, the behavior will result in almost the same as when using the CLIP Text Encode node.
  • It is known from experimental results that the output tends to be most stable when the same text is input to both CLIPs.

base + refiner

Next, let's finish the image generated by base with refiner.

Generated by base → image2image with refiner

SDXL base and SDXL refiner use the same latent representation. Therefore, the latent generated by base can be input to the KSampler of the refiner side as is for image2image.

SDXL_text2image_base-refiner.json
    1. 🟪 text2image as usual with SDXL base (output latent)
    1. 🟨 Connect that latent to KSampler using SDXL refiner
    1. 🟨 image2image with low denoise (e.g., 0.2 to 0.3)
    • Since it specializes in increasing details, really a little is enough.

The image is that the original style of base is utilized, and only the details and textures are adjusted by refiner.

Switching during sampling (KSampler Advanced)

As a slightly smarter way, there is also a way to switch from base to refiner during sampling. Use KSampler (Advanced) Node.

SDXL_text2image_base-refiner_Advanced.json
  • 🟪 Sample with SDXL base until the middle
  • 🟨 Switch remaining steps to SDXL refiner to sample
  • 🟦 Set the timing to switch with int node.

Personally, I prefer image2image because it is easier to understand, but it might be good to remember that switching between base and refiner within one sampling pass is possible.


refiner is not necessary, but "refiner-like thinking" is important

Refiner-less SDXL Models

There are many derivative models based on SDXL (community models and commercial models), but many models are adjusted to produce sufficient image quality without using refiner.

To put it a bit strongly, the design of "post-processing with refiner" was also a compromise to compensate for the performance of base alone at that time.

Refiner-like Thinking

However, the idea of "finishing one image across multiple models" itself is still a valid idea.

  • Models whose style is good but do not follow prompts very well
  • Conversely, models that follow prompts well but whose style is not preferred

There are plenty of such "not quite right" models.

In such situations, SDXL's refiner-like thinking is useful.

  • First, generate a base image with a model excellent in composition and prompt reproducibility
  • Finish that image by image2image-ing with a model whose style you prefer

By making a two-stage configuration like this, you can build a "best of both worlds" workflow where "Composition is model A" and "Style is model B".

base / refiner in SDXL is just one specific example of that. Please look for your own combination of "how to multiply multiple models".