Running AI Toolkit on RunPod

What is RunPod?

RunPod is a service where you can rent cloud GPU power for short periods of time.

There are several similar services, but I feel RunPod has the best balance of price and ease of use.

Compared with simple image generation, model training needs more compute power and VRAM. You also keep the GPU running for tens of minutes or several hours.

If you only want to train a simple LoRA, it can be done for a few dollars, so it is easy enough to try.

This note walks through the flow from launching AI Toolkit on RunPod to downloading the LoRA you trained.

I plan to cover detailed training settings and dataset preparation for each model in separate notes.

Overall flow

Create a RunPod account and buy credits
Create a Pod
Open the AI Toolkit UI
Upload a Dataset
Create a Job
Run training
Download the LoRA file
Stop the Pod

1. Create a RunPod account and buy credits

Create an account

Open RunPod and create an account from Sign Up.

Buy credits

RunPod works by buying credits first and spending them as you use GPUs.

If you only want to try LoRA training, about 10 dollars is enough.

Click the + button in the upper right
Choose Other if you want to start with a small amount
Enter the amount and continue to Go to Checkout

You do not have to use it, but this is my referral code. If you register from here, you can get a little extra credit on your first credit purchase.

https://runpod.io?ref=ke9q7kqp

2. Create a Pod

What is a Pod?

RunPod has several features, but for this note you only need to understand Pod.

A Pod is like a customizable rental PC in the cloud.

You choose which GPU to use and how much storage to allocate, then rent that environment.

This time, we will create a Pod that can run AI Toolkit, open AI Toolkit from the browser, and train a LoRA.

Create the Pod

Open Pods from the sidebar and create a new Pod.

Here you choose Cloud, GPU, CUDA version, storage, and so on.

Cloud
- Secure Cloud
  - This is an environment prepared by RunPod. It is stable, but the price is higher.
- Community Cloud
  - This is an externally provided environment that has passed RunPod's review.
- Community Cloud cannot use Network Volume, but we will not use it this time, so either option is fine.
Network Volume
- When you close a Pod, the data inside it is also deleted.
- Network Volume is a service for keeping data somewhere else so that does not happen.
- In this note, we will download the LoRA after training and delete the whole Pod.
CUDA version
- AI Toolkit may not run with an old CUDA version.
- Choose 12.8 or later for this flow.
GPU
- There are many choices, so it is easy to get lost, but the first thing to look at is VRAM.
- If you do not have enough VRAM, training will fail with Out of Memory.
- After that, you can choose a higher-grade GPU for speed.

Here are a few GPUs I often use.

GPU	VRAM	Note
RTX 3090	24GB	For SDXL-style LoRA training, this has a good balance of speed and cost.
RTX A40	48GB	Because 48GB VRAM is available at a relatively low price, this is the GPU I usually pick first for LoRA training. Use it when 24GB is not enough.
RTX PRO 6000	96GB	This is a little overkill for LoRA training, but useful for large models such as LTX-2 or settings with high VRAM usage.

Deployment settings

After the hardware settings are done, choose what software to run.

We will use a template, so there is nothing especially difficult here.

Pod name
- Give it any name you like.
Pod template
- Choose AI Toolkit - ostris - ui - official.
- This is the template made by Ostris, the author of AI Toolkit.
- Be careful when searching for AI Toolkit, because many templates with similar names appear.
Next, edit this template slightly.
- Click Edit.
- Open Environment Variables at the bottom.
- Change the value of AI_TOOLKIT_AUTH to a password only you know.
You will use this value when opening AI Toolkit. If you leave the default as-is, anyone can open it with password, so use a different value.
Storage configuration
- Allocate enough Volume disk for the dataset, base model, and output LoRA files.
- The default value is fine.

After that, click Deploy On-Demand and the Pod will be created.

Credit usage starts at this point. Prepare your dataset before deploying the Pod.

3. Open the AI Toolkit UI

It takes a little while for the Pod to be created. Wait for it to finish.

When the Pod is ready, it will show 🟢Ready, and a link for opening AI Toolkit will appear.

Click HTTP Service, and AI Toolkit should open.

When it asks for a password, enter the value you just set in AI_TOOLKIT_AUTH.

From here, we will look at the rough training flow in AI Toolkit.

4. Upload a Dataset

Upload the images and caption files used for training to AI Toolkit.

Open the Dataset tab
Click New Dataset in the upper right
Give the dataset a name
Drag and drop the images and .txt files

If the images and their matching caption files are loaded, you are ready.

5. Create a Job

In AI Toolkit, the flow is to create a training setup called a Job, then run it.

It is a bit like a workflow in ComfyUI.

Set the base model, learning rate, dataset you just loaded, and other training parameters here.

When the settings are done, click Create Job in the upper right.

Before training starts, you can change the settings as many times as you want.

6. Run training

After creating the Job, run the training.

Click the run button (▶) in the upper right

If there is no error and the progress bar is moving, it is basically working.

You can stop training and resume it later.

You can also stop it, change parameters, and run it again, but some parameters can break the training state. If you are not sure, it is better to start again from 0.

7. Download the LoRA file

Depending on the settings, AI Toolkit periodically outputs LoRA files during training.

The only real way to know whether training went well is to generate images with ComfyUI or another tool. To be honest, the Loss Graph is not very useful for judging that.

Output LoRA files appear in the Checkpoints area
Click the download button and save them

That is the basic flow.

If you delete the Pod, the uploaded dataset and generated LoRA files are deleted too. Make sure to download every file you need.
It is also useful to save the config file that contains all the settings, so you can review it later.

8. Stop the Pod

RunPod charges while the Pod is running, even if you are not actively working.

Do not forget to stop it.

Return to the RunPod page
Open the running Pod
Stop the Pod with Stop
Confirm that you have downloaded all necessary files

Stop stops GPU billing, but the storage fee for the Volume disk remains.

If you no longer need AI Toolkit, run Terminate to shut it down completely.

Delete the Pod with Terminate

Training a specific model

This note mainly covered the part up to launching AI Toolkit from RunPod.

For the concrete flow of training a model with AI Toolkit, see the following note.

Training an SDXL (Illustrious) LoRA with AI Toolkit

Running AI Toolkit on RunPod

What is RunPod?

Overall flow