Fine-Tune an LLM on a Single Consumer GPU with QLoRA
A step-by-step walkthrough for fine-tuning Llama 4 8B or Qwen3 7B on one 24GB RTX 4090 using QLoRA. Full code, dataset prep, training config, and deployment.
Why QLoRA, Why Now
Fine-tuning a 7-8B LLM on a single GPU used to require painful engineering — CPU offloading, ZeRO stage 3, gradient checkpointing tuned by hand. QLoRA changed that. By combining 4-bit quantization of the frozen base model with low-rank adapters trained in higher precision, you get near-full-fine-tune quality while fitting a 7-8B model comfortably in 24GB VRAM.
In 2026, a single RTX 4090 or RTX 6000 Ada is enough to produce a specialized model that outperforms GPT-5 on your narrow task, in a weekend, for the electricity cost of running the GPU. This tutorial walks through the whole pipeline: dataset prep, training config, evaluation, deployment.
We’ll fine-tune Llama 4 8B into a specialized SQL-generation model using the Spider dataset. Swap in your own data and the same pipeline works for customer support, medical notes, code migration, or anything else.
Hardware and Environment
Minimum: 24GB VRAM. RTX 3090, 4090, A5000, A6000, L4, or rented cloud equivalent. You can squeeze Llama 4 8B into 16GB with aggressive settings but your batch size suffers.
# Fresh Ubuntu 22.04 or WSL2 with NVIDIA drivers
python -m venv venv && source venv/bin/activate
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate peft bitsandbytes datasets trl
pip install unsloth # optional, but doubles training speed
We’ll use Unsloth — an open-source library from Daniel Han that reimplements the inner training loop in Triton kernels. It delivers roughly 2x faster QLoRA training and ~50% less memory than vanilla Hugging Face on the same hardware. It’s the default choice in 2026.
Step 1: Prepare the Dataset
Spider is a text-to-SQL dataset with ~10k examples. For any real fine-tune, aim for at least 1k high-quality examples. Less than that and you’re better off with few-shot prompting.
from datasets import load_dataset
raw = load_dataset("spider", split="train")
def format_example(row):
prompt = f"""### Schema:
{row['db_schema']}
### Question:
{row['question']}
### SQL:
"""
return {"text": prompt + row['query'] + "</s>"}
formatted = raw.map(format_example, remove_columns=raw.column_names)
formatted.save_to_disk("./spider-formatted")
Key principles for fine-tune datasets:
- Every example should be correct. A model trained on noisy labels learns noise.
- Use a consistent prompt template. The model memorizes the structure.
- Include the stop token explicitly.
</s>or<|endoftext|>depending on the base model. - Balance your distribution. If 80% of your examples are short, the model will struggle with long ones.
Step 2: Load the Base Model in 4-bit
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-4-8b-bnb-4bit",
max_seq_length=2048,
dtype=None, # auto-detect
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
lora_dropout=0.0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
The important hyperparameters:
- r (rank): controls adapter capacity. 8-16 for most tasks, 32-64 for hard reasoning transfers.
- lora_alpha: scaling factor. Rule of thumb: alpha = 2×r.
- target_modules: which layers get adapters. Including all attention + MLP layers is standard.
- max_seq_length: longer = more memory. 2048 is fine for most tasks; 4096 needs ~20GB.
With these settings on a 4090, peak VRAM is ~18GB. You have headroom for a batch size of 4-8.
Step 3: Training Configuration
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_from_disk
dataset = load_from_disk("./spider-formatted")
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch size 16
warmup_steps=50,
num_train_epochs=3,
learning_rate=2e-4,
fp16=False,
bf16=True,
logging_steps=10,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="cosine",
seed=42,
output_dir="./outputs",
save_strategy="epoch",
report_to="none",
),
)
trainer.train()
The tuning knobs that matter most:
- learning_rate: 2e-4 is standard for QLoRA. Drop to 1e-4 for sensitive tasks, bump to 3e-4 for big dataset shifts.
- num_train_epochs: 3 is usually right. More than 5 almost always overfits on small datasets.
- effective batch size: 16 is the sweet spot for stability on small datasets.
- adamw_8bit: quantized optimizer states. Saves ~4GB of VRAM with near-zero quality impact.
On a 4090, this trains in about 90 minutes on 10k examples. On an A100, about 25 minutes.
Step 4: Evaluate the Model
Don’t trust training loss. Always hold out a test set and run actual inference:
FastLanguageModel.for_inference(model)
test_set = load_dataset("spider", split="validation").select(range(100))
correct = 0
for row in test_set:
prompt = format_example(row)["text"].split("### SQL:")[0] + "### SQL:\n"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=128, temperature=0.0, do_sample=False)
predicted = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
if predicted.strip().split("</s>")[0].strip() == row["query"].strip():
correct += 1
print(f"Exact match accuracy: {correct/len(test_set):.2%}")
For Spider, exact match is overly strict — use the official evaluation script for a more realistic score (it compares execution results against a SQLite database). Expect 65-75% execution accuracy after 3 epochs on Llama 4 8B. GPT-5 base scores ~70% on the same benchmark without fine-tuning, so a well-tuned 8B can match frontier models on a narrow task.
Step 5: Merge and Deploy
For production you usually want the LoRA adapter merged into the base weights:
merged = model.merge_and_unload()
merged.save_pretrained("./llama4-sql-merged", safe_serialization=True)
tokenizer.save_pretrained("./llama4-sql-merged")
Serve it with vLLM for production throughput:
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./llama4-sql-merged \
--dtype bfloat16 \
--max-model-len 2048 \
--port 8000
vLLM gives you OpenAI-compatible endpoints and PagedAttention-powered throughput that handily beats Hugging Face Transformers. On a single 4090 you can serve 40-60 requests/second on short prompts.
Cost Reality Check
A rough cost comparison for fine-tuning + serving 1M SQL queries per month:
| Option | Fine-tune cost | Monthly serving | Notes |
|---|---|---|---|
| GPT-5 API | $0 | ~$2,500 | No training required |
| Claude Sonnet 4.6 API | $0 | ~$3,000 | Best zero-shot quality |
| QLoRA Llama 4 8B on rented 4090 | ~$10 | ~$350 | 1 GPU, ~40 req/s |
| QLoRA Llama 4 8B owned 4090 | electricity | ~$50 | ~$1,500 hardware upfront |
Past ~500k queries/month, self-hosted fine-tunes are cheaper than frontier APIs. For specialized tasks where a fine-tuned 8B matches or beats frontier models, the economics are brutal.
Pitfalls That Will Bite You
- Catastrophic forgetting. Train too many epochs and your model forgets general instruction following. Mix 10-20% of general instruction data (UltraChat, Alpaca) into your training set as a regularizer.
- Data leakage in eval. Make sure test examples aren’t in your train set — Spider has known overlap issues.
- Bad stop tokens. If the model keeps generating past your answer, check that your training template includes the EOS token.
- Oversized rank. r=64 isn’t automatically better than r=16 — larger ranks need more data to avoid overfitting.
- Ignoring inference quantization. Fine-tuning in 4-bit then deploying in fp16 works fine, but deploying at a different quantization than training can produce surprises.
What To Do Next
- Try different base models (Qwen3 7B, DeepSeek-coder, Phi-4) on the same task
- Compare against DPO/ORPO preference tuning for tasks with ranked outputs
- Experiment with longer contexts using rope scaling
- Deploy the merged model behind a vLLM + FastAPI gateway for production
QLoRA on consumer hardware is the most underappreciated leverage in 2026. If you have 100 hours of specialized data, you can almost certainly build something better than any frontier API on that task — and run it for pennies.
Appendix: Hyperparameter Cheatsheet
After running hundreds of QLoRA training runs on different base models, here are defaults that almost always work as a starting point:
| Parameter | Value | Notes |
|---|---|---|
| r (rank) | 16 | Raise to 32-64 only for large domain shifts |
| lora_alpha | 32 | Keep at 2x rank |
| lora_dropout | 0.0 | Raise to 0.05 if overfitting |
| learning_rate | 2e-4 | Drop to 1e-4 for sensitive tasks |
| epochs | 3 | Never exceed 5 without a reason |
| batch size (effective) | 16 | Fine for 1k-100k dataset sizes |
| warmup_steps | 50 or 3% of total | Helps training stability |
| weight_decay | 0.01 | Standard AdamW |
| lr_scheduler | cosine | Slightly better than linear |
| max_seq_length | 2048 | Raise only if your data needs it |
If any of these are clearly wrong for your task, the symptoms usually appear in training loss within the first 100 steps. Loss that flatlines after a few steps means learning rate too low. Loss that spikes means learning rate too high or gradient explosion. Loss that decreases in training but fails on eval means you’re overfitting — reduce epochs or increase dataset diversity.
Treat your first few fine-tunes as experiments, not production artifacts. You’ll learn more from five 30-minute runs with different settings than one 5-hour run with your best guess.
Sources
> Want more like this?
Get the best AI insights delivered weekly.
> Related Articles
Web Scraping with AI: Build a Smart Data Extraction Pipeline
Traditional web scraping breaks when websites change layouts. AI-powered scraping understands page structure and extracts data intelligently. Here's how to build one using Python, Beautiful Soup, and Claude.
Create an AI Art Portfolio: From Generation to Gallery in One Weekend
Build a professional AI art portfolio website with curated collections, consistent style, and proper attribution. Covers prompt engineering, style consistency, curation, and deployment.
Build an AI Chrome Extension: Add Claude to Any Webpage in 60 Minutes
Build a Chrome extension that summarizes web pages, answers questions about content, and rewrites selected text — all powered by Claude. Full source code and step-by-step instructions included.
Tags
> Stay in the loop
Weekly AI tools & insights.