Fine-Tune an LLM on a Single Consumer GPU with QLoRA

Why QLoRA, Why Now

Fine-tuning a 7-8B LLM on a single GPU used to require painful engineering — CPU offloading, ZeRO stage 3, gradient checkpointing tuned by hand. QLoRA changed that. By combining 4-bit quantization of the frozen base model with low-rank adapters trained in higher precision, you get near-full-fine-tune quality while fitting a 7-8B model comfortably in 24GB VRAM.

In 2026, a single RTX 4090 or RTX 6000 Ada is enough to produce a specialized model that outperforms GPT-5 on your narrow task, in a weekend, for the electricity cost of running the GPU. This tutorial walks through the whole pipeline: dataset prep, training config, evaluation, deployment.

We’ll fine-tune Llama 4 8B into a specialized SQL-generation model using the Spider dataset. Swap in your own data and the same pipeline works for customer support, medical notes, code migration, or anything else.

Hardware and Environment

Minimum: 24GB VRAM. RTX 3090, 4090, A5000, A6000, L4, or rented cloud equivalent. You can squeeze Llama 4 8B into 16GB with aggressive settings but your batch size suffers.

# Fresh Ubuntu 22.04 or WSL2 with NVIDIA drivers
python -m venv venv && source venv/bin/activate
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate peft bitsandbytes datasets trl
pip install unsloth  # optional, but doubles training speed

We’ll use Unsloth — an open-source library from Daniel Han that reimplements the inner training loop in Triton kernels. It delivers roughly 2x faster QLoRA training and ~50% less memory than vanilla Hugging Face on the same hardware. It’s the default choice in 2026.

Step 1: Prepare the Dataset

Spider is a text-to-SQL dataset with ~10k examples. For any real fine-tune, aim for at least 1k high-quality examples. Less than that and you’re better off with few-shot prompting.

from datasets import load_dataset

raw = load_dataset("spider", split="train")

def format_example(row):
    prompt = f"""### Schema:
{row['db_schema']}

### Question:
{row['question']}

### SQL:
"""
    return {"text": prompt + row['query'] + "</s>"}

formatted = raw.map(format_example, remove_columns=raw.column_names)
formatted.save_to_disk("./spider-formatted")

Key principles for fine-tune datasets:

Every example should be correct. A model trained on noisy labels learns noise.
Use a consistent prompt template. The model memorizes the structure.
Include the stop token explicitly. </s> or <|endoftext|> depending on the base model.
Balance your distribution. If 80% of your examples are short, the model will struggle with long ones.

Step 2: Load the Base Model in 4-bit

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-4-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

The important hyperparameters:

r (rank): controls adapter capacity. 8-16 for most tasks, 32-64 for hard reasoning transfers.
lora_alpha: scaling factor. Rule of thumb: alpha = 2×r.
target_modules: which layers get adapters. Including all attention + MLP layers is standard.
max_seq_length: longer = more memory. 2048 is fine for most tasks; 4096 needs ~20GB.

With these settings on a 4090, peak VRAM is ~18GB. You have headroom for a batch size of 4-8.

Step 3: Training Configuration

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_from_disk

dataset = load_from_disk("./spider-formatted")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch size 16
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=False,
        bf16=True,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="./outputs",
        save_strategy="epoch",
        report_to="none",
    ),
)

trainer.train()

The tuning knobs that matter most:

learning_rate: 2e-4 is standard for QLoRA. Drop to 1e-4 for sensitive tasks, bump to 3e-4 for big dataset shifts.
num_train_epochs: 3 is usually right. More than 5 almost always overfits on small datasets.
effective batch size: 16 is the sweet spot for stability on small datasets.
adamw_8bit: quantized optimizer states. Saves ~4GB of VRAM with near-zero quality impact.

On a 4090, this trains in about 90 minutes on 10k examples. On an A100, about 25 minutes.

Step 4: Evaluate the Model

Don’t trust training loss. Always hold out a test set and run actual inference:

FastLanguageModel.for_inference(model)

test_set = load_dataset("spider", split="validation").select(range(100))

correct = 0
for row in test_set:
    prompt = format_example(row)["text"].split("### SQL:")[0] + "### SQL:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.0, do_sample=False)
    predicted = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    if predicted.strip().split("</s>")[0].strip() == row["query"].strip():
        correct += 1

print(f"Exact match accuracy: {correct/len(test_set):.2%}")

For Spider, exact match is overly strict — use the official evaluation script for a more realistic score (it compares execution results against a SQLite database). Expect 65-75% execution accuracy after 3 epochs on Llama 4 8B. GPT-5 base scores ~70% on the same benchmark without fine-tuning, so a well-tuned 8B can match frontier models on a narrow task.

Step 5: Merge and Deploy

For production you usually want the LoRA adapter merged into the base weights:

merged = model.merge_and_unload()
merged.save_pretrained("./llama4-sql-merged", safe_serialization=True)
tokenizer.save_pretrained("./llama4-sql-merged")

Serve it with vLLM for production throughput:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model ./llama4-sql-merged \
    --dtype bfloat16 \
    --max-model-len 2048 \
    --port 8000

vLLM gives you OpenAI-compatible endpoints and PagedAttention-powered throughput that handily beats Hugging Face Transformers. On a single 4090 you can serve 40-60 requests/second on short prompts.

Cost Reality Check

A rough cost comparison for fine-tuning + serving 1M SQL queries per month:

Option	Fine-tune cost	Monthly serving	Notes
GPT-5 API	$0	~$2,500	No training required
Claude Sonnet 4.6 API	$0	~$3,000	Best zero-shot quality
QLoRA Llama 4 8B on rented 4090	~$10	~$350	1 GPU, ~40 req/s
QLoRA Llama 4 8B owned 4090	electricity	~$50	~$1,500 hardware upfront

Past ~500k queries/month, self-hosted fine-tunes are cheaper than frontier APIs. For specialized tasks where a fine-tuned 8B matches or beats frontier models, the economics are brutal.

Pitfalls That Will Bite You

Catastrophic forgetting. Train too many epochs and your model forgets general instruction following. Mix 10-20% of general instruction data (UltraChat, Alpaca) into your training set as a regularizer.
Data leakage in eval. Make sure test examples aren’t in your train set — Spider has known overlap issues.
Bad stop tokens. If the model keeps generating past your answer, check that your training template includes the EOS token.
Oversized rank. r=64 isn’t automatically better than r=16 — larger ranks need more data to avoid overfitting.
Ignoring inference quantization. Fine-tuning in 4-bit then deploying in fp16 works fine, but deploying at a different quantization than training can produce surprises.

What To Do Next

Try different base models (Qwen3 7B, DeepSeek-coder, Phi-4) on the same task
Compare against DPO/ORPO preference tuning for tasks with ranked outputs
Experiment with longer contexts using rope scaling
Deploy the merged model behind a vLLM + FastAPI gateway for production

QLoRA on consumer hardware is the most underappreciated leverage in 2026. If you have 100 hours of specialized data, you can almost certainly build something better than any frontier API on that task — and run it for pennies.

Appendix: Hyperparameter Cheatsheet

After running hundreds of QLoRA training runs on different base models, here are defaults that almost always work as a starting point:

Parameter	Value	Notes
r (rank)	16	Raise to 32-64 only for large domain shifts
lora_alpha	32	Keep at 2x rank
lora_dropout	0.0	Raise to 0.05 if overfitting
learning_rate	2e-4	Drop to 1e-4 for sensitive tasks
epochs	3	Never exceed 5 without a reason
batch size (effective)	16	Fine for 1k-100k dataset sizes
warmup_steps	50 or 3% of total	Helps training stability
weight_decay	0.01	Standard AdamW
lr_scheduler	cosine	Slightly better than linear
max_seq_length	2048	Raise only if your data needs it

If any of these are clearly wrong for your task, the symptoms usually appear in training loss within the first 100 steps. Loss that flatlines after a few steps means learning rate too low. Loss that spikes means learning rate too high or gradient explosion. Loss that decreases in training but fails on eval means you’re overfitting — reduce epochs or increase dataset diversity.

Treat your first few fine-tunes as experiments, not production artifacts. You’ll learn more from five 30-minute runs with different settings than one 5-hour run with your best guess.

Fine-Tune an LLM on a Single Consumer GPU with QLoRA

Why QLoRA, Why Now

Hardware and Environment

Step 1: Prepare the Dataset

Step 2: Load the Base Model in 4-bit

Step 3: Training Configuration

Step 4: Evaluate the Model

Step 5: Merge and Deploy

Cost Reality Check

Pitfalls That Will Bite You

What To Do Next

Appendix: Hyperparameter Cheatsheet

Sources

Share this article

> Want more like this?

> Related Articles

Build a Production AI Agent with the Claude Agent SDK

Build an AI Voice Assistant with Whisper, Claude, and ElevenLabs in Python

Build Custom GPTs That Actually Work: A Developer's Guide to OpenAI's GPT Builder

Tags

> Stay in the loop