TUTORIALS 13 min read

Fine-Tune an LLM on a Single Consumer GPU with QLoRA

A step-by-step walkthrough for fine-tuning Llama 4 8B or Qwen3 7B on one 24GB RTX 4090 using QLoRA. Full code, dataset prep, training config, and deployment.

By EgoistAI ·
Fine-Tune an LLM on a Single Consumer GPU with QLoRA

Why QLoRA, Why Now

Fine-tuning a 7-8B LLM on a single GPU used to require painful engineering — CPU offloading, ZeRO stage 3, gradient checkpointing tuned by hand. QLoRA changed that. By combining 4-bit quantization of the frozen base model with low-rank adapters trained in higher precision, you get near-full-fine-tune quality while fitting a 7-8B model comfortably in 24GB VRAM.

In 2026, a single RTX 4090 or RTX 6000 Ada is enough to produce a specialized model that outperforms GPT-5 on your narrow task, in a weekend, for the electricity cost of running the GPU. This tutorial walks through the whole pipeline: dataset prep, training config, evaluation, deployment.

We’ll fine-tune Llama 4 8B into a specialized SQL-generation model using the Spider dataset. Swap in your own data and the same pipeline works for customer support, medical notes, code migration, or anything else.


Hardware and Environment

Minimum: 24GB VRAM. RTX 3090, 4090, A5000, A6000, L4, or rented cloud equivalent. You can squeeze Llama 4 8B into 16GB with aggressive settings but your batch size suffers.

# Fresh Ubuntu 22.04 or WSL2 with NVIDIA drivers
python -m venv venv && source venv/bin/activate
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install transformers accelerate peft bitsandbytes datasets trl
pip install unsloth  # optional, but doubles training speed

We’ll use Unsloth — an open-source library from Daniel Han that reimplements the inner training loop in Triton kernels. It delivers roughly 2x faster QLoRA training and ~50% less memory than vanilla Hugging Face on the same hardware. It’s the default choice in 2026.


Step 1: Prepare the Dataset

Spider is a text-to-SQL dataset with ~10k examples. For any real fine-tune, aim for at least 1k high-quality examples. Less than that and you’re better off with few-shot prompting.

from datasets import load_dataset

raw = load_dataset("spider", split="train")

def format_example(row):
    prompt = f"""### Schema:
{row['db_schema']}

### Question:
{row['question']}

### SQL:
"""
    return {"text": prompt + row['query'] + "</s>"}

formatted = raw.map(format_example, remove_columns=raw.column_names)
formatted.save_to_disk("./spider-formatted")

Key principles for fine-tune datasets:

  1. Every example should be correct. A model trained on noisy labels learns noise.
  2. Use a consistent prompt template. The model memorizes the structure.
  3. Include the stop token explicitly. </s> or <|endoftext|> depending on the base model.
  4. Balance your distribution. If 80% of your examples are short, the model will struggle with long ones.

Step 2: Load the Base Model in 4-bit

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-4-8b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # auto-detect
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    lora_dropout=0.0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

The important hyperparameters:

  • r (rank): controls adapter capacity. 8-16 for most tasks, 32-64 for hard reasoning transfers.
  • lora_alpha: scaling factor. Rule of thumb: alpha = 2×r.
  • target_modules: which layers get adapters. Including all attention + MLP layers is standard.
  • max_seq_length: longer = more memory. 2048 is fine for most tasks; 4096 needs ~20GB.

With these settings on a 4090, peak VRAM is ~18GB. You have headroom for a batch size of 4-8.


Step 3: Training Configuration

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_from_disk

dataset = load_from_disk("./spider-formatted")

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch size 16
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=False,
        bf16=True,
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",
        seed=42,
        output_dir="./outputs",
        save_strategy="epoch",
        report_to="none",
    ),
)

trainer.train()

The tuning knobs that matter most:

  • learning_rate: 2e-4 is standard for QLoRA. Drop to 1e-4 for sensitive tasks, bump to 3e-4 for big dataset shifts.
  • num_train_epochs: 3 is usually right. More than 5 almost always overfits on small datasets.
  • effective batch size: 16 is the sweet spot for stability on small datasets.
  • adamw_8bit: quantized optimizer states. Saves ~4GB of VRAM with near-zero quality impact.

On a 4090, this trains in about 90 minutes on 10k examples. On an A100, about 25 minutes.


Step 4: Evaluate the Model

Don’t trust training loss. Always hold out a test set and run actual inference:

FastLanguageModel.for_inference(model)

test_set = load_dataset("spider", split="validation").select(range(100))

correct = 0
for row in test_set:
    prompt = format_example(row)["text"].split("### SQL:")[0] + "### SQL:\n"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    out = model.generate(**inputs, max_new_tokens=128, temperature=0.0, do_sample=False)
    predicted = tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
    if predicted.strip().split("</s>")[0].strip() == row["query"].strip():
        correct += 1

print(f"Exact match accuracy: {correct/len(test_set):.2%}")

For Spider, exact match is overly strict — use the official evaluation script for a more realistic score (it compares execution results against a SQLite database). Expect 65-75% execution accuracy after 3 epochs on Llama 4 8B. GPT-5 base scores ~70% on the same benchmark without fine-tuning, so a well-tuned 8B can match frontier models on a narrow task.


Step 5: Merge and Deploy

For production you usually want the LoRA adapter merged into the base weights:

merged = model.merge_and_unload()
merged.save_pretrained("./llama4-sql-merged", safe_serialization=True)
tokenizer.save_pretrained("./llama4-sql-merged")

Serve it with vLLM for production throughput:

pip install vllm
python -m vllm.entrypoints.openai.api_server \
    --model ./llama4-sql-merged \
    --dtype bfloat16 \
    --max-model-len 2048 \
    --port 8000

vLLM gives you OpenAI-compatible endpoints and PagedAttention-powered throughput that handily beats Hugging Face Transformers. On a single 4090 you can serve 40-60 requests/second on short prompts.


Cost Reality Check

A rough cost comparison for fine-tuning + serving 1M SQL queries per month:

OptionFine-tune costMonthly servingNotes
GPT-5 API$0~$2,500No training required
Claude Sonnet 4.6 API$0~$3,000Best zero-shot quality
QLoRA Llama 4 8B on rented 4090~$10~$3501 GPU, ~40 req/s
QLoRA Llama 4 8B owned 4090electricity~$50~$1,500 hardware upfront

Past ~500k queries/month, self-hosted fine-tunes are cheaper than frontier APIs. For specialized tasks where a fine-tuned 8B matches or beats frontier models, the economics are brutal.


Pitfalls That Will Bite You

  1. Catastrophic forgetting. Train too many epochs and your model forgets general instruction following. Mix 10-20% of general instruction data (UltraChat, Alpaca) into your training set as a regularizer.
  2. Data leakage in eval. Make sure test examples aren’t in your train set — Spider has known overlap issues.
  3. Bad stop tokens. If the model keeps generating past your answer, check that your training template includes the EOS token.
  4. Oversized rank. r=64 isn’t automatically better than r=16 — larger ranks need more data to avoid overfitting.
  5. Ignoring inference quantization. Fine-tuning in 4-bit then deploying in fp16 works fine, but deploying at a different quantization than training can produce surprises.

What To Do Next

  • Try different base models (Qwen3 7B, DeepSeek-coder, Phi-4) on the same task
  • Compare against DPO/ORPO preference tuning for tasks with ranked outputs
  • Experiment with longer contexts using rope scaling
  • Deploy the merged model behind a vLLM + FastAPI gateway for production

QLoRA on consumer hardware is the most underappreciated leverage in 2026. If you have 100 hours of specialized data, you can almost certainly build something better than any frontier API on that task — and run it for pennies.


Appendix: Hyperparameter Cheatsheet

After running hundreds of QLoRA training runs on different base models, here are defaults that almost always work as a starting point:

ParameterValueNotes
r (rank)16Raise to 32-64 only for large domain shifts
lora_alpha32Keep at 2x rank
lora_dropout0.0Raise to 0.05 if overfitting
learning_rate2e-4Drop to 1e-4 for sensitive tasks
epochs3Never exceed 5 without a reason
batch size (effective)16Fine for 1k-100k dataset sizes
warmup_steps50 or 3% of totalHelps training stability
weight_decay0.01Standard AdamW
lr_schedulercosineSlightly better than linear
max_seq_length2048Raise only if your data needs it

If any of these are clearly wrong for your task, the symptoms usually appear in training loss within the first 100 steps. Loss that flatlines after a few steps means learning rate too low. Loss that spikes means learning rate too high or gradient explosion. Loss that decreases in training but fails on eval means you’re overfitting — reduce epochs or increase dataset diversity.

Treat your first few fine-tunes as experiments, not production artifacts. You’ll learn more from five 30-minute runs with different settings than one 5-hour run with your best guess.

Share this article

> Want more like this?

Get the best AI insights delivered weekly.

> Related Articles

Tags

aitutorialfine-tuningqlorallamagpuopen-source

> Stay in the loop

Weekly AI tools & insights.