Tool

LLM Fine-Tuning Cost Estimator

Can I fine-tune this model on my hardware, how long will it take, what will it cost, how much VRAM will I need, and what is the cheapest viable option? Configure your run on the left — VRAM, time, cost, storage and carbon update instantly.

Inputs

Configure your training run

Fill out the details below, then click Estimate.

Model

The base LLM being fine-tuned. Drives base VRAM, disk footprint, and the active-parameter count used in the throughput formula. Larger models grow VRAM roughly linearly and slow throughput sub-linearly.

Fine-tuning method

QLoRA 4-bit / 8-bit quantize base weights and train low-rank adapters (lowest VRAM). LoRA FP16 keeps weights in FP16. Full BF16 / FP16 train every parameter (highest VRAM and slowest throughput).

Dataset size

Number of training examples. Affects total tokens, training time, and storage — but not VRAM.

Avg tokens / example

Average tokenized length of one training sample. Long sequences are expensive: the estimator applies a +25% activation memory penalty above 8,192 tokens.

Epochs (3)

Number of full passes through the dataset. Scales total tokens and therefore training time and cost linearly.

Batch size

Examples processed in parallel per step. Directly multiplies activation memory and improves GPU utilization up to the memory ceiling.

GPU / accelerator

Target hardware. VRAM determines feasibility; memory bandwidth drives throughput; type (Local/Cloud) drives the cost path.

Grad checkpoint

Recomputes activations during backward pass. Cuts activation memory by ~70% but reduces throughput by ~30%.

Seq packing

Concatenates short samples to eliminate padding. Raises throughput by ~25% with no VRAM impact.

Flash attention

Memory-optimized attention kernel. Reduces activation memory by ~10% and improves throughput by ~20%.

Training region

Used for carbon footprint estimate and, for local GPUs, the electricity rate.

Methodology

How the numbers are derived

VRAM= base-model memory × method multiplier (QLoRA 4-bit 0.35 → Full FP16 1.15) + activation memory, with a 10% safety buffer. Gradient checkpointing × 0.30, Flash Attention × 0.90, long-context (> 8K tokens) × 1.25.
Throughput uses an A100 80GB baseline of 2,000 tok/s scaled by memory bandwidth, with a (active_params / 8)^0.85 model-size slowdown and method/packing/checkpoint/flash multipliers.
Training time adds 5% warmup + 10% evaluation + 3% checkpoint saving overhead on top of raw tokens / throughput.
Cost uses cloud €/hr for cloud GPUs and regional €/kWh × TDP for local GPUs. Multi-GPU multiplies hourly cost by GPU count.
Carbon = hours × kW × regional grid carbon intensity (Germany 0.38, AWS EU-West 0.32, Azure WE 0.30, RunPod 0.40 kg CO₂/kWh).
Storage includes base model + adapter + tokenized dataset (~5 bytes/token) + 2 retained checkpoints.

These are planning estimates for sizing and budgeting — actuals vary with framework, optimizer state, data distribution, and cluster overhead.