Tool

LLM Fine-Tuning Cost Estimator

Can I fine-tune this model on my hardware, how long will it take, what will it cost, how much VRAM will I need, and what is the cheapest viable option? Configure your run on the left — VRAM, time, cost, storage and carbon update instantly.

Inputs

Configure your training run

Fill out the details below, then click Estimate.

The base LLM being fine-tuned. Drives base VRAM, disk footprint, and the active-parameter count used in the throughput formula. Larger models grow VRAM roughly linearly and slow throughput sub-linearly.
QLoRA 4-bit / 8-bit quantize base weights and train low-rank adapters (lowest VRAM). LoRA FP16 keeps weights in FP16. Full BF16 / FP16 train every parameter (highest VRAM and slowest throughput).
Number of training examples. Affects total tokens, training time, and storage — but not VRAM.
Average tokenized length of one training sample. Long sequences are expensive: the estimator applies a +25% activation memory penalty above 8,192 tokens.
Number of full passes through the dataset. Scales total tokens and therefore training time and cost linearly.
Examples processed in parallel per step. Directly multiplies activation memory and improves GPU utilization up to the memory ceiling.
Target hardware. VRAM determines feasibility; memory bandwidth drives throughput; type (Local/Cloud) drives the cost path.
Recomputes activations during backward pass. Cuts activation memory by ~70% but reduces throughput by ~30%.
Concatenates short samples to eliminate padding. Raises throughput by ~25% with no VRAM impact.
Memory-optimized attention kernel. Reduces activation memory by ~10% and improves throughput by ~20%.
Used for carbon footprint estimate and, for local GPUs, the electricity rate.

Methodology

How the numbers are derived

  • VRAM= base-model memory × method multiplier (QLoRA 4-bit 0.35 → Full FP16 1.15) + activation memory, with a 10% safety buffer. Gradient checkpointing × 0.30, Flash Attention × 0.90, long-context (> 8K tokens) × 1.25.
  • Throughput uses an A100 80GB baseline of 2,000 tok/s scaled by memory bandwidth, with a (active_params / 8)^0.85 model-size slowdown and method/packing/checkpoint/flash multipliers.
  • Training time adds 5% warmup + 10% evaluation + 3% checkpoint saving overhead on top of raw tokens / throughput.
  • Cost uses cloud €/hr for cloud GPUs and regional €/kWh × TDP for local GPUs. Multi-GPU multiplies hourly cost by GPU count.
  • Carbon = hours × kW × regional grid carbon intensity (Germany 0.38, AWS EU-West 0.32, Azure WE 0.30, RunPod 0.40 kg CO₂/kWh).
  • Storage includes base model + adapter + tokenized dataset (~5 bytes/token) + 2 retained checkpoints.

These are planning estimates for sizing and budgeting — actuals vary with framework, optimizer state, data distribution, and cluster overhead.