Memory Budget

A practical guide to estimating how much VRAM a training run needs, and which MemScale mode brings it within your GPU’s budget.

What consumes VRAM in training

Peak training memory is the sum of four parts:

Model parameters — the weights themselves.
Gradients — one value per parameter.
Optimizer state — Adam/AdamW keeps two moments per parameter.
Activations — intermediate tensors held for the backward pass. This term scales with batch size and sequence length and is usually the largest and most variable part.

MemScale’s techniques target parts 3 and 4: checkpointing and tiling cut activations, offloading moves optimizer state and activations to CPU, and the 8-bit optimizer shrinks optimizer state.

Rough rule of thumb (FP32, Adam)

Without optimization, a rough estimate for parameters + gradients + optimizer state alone is ~16 bytes per parameter:

Model size	Params + grads + Adam state	Plus activations	Practical GPU (unoptimized)
110M (BERT-Base)	~1.8 GB	+ a few GB	8 GB is tight
355M (GPT-2 Medium)	~5.7 GB	+ several GB	12–16 GB
774M (GPT-2 Large)	~12 GB	+ several GB	16–24 GB
1.5B (GPT-2 XL)	~24 GB	+ several GB	OOM on 24 GB

Activations push the real peak well above the table’s first column — which is exactly where MemScale helps.

Measured reductions (RTX 3090, 24 GB)

From the benchmark suite, with the default config:

Model	Batch × Seq	Baseline	MemScale	Reduction
BERT-Base	16 × 128	3.14 GB	0.84 GB	73.1%
BERT-Large	16 × 128	7.60 GB	2.02 GB	73.4%
GPT-2 Medium	4 × 512	10.87 GB	2.61 GB	76.0%
GPT-2 Large	2 × 512	14.87 GB	4.68 GB	68.5%
GPT-2 XL	1 × 512	OOM	9.25 GB	fits

Choosing a mode by budget

Estimate your unoptimized peak, then compare to your GPU:

Situation	Recommended mode
Unoptimized peak is below your VRAM	none needed — or `CONSERVATIVE` for headroom
Peak is 1–1.5× your VRAM	`BALANCED`
Peak is 1.5–3× your VRAM	`AGGRESSIVE`
Peak is more than ~3× your VRAM	`AGGRESSIVE` + reduce batch size / sequence length

See Optimization Modes for what each mode enables.

When you need `AGGRESSIVE` vs. `BALANCED`

BALANCED (checkpointing + offloading) is enough for most medium models that overshoot VRAM by a moderate margin.
AGGRESSIVE (full stack, including tiling and the 8-bit optimizer) is for models that otherwise OOM — like GPT-2 XL on a 24 GB card.
If AGGRESSIVE still OOMs, the budget gap is too large for memory techniques alone: lower the batch size or sequence length, per Troubleshooting.

Practical advice for fixed GPUs

On a single fixed card — a workstation or campus GPU — the workflow is:

Estimate the unoptimized peak from the table above.
Pick a mode from the budget table.
Run, and read the plan MemScale logs.
If it OOMs, drop batch size first; it is the cheapest lever.

Benchmarking ML Policy (Experimental)