GuidesBenchmarking

Benchmarking

MemScale ships a reproducible benchmark suite so you can measure VRAM savings on standard models — and on your own.

The benchmark CLI

Run the built-in suite with:

python -m memscale.benchmarks

This runs MemScale over a registry of standard models (BENCHMARK_MODELS) and reports baseline vs. optimized peak VRAM for each. The suite was added in v1.1.0 as the basis for the project’s committed, reproducible numbers.

Canonical results

These are the v1.1.0 figures, measured on an NVIDIA RTX 3090 (24 GB). Peak training VRAM is compared with and without MemScale:

ModelParamsBatch × SeqBaseline VRAMMemScale VRAMReduction
BERT-Base110M16 × 1283.14 GB0.84 GB73.1%
BERT-Large340M16 × 1287.60 GB2.02 GB73.4%
GPT-2 Medium355M4 × 51210.87 GB2.61 GB76.0%
GPT-2 Large774M2 × 51214.87 GB4.68 GB68.5%
GPT-2 XL1.5B1 × 512OOM9.25 GBenables training

GPT-2 XL does not fit the 24 GB baseline at all; with MemScale it trains in 9.25 GB.

Reduction percentages depend on the workload — model, batch size, and sequence length. Older marketing figures used different baselines, so they differ from the numbers above. Always trust a measurement on your own configuration over any single quoted percentage.

Benchmarking your own model

To measure savings for a model that is not in the registry, compare peak VRAM yourself:

import torch
import memscale
 
# --- baseline ---
torch.cuda.reset_peak_memory_stats()
run_one_training_step(model)            # your own step
baseline = torch.cuda.max_memory_allocated() / 1e9
 
# --- with MemScale ---
model = memscale.wrap(model)
torch.cuda.reset_peak_memory_stats()
run_one_training_step(model)
optimized = torch.cuda.max_memory_allocated() / 1e9
 
print(f"baseline {baseline:.2f} GB -> memscale {optimized:.2f} GB")
print(f"reduction: {100 * (1 - optimized / baseline):.1f}%")

Tips for fair comparison

  • Measure peak VRAM (max_memory_allocated), not current usage.
  • Call torch.cuda.reset_peak_memory_stats() before each measured run.
  • Use the same batch size and sequence length for both runs.
  • Warm up first — the first step allocates caches that distort the peak.
  • Make sure no other process holds VRAM (nvidia-smi).