v1.2.0 · Intelligence Foundation
Memory Engineering
for AI
Train bigger models on the GPU you already have. A drop-in PyTorch memory optimizer — one call rewrites how your model uses VRAM, with no change to your training loop.
up to 76% less peak VRAM · ~59% median across 25 RTX 3090 configs
MemScale Documentation
Memory Engineering for AI — Optimize PyTorch training memory.
MemScale lets you train bigger models on the GPU you already have. It is a drop-in memory optimizer for PyTorch training: one function call rewrites how your model uses VRAM, with no change to your training loop.
What is MemScale?
MemScale is a Python library that reduces the peak VRAM of PyTorch training
runs. You hand it a model (or a Hugging Face Trainer), and it profiles the
model, decides a per-layer optimization plan, and attaches the necessary
hooks — all behind a single wrap() call.
Under the hood MemScale applies well-understood memory techniques — gradient checkpointing, mixed precision, CPU offloading, activation tiling, and 8-bit optimizers — but it chooses which technique to apply to which layer for you. The decision is made by a deterministic, rule-based engine, so the same model and hardware always produce the same plan.
The current release is v1.2.0 (“Intelligence Foundation”), published 2026-05-21. It ships an opt-in ML-assisted policy framework on top of the existing rule engine (off by default), plus the empirically-backed benchmark numbers (see Benchmarking) and experimental async CPU offload from v1.1.
Why MemScale?
Frameworks like DeepSpeed and Accelerate are powerful, but they ask you to restructure your training code, manage config files, and understand ZeRO stages or sharding strategies before you see a benefit.
MemScale optimizes for a different goal: the shortest path from “my model
OOMs” to “my model trains.” You do not rewrite your loop. You do not pick
a sharding strategy. You call wrap() and keep training as usual. If you
later want control, the Config object exposes every knob.
This makes MemScale a good fit for single-GPU workstations, lab machines, and campus GPUs — where the constraint is one fixed card, not a cluster.
Quick example
import memscale
import torch
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
# full optimization stack: checkpointing + offload + BF16 + 8-bit Adam
model, optimizer = memscale.apply_all_optimizations(model, optimizer)
# train exactly as you would without MemScaleOn an RTX 3090, GPT-2 Medium training drops from 10.87 GB to 2.61 GB
peak VRAM — a 76% reduction.1 The single wrap() call still works for
the lightest setup (checkpointing + offload only); the headline reductions
come from apply_all_optimizations, which adds BF16 and 8-bit Adam on top.
Where to go next
- Quick Start — get a model optimized in 5 minutes.
- API Reference — every public function and its real signature.
- Core Concepts — how the decision engine and techniques work.
Footnotes
-
VRAM reduction is workload-dependent: up to ~76% best case, typically ~59% median across 25 measured configurations on an RTX 3090. Highest (~70%) on small-batch / short-sequence runs, where optimizer and parameter state dominate peak memory; lowest (~51%) on large-batch / long-sequence runs, where activations dominate. These require the full optimization stack (
apply_all_optimizations) — a plainwrap()call yields only a small reduction. ↩