v1.2.0 · Intelligence Foundation

Memory Engineering
for AI

Train bigger models on the GPU you already have. A drop-in PyTorch memory optimizer — one call rewrites how your model uses VRAM, with no change to your training loop.

Quick Start API Reference

up to 76% less peak VRAM · ~59% median across 25 RTX 3090 configs

MemScale Documentation

Memory Engineering for AI — Optimize PyTorch training memory.

MemScale lets you train bigger models on the GPU you already have. It is a drop-in memory optimizer for PyTorch training: one function call rewrites how your model uses VRAM, with no change to your training loop.

What is MemScale?

MemScale is a Python library that reduces the peak VRAM of PyTorch training runs. You hand it a model (or a Hugging Face Trainer), and it profiles the model, decides a per-layer optimization plan, and attaches the necessary hooks — all behind a single wrap() call.

Under the hood MemScale applies well-understood memory techniques — gradient checkpointing, mixed precision, CPU offloading, activation tiling, and 8-bit optimizers — but it chooses which technique to apply to which layer for you. The decision is made by a deterministic, rule-based engine, so the same model and hardware always produce the same plan.

The current release is v1.2.0 (“Intelligence Foundation”), published 2026-05-21. It ships an opt-in ML-assisted policy framework on top of the existing rule engine (off by default), plus the empirically-backed benchmark numbers (see Benchmarking) and experimental async CPU offload from v1.1.

Why MemScale?

Frameworks like DeepSpeed and Accelerate are powerful, but they ask you to restructure your training code, manage config files, and understand ZeRO stages or sharding strategies before you see a benefit.

MemScale optimizes for a different goal: the shortest path from “my model OOMs” to “my model trains.” You do not rewrite your loop. You do not pick a sharding strategy. You call wrap() and keep training as usual. If you later want control, the Config object exposes every knob.

This makes MemScale a good fit for single-GPU workstations, lab machines, and campus GPUs — where the constraint is one fixed card, not a cluster.

Quick example

import memscale
import torch
from transformers import AutoModelForCausalLM
 
model = AutoModelForCausalLM.from_pretrained("gpt2-medium")
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
 
# full optimization stack: checkpointing + offload + BF16 + 8-bit Adam
model, optimizer = memscale.apply_all_optimizations(model, optimizer)
 
# train exactly as you would without MemScale

On an RTX 3090, GPT-2 Medium training drops from 10.87 GB to 2.61 GB peak VRAM — a 76% reduction.¹ The single wrap() call still works for the lightest setup (checkpointing + offload only); the headline reductions come from apply_all_optimizations, which adds BF16 and 8-bit Adam on top.

Where to go next

Quick Start — get a model optimized in 5 minutes.
API Reference — every public function and its real signature.
Core Concepts — how the decision engine and techniques work.

Footnotes

VRAM reduction is workload-dependent: up to ~76% best case, typically ~59% median across 25 measured configurations on an RTX 3090. Highest (~70%) on small-batch / short-sequence runs, where optimizer and parameter state dominate peak memory; lowest (~51%) on large-batch / long-sequence runs, where activations dominate. These require the full optimization stack (apply_all_optimizations) — a plain wrap() call yields only a small reduction. ↩

Installation

Memory Engineeringfor AI