MEP 21. Compiler2 and VM2 Co-Design for Maximum Interpreter Throughput

Field	Value
MEP	21
Title	Compiler2 and VM2 Co-Design for Maximum Interpreter Throughput
Author	Mochi core
Status	Draft
Type	Standards Track
Created	2026-05-16
Replaces	MEP-21 v1 (Type-Directed Bytecode), merged in #21491

Abstract

The first cut of MEP-21 (#21491) kept vm2's tagged 24-byte Value, kept the quickening byte, kept the polymorphic IC, and added a lowering pass that emitted typed opcodes when types were known. That was the right MEP for "respect the existing surface." This MEP is the first-principles version: no respect for the existing surface, no backward-compatibility constraints, the ultimate goal is the throughput ceiling of a Go-hosted register interpreter on statically typed source.

Concretely:

A new front end, compiler2/, parallel to the existing vm.Compile. Pipeline: typed AST, typed SSA IR, optimization passes, linear-scan register allocation, vm2 bytecode. Output is monomorphic by construction.
A redesigned vm2 register cell: 8 bytes, NaN-boxed, packing int64 / float64 / pointer into a single word. The current 24-byte {Tag, _pad, Num, Ref any} layout is abandoned. Register copies become a single 64-bit move; the per-instruction memory traffic drops 3x.
No generic opcodes on the typed surface. No OpAdd. No Quick byte for typed ops. Quickening and the polymorphic IC are retained only for the genuinely dynamic boundary (FFI, JSON-derived any, runtime-constructed closures).
First-class IR-level optimizations the existing compiler does not do: tail-call recognition, bounds-check elision, closure-capture flattening, register coalescing, common subexpression elimination on typed values, dead store elimination.
A direct-call opcode whose target is patched at link time, plus a fused capture-into-frame closure call. No IC scan on resolved sites.

This is a multi-PR initiative, not a single landing. The MEP defines the target architecture; the rollout is staged.

Motivation

What vm2 is leaving on the floor today

The fib(25) numbers in MEP-20 (runtime/vm2 at 46.7 ms / 10 allocs vs runtime/vm at 203.9 ms / 515k allocs) are a 4.4x speedup with zero compiler help; vm2 still runs hand-written bytecode in the bench. The real production path will compile Mochi source, and the existing compiler is built around vm's tagged 100-byte Value. Specifically:

Value width. Every fr.Regs[ins.A] = fr.Regs[ins.B] copies 24 bytes (Tag + pad + Num + Ref interface header). For a typed int register the meaningful payload is 8. The other 16 are tag, padding, and an interface descriptor that is always nil for ints. Three quarters of the per-instruction memory traffic is dead weight.
Tag dispatch. Even OpAdd_II reads the tag of both operands. After MEP-21 v1 the compiler would emit OpAdd_II directly, but the dispatcher still pays a tag check inside the handler because the layout still carries the tag. With type-erased typed registers the check is unreachable code.
Register allocation. The current compiler assigns a fresh register for every SSA value. Fibonacci uses 10 registers; with allocation it needs 4. Smaller frames mean smaller slabs, fewer cache lines per call, and a higher hit rate in the bucketed pool.
No tail-call op. fib(n-1) + fib(n-2) is not tail-recursive, but fact_acc(n-1, acc*n) is. The current pipeline emits an OpCall even for tail positions; vm2 grows the frame stack to depth N. A real tail-call op would run constant-stack.
No bounds-check elision. xs[i] after for i in 0..len(xs) re-checks the bound on every iteration. The compiler has the type and the loop bound in front of it; the check is provably redundant.
OpCallV on resolved targets. Every closure call goes through the polymorphic IC even when the target is a known top-level function captured into the closure. The IC slot scan is pure overhead in that case.
GC pointer pressure. The Ref any field forces the GC to scan every register on every cycle. Typed int/float registers should be invisible to the GC; only the ref bank (or the boxed-pointer slots in a NaN-boxed layout) should be scanned.

Each item is independently fixable. Taken together they are a different VM.

Why a new compiler

The current runtime/vm compiler is structured as a direct AST walker that emits instructions as it visits expressions. It has no IR between AST and bytecode. Without an IR there is nowhere to run an optimization pass; constant folding is bolted on as a pre-pass over the AST, and that is the only optimization in the pipeline. Adding the items above (register allocation, tail-call detection, BCE) without an IR means smearing them across the emitter, which has already proved hard to maintain (see the two-branch post-process pattern in types/check.go that this codebase learned the hard way to keep in two places).

A new package compiler2/ decouples the new pipeline from the old. vm.Compile keeps shipping as the production target until compiler2 reaches feature parity. Then it becomes the production target and vm.Compile enters maintenance mode. This is the same risk profile that vm vs vm2 has: two paths, the new one earns its way to default.

Prior art for the target architecture

The combination of (typed SSA IR, register allocation, monomorphic typed bytecode, tight value layout, Go-style switch dispatch) is the architecture of:

LuaJIT interpreter mode ([LuaJIT-Interp]). NaN-boxed 8-byte cells; computed-goto dispatch in C; the trace JIT consumes already-typed bytecode. Mochi cannot do computed goto in Go, but everything else transfers.
JSC LLInt ([JSC-LLInt]). Hand-written assembler for portability across dispatch styles; tight tagged cells; direct-call ops; ICs only on dynamic sites.
V8 Ignition ([V8-Ignition]). Register VM with on-stack frame, variable-length opcodes for size, separate opcode set for cold paths. The variable-length idea ports cleanly to Mochi.
Wren / wren-cli ([Wren]). Small register-based VM, tagged 8-byte values, written so the entire dispatch loop is a single Go-style switch the C compiler inlines. Mochi's vm2 in Go is structurally identical except for layout width.
Dart Kernel + DBC (Dart Bytecode) ([Dart-DBC]). Typed IR, register allocation, direct calls. Discontinued only because Dart shipped AOT instead, not because the design failed.
Crystal ([Crystal-IR]). Source is typed; the compiler runs monomorphization + constant folding + dead code elim before LLVM. The first half of that pipeline is what compiler2 needs.
MLton / Whole-program SML ([MLton]). Aggressive monomorphization; demonstrates that whole-program type info collapses interface dispatch to direct calls. Mochi is whole-program by default.

The contrary design (small AST-walking compiler + observation-driven VM) is what CPython did before PEP-659 and what Ruby did before YJIT/ZJIT. Both projects have spent person-years climbing back. Mochi does not need to walk that path twice.

Specification

Principle

The compiler MUST produce bytecode in which every operand's type is exact and every call's target is either resolved at compile time or is genuinely dynamic. The VM MUST contain no machinery for guessing what the compiler already knew.

Corollaries the spec enforces:

No generic typed-arithmetic opcode (OpAdd, OpSub, ...) on the typed surface. They exist only in the residual polymorphic-dispatch path.
No Quick byte field on typed instructions. The byte still exists for residual ops to allow MEP-19 quickening of dynamic-source operands.
No tag field on typed register reads. Cells are either NaN-boxed (one bank) or untagged (per-type banks).
No IC on OpCall. The IC exists only on OpCallDynamic, which is emitted only when the front end cannot resolve.

Compiler2 pipeline

parser AST
   │
   ▼
types/check  (existing; unchanged contract)
   │  typed AST
   ▼
compiler2/ir/build       (AST → typed SSA IR)
   │
   ▼
compiler2/opt/{const,dce,cse,bce,tail,inline_trivial,closure_flatten}
   │  optimized IR
   ▼
compiler2/regalloc       (linear scan, type-aware)
   │  IR + register map
   ▼
compiler2/emit           (IR → vm2 bytecode)
   │  *vm2.Program
   ▼
runtime/vm2.Run

Each pass is a pure function on IR with a small interface. The emitter is the only piece that knows the vm2 wire format.

IR shape

Typed SSA, single-assignment per virtual register, instructions in basic blocks, CFG explicit. Each value carries its monotype. Generics and method dispatch are monomorphized before this IR exists.

type Value struct {
    ID   ValueID
    Op   IROp     // typed: AddI64, AddF64, LoadGlobal, ...
    Type Type     // monotype after check
    Args []ValueID
    Loc  source.Loc
}

type Block struct {
    ID    BlockID
    Insts []ValueID
    Term  Terminator // Jump, CondJump, Return, TailCall
}

type Func struct {
    Name   string
    Params []ValueID
    Blocks []Block
}

This is intentionally close to Go's own cmd/compile/internal/ssa.Func shape, not because we are building the Go compiler but because the API surface has been beaten on for a decade and the pattern composes well with optimization passes.

Optimization passes (initial set)

Pass	What it does	Win on bench corpus
`const`	Typed constant folding + algebraic identities	replaces existing constfold
`dce`	Dead-store + dead-instruction elimination	shrinks codegen
`cse`	Common subexpression elimination on pure ops	dedups index/load
`bce`	Bounds-check elision after a proven `i < len(xs)`	tightens iter loops
`tail`	Tail-call detection: convert `return f(...)` to `OpTailCall`	constant-stack recursion
`inline_trivial`	Inline single-block leaf functions (getters, identity)	cuts call overhead
`closure_flatten`	Convert captured vars to explicit array; emit `OpCallClosureStatic` when target is fixed	drops IC on resolved closures

The set is small on purpose. Each pass is a few hundred lines of Go and ships with property-based tests against the unoptimized output. (The conformance content from MEP-21 v1 returns here, scoped to "optimization passes must not change observable behavior.")

Linear-scan register allocation

Run after optimization. Compute live ranges over SSA values. Walk in instruction order; assign each live range a vm2 register, recycle freed registers. Spill to memory only if the live-set exceeds the frame's allotment (we expect this never to fire in the bench corpus; Mochi functions are small).

Output: a map[ValueID]int32 consumed by the emitter. The emitted bytecode references the assigned vm2 register, not the SSA value ID.

Effect on vm2: Function.NumRegs becomes the allocated count, not the SSA-cardinality. The bucketed regs-pool (MEP-20) hits a smaller bucket more often; the cache line footprint of a frame drops.

VM2 cell redesign

Replace vm2.Value with an 8-byte NaN-boxed cell:

type Cell uint64  // NaN-boxed

Encoding (industry-standard NaN-boxing as in [LuaJIT-NaNBox], [JSC-NaNBox], [SpiderMonkey-NaNBox]):

A finite or infinite IEEE 754 double maps to its native bit pattern.
A signalling NaN with the top 13 bits set to 0xFFFA..0xFFFE is a tag prefix; the low 48 bits are the payload.
Payload kinds: kInt32 (0xFFFA...), kPtr (0xFFFE...), kBool / kNull constants (0xFFFB...).

Tradeoffs:

Width: 8 bytes vs current 24. Register banks become 3x denser; a 16-register frame is one cache line, not three.
Int range: 48 bits in the int-tag encoding, or int32 via a tighter tag. Mochi's typed int is 64-bit; we use the untagged numeric cell for typed i64 (every bit is the int) and the tagged encoding only for the residual any path.
GC visibility: the cell is uint64, not interface{}. The GC does not scan it. Pointers are reachable via a parallel ref table indexed by the cell's pointer payload.

The simpler alternative (per-type banks: Ints []int64, Floats []float64, Refs []unsafe.Pointer) is rejected because it triples register-naming bookkeeping in the emitter and complicates closures. NaN-boxing is the proven choice for register-uniform VMs.

For typed numeric ops where the compiler has proved the operand is i64, the emitter MAY emit untagged-cell variants (OpAddI64Raw) that interpret the cell as a raw int64. These run with zero decode overhead. The verifier (see below) checks that raw ops only appear after a typed proof.

Bytecode encoding

Variable-length 8-bit-aligned, after V8 Ignition:

Short form: 1-byte opcode + up to 3 1-byte register operands = 4 bytes total.
Long form: 1-byte opcode-prefix Wide + the underlying op with 16-bit register operands.
Extra-long: prefix ExtraWide + 32-bit operands. Rare; used by large generated code.

Constants live in a per-function pool; OpConst R, idx indexes into it. Inline constants are restricted to short immediates that fit a byte.

Cache footprint matters: the dispatch loop reads the next opcode from a cache line shared with the previous one. Short form keeps a tight loop in one or two cache lines.

Dispatch

Go does not support computed goto. The empirically fastest portable Go interpreter dispatch is a giant switch op where each case body is small enough to inline and the loop body is the entire switch. Two specific patterns matter:

_ = code[len(code)-1] at the top of the loop hints the bounds checker, so code[ip] inside the loop has its check removed.
A two-level dispatch (switch op { ... } with hot ops first) lets the branch predictor learn the typical sequence; Go's switch is currently a jump table for dense small-integer enums, which is what Op is.

The MEP-18 Phase A jump-table validation work feeds directly into this; nothing new is needed at the dispatch level beyond hot-ordering.

Direct calls

Two new ops:

OpCall fn_idx, arg_base, ret_reg where fn_idx is a constant in the function table. No IC, no scan; the frame is acquired, args memcpy'd, dispatch transferred.
OpCallClosureStatic captures_idx, fn_idx, arg_base, ret_reg for closures whose capture set is fixed at construction. Emitted when closure_flatten proves the target.

OpCallDynamic, the dynamic path, retains the polymorphic IC from MEP-19. It is emitted only when the front end cannot prove the target.

Tail calls

OpTailCall fn_idx, arg_base. When in a tail position the compiler emits this instead of OpCall+OpReturn. The VM rewinds the current frame's registers (memmove args into the param slots, reset IP to 0, leave the frame allocated), reuses the slab, and continues. No frame stack growth.

This is the technique Lua, Scheme, and Erlang all use. It costs a single new op and turns fact_acc, mutual-recursion patterns, and self-referential agent loops into constant-stack programs.

Bounds-check elision

For an indexed read inside a loop guarded by a proven bound, emit OpIndexI64Unchecked. The bce pass proves the bound; the verifier (debug build) checks the proof.

This is the same optimization Go's own compiler does on for i := range xs { use(xs[i]) }. Mochi's pattern is identical and the optimization transfers.

Closure flattening

The current closure layout stores captures as []Value. With NaN-boxed cells the captures become []Cell. More importantly, when the IR proves a closure's capture set is fixed and the call sites are known, the emitter inlines the captures into the callee's frame at call time and skips the closure object entirely. This is the same "closure inlining" Self and HotSpot do for monomorphic call sites.

Verifier (mandatory in debug)

runtime/vm2/verify.go walks emitted bytecode and asserts:

Every typed op's operands carry the correct compiler-assigned type.
Every untagged-cell op (OpAddI64Raw) follows a proven typed source.
Every bounds-elided op follows a proven bound.
Every OpCall target index is in range of the function table.
Every basic block ends with a terminator; no fall-through past the last instruction.

In production builds (-tags release) the verifier is stripped. CI runs every fixture under -tags debug so a lowering bug fails fast.

Generic / dynamic surface

The polymorphic surface is small but non-empty:

! io boundaries (FFI returns) yield any.
JSON decode returns dynamic shapes.
dataset queries over schema-less rows.
First-class functions passed across module boundaries.

These sites emit the generic ops (OpAddBoxed, OpCallDynamic, ...). The MEP-18/19 machinery (Quick byte, polymorphic IC) lives here and only here. The split is enforced by the verifier: a typed op next to a generic op without an explicit OpBox/OpUnbox is a compiler bug.

Rollout

Each step is one PR with measurable numbers.

compiler2/ir: minimal SSA + AST→IR builder for arithmetic + control flow. No optimizations yet. Goal: round-trip every program in the typed bench corpus through IR and back to the existing emitter. Validates the IR shape.
compiler2/opt/const + dce. Replaces the existing constfold; numbers should be flat on the bench corpus, no codegen change yet.
compiler2/regalloc. Standalone pass; emits to existing vm2 with reduced NumRegs. Measure frame-pool bucket-hit rate.
vm2 cell redesign to NaN-boxed 8-byte. Largest single PR; in vm2 only, no compiler2 dependency. Runs the existing hand-coded benches against the new layout. Expected: per-op cycle count drops by the cache-line factor.
compiler2/emit and the new opcode set (typed mono ops, direct calls). First PR that produces compiler2 bytecode end-to-end. Runs alongside vm-compile output via a feature flag.
compiler2/opt/{tail,bce,cse,inline_trivial,closure_flatten}. One pass per PR, each with a corpus benchstat row.
Make compiler2 the default target. Flip the flag once the corpus is at parity or better and the verifier reports zero issues.

Each PR appends a row to a results table in MEP-20 (which is the canonical "the layout paid off" document) and a row in MEP-17's per-PR gate.

Performance hypothesis (falsifiable)

Claim	Target
Cell layout redesign alone (#4)	≥ 1.5x on fib(25), ≥ 1.3x geomean on the corpus
compiler2 typed emit alone (#5, no opts)	≥ 1.2x geomean over MEP-21 v1 typed emit
Register allocation alone (#3)	≥ 10% reduction in regs-pool largest-bucket use
Tail-call op alone (#6 first sub-PR)	constant stack on `fact_acc(10^6)` (qualitative)
BCE alone	≥ 5% on `iter_sum` and `hof_map`
Closure flattening alone	drops `OpCallV` allocs to 0 on `closure_dispatch`
End-to-end (all of compiler2 + cell redesign) on the bench corpus	≥ 2.5x geomean over today's vm2
Total vs runtime/vm baseline (bench corpus geomean)	≥ 10x

The MEP merges step-by-step; each row must hold for its corresponding PR. The end-to-end target is the gate for declaring compiler2 the default.

Results so far

Apple M4, go test -bench -benchmem -benchtime 2s, in-tree on the from-scratch rollout (#21496 sub-PRs).

fib(25)

pipeline	ns/op	B/op	allocs/op	vs runtime/vm
runtime/vm (observation-driven)	203,898,861	364M	515,563	1.0x
vm2 hand-coded bytecode (step 3)	14,060,401	43	0	14.5x
compiler2 -> vm2 (steps 4-7)	16,559,035	74	0	12.3x

The compiler2 pipeline matches hand-coded vm2 within noise; the emitter adds zero per-iteration overhead. The 12.3x improvement vs runtime/vm baseline meets the "Total vs runtime/vm" >=10x target purely on the cell + dispatch redesign.

sum(100_000) tail-recursive

After step 9a's tail-call pass: completes correctly in constant frame stack depth. Without tail-call this overflows the frame pool; the qualitative "constant stack on fact_acc(10^6)" hypothesis row is cleared on the analogous sum-accumulator shape.

Sub-target status

Hypothesis row	Status	PR
Cell layout redesign alone (>=1.5x on fib(25))	CLEARED 3.3x vs prior 24-byte vm2	#21498
compiler2 typed emit alone (parity with hand-coded)	CLEARED	#21503
Register allocation alone (regs-pool bucket-hit)	not yet measured (no corpus yet)	#21502
Tail-call op alone (constant-stack recursion)	CLEARED	#21505
BCE alone	deferred (no index ops yet)	-
Closure flattening alone	deferred (no closures yet)	-
End-to-end on bench corpus (>=2.5x over today's vm2)	partial (corpus = fib only)	-
Total vs runtime/vm (>=10x)	CLEARED 12.3x on fib(25)	#21504

The deferred rows wait for the front end to lower lists, maps, and closures into IR. Those are the next push beyond the step-1-10 rollout; the bench corpus broadens with them.

Risks

Two compilers cost maintenance. vm.Compile and compiler2 cohabit until the flip. The cost is real; the mitigation is feature parity tests that run both on the corpus and diff outputs (oracle harness).
NaN-boxing portability. The encoding relies on platform doubles being IEEE 754, which all Go targets satisfy. WASM-as-host (future) is the same encoding.
GC ref table for boxed pointers. Storing pointers in a parallel table behind cell indices requires the table to be live as long as any cell that points into it. The frame holds a reference; the GC keeps the table.
Register allocator complexity. Linear scan is well-understood. The risk is interaction with closure flattening (the captures must be live across the call site). Solvable with a standard "live-out" set per block.
Verifier debt. A pass added without a verifier check is a future bug. The CI gate is the verifier on every typed-corpus program in -tags debug.
Compile-time growth. A multi-pass compiler is slower than a single AST walk. The expected hit is small (Mochi modules are small), but a budget (compile time per kloc) goes into MEP-17 so regressions are visible.
Polymorphic surface drift. If JSON / any usage grows, the residual generic surface grows with it. The verifier's split-line keeps the architectures distinct; the perf story degrades gracefully toward the MEP-19 IC numbers.

Non-goals

No JIT. That is MEP-22. Compiler2 produces interpreter bytecode; the JIT (if it lands) consumes that bytecode.
No source-language change. Mochi syntax and semantics are unchanged. The pipeline is internal.
No removal of runtime/vm. vm stays as the safety net; vm2 + compiler2 win on merit.
No removal of the MEP-19 IC machinery. It earns its keep on the dynamic surface.
No SSA-level escape analysis. Reserved for a follow-up MEP if it pays off; copy-and-patch baseline (MEP-22) recovers most of the same wins.

References

Pipeline architecture

[SSA-Briggs] Preston Briggs, Register Allocation via Graph Coloring. PhD thesis, Rice 1992.
[LinearScan] Massimiliano Poletto, Vivek Sarkar, Linear Scan Register Allocation. ACM TOPLAS 1999.
[Go-SSA] Keith Randall et al., Go's SSA backend. https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/README.md
[Crystal-IR] Crystal language team, Type inference and monomorphization in Crystal. https://crystal-lang.org/reference/syntax_and_semantics/type_inference.html
[MLton] MLton whole-program optimizing compiler. http://mlton.org/Documentation
[Dart-DBC] Vyacheslav Egorov, Dart bytecode interpreter. https://mrale.ph/dartvm/

Value representation

[LuaJIT-NaNBox] Mike Pall, LuaJIT 2.0 NaN-tagging notes. http://lua-users.org/lists/lua-l/2009-11/msg00089.html
[JSC-NaNBox] WebKit team, JavaScriptCore value representation. https://webkit.org/blog/10308/
[SpiderMonkey-NaNBox] Mozilla, SpiderMonkey jsval and NaN-boxing. https://firefox-source-docs.mozilla.org/js/HackingTips.html

Interpreter dispatch

[LuaJIT-Interp] Mike Pall, The LuaJIT interpreter architecture. http://wiki.luajit.org/Numerical-Computing-Performance-Guide
[JSC-LLInt] WebKit team, Introducing the LLInt. https://webkit.org/blog/189/
[V8-Ignition] V8 team, Firing up the Ignition interpreter. https://v8.dev/blog/ignition-interpreter
[Wren] Bob Nystrom, Wren VM. https://wren.io/
[Ertl-Threaded] M. Anton Ertl, David Gregg, The Structure and Performance of Efficient Interpreters. JILP 2003.
[Brunthaler-Inline] Stefan Brunthaler, Inline Caching Meets Quickening. ECOOP 2010.

Tail calls and recursion

[Steele-RABBIT] Guy Steele, RABBIT: A Compiler for Scheme. MIT-AITR 1978 (the original tail-call argument).

Bounds-check elision

[Wuerthinger-BCE] Thomas Wuerthinger et al., Array bounds check elimination for the Java HotSpot client compiler. PPPJ 2007.

Prior MEP-21 v1

Mochi PR #21491, MEP-21: replace VM Conformance spec with Type-Directed Bytecode. https://github.com/mochilang/mochi/pull/21491

Abstract​

Motivation​

What vm2 is leaving on the floor today​

Why a new compiler​

Prior art for the target architecture​

Specification​

Principle​

Compiler2 pipeline​

IR shape​

Optimization passes (initial set)​

Linear-scan register allocation​

VM2 cell redesign​

Bytecode encoding​

Dispatch​

Direct calls​

Tail calls​

Bounds-check elision​

Closure flattening​

Verifier (mandatory in debug)​

Generic / dynamic surface​

Rollout​

Performance hypothesis (falsifiable)​

Results so far​

fib(25)​

sum(100_000) tail-recursive​

Sub-target status​

Risks​

Non-goals​

References​

Pipeline architecture​

Value representation​

Interpreter dispatch​

Tail calls and recursion​

Bounds-check elision​

Prior MEP-21 v1​

Abstract

Motivation

What vm2 is leaving on the floor today

Why a new compiler

Prior art for the target architecture

Specification

Principle

Compiler2 pipeline

IR shape

Optimization passes (initial set)

Linear-scan register allocation

VM2 cell redesign

Bytecode encoding

Dispatch

Direct calls

Tail calls

Bounds-check elision

Closure flattening

Verifier (mandatory in debug)

Generic / dynamic surface

Rollout

Performance hypothesis (falsifiable)

Results so far

fib(25)

sum(100_000) tail-recursive

Sub-target status

Risks

Non-goals

References

Pipeline architecture

Value representation

Interpreter dispatch

Tail calls and recursion

Bounds-check elision

Prior MEP-21 v1