MEP 21. Compiler2 and VM2 Co-Design for Maximum Interpreter Throughput
| Field | Value |
|---|---|
| MEP | 21 |
| Title | Compiler2 and VM2 Co-Design for Maximum Interpreter Throughput |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-16 |
| Replaces | MEP-21 v1 (Type-Directed Bytecode), merged in #21491 |
Abstract
The first cut of MEP-21 (#21491) kept vm2's tagged 24-byte Value, kept the quickening byte, kept the polymorphic IC, and added a lowering pass that emitted typed opcodes when types were known. That was the right MEP for "respect the existing surface." This MEP is the first-principles version: no respect for the existing surface, no backward-compatibility constraints, the ultimate goal is the throughput ceiling of a Go-hosted register interpreter on statically typed source.
Concretely:
- A new front end,
compiler2/, parallel to the existingvm.Compile. Pipeline: typed AST, typed SSA IR, optimization passes, linear-scan register allocation, vm2 bytecode. Output is monomorphic by construction. - A redesigned vm2 register cell: 8 bytes, NaN-boxed, packing int64 / float64 / pointer into a single word. The current 24-byte
{Tag, _pad, Num, Ref any}layout is abandoned. Register copies become a single 64-bit move; the per-instruction memory traffic drops 3x. - No generic opcodes on the typed surface. No
OpAdd. NoQuickbyte for typed ops. Quickening and the polymorphic IC are retained only for the genuinely dynamic boundary (FFI, JSON-derivedany, runtime-constructed closures). - First-class IR-level optimizations the existing compiler does not do: tail-call recognition, bounds-check elision, closure-capture flattening, register coalescing, common subexpression elimination on typed values, dead store elimination.
- A direct-call opcode whose target is patched at link time, plus a fused capture-into-frame closure call. No IC scan on resolved sites.
This is a multi-PR initiative, not a single landing. The MEP defines the target architecture; the rollout is staged.
Motivation
What vm2 is leaving on the floor today
The fib(25) numbers in MEP-20 (runtime/vm2 at 46.7 ms / 10 allocs vs runtime/vm at 203.9 ms / 515k allocs) are a 4.4x speedup with zero compiler help; vm2 still runs hand-written bytecode in the bench. The real production path will compile Mochi source, and the existing compiler is built around vm's tagged 100-byte Value. Specifically:
- Value width. Every
fr.Regs[ins.A] = fr.Regs[ins.B]copies 24 bytes (Tag + pad + Num + Ref interface header). For a typed int register the meaningful payload is 8. The other 16 are tag, padding, and an interface descriptor that is always nil for ints. Three quarters of the per-instruction memory traffic is dead weight. - Tag dispatch. Even
OpAdd_IIreads the tag of both operands. After MEP-21 v1 the compiler would emitOpAdd_IIdirectly, but the dispatcher still pays a tag check inside the handler because the layout still carries the tag. With type-erased typed registers the check is unreachable code. - Register allocation. The current compiler assigns a fresh register for every SSA value. Fibonacci uses 10 registers; with allocation it needs 4. Smaller frames mean smaller slabs, fewer cache lines per call, and a higher hit rate in the bucketed pool.
- No tail-call op.
fib(n-1) + fib(n-2)is not tail-recursive, butfact_acc(n-1, acc*n)is. The current pipeline emits anOpCalleven for tail positions; vm2 grows the frame stack to depth N. A real tail-call op would run constant-stack. - No bounds-check elision.
xs[i]afterfor i in 0..len(xs)re-checks the bound on every iteration. The compiler has the type and the loop bound in front of it; the check is provably redundant. - OpCallV on resolved targets. Every closure call goes through the polymorphic IC even when the target is a known top-level function captured into the closure. The IC slot scan is pure overhead in that case.
- GC pointer pressure. The
Ref anyfield forces the GC to scan every register on every cycle. Typed int/float registers should be invisible to the GC; only the ref bank (or the boxed-pointer slots in a NaN-boxed layout) should be scanned.
Each item is independently fixable. Taken together they are a different VM.
Why a new compiler
The current runtime/vm compiler is structured as a direct AST walker that emits instructions as it visits expressions. It has no IR between AST and bytecode. Without an IR there is nowhere to run an optimization pass; constant folding is bolted on as a pre-pass over the AST, and that is the only optimization in the pipeline. Adding the items above (register allocation, tail-call detection, BCE) without an IR means smearing them across the emitter, which has already proved hard to maintain (see the two-branch post-process pattern in types/check.go that this codebase learned the hard way to keep in two places).
A new package compiler2/ decouples the new pipeline from the old. vm.Compile keeps shipping as the production target until compiler2 reaches feature parity. Then it becomes the production target and vm.Compile enters maintenance mode. This is the same risk profile that vm vs vm2 has: two paths, the new one earns its way to default.
Prior art for the target architecture
The combination of (typed SSA IR, register allocation, monomorphic typed bytecode, tight value layout, Go-style switch dispatch) is the architecture of:
- LuaJIT interpreter mode ([LuaJIT-Interp]). NaN-boxed 8-byte cells; computed-goto dispatch in C; the trace JIT consumes already-typed bytecode. Mochi cannot do computed goto in Go, but everything else transfers.
- JSC LLInt ([JSC-LLInt]). Hand-written assembler for portability across dispatch styles; tight tagged cells; direct-call ops; ICs only on dynamic sites.
- V8 Ignition ([V8-Ignition]). Register VM with on-stack frame, variable-length opcodes for size, separate opcode set for cold paths. The variable-length idea ports cleanly to Mochi.
- Wren / wren-cli ([Wren]). Small register-based VM, tagged 8-byte values, written so the entire dispatch loop is a single Go-style switch the C compiler inlines. Mochi's vm2 in Go is structurally identical except for layout width.
- Dart Kernel + DBC (Dart Bytecode) ([Dart-DBC]). Typed IR, register allocation, direct calls. Discontinued only because Dart shipped AOT instead, not because the design failed.
- Crystal ([Crystal-IR]). Source is typed; the compiler runs monomorphization + constant folding + dead code elim before LLVM. The first half of that pipeline is what compiler2 needs.
- MLton / Whole-program SML ([MLton]). Aggressive monomorphization; demonstrates that whole-program type info collapses interface dispatch to direct calls. Mochi is whole-program by default.
The contrary design (small AST-walking compiler + observation-driven VM) is what CPython did before PEP-659 and what Ruby did before YJIT/ZJIT. Both projects have spent person-years climbing back. Mochi does not need to walk that path twice.
Specification
Principle
The compiler MUST produce bytecode in which every operand's type is exact and every call's target is either resolved at compile time or is genuinely dynamic. The VM MUST contain no machinery for guessing what the compiler already knew.
Corollaries the spec enforces:
- No generic typed-arithmetic opcode (
OpAdd,OpSub, ...) on the typed surface. They exist only in the residual polymorphic-dispatch path. - No
Quickbyte field on typed instructions. The byte still exists for residual ops to allow MEP-19 quickening of dynamic-source operands. - No tag field on typed register reads. Cells are either NaN-boxed (one bank) or untagged (per-type banks).
- No IC on
OpCall. The IC exists only onOpCallDynamic, which is emitted only when the front end cannot resolve.
Compiler2 pipeline
parser AST
│
▼
types/check (existing; unchanged contract)
│ typed AST
▼
compiler2/ir/build (AST → typed SSA IR)
│
▼
compiler2/opt/{const,dce,cse,bce,tail,inline_trivial,closure_flatten}
│ optimized IR
▼
compiler2/regalloc (linear scan, type-aware)
│ IR + register map
▼
compiler2/emit (IR → vm2 bytecode)
│ *vm2.Program
▼
runtime/vm2.Run
Each pass is a pure function on IR with a small interface. The emitter is the only piece that knows the vm2 wire format.
IR shape
Typed SSA, single-assignment per virtual register, instructions in basic blocks, CFG explicit. Each value carries its monotype. Generics and method dispatch are monomorphized before this IR exists.
type Value struct {
ID ValueID
Op IROp // typed: AddI64, AddF64, LoadGlobal, ...
Type Type // monotype after check
Args []ValueID
Loc source.Loc
}
type Block struct {
ID BlockID
Insts []ValueID
Term Terminator // Jump, CondJump, Return, TailCall
}
type Func struct {
Name string
Params []ValueID
Blocks []Block
}
This is intentionally close to Go's own cmd/compile/internal/ssa.Func shape, not because we are building the Go compiler but because the API surface has been beaten on for a decade and the pattern composes well with optimization passes.
Optimization passes (initial set)
| Pass | What it does | Win on bench corpus |
|---|---|---|
const | Typed constant folding + algebraic identities | replaces existing constfold |
dce | Dead-store + dead-instruction elimination | shrinks codegen |
cse | Common subexpression elimination on pure ops | dedups index/load |
bce | Bounds-check elision after a proven i < len(xs) | tightens iter loops |
tail | Tail-call detection: convert return f(...) to OpTailCall | constant-stack recursion |
inline_trivial | Inline single-block leaf functions (getters, identity) | cuts call overhead |
closure_flatten | Convert captured vars to explicit array; emit OpCallClosureStatic when target is fixed | drops IC on resolved closures |
The set is small on purpose. Each pass is a few hundred lines of Go and ships with property-based tests against the unoptimized output. (The conformance content from MEP-21 v1 returns here, scoped to "optimization passes must not change observable behavior.")
Linear-scan register allocation
Run after optimization. Compute live ranges over SSA values. Walk in instruction order; assign each live range a vm2 register, recycle freed registers. Spill to memory only if the live-set exceeds the frame's allotment (we expect this never to fire in the bench corpus; Mochi functions are small).
Output: a map[ValueID]int32 consumed by the emitter. The emitted bytecode references the assigned vm2 register, not the SSA value ID.
Effect on vm2: Function.NumRegs becomes the allocated count, not the SSA-cardinality. The bucketed regs-pool (MEP-20) hits a smaller bucket more often; the cache line footprint of a frame drops.
VM2 cell redesign
Replace vm2.Value with an 8-byte NaN-boxed cell:
type Cell uint64 // NaN-boxed
Encoding (industry-standard NaN-boxing as in [LuaJIT-NaNBox], [JSC-NaNBox], [SpiderMonkey-NaNBox]):
- A finite or infinite IEEE 754 double maps to its native bit pattern.
- A signalling NaN with the top 13 bits set to
0xFFFA..0xFFFEis a tag prefix; the low 48 bits are the payload. - Payload kinds:
kInt32(0xFFFA...),kPtr(0xFFFE...),kBool/kNullconstants (0xFFFB...).
Tradeoffs:
- Width: 8 bytes vs current 24. Register banks become 3x denser; a 16-register frame is one cache line, not three.
- Int range: 48 bits in the int-tag encoding, or
int32via a tighter tag. Mochi's typedintis 64-bit; we use the untagged numeric cell for typedi64(every bit is the int) and the tagged encoding only for the residualanypath. - GC visibility: the cell is
uint64, notinterface{}. The GC does not scan it. Pointers are reachable via a parallel ref table indexed by the cell's pointer payload.
The simpler alternative (per-type banks: Ints []int64, Floats []float64, Refs []unsafe.Pointer) is rejected because it triples register-naming bookkeeping in the emitter and complicates closures. NaN-boxing is the proven choice for register-uniform VMs.
For typed numeric ops where the compiler has proved the operand is i64, the emitter MAY emit untagged-cell variants (OpAddI64Raw) that interpret the cell as a raw int64. These run with zero decode overhead. The verifier (see below) checks that raw ops only appear after a typed proof.
Bytecode encoding
Variable-length 8-bit-aligned, after V8 Ignition:
- Short form: 1-byte opcode + up to 3 1-byte register operands = 4 bytes total.
- Long form: 1-byte opcode-prefix
Wide+ the underlying op with 16-bit register operands. - Extra-long: prefix
ExtraWide+ 32-bit operands. Rare; used by large generated code.
Constants live in a per-function pool; OpConst R, idx indexes into it. Inline constants are restricted to short immediates that fit a byte.
Cache footprint matters: the dispatch loop reads the next opcode from a cache line shared with the previous one. Short form keeps a tight loop in one or two cache lines.
Dispatch
Go does not support computed goto. The empirically fastest portable Go interpreter dispatch is a giant switch op where each case body is small enough to inline and the loop body is the entire switch. Two specific patterns matter:
_ = code[len(code)-1]at the top of the loop hints the bounds checker, socode[ip]inside the loop has its check removed.- A two-level dispatch (
switch op { ... }with hot ops first) lets the branch predictor learn the typical sequence; Go's switch is currently a jump table for dense small-integer enums, which is whatOpis.
The MEP-18 Phase A jump-table validation work feeds directly into this; nothing new is needed at the dispatch level beyond hot-ordering.
Direct calls
Two new ops:
OpCall fn_idx, arg_base, ret_regwherefn_idxis a constant in the function table. No IC, no scan; the frame is acquired, args memcpy'd, dispatch transferred.OpCallClosureStatic captures_idx, fn_idx, arg_base, ret_regfor closures whose capture set is fixed at construction. Emitted whenclosure_flattenproves the target.
OpCallDynamic, the dynamic path, retains the polymorphic IC from MEP-19. It is emitted only when the front end cannot prove the target.
Tail calls
OpTailCall fn_idx, arg_base. When in a tail position the compiler emits this instead of OpCall+OpReturn. The VM rewinds the current frame's registers (memmove args into the param slots, reset IP to 0, leave the frame allocated), reuses the slab, and continues. No frame stack growth.
This is the technique Lua, Scheme, and Erlang all use. It costs a single new op and turns fact_acc, mutual-recursion patterns, and self-referential agent loops into constant-stack programs.
Bounds-check elision
For an indexed read inside a loop guarded by a proven bound, emit OpIndexI64Unchecked. The bce pass proves the bound; the verifier (debug build) checks the proof.
This is the same optimization Go's own compiler does on for i := range xs { use(xs[i]) }. Mochi's pattern is identical and the optimization transfers.
Closure flattening
The current closure layout stores captures as []Value. With NaN-boxed cells the captures become []Cell. More importantly, when the IR proves a closure's capture set is fixed and the call sites are known, the emitter inlines the captures into the callee's frame at call time and skips the closure object entirely. This is the same "closure inlining" Self and HotSpot do for monomorphic call sites.
Verifier (mandatory in debug)
runtime/vm2/verify.go walks emitted bytecode and asserts:
- Every typed op's operands carry the correct compiler-assigned type.
- Every untagged-cell op (
OpAddI64Raw) follows a proven typed source. - Every bounds-elided op follows a proven bound.
- Every
OpCalltarget index is in range of the function table. - Every basic block ends with a terminator; no fall-through past the last instruction.
In production builds (-tags release) the verifier is stripped. CI runs every fixture under -tags debug so a lowering bug fails fast.
Generic / dynamic surface
The polymorphic surface is small but non-empty:
! ioboundaries (FFI returns) yieldany.- JSON decode returns dynamic shapes.
datasetqueries over schema-less rows.- First-class functions passed across module boundaries.
These sites emit the generic ops (OpAddBoxed, OpCallDynamic, ...). The MEP-18/19 machinery (Quick byte, polymorphic IC) lives here and only here. The split is enforced by the verifier: a typed op next to a generic op without an explicit OpBox/OpUnbox is a compiler bug.
Rollout
Each step is one PR with measurable numbers.
compiler2/ir: minimal SSA + AST→IR builder for arithmetic + control flow. No optimizations yet. Goal: round-trip every program in the typed bench corpus through IR and back to the existing emitter. Validates the IR shape.compiler2/opt/const+dce. Replaces the existing constfold; numbers should be flat on the bench corpus, no codegen change yet.compiler2/regalloc. Standalone pass; emits to existing vm2 with reducedNumRegs. Measure frame-pool bucket-hit rate.- vm2 cell redesign to NaN-boxed 8-byte. Largest single PR; in vm2 only, no compiler2 dependency. Runs the existing hand-coded benches against the new layout. Expected: per-op cycle count drops by the cache-line factor.
compiler2/emitand the new opcode set (typed mono ops, direct calls). First PR that produces compiler2 bytecode end-to-end. Runs alongside vm-compile output via a feature flag.compiler2/opt/{tail,bce,cse,inline_trivial,closure_flatten}. One pass per PR, each with a corpus benchstat row.- Make compiler2 the default target. Flip the flag once the corpus is at parity or better and the verifier reports zero issues.
Each PR appends a row to a results table in MEP-20 (which is the canonical "the layout paid off" document) and a row in MEP-17's per-PR gate.
Performance hypothesis (falsifiable)
| Claim | Target |
|---|---|
| Cell layout redesign alone (#4) | ≥ 1.5x on fib(25), ≥ 1.3x geomean on the corpus |
| compiler2 typed emit alone (#5, no opts) | ≥ 1.2x geomean over MEP-21 v1 typed emit |
| Register allocation alone (#3) | ≥ 10% reduction in regs-pool largest-bucket use |
| Tail-call op alone (#6 first sub-PR) | constant stack on fact_acc(10^6) (qualitative) |
| BCE alone | ≥ 5% on iter_sum and hof_map |
| Closure flattening alone | drops OpCallV allocs to 0 on closure_dispatch |
| End-to-end (all of compiler2 + cell redesign) on the bench corpus | ≥ 2.5x geomean over today's vm2 |
| Total vs runtime/vm baseline (bench corpus geomean) | ≥ 10x |
The MEP merges step-by-step; each row must hold for its corresponding PR. The end-to-end target is the gate for declaring compiler2 the default.
Results so far
Apple M4, go test -bench -benchmem -benchtime 2s, in-tree on the from-scratch rollout (#21496 sub-PRs).
fib(25)
| pipeline | ns/op | B/op | allocs/op | vs runtime/vm |
|---|---|---|---|---|
| runtime/vm (observation-driven) | 203,898,861 | 364M | 515,563 | 1.0x |
| vm2 hand-coded bytecode (step 3) | 14,060,401 | 43 | 0 | 14.5x |
| compiler2 -> vm2 (steps 4-7) | 16,559,035 | 74 | 0 | 12.3x |
The compiler2 pipeline matches hand-coded vm2 within noise; the emitter adds zero per-iteration overhead. The 12.3x improvement vs runtime/vm baseline meets the "Total vs runtime/vm" >=10x target purely on the cell + dispatch redesign.
sum(100_000) tail-recursive
After step 9a's tail-call pass: completes correctly in constant frame stack depth. Without tail-call this overflows the frame pool; the qualitative "constant stack on fact_acc(10^6)" hypothesis row is cleared on the analogous sum-accumulator shape.
Sub-target status
| Hypothesis row | Status | PR |
|---|---|---|
| Cell layout redesign alone (>=1.5x on fib(25)) | CLEARED 3.3x vs prior 24-byte vm2 | #21498 |
| compiler2 typed emit alone (parity with hand-coded) | CLEARED | #21503 |
| Register allocation alone (regs-pool bucket-hit) | not yet measured (no corpus yet) | #21502 |
| Tail-call op alone (constant-stack recursion) | CLEARED | #21505 |
| BCE alone | deferred (no index ops yet) | - |
| Closure flattening alone | deferred (no closures yet) | - |
| End-to-end on bench corpus (>=2.5x over today's vm2) | partial (corpus = fib only) | - |
| Total vs runtime/vm (>=10x) | CLEARED 12.3x on fib(25) | #21504 |
The deferred rows wait for the front end to lower lists, maps, and closures into IR. Those are the next push beyond the step-1-10 rollout; the bench corpus broadens with them.
Risks
- Two compilers cost maintenance. vm.Compile and compiler2 cohabit until the flip. The cost is real; the mitigation is feature parity tests that run both on the corpus and diff outputs (oracle harness).
- NaN-boxing portability. The encoding relies on platform doubles being IEEE 754, which all Go targets satisfy. WASM-as-host (future) is the same encoding.
- GC ref table for boxed pointers. Storing pointers in a parallel table behind cell indices requires the table to be live as long as any cell that points into it. The frame holds a reference; the GC keeps the table.
- Register allocator complexity. Linear scan is well-understood. The risk is interaction with closure flattening (the captures must be live across the call site). Solvable with a standard "live-out" set per block.
- Verifier debt. A pass added without a verifier check is a future bug. The CI gate is the verifier on every typed-corpus program in
-tags debug. - Compile-time growth. A multi-pass compiler is slower than a single AST walk. The expected hit is small (Mochi modules are small), but a budget (compile time per kloc) goes into MEP-17 so regressions are visible.
- Polymorphic surface drift. If JSON /
anyusage grows, the residual generic surface grows with it. The verifier's split-line keeps the architectures distinct; the perf story degrades gracefully toward the MEP-19 IC numbers.
Non-goals
- No JIT. That is MEP-22. Compiler2 produces interpreter bytecode; the JIT (if it lands) consumes that bytecode.
- No source-language change. Mochi syntax and semantics are unchanged. The pipeline is internal.
- No removal of runtime/vm. vm stays as the safety net; vm2 + compiler2 win on merit.
- No removal of the MEP-19 IC machinery. It earns its keep on the dynamic surface.
- No SSA-level escape analysis. Reserved for a follow-up MEP if it pays off; copy-and-patch baseline (MEP-22) recovers most of the same wins.
References
Pipeline architecture
- [SSA-Briggs] Preston Briggs, Register Allocation via Graph Coloring. PhD thesis, Rice 1992.
- [LinearScan] Massimiliano Poletto, Vivek Sarkar, Linear Scan Register Allocation. ACM TOPLAS 1999.
- [Go-SSA] Keith Randall et al., Go's SSA backend. https://github.com/golang/go/blob/master/src/cmd/compile/internal/ssa/README.md
- [Crystal-IR] Crystal language team, Type inference and monomorphization in Crystal. https://crystal-lang.org/reference/syntax_and_semantics/type_inference.html
- [MLton] MLton whole-program optimizing compiler. http://mlton.org/Documentation
- [Dart-DBC] Vyacheslav Egorov, Dart bytecode interpreter. https://mrale.ph/dartvm/
Value representation
- [LuaJIT-NaNBox] Mike Pall, LuaJIT 2.0 NaN-tagging notes. http://lua-users.org/lists/lua-l/2009-11/msg00089.html
- [JSC-NaNBox] WebKit team, JavaScriptCore value representation. https://webkit.org/blog/10308/
- [SpiderMonkey-NaNBox] Mozilla, SpiderMonkey jsval and NaN-boxing. https://firefox-source-docs.mozilla.org/js/HackingTips.html
Interpreter dispatch
- [LuaJIT-Interp] Mike Pall, The LuaJIT interpreter architecture. http://wiki.luajit.org/Numerical-Computing-Performance-Guide
- [JSC-LLInt] WebKit team, Introducing the LLInt. https://webkit.org/blog/189/
- [V8-Ignition] V8 team, Firing up the Ignition interpreter. https://v8.dev/blog/ignition-interpreter
- [Wren] Bob Nystrom, Wren VM. https://wren.io/
- [Ertl-Threaded] M. Anton Ertl, David Gregg, The Structure and Performance of Efficient Interpreters. JILP 2003.
- [Brunthaler-Inline] Stefan Brunthaler, Inline Caching Meets Quickening. ECOOP 2010.
Tail calls and recursion
- [Steele-RABBIT] Guy Steele, RABBIT: A Compiler for Scheme. MIT-AITR 1978 (the original tail-call argument).
Bounds-check elision
- [Wuerthinger-BCE] Thomas Wuerthinger et al., Array bounds check elimination for the Java HotSpot client compiler. PPPJ 2007.
Prior MEP-21 v1
- Mochi PR #21491, MEP-21: replace VM Conformance spec with Type-Directed Bytecode. https://github.com/mochilang/mochi/pull/21491