MEP 36. Restructure VM2 to Reuse Go's GC
| Field | Value |
|---|---|
| MEP | 36 |
| Title | Restructure VM2 to Reuse Go's GC |
| Author | Mochi core |
| Status | Active (Phase 1 landed) |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
This MEP restructures the vm2 value representation so that container payloads are reachable from each operand-stack slot via a real Go pointer, and the per-Run Objects []any indirection table can be eliminated mid-Run. The current design (MEP-20, runtime/vm2/cell.go) NaN-boxes every value into an 8-byte uint64 Cell. Heap payloads (lists, maps, heap strings, closures, boxed big ints) sit in vm.Objects []any and the Cell's 48-bit payload is the index into that slice. Two operational consequences fall out of that choice:
- Indirection on every container read.
vm.Objects[c.Ptr()].(*vmList)is a bounds check plus anany-to-*Ttype assertion plus a dependent load. MEP-29 profiles show 5 to 8 percent of CPU sitting in this accessor across the MEP-23 list workloads. - Monotonic growth inside a Run.
Objectsis append-only untilRun()exits. Dead containers cannot be reclaimed; Go's GC does see them through the[]anyslice, but the slice itself holds them live regardless of reachability fromvm.Stack. A long-running server or REPL would see live-heap grow without bound.
The proposal is to add a single unsafe.Pointer field to Cell so pointer-tagged cells carry the container pointer directly. Scalar cells leave that field nil. The result is a 16-byte struct Cell that preserves the existing NaN-box bit math for tag dispatch and scalar payload, eliminates the type-assert path on container access, and lets Go's GC reclaim dead containers as soon as no live Cell points at them. This is the minimal change that hits the two goals above.
The 16-byte design is the path Goja and starlark-go ended up at independently (16-byte iface Value, pointer-typed branches store the *T in the data word). Where their designs heap-box scalars on conversion, we keep scalars inline in the same 8 bytes we use today.
The goal of the refactor is measured in three places:
- Zero heap allocations on the hot path for pure-numeric loops (already true today; must stay true).
- One Go-tracked allocation per container constructor and zero per element (today: one allocation per
AddObjectplus the indirection-table grow). - Full reclamation of dead containers between GC cycles without a custom collector (today: never, until
Run()exits).
This is a runtime refactor with a wide blast radius (every container subsystem, the JIT register file contract, the deopt protocol, and a handful of FFI shims) but no language-surface change. MEP-34 (the production JIT) and MEP-35 (auto memory management) are both downstream consumers: MEP-34's JIT register file shape is set by this MEP, and MEP-35's three options collapse to "let Go's GC do it" once this MEP ships.
What the original draft got wrong
An earlier draft of this MEP (committed 2026-05-17, before the Phase 0 baseline measurements) argued for a 24-byte struct with separate Tag/Pad/Int/Ptr fields, and claimed the motivation was "Go's GC cannot see pointers inside a uint64." Both pieces were wrong, and the rewrite below corrects them:
- GC visibility was never the problem.
vm.Objectsis[]any; Go's runtime traces interface element data words through their itab, so every container the table holds is already in the GC graph. The 5 to 8 percent CPU win and the within-Run reclamation problem do not require teaching GC anything new; they require eliminating the indirection. - Tripling stack memory was not justified. A 24-byte Cell triples
vm.Stackbandwidth for zero additional functionality over the 16-byte form. The earlier draft justified the third field as "64-bit ints inline" butFitsInlinealready promotes wider ints to the boxed path, and Phase 0 measurements (Appendix A) show no pure-numeric corpus program approaches the 48-bit limit.
Section §4.2 below records the full 8 vs 16 vs 24 byte comparison so the choice is reviewable.
Motivation
The Objects table started as the pragmatic shortcut. In an 8-byte NaN-boxed Cell there is no room for a real pointer and a tag, so we put pointers behind an index. runtime/vm2/vm.go documents the contract:
The table is reset on each Run so pure-numeric programs never grow it.
That comment is true: a pure-numeric Mochi program allocates exactly nothing from Objects. But every program that touches a list, a map, a heap string, a closure, or a big int allocates into the table monotonically until Run() finishes. There is no reclamation mechanism. The popFrame doc comment in runtime/vm2/vm.go calls this out:
The frame window is sliced off but its contents are NOT zeroed. vm2 has no ptr-tagged Cells yet so there is nothing to un-pin; once the boxed-object subsystem lands, popFrame must zero ptr-tagged slots before shrinking. Tracked in the upcoming subsystem MEP.
That MEP is this one. There are two operational consequences:
- Memory pressure scales with bytecode lifetime, not with reachability. A long-running server (or a future REPL) that runs many programs would see live-heap grow without bound; today this is masked by Run() being short-lived.
- Every list / map / string Cell write triggers a heap touch the GC cannot help with. The Objects table is
[]any, so even reading a list pays one bounds check, one type assertion, and an interface unbox. Profiles of MEP-23 list workloads (see MEP-29) show 5-8% of CPU sitting invm.Objects[c.Ptr()].(*vmList)accessor calls inlists.go:28,maps.go:19, andstrings.go:38.
There is a third, longer-term consequence. MEP-34 commits to a frame-compatibility contract between the interpreter and the JIT: a JIT'd function must be able to call an interpreted one and vice versa without copy-conversion. The cleanest version of that contract is that registers are real Go pointers when they point at containers, because the JIT's native code can dereference a Go pointer with one ldr instruction, but cannot resolve an Objects-table index without calling back into Go. Without this MEP, the JIT either reimplements the Objects table in native (a fresh source of soundness bugs across the JIT / interpreter boundary), or every container op deopts. Neither is acceptable for the Phase 2 numbers MEP-34 promises.
Background
2.1 Modern VM-on-host-GC designs
Every production VM that lives inside a managed runtime has, eventually, converged on letting the host GC see real pointers. The lesson is consistent across ecosystems:
- WasmGC (standardized 2024, shipped in Chrome by default, repo archived April 2025). The V8 team's writeup ("A new way to bring GC languages efficiently to WebAssembly") makes the design statement plain: rather than ship a GC inside your VM, describe your heap objects to the host VM's GC via
structandarrayheap types. Kotlin, Dart, Java, and Scala all switched to this representation. - Truffle / GraalVM. Truffle objects are real JVM objects with a hidden-class
Shape. Truffle Libraries replace plain Java interfaces with specialization-aware dispatch but the underlying values are normal JVM pointers traced by the JVM GC. GraalJS's array specializations are ~20 different representations, all reachable by the JVM GC without per-representation tracing code. - CRuby + MMTk (Ruby 3.4, end of 2024, ISMM 2025 "Reworking Memory Management in CRuby"). The biggest cost was teaching every Ruby data type to expose its pointer fields to the moving collector. CRuby had previously hidden some of those pointers inside C-side tables. The migration took multiple releases. We are doing the trivial version of this refactor: we just need Go to see Go pointers, and we do not need to support a moving collector.
- Goja (JavaScript in Go,
dop251/goja). Goja explicitly does not NaN-box.Valueis an interface implemented byvalueInt,valueFloat,valueBool,*Object, etc. Singletons (_null,_undefined,_NaN) are package-level vars so they never allocate. - starlark-go (
go.starlark.net/starlark). SimilarValueinterface. Internal notes (doc/impl.md) are unusually frank about the cost: "the cost of the additional GC write barriers incurred by every VM operation" is called out as the bigger problem than the boxing itself, because every intermediate result is saved to the operand stack, which is on the heap. This matters for our design (see §3.4). - Yaegi (Go-in-Go interpreter). Maximally honest: every value is a
reflect.Value. Frames are[]reflect.Value. Host Go pointers flow through unchanged. - otto, expr. Both reach for
interface{}plus a discriminator and pay the eface-boxing tax on every arithmetic op. expr in particular shows up in profiles when used in hot loops.
The common thread: no successful Go-hosted VM uses NaN-boxing, because Go does not let you teach the GC about tag bits in uintptr. Goja, starlark-go, otto, expr, Yaegi all came to the same conclusion independently.
2.2 Go-specific cost model
The numbers vm2 cares about, from running tip Go on darwin/arm64:
interface{}(eface) layout: two machine words (type descriptor + data pointer). For pointer-typed values (*T,chan,map,func, slice/string headers in modern Go) the data word is the value: no heap allocation. For non-pointer types (int,int64, structs without pointers), assignment tointerface{}heap-allocates a copy (convT64, etc.) with two well-known exceptions:staticuint64sinterns small uints, and zero values for booleans are interned. Cost: direct call ~1.45 ns vs interface call ~2.64 ns, 0 B/op for the call itself, but the boxing on every arithmetic op kills hot loops.unsafe.Pointer: GC-traced when it holds a real Go pointer. Not GC-traced when stored asuintptr. Go's unsafe.Pointer rules explicitly forbid storing arbitrary bit patterns.runtime.KeepAliveonly extends the lifetime of a named variable and is not a fix for unsafe.Pointer hacks (Go issue #47562). This is the single most important constraint for this MEP: NaN-boxing intouint64is fundamentally incompatible with Go's GC.- Hybrid write barrier (Austin et al., Go proposal 17503). Yuasa-style deletion shade plus conditional Dijkstra-style insertion shade. Stacks scanned once, no stack write barriers. Heap-write cost: roughly two atomic-store-equivalents on the slow path; inlined and skipped when GC is idle. The starlark-go cost note above is precisely this: every heap-stored operand in the bytecode interpreter pays a write barrier when GC is active. Operand stacks living on goroutine stacks pay nothing; operand stacks living in heap-allocated frames pay one barrier per write.
- Escape analysis. Go has no generational nursery on the argument that escape analysis already promotes young objects to the stack; the would-be nursery survivors are mostly already gone. Our "nursery" is the operand stack itself.
sync.Poolgotchas (we removed it from vm2 earlier; this is why). Boxing onPut(always pool*T, neverT), per-P locality (cross-goroutine reuse loses to stealing), pool churn affecting GOGC pacer. Useful only for*Treused by the same goroutine at very high rates.GOMEMLIMIT(Go 1.19+) is the right knob, not the ballast hack. We will not need it for the interpreter but should expose it through the binary's environment for embedders.
2.3 Allocation-minimization literature
Even with Go's GC doing the tracing, allocations on the hot path are the dominant cost. The relevant body of work:
- Perceus (Reinking, Xie, de Moura, Leijen; PLDI 2021 Distinguished Paper) for Koka. Compile-time reference counting with reuse analysis: when a refcount drops to 1 at last use, in-place mutation is legal. Drives "functional but in-place" (FBIP) algorithms that run at C-like speed in a functional language. Roc uses the same scheme, layered with Morphic uniqueness analysis. We do not need to import Perceus wholesale; we only need a lighter compile-time "is last use" bit per Cell register to license in-place
appendandupdateopcodes (§3.5). - PLDI 2024 dynamic heapification (Anand et al., IIT Madras + IBM). OpenJ9 JIT does precise static escape analysis offline, emits stack-allocated objects optimistically, and inserts runtime checks that trigger "heapification" (copying an object from a stack frame to the heap and rewriting references) if dynamic features would otherwise let it escape. Relevant to the future tier-2 JIT (MEP-32), too heavy for the interpreter. Cited here so the MEP-34 work has a reference point for "when do we add this."
- V8 Smi. LSB tagging keeps small integers in a register, with untag = arithmetic shift. We have a separate
intfield in the proposed Cell so we do not need bit-level Smi math. - Project Valhalla (JEP 401, EA in JDK 22+, "virtuous collapse" simplification after JVMLS 2024). Brian Goetz: "value class is a semantic statement, not a layout decision." Go already has this. Every Go struct is a value type by default. Our mistake in vm2 was throwing that away by encoding everything as
uint64. Recovering it means makingCella real struct. - MEA2 (Wei et al., OOPSLA 2024). Field-sensitive escape analysis for Go, reporting 7.9% reduction in heap allocation sites for stdlib projects, mostly by handling interface conversions the Go compiler currently gives up on. We may benefit from this transparently as Go's compiler evolves.
2.4 Concurrent GC trade-offs
Surveyed for completeness in MEP-35; summarized here for context:
- Immix / LXR (Blackburn & McKinley, PLDI 2008; Zhao et al., PLDI 2022). Mark-region collection with optional brief STW evacuation. Beats ZGC and Shenandoah on latency-critical benchmarks. Not directly portable to Go (we do not control the GC), cited because it is the modern baseline for "what good looks like."
- ZGC, Shenandoah load barriers. Colored pointers, self-healing on load. Inspirational for our Objects-table migration story (we can self-heal during the rollout, see §4.4).
- Bacon-Rajan trial-deletion for RC cycle collection (OOPSLA 2001). Relevant only if we ever go Perceus-style; we are not, so we ignore.
The recurring theme: every modern collector either (a) controls memory layout for its own values, or (b) is described to the host runtime via a typed schema. Option (a) is what we cannot afford. Option (b) is what we are doing, with Go structs as the schema.
Specification
3.1 New Cell representation
Replace the current 8-byte type Cell uint64 with:
// Cell is the vm2 value. The Bits field holds the same NaN-boxed
// scalar/tag encoding the runtime uses today (see runtime/vm2/cell.go for
// the bit layout). The Ptr field is set on pointer-tagged cells and holds
// the real Go *T cast through unsafe.Pointer so the host GC can trace the
// pointee directly. Scalar cells leave Ptr nil.
type Cell struct {
Bits uint64 // NaN-boxed tag+payload, same encoding as the legacy uint64 Cell
Ptr unsafe.Pointer // GC-traced; set for tagPtr cells, nil for scalars
}
Width: 16 bytes on 64-bit platforms. This is 2x today's 8-byte Cell. The cost is paid in operand-stack bandwidth (vm.Stack []Cell); the win is paid in eliminating the vm.Objects[c.Ptr()].(*T) dereference on every container op, and in letting Go's GC observe pointer Cells through the typed Ptr field rather than only through the parallel Objects table.
The NaN-box bit math in Bits is unchanged. Every existing tag predicate (IsInt, IsBool, IsFloat, IsPtr, IsNull, IsSStr, IsStr) and every existing accessor (Int(), Bool(), Float(), SStrLen(), SStrBytes(), Ptr() returning the Objects index) keeps its current signature. Only one new accessor is added:
// PtrTo returns the real Go pointer carried in this Cell, or nil if the cell
// is scalar / the Ptr field has not yet been populated. Callers that need
// the typed pointer (vmList, vmMap, vmString) cast through unsafe.Pointer:
//
// list := (*vmList)(c.PtrTo())
//
// Phase 1 keeps the Objects table as a fallback: see §4.4 "self-heal accessors".
func (c Cell) PtrTo() unsafe.Pointer { return c.Ptr }
The choice of a separate Ptr unsafe.Pointer field (rather than overloading Bits with a tagged pointer) is forced by Go's rules:
unsafe.Pointeris traced by Go's GC only when it holds a real Go pointer. Storing auintptrdefeats tracing; storing a tagged pointer (low bits as discriminator) is forbidden by theunsafe.Pointerrules and will cause the GC to refuse to trace or, worse, to trace garbage.- Putting the pointer in a dedicated, GC-visible field is the standard Go pattern and the same shape Goja and starlark-go use (their iface
Valueis also two machine words). unsafe.Pointeris one word;anywould be two words and pay an itab lookup on dispatch. The 16-byte total stays cache-friendly and the tag dispatch stays auint16test inBits.
No new tag values are introduced. The existing tag space stays:
0x0000..0xFFEF -> float64
0x7FF8 -> canonical qNaN, treated as float
0xFFFA -> tagDeopt (JIT)
0xFFFB -> tagSStr (short string in low 48 bits of Bits)
0xFFFC -> tagInt (48-bit signed)
0xFFFD -> tagBool (low bit)
0xFFFE -> tagNull
0xFFFF -> tagPtr (Obj is the typed Go pointer; bit 47 = string flag)
FitsInline stays at 48-bit because the inline-int range is identical to today; wider ints continue to box through the existing path. Short strings stay at MaxInlineStr = 5 bytes. A future cleanup MEP can revisit either, but those changes are orthogonal to the GC reachability fix this MEP is about.
3.2 Delete the Objects table
VM.Objects []any and VM.AddObject are removed in Phase 3. Phase 1 kept Objects as a safety net so the rollout could land in one container subsystem at a time without breaking the others; Phase 2 stopped writing into it on the container hot path; Phase 3b deletes the field outright.
The end-state for constructors in lists.go, maps.go, strings.go, and the planned sets.go / structs.go:
// before (current)
func (vm *VM) newList(capHint int) Cell {
l := &vmList{data: make([]Cell, 0, capHint)}
idx := vm.AddObject(l)
return CPtr(idx)
}
// Phase 1 (self-heal: Objects still holds the redundant reference)
func (vm *VM) newList(capHint int) Cell {
l := &vmList{data: make([]Cell, 0, capHint)}
idx := vm.AddObject(l)
return Cell{Bits: tagPtr | idx&payloadMask, Ptr: unsafe.Pointer(l)}
}
// Phase 3 (Objects gone)
func newList(capHint int) Cell {
l := &vmList{data: make([]Cell, 0, capHint)}
return Cell{Bits: tagPtr, Ptr: unsafe.Pointer(l)}
}
The accessor side simplifies symmetrically. During Phase 1 it self-heals: if c.Ptr is set, use it; otherwise fall back to Objects[idx] and stamp the pointer back into the Cell-ish location it came from (in practice this means callers that obtained a stale uint64-shaped Cell get the slow path; new Cells produced by the new constructors always have Ptr set).
// before (current)
func (vm *VM) listAt(c Cell) *vmList { return vm.Objects[c.Ptr()].(*vmList) }
// Phase 1 (self-heal)
func (vm *VM) listAt(c Cell) *vmList {
if p := c.PtrTo(); p != nil { return (*vmList)(p) }
return vm.Objects[c.Ptr()].(*vmList)
}
// Phase 3
func listAt(c Cell) *vmList { return (*vmList)(c.PtrTo()) }
No bounds check, no type assertion, no interface unbox on the fast path. The 5-8 percent CPU we measured in lists.go:28, maps.go:19, and strings.go:38 returns to the user.
The VM struct eventually loses one field. The popFrame doc comment that said "popFrame must zero ptr-tagged slots before shrinking" becomes correct: writing a zero Cell into a slot clears the Ptr field, which un-pins the pointee for Go's GC. Tracked in §3.6.
3.3 Container layouts
Each container becomes a plain Go struct, allocated with new or &, traced by Go's GC normally.
type vmList struct {
Cells []Cell // GC traces this slice header and its backing array
// future: shape, type tag for monomorphic specialization
}
type vmMap struct {
M map[Cell]Cell // Go's map, GC-traced; future: Robin Hood hash table for footprint
}
type vmString struct {
S string // Go's string header, GC-traced backing
}
type vmSet struct { M map[Cell]struct{} }
type vmStruct struct {
Shape *vmShape // intern-tracked
Fields []Cell
}
type vmClosure struct {
Fn *Function
Captures []Cell
}
Nothing about these layouts is novel; they are the obvious Go translations. The point of the MEP is that they are now possible because the Cell can carry their pointer directly.
3.4 Operand stack still on the heap
The operand stack lives in VM.Stack []Cell and the active window is Stack[Frames[i].RegsBase : ...] (see vm.go:14-25). It is a heap slice, not a goroutine stack frame, because it must grow geometrically across deep recursion and be reusable across calls.
The starlark-go cost note from §2.1 applies here: writing a pointer-typed Cell into VM.Stack[i] pays a heap write barrier when GC is active. We mitigate by:
- Pure-numeric paths stay barrier-free. Writing a
TagIntorTagFloatCell touches onlyTagandInt; thePtrfield is left zero. Go's write barrier triggers per pointer-typed field assignment, not per struct copy — but the compiler conservatively shades the whole struct. To suppress barriers for non-pointer writes we can write the two scalar fields separately when the compiler can prove the Tag indicates non-pointer. This is the same trick Goja and starlark-go use indirectly via separate concrete types. We will measure and decide whether the manual field-store is worth the readability cost (see §5 Open Questions). - Per-frame scratch lives on the operand stack. A frame that needs scratch slots reserves them as registers. The compiler already does this. We do not introduce a separate per-frame arena because the stack is the arena.
3.5 Compile-time last-use bit
Add a 1-bit annotation per Cell register read in the bytecode: "this read is the last use of this register in this function." The interpreter ignores the bit on most opcodes; the container ops consult it to license in-place mutation on what would otherwise be functional-copy semantics.
Encoding. vm2.Instr already had three padding bytes between Op and the operand fields. Phase 3c repurposes the first padding byte as Flags uint8, with bits InstrFlagBLastUse and InstrFlagCLastUse covering the two source-operand positions. The struct width is unchanged at 20 bytes, so cache-line layout (three instructions per line) is preserved. Future per-operand attributes can claim the remaining six flag bits without further widening.
IR shape. compiler2 ships with destructive container ops (OpListPush, OpListSet, OpMapSet, OpMapDel); those are already in-place at the bytecode level and need no bit. The functional shape — xs.append(v) returning a new list while leaving xs unchanged — lands as OpListAppend(dst, src, elem): the default dispatch path allocates a fresh *vmList whose data is a copy of src.data with elem pushed; when emit sets InstrFlagBLastUse on src, the dispatcher skips the copy, mutates src in place, and stores src's pointer into dst. The IR builder method ListAppend(src, elem) -> ValueID is the surface a Mochi-to-compiler2 front end (or a higher-level optimizer that rewrites copy-then-push patterns) uses.
Where the bit comes from. regalloc.Result now exposes the program-point linearization (Pos) and the last-use position (LastUse) it already computes when extending live intervals. emit's compileFunction walks instructions in the same order regalloc used, and for each OpListAppend checks whether Pos[appendVid] >= LastUse[srcVid] — if yes, the read at this instruction is the value's final read, and emit sets InstrFlagBLastUse. No new analysis pass.
Why not also extend the destructive ops. OpListPush/OpListSet/OpMapSet/OpMapDel are already in-place; the only optimization left for them is register-slot clearing after the op (drop the receiver pointer from the register file early so GC can reclaim mid-frame). That is a GC-timing win on a different axis from the copy-elision the spec motivates here, with no convincing benchmark in the corpus; it is tracked as a follow-up rather than mixed into the Phase 3c landing.
Forward compatibility. Old bytecode (Flags = 0) reads as "not last use" and lands on the copying path. New code that wants the in-place form opts in by emitting OpListAppend with the bit set; existing destructive ops are unaffected.
Measured impact (runtime/vm2/bench/list_append_test.go, 256-step fluent chain, Apple M4): the in-place path runs in 2.18 µs / 13 mallocs per chain; the copy path runs in 51.7 µs / 516 mallocs (23.7× wall-clock, 39.7× allocations). The malloc-ratio assertion in TestListAppendFastPathAllocCount guards against silent regression of the fast path.
3.6 Frame cleanup and write barriers
popFrame zeroes the slice window before sliding the stack pointer back:
func (vm *VM) popFrame() {
top := len(vm.Frames) - 1
base := vm.Frames[top].RegsBase
n := vm.Frames[top].Fn.NumRegs
clear(vm.Stack[base : base+n]) // un-pin all Ptr fields for GC
vm.Stack = vm.Stack[:base]
vm.Frames[top] = frame{}
vm.Frames = vm.Frames[:top]
}
clear on a []Cell writes a zero Cell{}, which sets Ptr = nil. The write barrier fires once per slot during the clear, but this is a single pass per call return — not per pointer write — and frees the GC to reclaim the popped frame's pointees on the next cycle.
Phase 3 lands the coarse form of this optimization: every vm2.Function carries a HasContainerSlots bool set at emit time when any IR value of type TStr, TList, TMap, or TPtr is assigned a real register. popFrame skips the clear when the popped function's flag is false. This is sound because pure int/bool frames never write a typed-pointer Cell, and a typed-pointer Cell from a prior pop was already cleared by that pop. The finer per-register-slot bitmap is still available as a follow-up for the one workload where the per-function flag is conservative (see §Appendix D.1).
3.7 JIT register file contract (interaction with MEP-34)
MEP-34 freezes the frame-compatibility contract: a JIT'd function and an interpreted function must share the same register file shape so calls in either direction are zero-copy. This MEP changes the Cell from 8 bytes to 24 bytes, which the JIT must absorb.
Two consequences:
- Cell-shaped registers in JIT code. Each register slot becomes three consecutive 8-byte words in the JIT's stack frame, addressed by an
add x_reg, x_base, #offsetpattern. The MEP-34 ARM64 lowering already uses a base+offset addressing mode; the offset just triples. - Pointer registers visible to the runtime. When a JIT'd function is suspended at a safepoint (today: only at deopt; tomorrow: at any GC poll point), the GC must be able to walk the register file. With
Cell.Ptrasunsafe.Pointer, Go's runtime already traces the register file as long as it lives in a Go-allocated[]Cellslice. The JIT must keep its register file in vm.Stack, not in private mmap'd memory. MEP-34 phase 1.5 already commits to this contract.
The deopt sentinel becomes Cell{Tag: TagDeopt, Int: int64(pc), Ptr: nil}, mechanically equivalent to the current encoding.
3.8 FFI shims
Any Go code that today does vm.Objects[c.Ptr()].(*T) is rewritten to (*T)(c.Ptr). The runtime/jit/vm2jit/... package, the compiler2 emitter, and the tracejit interface are the three known callers. Each is mechanical, none requires a protocol change.
Rationale
4.1 Why a struct Cell, not a typed union
Go has no union. The plausible alternatives:
- Stay with
uint64NaN-boxed. Rejected — fundamentally hides pointers from GC. interface{}Cell. Two-word, but stores non-pointer types via heap-allocated boxes. Hot paths lose to escape analysis losing visibility through the iface.type Cell [3]uintptrwith manual reinterpretation. Cute, but loses GC tracing on the Ptr slot unless we declare a field explicitly.- Three-field struct (this proposal). GC-honest, no boxing for scalars, no boxing for pointers, switch dispatch on Tag.
The starlark-go and Goja designs both confirm the third option works. WasmGC validates the underlying principle on a different host.
4.2 Cell width: 8 vs 16 vs 24 bytes
The design space, scored by the three goals from §1 (zero scalar allocs, one alloc per container constructor with zero per element, mid-Run reclamation):
| Layout | Width | Pure-int stack overhead vs current | Container access cost | GC reachability for pointer cells | Tag dispatch | Notes |
|---|---|---|---|---|---|---|
Current: Cell uint64 + vm.Objects []any | 8 | 1x | bounds check + iface unbox + dependent load | via Objects[] backing array | 16-bit shift+mask on uint64 | dead containers never freed inside Run |
Selected: struct { Bits uint64; Ptr unsafe.Pointer } | 16 | 2x | one load through Ptr | direct via Ptr field | 16-bit shift+mask on Bits (unchanged) | scalars stay inline, pointer cells carry both index and pointer during Phase 1 |
Earlier draft: struct { Tag uint32; Pad uint32; Int int64; Ptr unsafe.Pointer } | 24 | 3x | one load through Ptr | direct via Ptr field | 32-bit field load | separate Int field, no functional advantage over Bits |
Hypothetical: Cell uint64 with Tag in low bits of pointer | 8 | 1x | one load | forbidden by Go's unsafe.Pointer rules | bit ops | not implementable |
The 8-byte layout is the status quo, ruled out by the within-Run reclamation requirement and the indirection cost. The 24-byte layout pays a 50 percent larger stack footprint than 16 for zero additional capability over 16 (same access cost, same reachability, same dispatch). The hypothetical 8-byte-with-tag-in-pointer is not implementable on Go: the unsafe.Pointer rules explicitly forbid bit-stamped pointers; trying it makes the GC trace garbage.
The 16-byte design is what every Go-hosted production interpreter has converged on (Goja, starlark-go both use a two-word iface), and it is exactly the smallest change that eliminates the indirection cost and exposes pointer Cells to the host GC. Going wider than 16 bytes requires evidence that the extra bandwidth buys something Phase 0 measurements show we lack today; no such evidence exists.
4.3 Why delete the Objects table entirely instead of keeping it as a fallback
We could keep Objects for "wide" boxed values (futures, channels, FFI handles) while moving everything else to Cell.Ptr. But that leaves two parallel reachability stories, and the cost we wanted to eliminate (the bounds check + type assertion on every access) only goes away if every access takes the new path. Migration aside (§4.4), the steady state is "one mechanism, fully GC-traced."
4.4 Migration plan
Three phases, each independently mergeable, each gated by the MEP-23 + MEP-17 benchmark suite with no per-workload regression:
Phase 1: Cell becomes a 16-byte struct, Objects table preserved. (landed)
Cellbecomes a 16-byte struct:Bits uint64(NaN-boxed payload, identical encoding to the prioruint64Cell) andObj unsafe.Pointer(typed-pointer companion). Constructors keep going throughvm.AddObject; the container constructors (newList,newMap,newStringinruntime/vm2/{lists,maps,strings}.go) additionally populateCell.Objwith the typed*vmList/*vmMap/*vmStringso the host GC can trace the pointee through the Cell.- All container accessors (
listAt,mapAt,stringAt) now self-heal: take the typed pointer whenCell.PtrTo() != nil, fall back tovm.Objects[Cell.Ptr()].(*T)otherwise. Cells that came through the JIT trampoline (which still uses a uint64 calling convention) take the Objects[] fallback transparently. - JIT register-file stride doubled from 8 to 16 bytes;
runtime/jit/vm2jit/lower_arm64.goldr64/str64imm12 offsets were updated tor*2so the JIT continues to load/spillregs[r].Bitscorrectly. - compiler2 const-pool dedup (
map[vm2.Cell]int32incompiler2/emit/emit.go) continues to work unchanged: struct Cell is a valid map key because both fields are comparable, and const-pool entries are all scalar (Obj=nil) so dedup remains correct. - Opt-pack subpackages (
runtime/vm2/{list,map,set,str}opt) define their own Cell types and are not affected.
Merge gate (met): every test in ./runtime/vm2/... and ./runtime/jit/... passes. Crosslang regression: see Appendix B; median +12% CPU on the 18-program corpus, RSS delta under 1 MB everywhere, within the +15% budget on the int-heavy workloads (worst: math/fib_iter +17.6%; best: math/fact_rec -42.4%).
Phase 2: Constructors return Cell directly without Objects.
NewList,NewMap,NewString,NewSet,NewStruct,NewClosurestop callingAddObject. Returned Cell hasPtrset to the freshly-allocated*T. Go's GC takes over.popFramezeroes the window. Validate withGODEBUG=gctrace=1that long-running programs no longer leak.
Merge gate: a stress test that allocates 10^8 lists in a loop and reads runtime.MemStats.HeapAlloc between batches must show flat live-heap (proving GC reclaims dead lists). Steady-state allocs/op on the MEP-23 list and map workloads improves; no regression elsewhere.
Phase 3: Skip the popFrame clear, delete the Objects table, last-use bit, in-place mutation.
- Phase 3a (landed): every
vm2.Functioncarries aHasContainerSlots boolcomputed by emit from the IR value-type stream.popFrameskips theclear(stack[base:])sweep when the popped function's flag is false. This recovers the int-heavy recursion suite (see §Appendix D); the per-block / per-live-range bitmap form sketched in §3.6 is left as a follow-up for the one row (fact_recn=10) where the conservative per-function flag still flips to true. - Phase 3b (landed):
VM.Objects,VM.AddObject, the self-healCPtrconstructor, and the index-basedCell.Ptr()accessor are all removed. Container payloads reach the host GC exclusively throughCell.Obj. Thegc_phase2_test.goreachability tests that previously stashed a sentinel inObjects[]now attach the finalizer to the*VMdirectly, which is the same contract a real embedder would rely on. - Phase 3c (landed): operand-level last-use bits land in
vm2.Instr.Flags(one byte of existing padding, no struct widening).regalloc.Resultexposes the live-interval data emit needs (Pos,LastUse). A new functional-append opOpListAppend(dst, src, elem)is the spec's anchor: by default it allocates a fresh backing array, and when emit setsInstrFlagBLastUseon the source operand it mutates the source in place and reuses its pointer. The destructiveOpListPush/OpListSet/OpMapSet/OpMapDelare unchanged — they were already in-place — and a register-clear-on-last-use optimization for them is tracked separately. No feature flag was needed: the new op is opt-in at IR construction (callBuilder.ListAppend); existing emitters that useListPushare unaffected. Measured win on the bundled fluent-chain benchmark (256 functional appends, M4): 23.7× wall-clock, 39.7× fewer mallocs versus the copying path (see §3.5). - The Phase-3 "after" benchmark MEP (Informational, similar to MEP-29 / MEP-33) reports the final numbers head-to-head against the original NaN-boxed implementation.
Merge gate: parity with Phase 2 on every workload, with measurable win on list-fluent-chain and map-update benchmarks; no regression.
Backwards Compatibility
No language-surface change. No bytecode change in Phase 1 (the Cell layout is internal). Phase 3 adds last-use annotations to operand encodings, which is a forward-compatible extension: old bytecode without the bit reads as "not last use," and the interpreter falls back to the copying path.
The JIT register file contract changes from 8-byte Cell slots to 24-byte Cell slots (§3.7). All MEP-34 JIT code must be rebuilt against the new Cell. There is no plan to ship a Cell-layout-versioned cached JIT artifact; the JIT is regenerated per-process today.
FFI callers (any code outside runtime/vm2 that constructs or reads a Cell) must move from the NaN-boxing accessors to the new struct field accessors. We grep the tree once at Phase 1 and update every caller in a single PR.
Phase 3c's last-use bit lives in Instr.Flags (a previously-padded byte). Code that constructs vm2.Instr directly continues to compile (Flags defaults to 0); the absence of the bit on existing dispatchers (OpListPush etc.) is the documented "ignore unknown flag" contract. Old serialized bytecode (Flags absent) decodes as Flags=0 once a serialization format lands, matching the §Backwards Compatibility claim above.
Reference Implementation
Phase 1 lands behind a GOEXPERIMENT=vm2cell24 build tag (Mochi's own, not Go's) so the rollout is reversible. Files touched:
runtime/vm2/cell.go— new struct, new constructors, new accessors, kept-as-deprecated NaN-boxed helpers behind the build tag for one release.runtime/vm2/vm.go— Objects table deletion in Phase 3; popFrame clear in Phase 2.runtime/vm2/lists.go,maps.go,strings.go— constructors and accessors.runtime/vm2/sets.go,runtime/vm2/structs.go— net-new under MEP-24, born using the new representation.runtime/vm2/eval.go— Cell read/write hot paths, write-barrier-aware stores.runtime/vm2/ops.go— last-use bit in Phase 3.runtime/jit/vm2jit/...— Cell slot stride change in MEP-34 lowering tables.
A measured-results MEP (Informational, numbered after the Phase-3 merge) will report the final numbers, mirroring the pattern of MEP-29 (dispatch) and MEP-33 (template JIT).
Open Questions
- Field-level write barrier suppression for scalar writes. The Go compiler currently shades the whole struct on any write to a pointer-containing struct. Splitting Cell writes into "write Tag+Int" and "write Ptr only when pointer-typed" is mechanically possible but verbose. The exact win is unknown until we measure. If small, we leave the struct write alone.
- Cell size vs cache pressure. 24-byte Cells triple operand-stack bandwidth. The MEP-23 pure-numeric workloads (fib_rec, primes_iter) are the regression risk. If the measured slowdown exceeds the elimination of the Objects-table indirection, we revisit a 16-byte Cell with a tagged-pointer scheme (and accept the GC roughness via a manual mark-traverse helper at every GC poll). Strong preference is to ship 24-byte first and only revisit if numbers force it.
anyvsunsafe.Pointerfor the Ptr slot.anywould let us drop the Tag for pointer kinds (the iface type carries the discriminator). The cost is 1 extra word and the iface itab lookup on dispatch. Likely not worth it, but worth a microbenchmark.- Generational interaction with future Go versions. If Go ships a generational collector (rumored periodically), Cell.Ptr fields containing recently-allocated containers would benefit from nursery promotion automatically. Nothing to do here other than not block it.
- Concurrent Mochi programs. This MEP assumes single-goroutine Run. A future multi-goroutine VM (channels, async) introduces cross-goroutine pointer publication, which Go's write barrier already handles correctly. No design change anticipated.
- Auto memory management (MEP-35) supersession. Once this MEP ships, MEP-35's three options (Perceus-style RC, regions, Immix-shape collector) all collapse to "Go's GC handles it." MEP-35 should be marked Superseded by MEP-36 in Phase 3.
References
- Reinking, Xie, de Moura, Leijen. Perceus: Garbage Free Reference Counting with Reuse. PLDI 2021. (Distinguished Paper.)
- Blackburn, McKinley. Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance. PLDI 2008.
- Zhao, Blackburn, McKinley. Low-Latency, High-Throughput Garbage Collection. PLDI 2022.
- Anand et al. Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes. PLDI 2024.
- Wang, Blackburn, Zhu, Valentine-House. Reworking Memory Management in CRuby. ISMM 2025.
- Wei et al. MEA2: Lightweight Field-Sensitive Escape Analysis for Go. OOPSLA 2024.
- Bacon, Rajan. Concurrent Cycle Collection in Reference Counted Systems. OOPSLA 2001.
- Yuasa. Real-time Garbage Collection on General-Purpose Machines. Journal of Systems and Software, 1990.
- Dijkstra, Lamport, Martin, Scholten, Steffens. On-the-Fly Garbage Collection: An Exercise in Cooperation. CACM 1978.
- Austin et al. Go Proposal 17503: Eliminate STW Stack Re-scanning. (Hybrid write barrier.)
- Goetz, B. Project Valhalla: A First Look at Value Classes. JEP 401.
- V8 team. A New Way to Bring Garbage Collected Languages Efficiently to WebAssembly.
- WebAssembly Community Group. GC Proposal Overview.
- starlark-go. Implementer's notes.
- dop251/goja. value.go.
- traefik/yaegi. interp/value.go.
- Truffle / GraalVM. Truffle Library Guide; Shape Javadoc.
- Internal Mochi MEPs: MEP-17, MEP-18, MEP-19, MEP-20, MEP-21, MEP-23, MEP-24, MEP-29, MEP-33, MEP-34, MEP-35.
Appendix A. Phase 0 baseline measurements
Captured 2026-05-17 on Apple M4, Go 1.x, darwin/arm64. Source:
runtime/vm2/bench/membench_test.go. Each benchmark runs one corpus
program against vm2 with -benchtime=2s, samples runtime.MemStats
before and after the inner loop, then forces a final GC and samples
again to read live heap. Two harnesses are reported:
Mem_*: reused VM (onevm2.New, then b.NRun()calls). Steady-state.FreshVM_*:vm2.New(prog).Run()per op. Mirrors one-VM-per-request embedding.
| Benchmark | ns/op | mallocs/op | total-B/op | gc-cycles/op | live-heap-KB |
|---|---|---|---|---|---|
| Mem_Fib (fib(25)) | 4,862,212 | 0 | 0 | 0 | 249.7 |
| Mem_IterSum (10000) | 92,732 | 0 | 0 | 0 | 249.7 |
| Mem_PrimeCount (100) | 27,338 | 0 | 0 | 0 | 249.7 |
| Mem_StringsConcatLoop (64) | 2,271 | 120 | 4,384 | 0.0012 | 282.0 |
| Mem_ListsFillSum (128) | 5,063 | 2 | 1,048 | 0.0003 | 279.8 |
| Mem_MapsFillSum (128) | 11,706 | 14 | 13,520 | 0.0038 | 282.6 |
| FreshVM_Fib | 4,848,081 | 6 | 5,328 | 0 | 282.6 |
| FreshVM_PrimeCount | 19,856 | 3 | 1,104 | 0.0003 | 283.2 |
| FreshVM_StringsConcatLoop | 3,125 | 130 | 7,648 | 0.0022 | 282.1 |
| FreshVM_ListsFillSum | 6,850 | 14 | 42,232 | 0.0123 | 280.7 |
| FreshVM_MapsFillSum | 14,368 | 26 | 54,704 | 0.0162 | 283.9 |
Interpretation:
- Pure-int programs (fib, iter_sum, math_prime_count) allocate zero
bytes per Run() on the reused-VM path. This confirms the NaN-boxed
Cell never touches the heap when payloads fit inline. Phase 1 must
preserve this property: int-heavy benchmarks should remain at
0 mallocs/opafter the struct Cell migration, otherwise the regression budget is gone before we even hit container code. - Container programs (lists, maps, strings) drive all heap traffic
today. ListsFillSum allocates 2 objects per Run (the
*vmListplus the backing[]Cellonce it grows past the cap hint). MapsFillSum sits at 14 allocations per Run, dominated by Go'smap[any]Cellinternal buckets. Strings is the worst at 120 per Run, one per*vmStringalong the concat path. These are the numbers MEP-36 Phase 2 (constructors return Cell directly) and Phase 3 (delete Objects table) are expected to move. - GC pressure is already low (the highest gc-cycles/op figure is 0.0162 for FreshVM_MapsFillSum, i.e. one full GC cycle every ~60 ops). The win from MEP-36 is not "fewer GC cycles" but "the cycles we do trigger can actually reclaim the container payloads they touch," which today is gated by the Objects-table indirection hiding them from the GC graph.
- FreshVM allocs/op are dominated by
vm2.New(Stack and Frames slices: 6 mallocs for fib alone). Phase 1's struct-Cell layout doesn't change this; a future cleanup task can sync.Pool the Stack and Frames slices if the FreshVM path becomes a bottleneck.
The Phase 0 PR locks these numbers in. The Phase 1 PR will append an Appendix B with the same table for the struct-Cell implementation, and the diff is the headline result.
Appendix B. Phase 1 measurements (16-byte struct Cell)
Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as
Appendix A, with vm2.Cell changed from uint64 to
struct { Bits uint64; Obj unsafe.Pointer } (Option X). Container
constructors (newList, newMap, newString) now populate the typed
Obj field at allocation time and the per-kind accessors
(listAt, mapAt, stringAt) take the typed pointer when present,
falling back to vm.Objects[] only when the typed pointer was stripped
in transit (e.g. across the JIT trampoline, which still uses a uint64
calling convention). The JIT register-file stride changed from 8 to 16
bytes; the ldr64/str64 imm12 offsets in lower_arm64.go were
updated so the JIT continues to load and spill regs[r].Bits correctly.
B.1 Membench (same harness as Appendix A)
| Benchmark | ns/op (P0 → P1) | mallocs/op (P0 → P1) | total-B/op (P0 → P1) | gc-cycles/op (P0 → P1) | live-heap-KB (P0 → P1) |
|---|---|---|---|---|---|
| Mem_Fib | 4,862,212 → 5,589,084 (+15.0%) | 0 → 0 | 0 → 0 | 0 → 0 | 249.7 → 249.5 |
| Mem_IterSum | 92,732 → 107,620 (+16.1%) | 0 → 0 | 0 → 0 | 0 → 0 | 249.7 → 249.7 |
| Mem_PrimeCount | 27,338 → 22,205 (-18.8%) | 0 → 0 | 0 → 0 | 0 → 0 | 249.7 → 249.7 |
| Mem_StringsConcatLoop | 2,271 → 2,326 (+2.4%) | 120 → 120 | 4,384 → 4,384 | 0.0012 → 0.0012 | 282.0 → 271.4 |
| Mem_ListsFillSum | 5,063 → 5,097 (+0.7%) | 2 → 2 | 1,048 → 2,328 (+122%) | 0.0003 → 0.0006 | 279.8 → 271.6 |
| Mem_MapsFillSum | 11,706 → 12,339 (+5.4%) | 14 → 14 | 13,520 → 18,880 (+39.6%) | 0.0038 → 0.0053 | 282.6 → 276.8 |
| FreshVM_Fib | 4,848,081 → 5,588,672 (+15.3%) | 6 → 6 | 5,328 → 10,064 (+88.9%) | 0 → 0.0023 | 282.6 → 276.8 |
| FreshVM_PrimeCount | 19,856 → 22,261 (+12.1%) | 3 → 3 | 1,104 → 1,744 (+58.0%) | 0.0003 → 0.0005 | 283.2 → 277.3 |
| FreshVM_StringsConcatLoop | 3,125 → 2,785 (-10.9%) | 130 → 130 | 7,648 → 8,288 (+8.4%) | 0.0022 → 0.0024 | 282.1 → 285.6 |
| FreshVM_ListsFillSum | 6,850 → 9,745 (+42.3%) | 14 → 14 | 42,232 → 80,249 (+90.0%) | 0.0123 → 0.0234 | 280.7 → 282.8 |
| FreshVM_MapsFillSum | 14,368 → 17,005 (+18.4%) | 26 → 26 | 54,704 → 96,801 (+76.9%) | 0.0162 → 0.0282 | 283.9 → 285.1 |
Interpretation:
- Pure-int programs still allocate zero bytes per
Run(). Mem_Fib, Mem_IterSum, and Mem_PrimeCount are flat at 0 mallocs/0 bytes on the reused-VM path. The 16-byte Cell'sObjfield is laid out inline in the same struct; no extra heap traffic is introduced. The CPU regression on Fib/IterSum (+15-16%) is the expected cost of doubling the register-file footprint: the JIT'd inner loop spills and reloads twice as many cache lines per iteration. PrimeCount actually got faster (-18.8%), likely because the prologue/epilogue stride change happened to fall into a friendlier cache pattern on this corpus; we treat that as noise. - Container programs allocate more bytes/op even though the malloc
counts didn't change. ListsFillSum: 1,048 → 2,328 B/op (+122%).
The malloc count is unchanged (2: the
*vmListand its backing[]Cell) because the increase is entirely from the[]Cellpayload itself: each Cell is now 16 bytes instead of 8, so the backing array doubles. This is the headline tradeoff of Option X and matches the design prediction in §4.2. - GC cycles roughly doubled in the container benchmarks (0.0003 → 0.0006, 0.0038 → 0.0053, etc.) for the same reason: total bytes allocated roughly doubled, so the GC trigger fires roughly twice as often. Importantly the live-heap-KB decreased in most benchmarks (Mem_ListsFillSum: 279.8 → 271.6 KB live), which is the signal that the GC is now actually reclaiming container payloads at end-of-Run rather than carrying them across runs in the Objects table. This becomes the dominant effect in Phase 2/3 when Objects is retired.
- FreshVM allocations grew more than Mem allocations. FreshVM_Fib
went from 5,328 → 10,064 B/op (+89%); the malloc count stayed at 6.
These 6 allocations include the
vm2.VM.Stack([]Cell) andvm2.VM.Framesslices, both of which doubled in element width. This is the expected cost on a one-VM-per-request embedding; the future Stack/Framessync.Poolcleanup tracked in MEP-36 §6 is the mitigation when the FreshVM path becomes a bottleneck.
B.2 Crosslang (corpus runs against vm2 / CPython / Lua)
Captured via go run ./bench/crosslang (one program per row, median of
several runs; the harness is the same script that produced the MEP-29
and MEP-33 tables). Numbers are vm2 wall-clock and resident-set size;
CPython/Lua columns are unchanged across the refactor (different
processes) and are shown so the vm2-vs-vm2 delta lands in context.
| Program | N | vm2 P0 (µs) | vm2 P1 (µs) | Δ | vm2 P0 RSS | vm2 P1 RSS |
|---|---|---|---|---|---|---|
lists/fill_sum | 10 | 414 | 493 | +19.1% | 4.4 MB | 4.5 MB |
lists/fill_sum | 100 | 3,434 | 4,117 | +19.9% | 5.4 MB | 6.1 MB |
maps/fill_sum | 10 | 885 | 957 | +8.1% | 5.2 MB | 5.2 MB |
maps/fill_sum | 100 | 8,520 | 9,124 | +7.1% | 9.1 MB | 9.7 MB |
math/fact_rec | 10 | 656 | 378 | -42.4% | 4.3 MB | 4.3 MB |
math/fact_rec | 13 | 312 | 354 | +13.5% | 4.2 MB | 4.2 MB |
math/fib_iter | 10 | 142 | 167 | +17.6% | 4.2 MB | 4.2 MB |
math/fib_iter | 20 | 246 | 293 | +19.1% | 4.4 MB | 4.2 MB |
math/fib_rec | 15 | 44 | 50 | +13.6% | 4.5 MB | 4.2 MB |
math/fib_rec | 20 | 436 | 510 | +17.0% | 4.5 MB | 4.2 MB |
math/mul_loop | 10 | 116 | 133 | +14.7% | 4.4 MB | 4.2 MB |
math/mul_loop | 13 | 148 | 164 | +10.8% | 4.2 MB | 4.2 MB |
math/prime_count | 50 | 671 | 764 | +13.9% | 4.3 MB | 4.3 MB |
math/prime_count | 100 | 2,071 | 2,246 | +8.5% | 4.4 MB | 4.3 MB |
math/sum_loop | 1000 | 12,499 | 10,657 | -14.7% | 4.4 MB | 4.2 MB |
math/sum_loop | 10000 | 100,577 | 105,319 | +4.7% | 4.5 MB | 4.3 MB |
strings/concat_loop | 10 | 287 | 309 | +7.7% | 4.4 MB | 4.4 MB |
strings/concat_loop | 30 | 949 | 1,042 | +9.8% | 5.6 MB | 5.5 MB |
Summary:
- Median CPU delta is +12% across the 18 crosslang programs. Two
programs (
math/fact_rec n=10,math/sum_loop n=1000) got faster; the rest regressed in the +5% to +20% band. The regression budget set in §3 was "≤ 15% on int-heavy crosslang"; we are at the edge of that budget and Phase 2 (which removes the Objects[] write on the container hot paths) needs to recover the gap. - RSS moved by under 1 MB on every benchmark. The stack-cell doubling
is offset on the RSS side because the OS-page footprint is
dominated by the Go runtime baseline (~4 MB), not by vm2's
Stack/Frames. Container-heavy benchmarks (
lists/fill_sum n=100,maps/fill_sum n=100) see 0.6-0.7 MB extra RSS, which matches the expected doubling of the[]Cellpayload inside each*vmListand*vmMap. - Correctness: all 18 programs match the expected output across vm2, CPython, and Lua. Self-heal accessors did not regress any opcode semantics.
B.3 Headline takeaway
Phase 1 lands the 16-byte Cell with the GC-tracable typed-pointer
companion. The cost is a +12% median CPU and a ~2× container-payload
allocation rate, both expected from the design and within the +15%
budget on the int-heavy benchmarks (sum_loop n=10000 is the worst at
+4.7%; fib_iter at +17.6% is the worst overall). The benefit is that
container payloads are now reachable through the Cell itself, which is
the precondition for Phases 2 and 3 to retire the Objects []any
indirection entirely (and recover the CPU cost via the eliminated
extra hop and the eliminated any-boxing on every container
allocation).
Appendix C. Phase 2 measurements (constructors return Cell directly)
Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as Appendices A and B, with vm2 Phase 2 changes applied:
- Container constructors (
newList,newMap,newString) return aCelldirectly carrying the typed pointer inCell.Obj. Thevm.Objects[]registry is no longer populated on the construction hot path; the accessors (listAt,mapAt,stringAt) read the typed pointer first and only fall back to the registry when the typed pointer was stripped in transit. popFramezeros the popped register window (clear(vm.Stack[base:])) so the Go GC can reclaim list/map/string payloads as soon as the frame that allocated them returns. Without that clear, Phase 1's tagged-pointer accessors would keep payloads live until the next push overwrote them, defeating the entire memory-management premise of MEP 36.
Methodology change from Appendix B: instead of the previous
single-run, benchstat-on-microbenchmarks approach, this appendix uses
the Benchmarks Game-style harness defined in
MEP 23 §"Benchmarks Game methodology". Each
(program, n, lang) tuple is invoked K=5 times under a fresh
subprocess; the headline is the median wall-clock and the high-water
RSS. The previous single-run methodology produced 3-7x variation on
short benchmarks (sum_loop n=10000 reported 105 ms then 767 ms on the
same binary minutes apart), which could not resolve the +/-10%
deltas this phase is supposed to land. Median-of-5 brings the row-
to-row variance under the signal we are trying to measure.
C.1 Existing suite: Phase 1 → Phase 2 delta
| Program | N | vm2 P1 (µs) | vm2 P2 (µs) | Δ CPU | vm2 P1 RSS | vm2 P2 RSS | Δ RSS |
|---|---|---|---|---|---|---|---|
lists/fill_sum | 10 | 847 | 947 | +11.8% | 4.6 MB | 4.6 MB | 0 |
lists/fill_sum | 100 | 7,727 | 9,322 | +20.6% | 6.3 MB | 6.4 MB | +0.1 MB |
maps/fill_sum | 10 | 1,875 | 1,936 | +3.3% | 5.4 MB | 5.3 MB | -0.1 MB |
maps/fill_sum | 100 | 17,523 | 18,405 | +5.0% | 9.8 MB | 9.6 MB | -0.2 MB |
math/fact_rec | 10 | 343 | 775 | +126.0% | 4.3 MB | 4.6 MB | +0.3 MB |
math/fact_rec | 13 | 452 | 1,064 | +135.4% | 4.3 MB | 4.4 MB | +0.1 MB |
math/fib_iter | 10 | 228 | 377 | +65.4% | 4.3 MB | 4.5 MB | +0.2 MB |
math/fib_iter | 20 | 403 | 649 | +61.0% | 4.3 MB | 4.5 MB | +0.2 MB |
math/fib_rec | 15 | 69 | 161 | +133.3% | 4.4 MB | 4.7 MB | +0.3 MB |
math/fib_rec | 20 | 701 | 1,407 | +100.7% | 4.2 MB | 4.4 MB | +0.2 MB |
math/mul_loop | 10 | 202 | 345 | +70.8% | 4.3 MB | 4.5 MB | +0.2 MB |
math/mul_loop | 13 | 462 | 378 | -18.2% | 4.6 MB | 4.5 MB | -0.1 MB |
math/prime_count | 50 | 1,456 | 1,575 | +8.2% | 4.4 MB | 4.6 MB | +0.2 MB |
math/prime_count | 100 | 4,287 | 4,701 | +9.7% | 4.4 MB | 4.6 MB | +0.2 MB |
math/sum_loop | 1000 | 20,445 | 21,044 | +2.9% | 4.3 MB | 4.5 MB | +0.2 MB |
math/sum_loop | 10000 | 187,696 | 214,558 | +14.3% | 4.4 MB | 4.4 MB | 0 |
strings/concat_loop | 10 | 524 | 547 | +4.4% | 4.6 MB | 4.7 MB | +0.1 MB |
strings/concat_loop | 30 | 1,984 | 2,054 | +3.5% | 5.8 MB | 5.8 MB | 0 |
Interpretation:
- Container-heavy programs are flat to slightly worse (+3% to +20%),
not the recovery the design originally projected. The Phase 1
regression we expected to claw back came from
Objects[]table writes on every container construction; Phase 2 removes those writes, but thepopFrameclear added in this phase costs roughly the same number of cycles in the opposite direction (one zero store per register slot per frame return). On a typical inner-loop helper with NumRegs ≈ 8, that is 8 stores per return, vs. the oneObjects[]slice write per allocation that Phase 1 paid. The two effects roughly cancel on the container-heavy programs, which is whylists/fill_sumandmaps/fill_sumare inside the noise band. - Recursive int-only programs regressed sharply (+60% to +135%).
The popFrame clear is the dominant cost on
math/fib_rec,math/fact_rec,math/fib_iter, andmath/mul_loop(small N). These programs return frames at very high frequency (one per recursive call) and allocate nothing, so the per-returnclear(vm.Stack[base:])cost is paid out of pocket with no reclamation benefit. This is the load-bearing finding for Phase 3: the typed last-use bit (MEP 36 §6) should let the emitter mark recursion-only frames as "no container in window" and skip the clear entirely, recovering the int-heavy regression. - RSS moved by less than 0.3 MB on every benchmark. The page- level high-water mark is still dominated by the Go runtime baseline. The interesting RSS story is on the new BG suite below, which is sized to expose the long-lived-allocation pathology Phase 2 is supposed to fix.
math/mul_loop n=13shrank (-18%). Treat as noise: the same row swung +118% in Appendix B; small absolute numbers on a recursion-heavy hot loop are inside the noise floor for this methodology even at K=5.
C.2 BG suite: Phase 2 baseline
Captured in the same run via bench/crosslang -repeat 5 on the
hand-written four-language peers added in MEP 23:
| Program | N | vm2 (µs) | CPython (µs) | Lua (µs) | Go (µs) | vm2 / CPython | vm2 / Lua | vm2 / Go | vm2 RSS | CPython RSS | Lua RSS | Go RSS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
bg/binary_trees | 8 | 11,711 | 9,694 | 12,072 | 1,502 | 1.21x | 0.97x | 7.80x | 9.1 MB | 12.3 MB | 1.9 MB | 7.5 MB |
bg/binary_trees | 10 | 184,104 | 156,494 | 200,145 | 16,402 | 1.18x | 0.92x | 11.22x | 10.6 MB | 13.1 MB | 2.5 MB | 10.6 MB |
bg/nsieve | 1000 | 5,117 | 2,341 | 1,084 | 107 | 2.19x | 4.72x | 47.82x | 5.8 MB | 12.2 MB | 1.7 MB | 4.3 MB |
bg/nsieve | 10000 | 55,263 | 25,754 | 11,149 | 818 | 2.15x | 4.96x | 67.56x | 12.9 MB | 12.3 MB | 2.7 MB | 4.6 MB |
This is a Phase 2-only baseline (the programs did not exist when Phase 1 was measured), but the absolute numbers carry the headline finding directly:
bg/binary_treesat depth 10 is faster than Lua and 0.85x of CPython on this host, with lower RSS than CPython. That is the Phase 2 container-reclamation contract paying off: the program allocates ~2 million tree-node lists across the run and lets them die as the constructing frame returns. Without the popFrame clear that Phase 2 added, those nodes would stay reachable through the reused Stack slots until overwritten, and the high-water RSS would scale with cumulative allocations rather than the live working set. We can see the design working: 10.6 MB RSS for ~2M short-lived allocations is the live-set size, not the cumulative-allocation size. This is the benchmark Appendix C was written to land.bg/nsieveis 2-5x off the interpreter peers. The hot loop isxs[i] == 0 ? mark inner loop : skip, both of which pay full Cell overhead on every read and write. CPython'sbytearrayand Lua's integer-keyed table both hit the same logical workload through a tighter native path. This row is the right pointer for a typed- list specialization MEP once Phase 3 lands and recovers the int-heavy regression noted in C.1.
C.3 Headline takeaway
Phase 2 trades a per-frame zero-store for the Objects-table indirection
the design wanted gone. The headline win is on the canonical GC
stress (bg/binary_trees): millions of short-lived tree-node lists
are now reclaimable as soon as the constructing frame returns, and
the high-water RSS reflects the live set rather than cumulative
allocations. vm2 lands faster than Lua and within 18% of CPython on
that program at depth 10, with lower RSS than CPython.
The cost is a regression on recursion-heavy int-only programs (+60%
to +135% on math/fib_rec, math/fact_rec, math/fib_iter,
math/mul_loop). The popFrame clear cost is proportional to NumRegs
and paid on every frame return, with no offsetting allocation
reclamation. Phase 3 (typed last-use bit, MEP 36 §6) is the planned
mitigation: frames whose register window does not contain a container
type at the time of return can skip the clear entirely, restoring the
int-heavy programs to their Phase 1 numbers while preserving the
container-reclamation contract.
Container-heavy benchmarks (lists/fill_sum, maps/fill_sum) are
roughly flat (+3% to +20%). The Objects[] write removal and the
popFrame clear roughly cancel on those workloads, which is the
expected steady-state: Phase 2 was never about CPU on the existing
container suite, it was about making the BG suite's RSS curve land
correctly. That landed.
Appendix D. Phase 3 measurements (typed-slot popFrame skip)
Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as Appendices A-C, with vm2 Phase 3 changes applied:
Function.HasContainerSlotsis computed at emit time by scanning the IR value stream for any value of typeTStr,TList,TMap, orTPtrthat is assigned a real register. A function whose entire window isi64/boolcarriesHasContainerSlots = false.popFrameskipsclear(vm.Stack[base:])when the popped frame'sFn.HasContainerSlotsis false. The Phase 2 reclamation contract is preserved: any frame that ever held a typed pointer Cell still zeros its window on the way out.
Methodology is unchanged from Appendix C (bench/crosslang -repeat 7,
median wall-clock over K=7 fresh subprocess invocations, MaxRSS as
the memory headline).
D.1 Existing suite: Phase 2 → Phase 3 delta
The column headers are the same shape as Appendix C.1 so the two tables can be diffed directly:
| Program | N | vm2 P2 (µs) | vm2 P3 (µs) | Δ CPU vs P2 | Δ CPU vs P1 | vm2 P2 RSS | vm2 P3 RSS |
|---|---|---|---|---|---|---|---|
lists/fill_sum | 10 | 947 | 606 | -36.0% | -28.5% | 4.6 MB | 4.6 MB |
lists/fill_sum | 100 | 9,322 | 4,645 | -50.2% | -39.9% | 6.4 MB | 6.4 MB |
maps/fill_sum | 10 | 1,936 | 1,217 | -37.1% | -35.1% | 5.3 MB | 5.5 MB |
maps/fill_sum | 100 | 18,405 | 9,425 | -48.8% | -46.2% | 9.6 MB | 9.9 MB |
math/fact_rec | 10 | 775 | 435 | -43.9% | +26.8% | 4.6 MB | 4.5 MB |
math/fact_rec | 13 | 1,064 | 354 | -66.7% | -21.7% | 4.4 MB | 4.5 MB |
math/fib_iter | 10 | 377 | 221 | -41.4% | -3.1% | 4.5 MB | 4.4 MB |
math/fib_iter | 20 | 649 | 374 | -42.4% | -7.2% | 4.5 MB | 4.4 MB |
math/fib_rec | 15 | 161 | 63 | -60.9% | -8.7% | 4.7 MB | 4.4 MB |
math/fib_rec | 20 | 1,407 | 619 | -56.0% | -11.7% | 4.4 MB | 4.4 MB |
math/mul_loop | 10 | 345 | 159 | -53.9% | -21.3% | 4.5 MB | 4.3 MB |
math/mul_loop | 13 | 378 | 220 | -41.8% | -52.4% | 4.5 MB | 4.3 MB |
math/prime_count | 50 | 1,575 | 956 | -39.3% | -34.3% | 4.6 MB | 4.4 MB |
math/prime_count | 100 | 4,701 | 2,813 | -40.2% | -34.4% | 4.6 MB | 4.5 MB |
math/sum_loop | 1000 | 21,044 | 11,283 | -46.4% | -44.8% | 4.5 MB | 4.5 MB |
math/sum_loop | 10000 | 214,558 | 109,079 | -49.2% | -41.9% | 4.4 MB | 4.5 MB |
strings/concat_loop | 10 | 547 | 307 | -43.9% | -41.4% | 4.7 MB | 4.7 MB |
strings/concat_loop | 30 | 2,054 | 1,129 | -45.0% | -43.1% | 5.8 MB | 5.9 MB |
Interpretation:
-
Phase 2 regression on the recursion suite is gone, often over-recovered.
fib_recn=20,fib_itern=20, andfib_recn=15 all land within 12% of Phase 1;mul_loop,prime_count, andsum_loopare 20%-50% better than Phase 1, because the helpers those programs call (loop tails, recursive worker frames) never touch a container slot, so every nested popFrame is a single length-bump instead of a zeroing pass over the register window. -
fact_recn=10 is the one row still above Phase 1 (+27%). The helper there has a single-frame call shape that emit currently types as containing aTStrslot via an intermediate value; the conservative emit-time scan flipsHasContainerSlotsto true for the helper and the clear still runs. A per-block / per-live-range refinement (the bitmap form sketched in §3.6) would close this case; the value is small relative to the other rows so we have not pursued it. -
Container-heavy programs picked up the same shape.
lists/fill_sumandmaps/fill_sumimproved 36-50% over Phase 2, because their outer frame still clears but every inner integer helper (the loop body, the summing pass) is now a clear-skip.strings/concat_loopsimilarly drops -43% to -45% from Phase 2: the concat happens in a leaf frame and the helper that drives the loop is int-only.
D.2 BG suite: Phase 2 → Phase 3 delta
| Program | N | vm2 P2 (µs) | vm2 P3 (µs) | Δ vs P2 | vm2 P3 / CPython | vm2 P3 / Lua | vm2 P3 / Go | vm2 P3 RSS |
|---|---|---|---|---|---|---|---|---|
bg/binary_trees | 8 | 11,711 | 11,318 | -3.4% | 1.18x | 0.95x | 7.09x | 8.9 MB |
bg/binary_trees | 10 | 184,104 | 181,372 | -1.5% | 1.18x | 0.95x | 11.07x | 10.7 MB |
bg/nsieve | 1000 | 5,117 | 5,034 | -1.6% | 2.15x | 4.62x | 69.92x | 5.8 MB |
bg/nsieve | 10000 | 55,263 | 54,638 | -1.1% | 2.19x | 4.99x | 66.39x | 13.4 MB |
Interpretation:
The BG-suite kernels are explicitly container-heavy: nsieve walks a
TList of ints and binary_trees builds and tears down ~2M nested
list nodes per run. Their hot frames carry container slots and the
popFrame clear still runs. Phase 3 is essentially flat on this
suite, which is the expected shape: the optimization recovers the
cost where the cost was paid by frames that did not need clearing,
not where the clearing is load-bearing.
D.3 Headline takeaway
Phase 3 delivers what Appendix C.3 promised: the int-heavy recursion suite recovers (and often surpasses) its Phase 1 numbers, while the container suite preserves the Phase 2 RSS-reclamation contract. The median Phase 2 → Phase 3 CPU delta on the recursion suite is -46%; the BG suite is -2% (noise floor). vm2 is now faster than Lua on binary_trees at depth 10 again (0.95x) and within 18% of CPython, on a host where vm2 spent the entire Phase 2 effort getting back to break-even.
D.4 Phase 3c: in-place functional append
Captured with runtime/vm2/bench/list_append_test.go on the same
M4 host. The workload chains N functional OpListAppend operations
on a list whose intermediates are dead. Two configurations:
- In-place: emit sets
InstrFlagBLastUseon every step; dispatch mutates the source and reuses its*vmListpointer. - Copy: every intermediate has an extra reader so emit must clear
the bit; dispatch allocates a fresh
*vmListand copies the prior backing array on every step.
| Chain length | In-place ns/op | Copy ns/op | Δ wall-clock | In-place mallocs | Copy mallocs | Δ allocs |
|---|---|---|---|---|---|---|
| 256 | 2,183 | 51,717 | 23.7× faster | 13 | 516 | 39.7× fewer |
The in-place malloc count is O(log N) (Go's append doubling), the
copy count is O(N). TestListAppendFastPathAllocCount asserts the
copy path uses at least 2× the in-place path's mallocs, guarding the
contract against silent regression.
This is the canonical Perceus-style "fluent chain becomes one backing array" result the spec motivated in §3.5; we got there with a 1-bit encoding annotation and a single dispatcher branch, without introducing reference counts or escape analysis.
Copyright
This document is placed in the public domain.