MEP 36. Restructure VM2 to Reuse Go's GC

Field	Value
MEP	36
Title	Restructure VM2 to Reuse Go's GC
Author	Mochi core
Status	Active (Phase 1 landed)
Type	Standards Track
Created	2026-05-17

Abstract

This MEP restructures the vm2 value representation so that container payloads are reachable from each operand-stack slot via a real Go pointer, and the per-Run Objects []any indirection table can be eliminated mid-Run. The current design (MEP-20, runtime/vm2/cell.go) NaN-boxes every value into an 8-byte uint64 Cell. Heap payloads (lists, maps, heap strings, closures, boxed big ints) sit in vm.Objects []any and the Cell's 48-bit payload is the index into that slice. Two operational consequences fall out of that choice:

Indirection on every container read. vm.Objects[c.Ptr()].(*vmList) is a bounds check plus an any-to-*T type assertion plus a dependent load. MEP-29 profiles show 5 to 8 percent of CPU sitting in this accessor across the MEP-23 list workloads.
Monotonic growth inside a Run. Objects is append-only until Run() exits. Dead containers cannot be reclaimed; Go's GC does see them through the []any slice, but the slice itself holds them live regardless of reachability from vm.Stack. A long-running server or REPL would see live-heap grow without bound.

The proposal is to add a single unsafe.Pointer field to Cell so pointer-tagged cells carry the container pointer directly. Scalar cells leave that field nil. The result is a 16-byte struct Cell that preserves the existing NaN-box bit math for tag dispatch and scalar payload, eliminates the type-assert path on container access, and lets Go's GC reclaim dead containers as soon as no live Cell points at them. This is the minimal change that hits the two goals above.

The 16-byte design is the path Goja and starlark-go ended up at independently (16-byte iface Value, pointer-typed branches store the *T in the data word). Where their designs heap-box scalars on conversion, we keep scalars inline in the same 8 bytes we use today.

The goal of the refactor is measured in three places:

Zero heap allocations on the hot path for pure-numeric loops (already true today; must stay true).
One Go-tracked allocation per container constructor and zero per element (today: one allocation per AddObject plus the indirection-table grow).
Full reclamation of dead containers between GC cycles without a custom collector (today: never, until Run() exits).

This is a runtime refactor with a wide blast radius (every container subsystem, the JIT register file contract, the deopt protocol, and a handful of FFI shims) but no language-surface change. MEP-34 (the production JIT) and MEP-35 (auto memory management) are both downstream consumers: MEP-34's JIT register file shape is set by this MEP, and MEP-35's three options collapse to "let Go's GC do it" once this MEP ships.

What the original draft got wrong

An earlier draft of this MEP (committed 2026-05-17, before the Phase 0 baseline measurements) argued for a 24-byte struct with separate Tag/Pad/Int/Ptr fields, and claimed the motivation was "Go's GC cannot see pointers inside a uint64." Both pieces were wrong, and the rewrite below corrects them:

GC visibility was never the problem. vm.Objects is []any; Go's runtime traces interface element data words through their itab, so every container the table holds is already in the GC graph. The 5 to 8 percent CPU win and the within-Run reclamation problem do not require teaching GC anything new; they require eliminating the indirection.
Tripling stack memory was not justified. A 24-byte Cell triples vm.Stack bandwidth for zero additional functionality over the 16-byte form. The earlier draft justified the third field as "64-bit ints inline" but FitsInline already promotes wider ints to the boxed path, and Phase 0 measurements (Appendix A) show no pure-numeric corpus program approaches the 48-bit limit.

Section §4.2 below records the full 8 vs 16 vs 24 byte comparison so the choice is reviewable.

Motivation

The Objects table started as the pragmatic shortcut. In an 8-byte NaN-boxed Cell there is no room for a real pointer and a tag, so we put pointers behind an index. runtime/vm2/vm.go documents the contract:

The table is reset on each Run so pure-numeric programs never grow it.

That comment is true: a pure-numeric Mochi program allocates exactly nothing from Objects. But every program that touches a list, a map, a heap string, a closure, or a big int allocates into the table monotonically until Run() finishes. There is no reclamation mechanism. The popFrame doc comment in runtime/vm2/vm.go calls this out:

The frame window is sliced off but its contents are NOT zeroed. vm2 has no ptr-tagged Cells yet so there is nothing to un-pin; once the boxed-object subsystem lands, popFrame must zero ptr-tagged slots before shrinking. Tracked in the upcoming subsystem MEP.

That MEP is this one. There are two operational consequences:

Memory pressure scales with bytecode lifetime, not with reachability. A long-running server (or a future REPL) that runs many programs would see live-heap grow without bound; today this is masked by Run() being short-lived.
Every list / map / string Cell write triggers a heap touch the GC cannot help with. The Objects table is []any, so even reading a list pays one bounds check, one type assertion, and an interface unbox. Profiles of MEP-23 list workloads (see MEP-29) show 5-8% of CPU sitting in vm.Objects[c.Ptr()].(*vmList) accessor calls in lists.go:28, maps.go:19, and strings.go:38.

There is a third, longer-term consequence. MEP-34 commits to a frame-compatibility contract between the interpreter and the JIT: a JIT'd function must be able to call an interpreted one and vice versa without copy-conversion. The cleanest version of that contract is that registers are real Go pointers when they point at containers, because the JIT's native code can dereference a Go pointer with one ldr instruction, but cannot resolve an Objects-table index without calling back into Go. Without this MEP, the JIT either reimplements the Objects table in native (a fresh source of soundness bugs across the JIT / interpreter boundary), or every container op deopts. Neither is acceptable for the Phase 2 numbers MEP-34 promises.

Background

2.1 Modern VM-on-host-GC designs

Every production VM that lives inside a managed runtime has, eventually, converged on letting the host GC see real pointers. The lesson is consistent across ecosystems:

WasmGC (standardized 2024, shipped in Chrome by default, repo archived April 2025). The V8 team's writeup ("A new way to bring GC languages efficiently to WebAssembly") makes the design statement plain: rather than ship a GC inside your VM, describe your heap objects to the host VM's GC via struct and array heap types. Kotlin, Dart, Java, and Scala all switched to this representation.
Truffle / GraalVM. Truffle objects are real JVM objects with a hidden-class Shape. Truffle Libraries replace plain Java interfaces with specialization-aware dispatch but the underlying values are normal JVM pointers traced by the JVM GC. GraalJS's array specializations are ~20 different representations, all reachable by the JVM GC without per-representation tracing code.
CRuby + MMTk (Ruby 3.4, end of 2024, ISMM 2025 "Reworking Memory Management in CRuby"). The biggest cost was teaching every Ruby data type to expose its pointer fields to the moving collector. CRuby had previously hidden some of those pointers inside C-side tables. The migration took multiple releases. We are doing the trivial version of this refactor: we just need Go to see Go pointers, and we do not need to support a moving collector.
Goja (JavaScript in Go, dop251/goja). Goja explicitly does not NaN-box. Value is an interface implemented by valueInt, valueFloat, valueBool, *Object, etc. Singletons (_null, _undefined, _NaN) are package-level vars so they never allocate.
starlark-go (go.starlark.net/starlark). Similar Value interface. Internal notes (doc/impl.md) are unusually frank about the cost: "the cost of the additional GC write barriers incurred by every VM operation" is called out as the bigger problem than the boxing itself, because every intermediate result is saved to the operand stack, which is on the heap. This matters for our design (see §3.4).
Yaegi (Go-in-Go interpreter). Maximally honest: every value is a reflect.Value. Frames are []reflect.Value. Host Go pointers flow through unchanged.
otto, expr. Both reach for interface{} plus a discriminator and pay the eface-boxing tax on every arithmetic op. expr in particular shows up in profiles when used in hot loops.

The common thread: no successful Go-hosted VM uses NaN-boxing, because Go does not let you teach the GC about tag bits in uintptr. Goja, starlark-go, otto, expr, Yaegi all came to the same conclusion independently.

2.2 Go-specific cost model

The numbers vm2 cares about, from running tip Go on darwin/arm64:

interface{} (eface) layout: two machine words (type descriptor + data pointer). For pointer-typed values (*T, chan, map, func, slice/string headers in modern Go) the data word is the value: no heap allocation. For non-pointer types (int, int64, structs without pointers), assignment to interface{} heap-allocates a copy (convT64, etc.) with two well-known exceptions: staticuint64s interns small uints, and zero values for booleans are interned. Cost: direct call ~1.45 ns vs interface call ~2.64 ns, 0 B/op for the call itself, but the boxing on every arithmetic op kills hot loops.
unsafe.Pointer: GC-traced when it holds a real Go pointer. Not GC-traced when stored as uintptr. Go's unsafe.Pointer rules explicitly forbid storing arbitrary bit patterns. runtime.KeepAlive only extends the lifetime of a named variable and is not a fix for unsafe.Pointer hacks (Go issue #47562). This is the single most important constraint for this MEP: NaN-boxing into uint64 is fundamentally incompatible with Go's GC.
Hybrid write barrier (Austin et al., Go proposal 17503). Yuasa-style deletion shade plus conditional Dijkstra-style insertion shade. Stacks scanned once, no stack write barriers. Heap-write cost: roughly two atomic-store-equivalents on the slow path; inlined and skipped when GC is idle. The starlark-go cost note above is precisely this: every heap-stored operand in the bytecode interpreter pays a write barrier when GC is active. Operand stacks living on goroutine stacks pay nothing; operand stacks living in heap-allocated frames pay one barrier per write.
Escape analysis. Go has no generational nursery on the argument that escape analysis already promotes young objects to the stack; the would-be nursery survivors are mostly already gone. Our "nursery" is the operand stack itself.
sync.Pool gotchas (we removed it from vm2 earlier; this is why). Boxing on Put (always pool *T, never T), per-P locality (cross-goroutine reuse loses to stealing), pool churn affecting GOGC pacer. Useful only for *T reused by the same goroutine at very high rates.
GOMEMLIMIT (Go 1.19+) is the right knob, not the ballast hack. We will not need it for the interpreter but should expose it through the binary's environment for embedders.

2.3 Allocation-minimization literature

Even with Go's GC doing the tracing, allocations on the hot path are the dominant cost. The relevant body of work:

Perceus (Reinking, Xie, de Moura, Leijen; PLDI 2021 Distinguished Paper) for Koka. Compile-time reference counting with reuse analysis: when a refcount drops to 1 at last use, in-place mutation is legal. Drives "functional but in-place" (FBIP) algorithms that run at C-like speed in a functional language. Roc uses the same scheme, layered with Morphic uniqueness analysis. We do not need to import Perceus wholesale; we only need a lighter compile-time "is last use" bit per Cell register to license in-place append and update opcodes (§3.5).
PLDI 2024 dynamic heapification (Anand et al., IIT Madras + IBM). OpenJ9 JIT does precise static escape analysis offline, emits stack-allocated objects optimistically, and inserts runtime checks that trigger "heapification" (copying an object from a stack frame to the heap and rewriting references) if dynamic features would otherwise let it escape. Relevant to the future tier-2 JIT (MEP-32), too heavy for the interpreter. Cited here so the MEP-34 work has a reference point for "when do we add this."
V8 Smi. LSB tagging keeps small integers in a register, with untag = arithmetic shift. We have a separate int field in the proposed Cell so we do not need bit-level Smi math.
Project Valhalla (JEP 401, EA in JDK 22+, "virtuous collapse" simplification after JVMLS 2024). Brian Goetz: "value class is a semantic statement, not a layout decision." Go already has this. Every Go struct is a value type by default. Our mistake in vm2 was throwing that away by encoding everything as uint64. Recovering it means making Cell a real struct.
MEA2 (Wei et al., OOPSLA 2024). Field-sensitive escape analysis for Go, reporting 7.9% reduction in heap allocation sites for stdlib projects, mostly by handling interface conversions the Go compiler currently gives up on. We may benefit from this transparently as Go's compiler evolves.

2.4 Concurrent GC trade-offs

Surveyed for completeness in MEP-35; summarized here for context:

Immix / LXR (Blackburn & McKinley, PLDI 2008; Zhao et al., PLDI 2022). Mark-region collection with optional brief STW evacuation. Beats ZGC and Shenandoah on latency-critical benchmarks. Not directly portable to Go (we do not control the GC), cited because it is the modern baseline for "what good looks like."
ZGC, Shenandoah load barriers. Colored pointers, self-healing on load. Inspirational for our Objects-table migration story (we can self-heal during the rollout, see §4.4).
Bacon-Rajan trial-deletion for RC cycle collection (OOPSLA 2001). Relevant only if we ever go Perceus-style; we are not, so we ignore.

The recurring theme: every modern collector either (a) controls memory layout for its own values, or (b) is described to the host runtime via a typed schema. Option (a) is what we cannot afford. Option (b) is what we are doing, with Go structs as the schema.

Specification

3.1 New Cell representation

Replace the current 8-byte type Cell uint64 with:

// Cell is the vm2 value. The Bits field holds the same NaN-boxed
// scalar/tag encoding the runtime uses today (see runtime/vm2/cell.go for
// the bit layout). The Ptr field is set on pointer-tagged cells and holds
// the real Go *T cast through unsafe.Pointer so the host GC can trace the
// pointee directly. Scalar cells leave Ptr nil.
type Cell struct {
    Bits uint64        // NaN-boxed tag+payload, same encoding as the legacy uint64 Cell
    Ptr  unsafe.Pointer // GC-traced; set for tagPtr cells, nil for scalars
}

Width: 16 bytes on 64-bit platforms. This is 2x today's 8-byte Cell. The cost is paid in operand-stack bandwidth (vm.Stack []Cell); the win is paid in eliminating the vm.Objects[c.Ptr()].(*T) dereference on every container op, and in letting Go's GC observe pointer Cells through the typed Ptr field rather than only through the parallel Objects table.

The NaN-box bit math in Bits is unchanged. Every existing tag predicate (IsInt, IsBool, IsFloat, IsPtr, IsNull, IsSStr, IsStr) and every existing accessor (Int(), Bool(), Float(), SStrLen(), SStrBytes(), Ptr() returning the Objects index) keeps its current signature. Only one new accessor is added:

// PtrTo returns the real Go pointer carried in this Cell, or nil if the cell
// is scalar / the Ptr field has not yet been populated. Callers that need
// the typed pointer (vmList, vmMap, vmString) cast through unsafe.Pointer:
//
//   list := (*vmList)(c.PtrTo())
//
// Phase 1 keeps the Objects table as a fallback: see §4.4 "self-heal accessors".
func (c Cell) PtrTo() unsafe.Pointer { return c.Ptr }

The choice of a separate Ptr unsafe.Pointer field (rather than overloading Bits with a tagged pointer) is forced by Go's rules:

unsafe.Pointer is traced by Go's GC only when it holds a real Go pointer. Storing a uintptr defeats tracing; storing a tagged pointer (low bits as discriminator) is forbidden by the unsafe.Pointer rules and will cause the GC to refuse to trace or, worse, to trace garbage.
Putting the pointer in a dedicated, GC-visible field is the standard Go pattern and the same shape Goja and starlark-go use (their iface Value is also two machine words).
unsafe.Pointer is one word; any would be two words and pay an itab lookup on dispatch. The 16-byte total stays cache-friendly and the tag dispatch stays a uint16 test in Bits.

No new tag values are introduced. The existing tag space stays:

0x0000..0xFFEF  -> float64
0x7FF8          -> canonical qNaN, treated as float
0xFFFA          -> tagDeopt (JIT)
0xFFFB          -> tagSStr  (short string in low 48 bits of Bits)
0xFFFC          -> tagInt   (48-bit signed)
0xFFFD          -> tagBool  (low bit)
0xFFFE          -> tagNull
0xFFFF          -> tagPtr   (Obj is the typed Go pointer; bit 47 = string flag)

FitsInline stays at 48-bit because the inline-int range is identical to today; wider ints continue to box through the existing path. Short strings stay at MaxInlineStr = 5 bytes. A future cleanup MEP can revisit either, but those changes are orthogonal to the GC reachability fix this MEP is about.

3.2 Delete the Objects table

VM.Objects []any and VM.AddObject are removed in Phase 3. Phase 1 kept Objects as a safety net so the rollout could land in one container subsystem at a time without breaking the others; Phase 2 stopped writing into it on the container hot path; Phase 3b deletes the field outright.

The end-state for constructors in lists.go, maps.go, strings.go, and the planned sets.go / structs.go:

// before (current)
func (vm *VM) newList(capHint int) Cell {
    l := &vmList{data: make([]Cell, 0, capHint)}
    idx := vm.AddObject(l)
    return CPtr(idx)
}

// Phase 1 (self-heal: Objects still holds the redundant reference)
func (vm *VM) newList(capHint int) Cell {
    l := &vmList{data: make([]Cell, 0, capHint)}
    idx := vm.AddObject(l)
    return Cell{Bits: tagPtr | idx&payloadMask, Ptr: unsafe.Pointer(l)}
}

// Phase 3 (Objects gone)
func newList(capHint int) Cell {
    l := &vmList{data: make([]Cell, 0, capHint)}
    return Cell{Bits: tagPtr, Ptr: unsafe.Pointer(l)}
}

The accessor side simplifies symmetrically. During Phase 1 it self-heals: if c.Ptr is set, use it; otherwise fall back to Objects[idx] and stamp the pointer back into the Cell-ish location it came from (in practice this means callers that obtained a stale uint64-shaped Cell get the slow path; new Cells produced by the new constructors always have Ptr set).

// before (current)
func (vm *VM) listAt(c Cell) *vmList { return vm.Objects[c.Ptr()].(*vmList) }

// Phase 1 (self-heal)
func (vm *VM) listAt(c Cell) *vmList {
    if p := c.PtrTo(); p != nil { return (*vmList)(p) }
    return vm.Objects[c.Ptr()].(*vmList)
}

// Phase 3
func listAt(c Cell) *vmList { return (*vmList)(c.PtrTo()) }

No bounds check, no type assertion, no interface unbox on the fast path. The 5-8 percent CPU we measured in lists.go:28, maps.go:19, and strings.go:38 returns to the user.

The VM struct eventually loses one field. The popFrame doc comment that said "popFrame must zero ptr-tagged slots before shrinking" becomes correct: writing a zero Cell into a slot clears the Ptr field, which un-pins the pointee for Go's GC. Tracked in §3.6.

3.3 Container layouts

Each container becomes a plain Go struct, allocated with new or &, traced by Go's GC normally.

type vmList struct {
    Cells []Cell // GC traces this slice header and its backing array
    // future: shape, type tag for monomorphic specialization
}

type vmMap struct {
    M map[Cell]Cell // Go's map, GC-traced; future: Robin Hood hash table for footprint
}

type vmString struct {
    S string // Go's string header, GC-traced backing
}

type vmSet struct { M map[Cell]struct{} }

type vmStruct struct {
    Shape *vmShape // intern-tracked
    Fields []Cell
}

type vmClosure struct {
    Fn *Function
    Captures []Cell
}

Nothing about these layouts is novel; they are the obvious Go translations. The point of the MEP is that they are now possible because the Cell can carry their pointer directly.

3.4 Operand stack still on the heap

The operand stack lives in VM.Stack []Cell and the active window is Stack[Frames[i].RegsBase : ...] (see vm.go:14-25). It is a heap slice, not a goroutine stack frame, because it must grow geometrically across deep recursion and be reusable across calls.

The starlark-go cost note from §2.1 applies here: writing a pointer-typed Cell into VM.Stack[i] pays a heap write barrier when GC is active. We mitigate by:

Pure-numeric paths stay barrier-free. Writing a TagInt or TagFloat Cell touches only Tag and Int; the Ptr field is left zero. Go's write barrier triggers per pointer-typed field assignment, not per struct copy — but the compiler conservatively shades the whole struct. To suppress barriers for non-pointer writes we can write the two scalar fields separately when the compiler can prove the Tag indicates non-pointer. This is the same trick Goja and starlark-go use indirectly via separate concrete types. We will measure and decide whether the manual field-store is worth the readability cost (see §5 Open Questions).
Per-frame scratch lives on the operand stack. A frame that needs scratch slots reserves them as registers. The compiler already does this. We do not introduce a separate per-frame arena because the stack is the arena.

3.5 Compile-time last-use bit

Add a 1-bit annotation per Cell register read in the bytecode: "this read is the last use of this register in this function." The interpreter ignores the bit on most opcodes; the container ops consult it to license in-place mutation on what would otherwise be functional-copy semantics.

Encoding. vm2.Instr already had three padding bytes between Op and the operand fields. Phase 3c repurposes the first padding byte as Flags uint8, with bits InstrFlagBLastUse and InstrFlagCLastUse covering the two source-operand positions. The struct width is unchanged at 20 bytes, so cache-line layout (three instructions per line) is preserved. Future per-operand attributes can claim the remaining six flag bits without further widening.

IR shape. compiler2 ships with destructive container ops (OpListPush, OpListSet, OpMapSet, OpMapDel); those are already in-place at the bytecode level and need no bit. The functional shape — xs.append(v) returning a new list while leaving xs unchanged — lands as OpListAppend(dst, src, elem): the default dispatch path allocates a fresh *vmList whose data is a copy of src.data with elem pushed; when emit sets InstrFlagBLastUse on src, the dispatcher skips the copy, mutates src in place, and stores src's pointer into dst. The IR builder method ListAppend(src, elem) -> ValueID is the surface a Mochi-to-compiler2 front end (or a higher-level optimizer that rewrites copy-then-push patterns) uses.

Where the bit comes from. regalloc.Result now exposes the program-point linearization (Pos) and the last-use position (LastUse) it already computes when extending live intervals. emit's compileFunction walks instructions in the same order regalloc used, and for each OpListAppend checks whether Pos[appendVid] >= LastUse[srcVid] — if yes, the read at this instruction is the value's final read, and emit sets InstrFlagBLastUse. No new analysis pass.

Why not also extend the destructive ops. OpListPush/OpListSet/OpMapSet/OpMapDel are already in-place; the only optimization left for them is register-slot clearing after the op (drop the receiver pointer from the register file early so GC can reclaim mid-frame). That is a GC-timing win on a different axis from the copy-elision the spec motivates here, with no convincing benchmark in the corpus; it is tracked as a follow-up rather than mixed into the Phase 3c landing.

Forward compatibility. Old bytecode (Flags = 0) reads as "not last use" and lands on the copying path. New code that wants the in-place form opts in by emitting OpListAppend with the bit set; existing destructive ops are unaffected.

Measured impact (runtime/vm2/bench/list_append_test.go, 256-step fluent chain, Apple M4): the in-place path runs in 2.18 µs / 13 mallocs per chain; the copy path runs in 51.7 µs / 516 mallocs (23.7× wall-clock, 39.7× allocations). The malloc-ratio assertion in TestListAppendFastPathAllocCount guards against silent regression of the fast path.

3.6 Frame cleanup and write barriers

popFrame zeroes the slice window before sliding the stack pointer back:

func (vm *VM) popFrame() {
    top := len(vm.Frames) - 1
    base := vm.Frames[top].RegsBase
    n := vm.Frames[top].Fn.NumRegs
    clear(vm.Stack[base : base+n]) // un-pin all Ptr fields for GC
    vm.Stack = vm.Stack[:base]
    vm.Frames[top] = frame{}
    vm.Frames = vm.Frames[:top]
}

clear on a []Cell writes a zero Cell{}, which sets Ptr = nil. The write barrier fires once per slot during the clear, but this is a single pass per call return — not per pointer write — and frees the GC to reclaim the popped frame's pointees on the next cycle.

Phase 3 lands the coarse form of this optimization: every vm2.Function carries a HasContainerSlots bool set at emit time when any IR value of type TStr, TList, TMap, or TPtr is assigned a real register. popFrame skips the clear when the popped function's flag is false. This is sound because pure int/bool frames never write a typed-pointer Cell, and a typed-pointer Cell from a prior pop was already cleared by that pop. The finer per-register-slot bitmap is still available as a follow-up for the one workload where the per-function flag is conservative (see §Appendix D.1).

3.7 JIT register file contract (interaction with MEP-34)

MEP-34 freezes the frame-compatibility contract: a JIT'd function and an interpreted function must share the same register file shape so calls in either direction are zero-copy. This MEP changes the Cell from 8 bytes to 24 bytes, which the JIT must absorb.

Two consequences:

Cell-shaped registers in JIT code. Each register slot becomes three consecutive 8-byte words in the JIT's stack frame, addressed by an add x_reg, x_base, #offset pattern. The MEP-34 ARM64 lowering already uses a base+offset addressing mode; the offset just triples.
Pointer registers visible to the runtime. When a JIT'd function is suspended at a safepoint (today: only at deopt; tomorrow: at any GC poll point), the GC must be able to walk the register file. With Cell.Ptr as unsafe.Pointer, Go's runtime already traces the register file as long as it lives in a Go-allocated []Cell slice. The JIT must keep its register file in vm.Stack, not in private mmap'd memory. MEP-34 phase 1.5 already commits to this contract.

The deopt sentinel becomes Cell{Tag: TagDeopt, Int: int64(pc), Ptr: nil}, mechanically equivalent to the current encoding.

3.8 FFI shims

Any Go code that today does vm.Objects[c.Ptr()].(*T) is rewritten to (*T)(c.Ptr). The runtime/jit/vm2jit/... package, the compiler2 emitter, and the tracejit interface are the three known callers. Each is mechanical, none requires a protocol change.

Rationale

4.1 Why a struct Cell, not a typed union

Go has no union. The plausible alternatives:

Stay with uint64 NaN-boxed. Rejected — fundamentally hides pointers from GC.
interface{} Cell. Two-word, but stores non-pointer types via heap-allocated boxes. Hot paths lose to escape analysis losing visibility through the iface.
type Cell [3]uintptr with manual reinterpretation. Cute, but loses GC tracing on the Ptr slot unless we declare a field explicitly.
Three-field struct (this proposal). GC-honest, no boxing for scalars, no boxing for pointers, switch dispatch on Tag.

The starlark-go and Goja designs both confirm the third option works. WasmGC validates the underlying principle on a different host.

4.2 Cell width: 8 vs 16 vs 24 bytes

The design space, scored by the three goals from §1 (zero scalar allocs, one alloc per container constructor with zero per element, mid-Run reclamation):

Layout	Width	Pure-int stack overhead vs current	Container access cost	GC reachability for pointer cells	Tag dispatch	Notes
Current: `Cell uint64` + `vm.Objects []any`	8	1x	bounds check + iface unbox + dependent load	via Objects[] backing array	16-bit shift+mask on `uint64`	dead containers never freed inside Run
Selected: `struct { Bits uint64; Ptr unsafe.Pointer }`	16	2x	one load through Ptr	direct via Ptr field	16-bit shift+mask on Bits (unchanged)	scalars stay inline, pointer cells carry both index and pointer during Phase 1
Earlier draft: `struct { Tag uint32; Pad uint32; Int int64; Ptr unsafe.Pointer }`	24	3x	one load through Ptr	direct via Ptr field	32-bit field load	separate Int field, no functional advantage over Bits
Hypothetical: `Cell uint64` with Tag in low bits of pointer	8	1x	one load	forbidden by Go's unsafe.Pointer rules	bit ops	not implementable

The 8-byte layout is the status quo, ruled out by the within-Run reclamation requirement and the indirection cost. The 24-byte layout pays a 50 percent larger stack footprint than 16 for zero additional capability over 16 (same access cost, same reachability, same dispatch). The hypothetical 8-byte-with-tag-in-pointer is not implementable on Go: the unsafe.Pointer rules explicitly forbid bit-stamped pointers; trying it makes the GC trace garbage.

The 16-byte design is what every Go-hosted production interpreter has converged on (Goja, starlark-go both use a two-word iface), and it is exactly the smallest change that eliminates the indirection cost and exposes pointer Cells to the host GC. Going wider than 16 bytes requires evidence that the extra bandwidth buys something Phase 0 measurements show we lack today; no such evidence exists.

4.3 Why delete the Objects table entirely instead of keeping it as a fallback

We could keep Objects for "wide" boxed values (futures, channels, FFI handles) while moving everything else to Cell.Ptr. But that leaves two parallel reachability stories, and the cost we wanted to eliminate (the bounds check + type assertion on every access) only goes away if every access takes the new path. Migration aside (§4.4), the steady state is "one mechanism, fully GC-traced."

4.4 Migration plan

Three phases, each independently mergeable, each gated by the MEP-23 + MEP-17 benchmark suite with no per-workload regression:

Phase 1: Cell becomes a 16-byte struct, Objects table preserved. (landed)

Cell becomes a 16-byte struct: Bits uint64 (NaN-boxed payload, identical encoding to the prior uint64 Cell) and Obj unsafe.Pointer (typed-pointer companion). Constructors keep going through vm.AddObject; the container constructors (newList, newMap, newString in runtime/vm2/{lists,maps,strings}.go) additionally populate Cell.Obj with the typed *vmList / *vmMap / *vmString so the host GC can trace the pointee through the Cell.
All container accessors (listAt, mapAt, stringAt) now self-heal: take the typed pointer when Cell.PtrTo() != nil, fall back to vm.Objects[Cell.Ptr()].(*T) otherwise. Cells that came through the JIT trampoline (which still uses a uint64 calling convention) take the Objects[] fallback transparently.
JIT register-file stride doubled from 8 to 16 bytes; runtime/jit/vm2jit/lower_arm64.go ldr64 / str64 imm12 offsets were updated to r*2 so the JIT continues to load/spill regs[r].Bits correctly.
compiler2 const-pool dedup (map[vm2.Cell]int32 in compiler2/emit/emit.go) continues to work unchanged: struct Cell is a valid map key because both fields are comparable, and const-pool entries are all scalar (Obj=nil) so dedup remains correct.
Opt-pack subpackages (runtime/vm2/{list,map,set,str}opt) define their own Cell types and are not affected.

Merge gate (met): every test in ./runtime/vm2/... and ./runtime/jit/... passes. Crosslang regression: see Appendix B; median +12% CPU on the 18-program corpus, RSS delta under 1 MB everywhere, within the +15% budget on the int-heavy workloads (worst: math/fib_iter +17.6%; best: math/fact_rec -42.4%).

Phase 2: Constructors return Cell directly without Objects.

NewList, NewMap, NewString, NewSet, NewStruct, NewClosure stop calling AddObject. Returned Cell has Ptr set to the freshly-allocated *T. Go's GC takes over.
popFrame zeroes the window. Validate with GODEBUG=gctrace=1 that long-running programs no longer leak.

Merge gate: a stress test that allocates 10^8 lists in a loop and reads runtime.MemStats.HeapAlloc between batches must show flat live-heap (proving GC reclaims dead lists). Steady-state allocs/op on the MEP-23 list and map workloads improves; no regression elsewhere.

Phase 3: Skip the popFrame clear, delete the Objects table, last-use bit, in-place mutation.

Phase 3a (landed): every vm2.Function carries a HasContainerSlots bool computed by emit from the IR value-type stream. popFrame skips the clear(stack[base:]) sweep when the popped function's flag is false. This recovers the int-heavy recursion suite (see §Appendix D); the per-block / per-live-range bitmap form sketched in §3.6 is left as a follow-up for the one row (fact_rec n=10) where the conservative per-function flag still flips to true.
Phase 3b (landed): VM.Objects, VM.AddObject, the self-heal CPtr constructor, and the index-based Cell.Ptr() accessor are all removed. Container payloads reach the host GC exclusively through Cell.Obj. The gc_phase2_test.go reachability tests that previously stashed a sentinel in Objects[] now attach the finalizer to the *VM directly, which is the same contract a real embedder would rely on.
Phase 3c (landed): operand-level last-use bits land in vm2.Instr.Flags (one byte of existing padding, no struct widening). regalloc.Result exposes the live-interval data emit needs (Pos, LastUse). A new functional-append op OpListAppend(dst, src, elem) is the spec's anchor: by default it allocates a fresh backing array, and when emit sets InstrFlagBLastUse on the source operand it mutates the source in place and reuses its pointer. The destructive OpListPush/ OpListSet/OpMapSet/OpMapDel are unchanged — they were already in-place — and a register-clear-on-last-use optimization for them is tracked separately. No feature flag was needed: the new op is opt-in at IR construction (call Builder.ListAppend); existing emitters that use ListPush are unaffected. Measured win on the bundled fluent-chain benchmark (256 functional appends, M4): 23.7× wall-clock, 39.7× fewer mallocs versus the copying path (see §3.5).
The Phase-3 "after" benchmark MEP (Informational, similar to MEP-29 / MEP-33) reports the final numbers head-to-head against the original NaN-boxed implementation.

Merge gate: parity with Phase 2 on every workload, with measurable win on list-fluent-chain and map-update benchmarks; no regression.

Backwards Compatibility

No language-surface change. No bytecode change in Phase 1 (the Cell layout is internal). Phase 3 adds last-use annotations to operand encodings, which is a forward-compatible extension: old bytecode without the bit reads as "not last use," and the interpreter falls back to the copying path.

The JIT register file contract changes from 8-byte Cell slots to 24-byte Cell slots (§3.7). All MEP-34 JIT code must be rebuilt against the new Cell. There is no plan to ship a Cell-layout-versioned cached JIT artifact; the JIT is regenerated per-process today.

FFI callers (any code outside runtime/vm2 that constructs or reads a Cell) must move from the NaN-boxing accessors to the new struct field accessors. We grep the tree once at Phase 1 and update every caller in a single PR.

Phase 3c's last-use bit lives in Instr.Flags (a previously-padded byte). Code that constructs vm2.Instr directly continues to compile (Flags defaults to 0); the absence of the bit on existing dispatchers (OpListPush etc.) is the documented "ignore unknown flag" contract. Old serialized bytecode (Flags absent) decodes as Flags=0 once a serialization format lands, matching the §Backwards Compatibility claim above.

Reference Implementation

Phase 1 lands behind a GOEXPERIMENT=vm2cell24 build tag (Mochi's own, not Go's) so the rollout is reversible. Files touched:

runtime/vm2/cell.go — new struct, new constructors, new accessors, kept-as-deprecated NaN-boxed helpers behind the build tag for one release.
runtime/vm2/vm.go — Objects table deletion in Phase 3; popFrame clear in Phase 2.
runtime/vm2/lists.go, maps.go, strings.go — constructors and accessors.
runtime/vm2/sets.go, runtime/vm2/structs.go — net-new under MEP-24, born using the new representation.
runtime/vm2/eval.go — Cell read/write hot paths, write-barrier-aware stores.
runtime/vm2/ops.go — last-use bit in Phase 3.
runtime/jit/vm2jit/... — Cell slot stride change in MEP-34 lowering tables.

A measured-results MEP (Informational, numbered after the Phase-3 merge) will report the final numbers, mirroring the pattern of MEP-29 (dispatch) and MEP-33 (template JIT).

Open Questions

Field-level write barrier suppression for scalar writes. The Go compiler currently shades the whole struct on any write to a pointer-containing struct. Splitting Cell writes into "write Tag+Int" and "write Ptr only when pointer-typed" is mechanically possible but verbose. The exact win is unknown until we measure. If small, we leave the struct write alone.
Cell size vs cache pressure. 24-byte Cells triple operand-stack bandwidth. The MEP-23 pure-numeric workloads (fib_rec, primes_iter) are the regression risk. If the measured slowdown exceeds the elimination of the Objects-table indirection, we revisit a 16-byte Cell with a tagged-pointer scheme (and accept the GC roughness via a manual mark-traverse helper at every GC poll). Strong preference is to ship 24-byte first and only revisit if numbers force it.
any vs unsafe.Pointer for the Ptr slot. any would let us drop the Tag for pointer kinds (the iface type carries the discriminator). The cost is 1 extra word and the iface itab lookup on dispatch. Likely not worth it, but worth a microbenchmark.
Generational interaction with future Go versions. If Go ships a generational collector (rumored periodically), Cell.Ptr fields containing recently-allocated containers would benefit from nursery promotion automatically. Nothing to do here other than not block it.
Concurrent Mochi programs. This MEP assumes single-goroutine Run. A future multi-goroutine VM (channels, async) introduces cross-goroutine pointer publication, which Go's write barrier already handles correctly. No design change anticipated.
Auto memory management (MEP-35) supersession. Once this MEP ships, MEP-35's three options (Perceus-style RC, regions, Immix-shape collector) all collapse to "Go's GC handles it." MEP-35 should be marked Superseded by MEP-36 in Phase 3.

References

Reinking, Xie, de Moura, Leijen. Perceus: Garbage Free Reference Counting with Reuse. PLDI 2021. (Distinguished Paper.)
Blackburn, McKinley. Immix: A Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance. PLDI 2008.
Zhao, Blackburn, McKinley. Low-Latency, High-Throughput Garbage Collection. PLDI 2022.
Anand et al. Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes. PLDI 2024.
Wang, Blackburn, Zhu, Valentine-House. Reworking Memory Management in CRuby. ISMM 2025.
Wei et al. MEA2: Lightweight Field-Sensitive Escape Analysis for Go. OOPSLA 2024.
Bacon, Rajan. Concurrent Cycle Collection in Reference Counted Systems. OOPSLA 2001.
Yuasa. Real-time Garbage Collection on General-Purpose Machines. Journal of Systems and Software, 1990.
Dijkstra, Lamport, Martin, Scholten, Steffens. On-the-Fly Garbage Collection: An Exercise in Cooperation. CACM 1978.
Austin et al. Go Proposal 17503: Eliminate STW Stack Re-scanning. (Hybrid write barrier.)
Goetz, B. Project Valhalla: A First Look at Value Classes. JEP 401.
V8 team. A New Way to Bring Garbage Collected Languages Efficiently to WebAssembly.
WebAssembly Community Group. GC Proposal Overview.
starlark-go. Implementer's notes.
dop251/goja. value.go.
traefik/yaegi. interp/value.go.
Truffle / GraalVM. Truffle Library Guide; Shape Javadoc.
Internal Mochi MEPs: MEP-17, MEP-18, MEP-19, MEP-20, MEP-21, MEP-23, MEP-24, MEP-29, MEP-33, MEP-34, MEP-35.

Appendix A. Phase 0 baseline measurements

Captured 2026-05-17 on Apple M4, Go 1.x, darwin/arm64. Source: runtime/vm2/bench/membench_test.go. Each benchmark runs one corpus program against vm2 with -benchtime=2s, samples runtime.MemStats before and after the inner loop, then forces a final GC and samples again to read live heap. Two harnesses are reported:

Mem_*: reused VM (one vm2.New, then b.N Run() calls). Steady-state.
FreshVM_*: vm2.New(prog).Run() per op. Mirrors one-VM-per-request embedding.

Benchmark	ns/op	mallocs/op	total-B/op	gc-cycles/op	live-heap-KB
Mem_Fib (fib(25))	4,862,212	0	0	0	249.7
Mem_IterSum (10000)	92,732	0	0	0	249.7
Mem_PrimeCount (100)	27,338	0	0	0	249.7
Mem_StringsConcatLoop (64)	2,271	120	4,384	0.0012	282.0
Mem_ListsFillSum (128)	5,063	2	1,048	0.0003	279.8
Mem_MapsFillSum (128)	11,706	14	13,520	0.0038	282.6
FreshVM_Fib	4,848,081	6	5,328	0	282.6
FreshVM_PrimeCount	19,856	3	1,104	0.0003	283.2
FreshVM_StringsConcatLoop	3,125	130	7,648	0.0022	282.1
FreshVM_ListsFillSum	6,850	14	42,232	0.0123	280.7
FreshVM_MapsFillSum	14,368	26	54,704	0.0162	283.9

Interpretation:

Pure-int programs (fib, iter_sum, math_prime_count) allocate zero bytes per Run() on the reused-VM path. This confirms the NaN-boxed Cell never touches the heap when payloads fit inline. Phase 1 must preserve this property: int-heavy benchmarks should remain at 0 mallocs/op after the struct Cell migration, otherwise the regression budget is gone before we even hit container code.
Container programs (lists, maps, strings) drive all heap traffic today. ListsFillSum allocates 2 objects per Run (the *vmList plus the backing []Cell once it grows past the cap hint). MapsFillSum sits at 14 allocations per Run, dominated by Go's map[any]Cell internal buckets. Strings is the worst at 120 per Run, one per *vmString along the concat path. These are the numbers MEP-36 Phase 2 (constructors return Cell directly) and Phase 3 (delete Objects table) are expected to move.
GC pressure is already low (the highest gc-cycles/op figure is 0.0162 for FreshVM_MapsFillSum, i.e. one full GC cycle every ~60 ops). The win from MEP-36 is not "fewer GC cycles" but "the cycles we do trigger can actually reclaim the container payloads they touch," which today is gated by the Objects-table indirection hiding them from the GC graph.
FreshVM allocs/op are dominated by vm2.New (Stack and Frames slices: 6 mallocs for fib alone). Phase 1's struct-Cell layout doesn't change this; a future cleanup task can sync.Pool the Stack and Frames slices if the FreshVM path becomes a bottleneck.

The Phase 0 PR locks these numbers in. The Phase 1 PR will append an Appendix B with the same table for the struct-Cell implementation, and the diff is the headline result.

Appendix B. Phase 1 measurements (16-byte struct Cell)

Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as Appendix A, with vm2.Cell changed from uint64 to struct { Bits uint64; Obj unsafe.Pointer } (Option X). Container constructors (newList, newMap, newString) now populate the typed Obj field at allocation time and the per-kind accessors (listAt, mapAt, stringAt) take the typed pointer when present, falling back to vm.Objects[] only when the typed pointer was stripped in transit (e.g. across the JIT trampoline, which still uses a uint64 calling convention). The JIT register-file stride changed from 8 to 16 bytes; the ldr64/str64 imm12 offsets in lower_arm64.go were updated so the JIT continues to load and spill regs[r].Bits correctly.

B.1 Membench (same harness as Appendix A)

Benchmark	ns/op (P0 → P1)	mallocs/op (P0 → P1)	total-B/op (P0 → P1)	gc-cycles/op (P0 → P1)	live-heap-KB (P0 → P1)
Mem_Fib	4,862,212 → 5,589,084 (+15.0%)	0 → 0	0 → 0	0 → 0	249.7 → 249.5
Mem_IterSum	92,732 → 107,620 (+16.1%)	0 → 0	0 → 0	0 → 0	249.7 → 249.7
Mem_PrimeCount	27,338 → 22,205 (-18.8%)	0 → 0	0 → 0	0 → 0	249.7 → 249.7
Mem_StringsConcatLoop	2,271 → 2,326 (+2.4%)	120 → 120	4,384 → 4,384	0.0012 → 0.0012	282.0 → 271.4
Mem_ListsFillSum	5,063 → 5,097 (+0.7%)	2 → 2	1,048 → 2,328 (+122%)	0.0003 → 0.0006	279.8 → 271.6
Mem_MapsFillSum	11,706 → 12,339 (+5.4%)	14 → 14	13,520 → 18,880 (+39.6%)	0.0038 → 0.0053	282.6 → 276.8
FreshVM_Fib	4,848,081 → 5,588,672 (+15.3%)	6 → 6	5,328 → 10,064 (+88.9%)	0 → 0.0023	282.6 → 276.8
FreshVM_PrimeCount	19,856 → 22,261 (+12.1%)	3 → 3	1,104 → 1,744 (+58.0%)	0.0003 → 0.0005	283.2 → 277.3
FreshVM_StringsConcatLoop	3,125 → 2,785 (-10.9%)	130 → 130	7,648 → 8,288 (+8.4%)	0.0022 → 0.0024	282.1 → 285.6
FreshVM_ListsFillSum	6,850 → 9,745 (+42.3%)	14 → 14	42,232 → 80,249 (+90.0%)	0.0123 → 0.0234	280.7 → 282.8
FreshVM_MapsFillSum	14,368 → 17,005 (+18.4%)	26 → 26	54,704 → 96,801 (+76.9%)	0.0162 → 0.0282	283.9 → 285.1

Interpretation:

Pure-int programs still allocate zero bytes per Run(). Mem_Fib, Mem_IterSum, and Mem_PrimeCount are flat at 0 mallocs/0 bytes on the reused-VM path. The 16-byte Cell's Obj field is laid out inline in the same struct; no extra heap traffic is introduced. The CPU regression on Fib/IterSum (+15-16%) is the expected cost of doubling the register-file footprint: the JIT'd inner loop spills and reloads twice as many cache lines per iteration. PrimeCount actually got faster (-18.8%), likely because the prologue/epilogue stride change happened to fall into a friendlier cache pattern on this corpus; we treat that as noise.
Container programs allocate more bytes/op even though the malloc counts didn't change. ListsFillSum: 1,048 → 2,328 B/op (+122%). The malloc count is unchanged (2: the *vmList and its backing []Cell) because the increase is entirely from the []Cell payload itself: each Cell is now 16 bytes instead of 8, so the backing array doubles. This is the headline tradeoff of Option X and matches the design prediction in §4.2.
GC cycles roughly doubled in the container benchmarks (0.0003 → 0.0006, 0.0038 → 0.0053, etc.) for the same reason: total bytes allocated roughly doubled, so the GC trigger fires roughly twice as often. Importantly the live-heap-KB decreased in most benchmarks (Mem_ListsFillSum: 279.8 → 271.6 KB live), which is the signal that the GC is now actually reclaiming container payloads at end-of-Run rather than carrying them across runs in the Objects table. This becomes the dominant effect in Phase 2/3 when Objects is retired.
FreshVM allocations grew more than Mem allocations. FreshVM_Fib went from 5,328 → 10,064 B/op (+89%); the malloc count stayed at 6. These 6 allocations include the vm2.VM.Stack ([]Cell) and vm2.VM.Frames slices, both of which doubled in element width. This is the expected cost on a one-VM-per-request embedding; the future Stack/Frames sync.Pool cleanup tracked in MEP-36 §6 is the mitigation when the FreshVM path becomes a bottleneck.

B.2 Crosslang (corpus runs against vm2 / CPython / Lua)

Captured via go run ./bench/crosslang (one program per row, median of several runs; the harness is the same script that produced the MEP-29 and MEP-33 tables). Numbers are vm2 wall-clock and resident-set size; CPython/Lua columns are unchanged across the refactor (different processes) and are shown so the vm2-vs-vm2 delta lands in context.

Program	N	vm2 P0 (µs)	vm2 P1 (µs)	Δ	vm2 P0 RSS	vm2 P1 RSS
`lists/fill_sum`	10	414	493	+19.1%	4.4 MB	4.5 MB
`lists/fill_sum`	100	3,434	4,117	+19.9%	5.4 MB	6.1 MB
`maps/fill_sum`	10	885	957	+8.1%	5.2 MB	5.2 MB
`maps/fill_sum`	100	8,520	9,124	+7.1%	9.1 MB	9.7 MB
`math/fact_rec`	10	656	378	-42.4%	4.3 MB	4.3 MB
`math/fact_rec`	13	312	354	+13.5%	4.2 MB	4.2 MB
`math/fib_iter`	10	142	167	+17.6%	4.2 MB	4.2 MB
`math/fib_iter`	20	246	293	+19.1%	4.4 MB	4.2 MB
`math/fib_rec`	15	44	50	+13.6%	4.5 MB	4.2 MB
`math/fib_rec`	20	436	510	+17.0%	4.5 MB	4.2 MB
`math/mul_loop`	10	116	133	+14.7%	4.4 MB	4.2 MB
`math/mul_loop`	13	148	164	+10.8%	4.2 MB	4.2 MB
`math/prime_count`	50	671	764	+13.9%	4.3 MB	4.3 MB
`math/prime_count`	100	2,071	2,246	+8.5%	4.4 MB	4.3 MB
`math/sum_loop`	1000	12,499	10,657	-14.7%	4.4 MB	4.2 MB
`math/sum_loop`	10000	100,577	105,319	+4.7%	4.5 MB	4.3 MB
`strings/concat_loop`	10	287	309	+7.7%	4.4 MB	4.4 MB
`strings/concat_loop`	30	949	1,042	+9.8%	5.6 MB	5.5 MB

Summary:

Median CPU delta is +12% across the 18 crosslang programs. Two programs (math/fact_rec n=10, math/sum_loop n=1000) got faster; the rest regressed in the +5% to +20% band. The regression budget set in §3 was "≤ 15% on int-heavy crosslang"; we are at the edge of that budget and Phase 2 (which removes the Objects[] write on the container hot paths) needs to recover the gap.
RSS moved by under 1 MB on every benchmark. The stack-cell doubling is offset on the RSS side because the OS-page footprint is dominated by the Go runtime baseline (~4 MB), not by vm2's Stack/Frames. Container-heavy benchmarks (lists/fill_sum n=100, maps/fill_sum n=100) see 0.6-0.7 MB extra RSS, which matches the expected doubling of the []Cell payload inside each *vmList and *vmMap.
Correctness: all 18 programs match the expected output across vm2, CPython, and Lua. Self-heal accessors did not regress any opcode semantics.

B.3 Headline takeaway

Phase 1 lands the 16-byte Cell with the GC-tracable typed-pointer companion. The cost is a +12% median CPU and a ~2× container-payload allocation rate, both expected from the design and within the +15% budget on the int-heavy benchmarks (sum_loop n=10000 is the worst at +4.7%; fib_iter at +17.6% is the worst overall). The benefit is that container payloads are now reachable through the Cell itself, which is the precondition for Phases 2 and 3 to retire the Objects []any indirection entirely (and recover the CPU cost via the eliminated extra hop and the eliminated any-boxing on every container allocation).

Appendix C. Phase 2 measurements (constructors return Cell directly)

Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as Appendices A and B, with vm2 Phase 2 changes applied:

Container constructors (newList, newMap, newString) return a Cell directly carrying the typed pointer in Cell.Obj. The vm.Objects[] registry is no longer populated on the construction hot path; the accessors (listAt, mapAt, stringAt) read the typed pointer first and only fall back to the registry when the typed pointer was stripped in transit.
popFrame zeros the popped register window (clear(vm.Stack[base:])) so the Go GC can reclaim list/map/string payloads as soon as the frame that allocated them returns. Without that clear, Phase 1's tagged-pointer accessors would keep payloads live until the next push overwrote them, defeating the entire memory-management premise of MEP 36.

Methodology change from Appendix B: instead of the previous single-run, benchstat-on-microbenchmarks approach, this appendix uses the Benchmarks Game-style harness defined in MEP 23 §"Benchmarks Game methodology". Each (program, n, lang) tuple is invoked K=5 times under a fresh subprocess; the headline is the median wall-clock and the high-water RSS. The previous single-run methodology produced 3-7x variation on short benchmarks (sum_loop n=10000 reported 105 ms then 767 ms on the same binary minutes apart), which could not resolve the +/-10% deltas this phase is supposed to land. Median-of-5 brings the row- to-row variance under the signal we are trying to measure.

C.1 Existing suite: Phase 1 → Phase 2 delta

Program	N	vm2 P1 (µs)	vm2 P2 (µs)	Δ CPU	vm2 P1 RSS	vm2 P2 RSS	Δ RSS
`lists/fill_sum`	10	847	947	+11.8%	4.6 MB	4.6 MB	0
`lists/fill_sum`	100	7,727	9,322	+20.6%	6.3 MB	6.4 MB	+0.1 MB
`maps/fill_sum`	10	1,875	1,936	+3.3%	5.4 MB	5.3 MB	-0.1 MB
`maps/fill_sum`	100	17,523	18,405	+5.0%	9.8 MB	9.6 MB	-0.2 MB
`math/fact_rec`	10	343	775	+126.0%	4.3 MB	4.6 MB	+0.3 MB
`math/fact_rec`	13	452	1,064	+135.4%	4.3 MB	4.4 MB	+0.1 MB
`math/fib_iter`	10	228	377	+65.4%	4.3 MB	4.5 MB	+0.2 MB
`math/fib_iter`	20	403	649	+61.0%	4.3 MB	4.5 MB	+0.2 MB
`math/fib_rec`	15	69	161	+133.3%	4.4 MB	4.7 MB	+0.3 MB
`math/fib_rec`	20	701	1,407	+100.7%	4.2 MB	4.4 MB	+0.2 MB
`math/mul_loop`	10	202	345	+70.8%	4.3 MB	4.5 MB	+0.2 MB
`math/mul_loop`	13	462	378	-18.2%	4.6 MB	4.5 MB	-0.1 MB
`math/prime_count`	50	1,456	1,575	+8.2%	4.4 MB	4.6 MB	+0.2 MB
`math/prime_count`	100	4,287	4,701	+9.7%	4.4 MB	4.6 MB	+0.2 MB
`math/sum_loop`	1000	20,445	21,044	+2.9%	4.3 MB	4.5 MB	+0.2 MB
`math/sum_loop`	10000	187,696	214,558	+14.3%	4.4 MB	4.4 MB	0
`strings/concat_loop`	10	524	547	+4.4%	4.6 MB	4.7 MB	+0.1 MB
`strings/concat_loop`	30	1,984	2,054	+3.5%	5.8 MB	5.8 MB	0

Interpretation:

Container-heavy programs are flat to slightly worse (+3% to +20%), not the recovery the design originally projected. The Phase 1 regression we expected to claw back came from Objects[] table writes on every container construction; Phase 2 removes those writes, but the popFrame clear added in this phase costs roughly the same number of cycles in the opposite direction (one zero store per register slot per frame return). On a typical inner-loop helper with NumRegs ≈ 8, that is 8 stores per return, vs. the one Objects[] slice write per allocation that Phase 1 paid. The two effects roughly cancel on the container-heavy programs, which is why lists/fill_sum and maps/fill_sum are inside the noise band.
Recursive int-only programs regressed sharply (+60% to +135%). The popFrame clear is the dominant cost on math/fib_rec, math/fact_rec, math/fib_iter, and math/mul_loop (small N). These programs return frames at very high frequency (one per recursive call) and allocate nothing, so the per-return clear(vm.Stack[base:]) cost is paid out of pocket with no reclamation benefit. This is the load-bearing finding for Phase 3: the typed last-use bit (MEP 36 §6) should let the emitter mark recursion-only frames as "no container in window" and skip the clear entirely, recovering the int-heavy regression.
RSS moved by less than 0.3 MB on every benchmark. The page- level high-water mark is still dominated by the Go runtime baseline. The interesting RSS story is on the new BG suite below, which is sized to expose the long-lived-allocation pathology Phase 2 is supposed to fix.
math/mul_loop n=13 shrank (-18%). Treat as noise: the same row swung +118% in Appendix B; small absolute numbers on a recursion-heavy hot loop are inside the noise floor for this methodology even at K=5.

C.2 BG suite: Phase 2 baseline

Captured in the same run via bench/crosslang -repeat 5 on the hand-written four-language peers added in MEP 23:

Program	N	vm2 (µs)	CPython (µs)	Lua (µs)	Go (µs)	vm2 / CPython	vm2 / Lua	vm2 / Go	vm2 RSS	CPython RSS	Lua RSS	Go RSS
`bg/binary_trees`	8	11,711	9,694	12,072	1,502	1.21x	0.97x	7.80x	9.1 MB	12.3 MB	1.9 MB	7.5 MB
`bg/binary_trees`	10	184,104	156,494	200,145	16,402	1.18x	0.92x	11.22x	10.6 MB	13.1 MB	2.5 MB	10.6 MB
`bg/nsieve`	1000	5,117	2,341	1,084	107	2.19x	4.72x	47.82x	5.8 MB	12.2 MB	1.7 MB	4.3 MB
`bg/nsieve`	10000	55,263	25,754	11,149	818	2.15x	4.96x	67.56x	12.9 MB	12.3 MB	2.7 MB	4.6 MB

This is a Phase 2-only baseline (the programs did not exist when Phase 1 was measured), but the absolute numbers carry the headline finding directly:

bg/binary_trees at depth 10 is faster than Lua and 0.85x of CPython on this host, with lower RSS than CPython. That is the Phase 2 container-reclamation contract paying off: the program allocates ~2 million tree-node lists across the run and lets them die as the constructing frame returns. Without the popFrame clear that Phase 2 added, those nodes would stay reachable through the reused Stack slots until overwritten, and the high-water RSS would scale with cumulative allocations rather than the live working set. We can see the design working: 10.6 MB RSS for ~2M short-lived allocations is the live-set size, not the cumulative-allocation size. This is the benchmark Appendix C was written to land.
bg/nsieve is 2-5x off the interpreter peers. The hot loop is xs[i] == 0 ? mark inner loop : skip, both of which pay full Cell overhead on every read and write. CPython's bytearray and Lua's integer-keyed table both hit the same logical workload through a tighter native path. This row is the right pointer for a typed- list specialization MEP once Phase 3 lands and recovers the int-heavy regression noted in C.1.

C.3 Headline takeaway

Phase 2 trades a per-frame zero-store for the Objects-table indirection the design wanted gone. The headline win is on the canonical GC stress (bg/binary_trees): millions of short-lived tree-node lists are now reclaimable as soon as the constructing frame returns, and the high-water RSS reflects the live set rather than cumulative allocations. vm2 lands faster than Lua and within 18% of CPython on that program at depth 10, with lower RSS than CPython.

The cost is a regression on recursion-heavy int-only programs (+60% to +135% on math/fib_rec, math/fact_rec, math/fib_iter, math/mul_loop). The popFrame clear cost is proportional to NumRegs and paid on every frame return, with no offsetting allocation reclamation. Phase 3 (typed last-use bit, MEP 36 §6) is the planned mitigation: frames whose register window does not contain a container type at the time of return can skip the clear entirely, restoring the int-heavy programs to their Phase 1 numbers while preserving the container-reclamation contract.

Container-heavy benchmarks (lists/fill_sum, maps/fill_sum) are roughly flat (+3% to +20%). The Objects[] write removal and the popFrame clear roughly cancel on those workloads, which is the expected steady-state: Phase 2 was never about CPU on the existing container suite, it was about making the BG suite's RSS curve land correctly. That landed.

Appendix D. Phase 3 measurements (typed-slot popFrame skip)

Captured 2026-05-17 on the same Apple M4, Go 1.x, darwin/arm64 host as Appendices A-C, with vm2 Phase 3 changes applied:

Function.HasContainerSlots is computed at emit time by scanning the IR value stream for any value of type TStr, TList, TMap, or TPtr that is assigned a real register. A function whose entire window is i64/bool carries HasContainerSlots = false.
popFrame skips clear(vm.Stack[base:]) when the popped frame's Fn.HasContainerSlots is false. The Phase 2 reclamation contract is preserved: any frame that ever held a typed pointer Cell still zeros its window on the way out.

Methodology is unchanged from Appendix C (bench/crosslang -repeat 7, median wall-clock over K=7 fresh subprocess invocations, MaxRSS as the memory headline).

D.1 Existing suite: Phase 2 → Phase 3 delta

The column headers are the same shape as Appendix C.1 so the two tables can be diffed directly:

Program	N	vm2 P2 (µs)	vm2 P3 (µs)	Δ CPU vs P2	Δ CPU vs P1	vm2 P2 RSS	vm2 P3 RSS
`lists/fill_sum`	10	947	606	-36.0%	-28.5%	4.6 MB	4.6 MB
`lists/fill_sum`	100	9,322	4,645	-50.2%	-39.9%	6.4 MB	6.4 MB
`maps/fill_sum`	10	1,936	1,217	-37.1%	-35.1%	5.3 MB	5.5 MB
`maps/fill_sum`	100	18,405	9,425	-48.8%	-46.2%	9.6 MB	9.9 MB
`math/fact_rec`	10	775	435	-43.9%	+26.8%	4.6 MB	4.5 MB
`math/fact_rec`	13	1,064	354	-66.7%	-21.7%	4.4 MB	4.5 MB
`math/fib_iter`	10	377	221	-41.4%	-3.1%	4.5 MB	4.4 MB
`math/fib_iter`	20	649	374	-42.4%	-7.2%	4.5 MB	4.4 MB
`math/fib_rec`	15	161	63	-60.9%	-8.7%	4.7 MB	4.4 MB
`math/fib_rec`	20	1,407	619	-56.0%	-11.7%	4.4 MB	4.4 MB
`math/mul_loop`	10	345	159	-53.9%	-21.3%	4.5 MB	4.3 MB
`math/mul_loop`	13	378	220	-41.8%	-52.4%	4.5 MB	4.3 MB
`math/prime_count`	50	1,575	956	-39.3%	-34.3%	4.6 MB	4.4 MB
`math/prime_count`	100	4,701	2,813	-40.2%	-34.4%	4.6 MB	4.5 MB
`math/sum_loop`	1000	21,044	11,283	-46.4%	-44.8%	4.5 MB	4.5 MB
`math/sum_loop`	10000	214,558	109,079	-49.2%	-41.9%	4.4 MB	4.5 MB
`strings/concat_loop`	10	547	307	-43.9%	-41.4%	4.7 MB	4.7 MB
`strings/concat_loop`	30	2,054	1,129	-45.0%	-43.1%	5.8 MB	5.9 MB

Interpretation:

Phase 2 regression on the recursion suite is gone, often over-recovered. fib_rec n=20, fib_iter n=20, and fib_rec n=15 all land within 12% of Phase 1; mul_loop, prime_count, and sum_loop are 20%-50% better than Phase 1, because the helpers those programs call (loop tails, recursive worker frames) never touch a container slot, so every nested popFrame is a single length-bump instead of a zeroing pass over the register window.
fact_rec n=10 is the one row still above Phase 1 (+27%). The helper there has a single-frame call shape that emit currently types as containing a TStr slot via an intermediate value; the conservative emit-time scan flips HasContainerSlots to true for the helper and the clear still runs. A per-block / per-live-range refinement (the bitmap form sketched in §3.6) would close this case; the value is small relative to the other rows so we have not pursued it.
Container-heavy programs picked up the same shape. lists/fill_sum and maps/fill_sum improved 36-50% over Phase 2, because their outer frame still clears but every inner integer helper (the loop body, the summing pass) is now a clear-skip. strings/concat_loop similarly drops -43% to -45% from Phase 2: the concat happens in a leaf frame and the helper that drives the loop is int-only.

D.2 BG suite: Phase 2 → Phase 3 delta

Program	N	vm2 P2 (µs)	vm2 P3 (µs)	Δ vs P2	vm2 P3 / CPython	vm2 P3 / Lua	vm2 P3 / Go	vm2 P3 RSS
`bg/binary_trees`	8	11,711	11,318	-3.4%	1.18x	0.95x	7.09x	8.9 MB
`bg/binary_trees`	10	184,104	181,372	-1.5%	1.18x	0.95x	11.07x	10.7 MB
`bg/nsieve`	1000	5,117	5,034	-1.6%	2.15x	4.62x	69.92x	5.8 MB
`bg/nsieve`	10000	55,263	54,638	-1.1%	2.19x	4.99x	66.39x	13.4 MB

Interpretation:

The BG-suite kernels are explicitly container-heavy: nsieve walks a TList of ints and binary_trees builds and tears down ~2M nested list nodes per run. Their hot frames carry container slots and the popFrame clear still runs. Phase 3 is essentially flat on this suite, which is the expected shape: the optimization recovers the cost where the cost was paid by frames that did not need clearing, not where the clearing is load-bearing.

D.3 Headline takeaway

Phase 3 delivers what Appendix C.3 promised: the int-heavy recursion suite recovers (and often surpasses) its Phase 1 numbers, while the container suite preserves the Phase 2 RSS-reclamation contract. The median Phase 2 → Phase 3 CPU delta on the recursion suite is -46%; the BG suite is -2% (noise floor). vm2 is now faster than Lua on binary_trees at depth 10 again (0.95x) and within 18% of CPython, on a host where vm2 spent the entire Phase 2 effort getting back to break-even.

D.4 Phase 3c: in-place functional append

Captured with runtime/vm2/bench/list_append_test.go on the same M4 host. The workload chains N functional OpListAppend operations on a list whose intermediates are dead. Two configurations:

In-place: emit sets InstrFlagBLastUse on every step; dispatch mutates the source and reuses its *vmList pointer.
Copy: every intermediate has an extra reader so emit must clear the bit; dispatch allocates a fresh *vmList and copies the prior backing array on every step.

Chain length	In-place ns/op	Copy ns/op	Δ wall-clock	In-place mallocs	Copy mallocs	Δ allocs
256	2,183	51,717	23.7× faster	13	516	39.7× fewer

The in-place malloc count is O(log N) (Go's append doubling), the copy count is O(N). TestListAppendFastPathAllocCount asserts the copy path uses at least 2× the in-place path's mallocs, guarding the contract against silent regression.

This is the canonical Perceus-style "fluent chain becomes one backing array" result the spec motivated in §3.5; we got there with a 1-bit encoding annotation and a single dispatcher branch, without introducing reference counts or escape analysis.

Copyright

This document is placed in the public domain.

Abstract​

What the original draft got wrong​

Motivation​

Background​

2.1 Modern VM-on-host-GC designs​

2.2 Go-specific cost model​

2.3 Allocation-minimization literature​

2.4 Concurrent GC trade-offs​

Specification​

3.1 New Cell representation​

3.2 Delete the Objects table​

3.3 Container layouts​

3.4 Operand stack still on the heap​

3.5 Compile-time last-use bit​

3.6 Frame cleanup and write barriers​

3.7 JIT register file contract (interaction with MEP-34)​

3.8 FFI shims​

Rationale​

4.1 Why a struct Cell, not a typed union​

4.2 Cell width: 8 vs 16 vs 24 bytes​

4.3 Why delete the Objects table entirely instead of keeping it as a fallback​

4.4 Migration plan​

Backwards Compatibility​

Reference Implementation​

Open Questions​

References​

Appendix A. Phase 0 baseline measurements​

Appendix B. Phase 1 measurements (16-byte struct Cell)​

B.1 Membench (same harness as Appendix A)​

B.2 Crosslang (corpus runs against vm2 / CPython / Lua)​

B.3 Headline takeaway​

Appendix C. Phase 2 measurements (constructors return Cell directly)​

C.1 Existing suite: Phase 1 → Phase 2 delta​

C.2 BG suite: Phase 2 baseline​

C.3 Headline takeaway​

Appendix D. Phase 3 measurements (typed-slot popFrame skip)​

D.1 Existing suite: Phase 2 → Phase 3 delta​

D.2 BG suite: Phase 2 → Phase 3 delta​

D.3 Headline takeaway​

D.4 Phase 3c: in-place functional append​

Copyright​

Abstract

What the original draft got wrong

Motivation

Background

2.1 Modern VM-on-host-GC designs

2.2 Go-specific cost model

2.3 Allocation-minimization literature

2.4 Concurrent GC trade-offs

Specification

3.1 New Cell representation

3.2 Delete the Objects table

3.3 Container layouts

3.4 Operand stack still on the heap

3.5 Compile-time last-use bit

3.6 Frame cleanup and write barriers

3.7 JIT register file contract (interaction with MEP-34)

3.8 FFI shims

Rationale

4.1 Why a struct Cell, not a typed union

4.2 Cell width: 8 vs 16 vs 24 bytes

4.3 Why delete the Objects table entirely instead of keeping it as a fallback

4.4 Migration plan

Backwards Compatibility

Reference Implementation

Open Questions

References

Appendix A. Phase 0 baseline measurements

Appendix B. Phase 1 measurements (16-byte struct Cell)

B.1 Membench (same harness as Appendix A)

B.2 Crosslang (corpus runs against vm2 / CPython / Lua)

B.3 Headline takeaway

Appendix C. Phase 2 measurements (constructors return Cell directly)

C.1 Existing suite: Phase 1 → Phase 2 delta

C.2 BG suite: Phase 2 baseline

C.3 Headline takeaway

Appendix D. Phase 3 measurements (typed-slot popFrame skip)

D.1 Existing suite: Phase 2 → Phase 3 delta

D.2 BG suite: Phase 2 → Phase 3 delta

D.3 Headline takeaway

D.4 Phase 3c: in-place functional append

Copyright