MEP 35. Auto Memory Management for the VM2 Objects Table

Field	Value
MEP	35
Title	Auto Memory Management for the VM2 Objects Table
Author	Mochi core
Status	Draft
Type	Standards Track
Created	2026-05-17

Abstract

vm2 carries every reference-typed value through an indirection table: VM.Objects []any (runtime/vm2/vm.go:14-26). A Cell with tag tagPtr is a 48-bit index into that slice (runtime/vm2/cell.go:40,83-85,137). The table is append-only for the duration of a Run. There is no reclamation. There is no cycle detection. There is no escape analysis. popFrame explicitly defers the work to a future MEP:

The frame window is sliced off but its contents are NOT zeroed. vm2 has no ptr-tagged Cells yet so there is nothing to un-pin; once the boxed-object subsystem lands, popFrame must zero ptr-tagged slots before shrinking. Tracked in the upcoming subsystem MEP.

runtime/vm2/vm.go:87-99

That MEP is this one.

The choice is not whether to reclaim. Append-only allocation is a non-starter for any program that touches lists or maps inside a loop. The choice is how. This MEP surveys the design space the last five years have produced (Perceus reuse analysis in Koka and Roc, Inko's runtime single ownership, Lobster's compile-time RC elimination, Verona's Reggio region forest, LXR over Immix, MMTk, generational ZGC and Shenandoah, JSC's Riptide), pins down the constraints unique to a Go-hosted VM, and proposes three concrete implementation paths with their measured-or-projected trade-offs.

The Phase 1 deliverable is option 1 (Perceus-style reference counting with compiler-driven RC elision and cycle collection at Run-exit), gated on the MEP-23 benchmark corpus and on allocs/op and heap-bytes-retained-at-Run-exit metrics introduced here.

Motivation

What "append-only" costs today

Run this program in vm2:

fun main() {
  for i in 0..1_000_000 {
    let xs = [i, i+1, i+2]
    println(xs[0])
  }
}

The loop allocates one *vmList per iteration (runtime/vm2/lists.go:16-28). Every one of those 1,000,000 lists stays pinned in vm.Objects until Run returns. Peak heap is 1_000_000 * sizeof(*vmList + backing array), even though only one list is reachable at a time from the program's perspective. The Go GC cannot help: every entry in Objects []any is a live root for Go.

Three workloads in the MEP-23 corpus already trip this: map_get builds and discards 32 maps per iteration, string_cat accumulates an intern table that never shrinks, and quicksort_recursive allocates sublists that the parent overwrites but never frees. The iter_sum and fib_rec numbers in MEP-30 are clean only because the numeric fast path never touches Objects. The minute we benchmark anything container-shaped, the absence of reclamation becomes the dominant cost.

Why we cannot just lean on Go's GC

The host language is Go and its GC is excellent for Go-shaped programs. It does not help here because the lifetime our users care about is shorter than the lifetime visible to Go: a *vmList is reachable through vm.Objects[i] as long as the program runs, even if the program will never read index i again. We have a precise dependency graph at the bytecode level (the JIT and the type system both know which Cells reference which Objects); Go does not. The job is to publish that reachability to a per-VM allocator that runs inside Go's GC but reclaims at vm2's granularity.

Why this is the right time

Three things changed under us in the last twelve months that make the choice less open than it was when MEP-20 deferred this question:

Perceus is no longer research. Roc ships it. Lean ships it. Koka ships it. The PLDI 2021 paper has been productized for four years and Roc 2024 numbers show Set/List benchmarks within 1.3-2x of C with no GC pauses (Roc Functional, Perceus PLDI'21).
MMTk reached Ruby 3.4 in January 2025 (Rails at Scale 2025-01-08). The modular interface (VMBinding) is the first credible way for a small VM to consume a research-grade collector (Immix, LXR, SemiSpace, MarkSweep) without writing one. The cost is the Rust FFI; the benefit is that the GC algorithm becomes a swap-out.
Generational Shenandoah went production in JDK 25 (Perf Parlor 2025-09-14) and Generational ZGC is the default in JDK 23+. The "pauseless concurrent GC pays back even at small heaps" hypothesis has now been re-checked at scale; the answer is only above 8 GB. At vm2's scale (interpreter heap rarely above 1 GB) the right primitive is per-allocation reclamation, not concurrent tracing.

The conclusion that follows from those three is that reference counting plus reuse analysis is the front-runner for a small register VM in 2026, not tracing GC. This MEP commits to validating that hypothesis on the MEP-23 corpus before locking it in, hence three options rather than one.

Background: what we are borrowing from

This section is the deep-research output, condensed. Each subsection ends with the one or two ideas we are taking from that line of work.

Perceus (Koka, Roc, Lean 4)

Perceus is precise non-deferred reference counting with three compile-time passes that erase most of the runtime cost (Reinking, Xie, de Moura, Leijen, PLDI 2021):

Precise RC. Drops are inserted by the compiler at the exact last-use point. There is no liveness extension to end-of-scope. Programs are "garbage free" in the sense that an object's refcount hits zero the instant it becomes unreachable.
Reuse analysis. If a function consumes a unique value of one shape and constructs a value of the same or smaller shape, the compiler rewrites the construction as an in-place mutation of the consumed cell. Koka's map over a unique list does zero allocation.
Borrow inference. Function parameters annotated as borrowed do not get a refcount bump on entry. Most local helpers in Roc compile to zero RC ops because borrow-inference handles the steady state.

Roc's 2024 status: ~95% of RC ops elided by compile-time analysis in Lobster (Lobster memory management); Roc reports comparable elision rates and Set / List benchmarks within 1.3-2x of C (Roc Functional). The known weakness is cycles: Roc resolves it by language design (no mutation, no cycles); Koka resolves it with a runtime cycle collector run on demand; Lobster reports cycles at exit ("cycle report") and asks the programmer to break them. Bacon and Rajan's concurrent cycle collector (ECOOP 2001) is the underlying algorithm.

What we take: the Perceus shape (drop at last-use, reuse-on-unique, borrow-on-parameter) is exactly what fits vm2: every container Cell already has a known type tag at JIT compile time (MEP-34 §Cell tag-check fast paths), so the analysis has the information it needs without a whole-program closed-world assumption.

Inko: runtime single ownership

Inko (inko-lang.org/papers/ownership.pdf) gives each value one owner; aliases are runtime-checked borrows. A "unique value" cannot be aliased outside its box and can be sent across processes without copy. Where Rust enforces single ownership at compile time, Inko enforces it at runtime with one extra word per object holding a borrow count; on drop, if the borrow count is non-zero the program aborts.

What we take: the runtime enforcement story is the right one for a dynamic register VM. The Rust-style "prove single ownership at compile time" story is incompatible with Mochi's reflection / any paths.

Lobster: compile-time RC elimination

Lobster (Wouter van Oortmerssen, 2019 talk) interleaves ownership analysis with type checking. The ownership "kind" is a property of every AST node, not just of variables. Result: 95% of RC ops removed at compile time, cycles reported at exit. Influenced Nim's --gc:arc.

What we take: the architecture decision that ownership analysis lives inside the type checker, not as a separate pass. For Mochi this means types/check.go and the soundness work in MEP-7 are the right home, not a new analyzer.

Verona / Reggio: region forest

Verona (OOPSLA 2023, Cheney et al.) organizes all objects into a forest of isolated regions. Each region has a memory-management strategy (trace, RC, arena) chosen per-region. A thread has one "window of mutability" at a time. Region isolation is enforced by a reference-capability type system; regions are reclaimed wholesale when their owning reference is dropped.

What we take: the per-allocation-site strategy idea. A frame's local lists can live in an arena that dies with the frame; the program's global persistent map cannot. Verona's choice to make strategy a region property rather than a global one is the right abstraction.

Immix / LXR / MMTk

Immix (Blackburn & McKinley, PLDI 2008) is a mark-region collector: 128-byte lines grouped into 32 KB blocks, allocate bump-pointer inside a block, reclaim at line granularity, opportunistically copy fragmented blocks. LXR (Zhao, Blackburn, McKinley, PLDI 2022) layers reference counting over Immix: stop-the-world RC pauses (a few ms) reclaim 90%+ of memory without copying; occasional concurrent traces catch cycles. LXR beat ZGC and Shenandoah on the 2022 latency benchmarks while matching G1 throughput.

MMTk (mmtk.io, Ruby 3.4 integration 2025-01) packages Immix, LXR, SemiSpace, MarkSweep, and friends as a Rust library with a VMBinding trait. Bindings exist for OpenJDK, V8, Julia, Ruby.

What we take: if the RC option underperforms, LXR is the fallback. MMTk gives us LXR (and SemiSpace, MarkSweep, Immix) without writing a collector. The cost is the Rust FFI boundary and a write-barrier on every ptr-tagged Cell store.

Generational ZGC, Shenandoah, JSC Riptide

The big concurrent collectors (Java 25 ZGC, WebKit Riptide) are engineered for heaps in the 8 GB to multi-TB range. Their constant overhead (write barriers, colored pointers in ZGC, Brooks pointers in Shenandoah, conservative root scanning in JSC) is meaningful on small heaps. Riptide's "logical versioning" trick (bump a global version rather than physically clearing mark bits) is interesting and cheap; the rest of the bag of tricks is sized wrong for vm2.

What we take: logical versioning for mark bits, in option 3. Nothing else.

Escape analysis: PLDI 2024 and OOPSLA 2024

"Optimistic Stack Allocation and Dynamic Heapification" (PLDI 2024) and MEA2 (OOPSLA 2024) advance the state of the art on stack-promoting heap allocations in managed runtimes. The PLDI'24 idea: do a precise static escape analysis offline, JIT-compile with optimistic stack allocation, and have the JIT and interpreter perform dynamic heapification (copy the stack object to the heap and rewrite references) when an optimistic assumption is invalidated.

What we take: the dynamic-heapification escape valve, in option 2 (regions). It is the answer to "what happens when an allocation outlives the frame we put it in".

Summary table

System	Mechanism	Cycle handling	Strength	Weakness
Perceus / Koka / Roc	Precise RC + reuse + borrow	Cycle collector or by-design	Predictable, allocation-free hot paths	Atomic RC for shared, cycle cost
Inko	RC + runtime borrow check	Single ownership rules out	Deterministic destructors, simple model	Aborts on borrow violation
Lobster	RC + compile-time elision	Reported at exit	95% RC ops gone, simple	Cycles leak silently
Verona / Reggio	Per-region strategy	Per-region	Mixes arena + RC + trace	Type system complexity
LXR / Immix	Mark-region + opportunistic copy	Concurrent trace	Throughput rivals G1, ms pauses	Write barrier, MMTk FFI
ZGC / Shenandoah	Concurrent copy, colored / Brooks ptrs	Trace	Pauseless at TB heaps	Overweight at MB heaps
JSC Riptide	Non-compacting concurrent mark	Trace	Conservative roots, logical versioning	Conservative roots are imprecise

Constraints unique to vm2

Before laying out options, the constraints that filter them:

Host is Go. Anything we allocate lives in Go's heap. The Go GC must remain able to scan our objects; otherwise we get use-after-free at the next concurrent mark. This rules out unsafe.Pointer-based bit-stealing for object payloads. See MEP-20 §"Why not NaN boxing in Go" for the analogous Cell argument.
Cells are 8 bytes and NaN-boxed. A ptr-tagged Cell is a 48-bit table index, not a pointer (runtime/vm2/cell.go:40). Adding a header word to every object is free for us (we already have an any slot per Objects[i]); adding a header bit to every Cell is not (the NaN-box is full).
The JIT and the interpreter share the frame format (MEP-34). Both must produce identical RC operations and identical write barriers, or a JIT'd function that calls an interpreted one (and vice versa) double-counts or skip-counts. The MEP-34 frame-compatibility contract extends to the memory subsystem.
Determinism is a feature. Mochi's test corpus diffs program output. Anything non-deterministic (concurrent tracing with cooperation points scheduled by wall-clock) breaks reproducibility. Pauses must be deterministic and scheduled at well-known program points (back-edges, OpReturn, allocation overflow).
No assumption of closed-world. Mochi has FFI (the Objects table holds opaque any payloads, some of which are Go-side resources). A whole-program escape analysis is not viable in v1. Per-function or per-call-site analysis is.

Three implementation options

Option 1: Perceus-style reference counting with compile-time RC elision and cycle collection at Run exit

Mechanism. Add one uint32 refcount per Objects[i] slot. Compile every Move, Call, container construct, and container drop opcode to emit dup (increment) and drop (decrement-and-maybe-free) operations on ptr-tagged Cells. Run two compiler passes inside compiler2 (MEP-21):

Last-use insertion. Place each drop at the last lexical use of a ptr-tagged value within a function. Implemented as a backward dataflow over the bytecode, identical in shape to Koka's pass.
Reuse analysis. When a function consumes a ptr-tagged value via drop and then constructs a new container of the same shape, rewrite the pair as an in-place reuse if the refcount is observed to be 1 at runtime. Cells of refcount 1 are "unique"; the JIT can inline this check as a one-instruction comparison (Roc Set.insert idiom).
Borrow inference. Function parameters used read-only and not stored into long-lived state are marked borrowed. Borrowed parameters skip the entry dup and the exit drop.

Cycles. Detected once, at Run exit, by sweeping Objects for entries with refcount > 0 that are unreachable from the registers. Bacon-Rajan (ECOOP 2001) is the algorithm. A program with no cycles pays nothing. A program with cycles pays one full sweep per Run.

Allocator. A small free-list per object class (*vmList, *vmMap, *vmString, *Closure, *big.Int) reuses freed slots in Objects. The slot index is stable across Run; freed slots are pushed onto a per-class freelist and reused by the next allocator call. Cells that hold the old index continue to compare equal but dereference to whatever now occupies the slot, which is the failure mode RC must rule out before slot reuse; the refcount-hit-zero check is exactly that proof.

Mochi mapping.

vm2 concept	New RC concept
`Objects []any`	`Objects []objSlot` where `objSlot = {payload any; refcount uint32; next uint32}`
`CPtr(idx)` cell construction	Compiler emits `dup` on the resulting Cell
`OpReturn`	Compiler emits `drop` on every ptr-tagged register at last use
`popFrame`	No change. RC drops happen at last-use, not at frame exit.
`Run` exit	`runCycleCollect(vm)` then reset `Objects`

Pros. Allocation-free hot paths once reuse fires (Roc demonstrates this in practice). No GC pauses other than at Run exit. Deterministic. Co-evolves with MEP-7 (the type checker already knows ptr-tagged Cells). Aligns with the precedent set in MEP-20 of "Go-friendly value layout".

Cons. Refcount operations on shared containers are atomic if Mochi grows concurrency. The atomic op cost is ~3-5 ns on modern x86, less on ARM; Biased Reference Counting (PACT'18) is the known mitigation if it becomes a bottleneck. Cycles in long-running programs leak until Run exit; programs that build cyclic graphs (graph algorithms, mutable cyclic data) regress.

Projected numbers (extrapolating from Roc, Lobster, Koka on equivalent workloads):

Workload	vs append-only baseline	vs LuaJIT
`iter_sum` (no refs)	1.00x (unchanged, no ptr-tagged Cells)	parity
`map_get`	30x lower heap retained, 1.1-1.3x throughput	1.3-1.8x slower (RC bump on each map insert)
`quicksort` (in-place reuse fires)	100x lower heap retained, 1.0x throughput	parity
`string_cat` (interned)	50x lower retained	1.2-1.5x slower

Option 2: Per-frame region arenas with dynamic heapification

Mechanism. Each pushFrame opens a new region in a []Region stack. Every allocation made from inside that frame's bytecode goes into the current region's bump allocator. popFrame discards the region in one operation: drop the bump allocator's slab, no per-object work.

When the type system cannot prove a ptr-tagged value is frame-local (the value escapes via OpReturn, is stored into a parent register, is captured by a closure, or flows into the FFI), the compiler inserts a heapification opcode that copies the object out of the frame region into the parent's region or into a long-lived "global" region. This is the PLDI 2024 idea (Anand et al.), adapted: instead of stack-vs-heap, we have region-vs-region, and heapification rewrites the source Cell to point at the copy.

The compiler's escape analysis is conservative: a value escapes unless the bytecode demonstrably does not let it. Cases that demonstrably do not let it (large fraction of the corpus, per MEA2's measurements on Go): list comprehensions over a local list, map literals consumed by a single call, intermediate strings in string_cat.

Mochi mapping.

vm2 concept	New region concept
`Objects []any`	`Objects []any` becomes one of many regions; one `*Region` per frame
`pushFrame`	`vm.Regions = append(vm.Regions, NewRegion())`
`popFrame`	`vm.Regions[top].free()` then `Regions = Regions[:top]`
Container construct	Allocates in current frame's region
Escaping value	Compiler emits `OpHeapify dst, src` that copies into parent region

Pros. O(1) reclamation per frame. No write barriers in the steady state. Aligns naturally with Mochi's frame structure (MEP-34 §frame-compatibility). Cycles among objects in the same region are reclaimed for free.

Cons. Escape analysis is the bottleneck. If most values escape (we have not measured), regions degrade to a slow heap. Long-lived values (the program's global state, the persistent map a user keeps growing) need their own region with one of the other two strategies layered on top, which means we end up implementing option 1 or 3 anyway, just on a smaller footprint. The Verona / Reggio paper makes precisely this argument (Cheney et al., OOPSLA 2023 §3.2).

Projected numbers.

Workload	vs append-only baseline	vs LuaJIT
`iter_sum`	1.00x	parity
`map_get`	100x lower retained if maps don't escape, 1.0x retained if they do	parity if non-escaping
`quicksort`	200x lower retained (frame-local sublists)	parity
`long_running_repl` (global growing map)	regresses, need fallback strategy	regresses

Option 3: Mark-region collector over the Objects table (Immix / LXR shape, no MMTk dependency)

Mechanism. Replace Objects []any with a slab-allocated region heap: 32 KB blocks, 128 B lines, bump-pointer inside a block. Every block holds objects of the same class (*vmList, *vmMap, ...) to keep the Go GC's type information precise. A mark phase runs at allocation overflow (when no block of the requested class has a free line):

Stop the mutator at the next safepoint (back-edge or OpCall / OpReturn).
Scan the register file plus the Objects indirection (we still need indirection for slot stability under refcount-free reclamation; see below) for ptr-tagged Cells. The Cells are the precise root set; no conservative scan.
Mark lines and blocks reachable. Use logical versioning (JSC Riptide) so mark bits do not need physical clearing between cycles.
Reclaim free lines for allocation. Compact only fragmented blocks via the Immix opportunistic-copy path.

LXR layers RC on top of this for incrementality. We would start without LXR and adopt it in a follow-on MEP if pause times exceed the budget.

Mochi mapping.

vm2 concept	New collector concept
`Objects []any`	Per-class block lists; ptr-tagged Cell still carries a 48-bit identity but resolved through a forwarding table
`pushFrame` / `popFrame`	No change
`OpCall` / back-edge	Safepoint check; collector runs if requested
Container construct	Bump-alloc in current block; allocate new block on overflow
`Run` exit	Discard all blocks

Pros. Throughput within the Immix / LXR range, which is to say competitive with G1 on Java benchmarks. Cycles handled. No compiler-side analysis required (no Perceus-style passes, no escape analysis). Simplifies the JIT contract: the JIT emits a write barrier on ptr-tagged stores, nothing else.

Cons. A write barrier on every ptr-tagged Cell write. The barrier is one branch and one byte-store (LXR §3.2) but it is not zero. Pause times are bounded but not zero: a Stop-The-World mark is in the millisecond range for heaps with a few hundred thousand live objects. The implementation effort is the largest of the three: we are writing a collector. We could lean on MMTk via cgo or a Rust subprocess; the cost is the FFI boundary and a build-system dependency on cargo and a Rust toolchain in CI.

Projected numbers.

Workload	vs append-only baseline	vs LuaJIT
`iter_sum`	0.97x (write-barrier cost)	parity
`map_get`	50x lower retained, 1.0-1.1x throughput	1.1-1.3x slower
`quicksort`	50x lower retained, 0.95x throughput	parity
cycle-heavy graphs	reclaimed correctly	parity

Comparison

Property	Option 1 (Perceus RC)	Option 2 (Regions)	Option 3 (Mark-region)
Compiler work	Last-use, reuse, borrow passes	Escape analysis + heapify opcode	Write-barrier emission only
Runtime work	RC bump on dup / drop	None in steady state	Mark sweep at overflow
Cycles	Bacon-Rajan at Run exit	Free per region	Free during mark
Pause	None (RC) or one per Run (cycles)	None	ms-scale per mark
Heap overhead	4 bytes per Object	One slab per frame	Block headers + mark bitvectors
Determinism	Full	Full	Pause time varies with heap
FFI	Clean (RC visible to Go)	Clean	Clean if pure-Go, FFI if MMTk
Failure mode	Cycles leak until Run exit	Escape analysis pessimism	Pause spikes under allocation pressure
Lines of code (estimate)	1500-2500 in compiler + runtime	2000-3000 in compiler + runtime	3000-5000 in runtime, or MMTk FFI
Risk of regressing MEP-30 numeric path	None (no ptr-tag in `fib_rec`)	None	Small (write barrier dead-coded out)

Recommendation

Option 1 is the proposed Phase 1. Specifically:

Land the slot freelist + refcount in Objects first, with the compiler emitting dup / drop at every ptr-tagged register move. This is the unoptimized RC baseline. Measure heap-retained on the MEP-23 corpus.
Add last-use insertion. Measure RC op count and heap-retained.
Add reuse analysis. Measure allocation rate on quicksort and map_get.
Add borrow inference. Measure RC op count on closure-heavy workloads.
Add Bacon-Rajan cycle collection. Measure on graph workloads.

Each of the five steps lands as a separate PR with its own gate against the MEP-23 baseline. If any step regresses geo-mean throughput against the previous step, the work stops and we re-evaluate against option 3.

Option 3 is the explicit fallback. If at step 3 or step 5 we cannot reach 0.9x of Roc-shape numbers on the container corpus, the recommendation pivots to MMTk via FFI. Option 2 is rejected as a primary path because the escape analysis dominates risk; we adopt regions only as an opportunistic optimization layered on whichever of options 1 or 3 wins.

Scope

In scope for the Phase 1 deliverable (option 1):

The slotted Objects table with refcount + freelist (runtime/vm2/vm.go).
The compiler passes for last-use, reuse, and borrow inference (compiler2/).
The Bacon-Rajan cycle collector triggered at Run exit (runtime/vm2/cycles.go, new).
JIT integration: vm2jit (MEP-34) must emit identical dup / drop operations to the interpreter at the matching opcodes.
Bench gates added to the MEP-23 corpus: heap-bytes-retained-at-Run-exit, rc-ops-per-iteration, cycle-collector-time-at-Run-exit.

Out of scope (deferred):

Concurrent reference counting / atomic RC ops. Mochi is single-threaded today; revisit when MEP-15 lands.
Generational RC (Blackburn & McKinley, "Ulterior RC") or Biased RC. Land if the steady-state RC overhead exceeds 10% of runtime on the corpus.
Region allocator for frame-local values (option 2). Land if escape analysis turns out cheap to bolt onto the type checker.
Mark-region collector (option 3). Land if option 1 stalls at step 3 or step 5.

Backwards Compatibility

The change is invisible to source-level Mochi programs. The bytecode changes (new dup / drop opcodes, new OpRunCycleCollect) are emitted by the compiler, not the user. The JIT and the interpreter must coordinate per MEP-34; a JIT'd function calling an interpreted one and vice versa must produce identical RC effects.

The Objects []any slot encoding stays compatible: existing CPtr Cells still resolve to the same Go-level payload. The added refcount and freelist are an internal detail of vm.Objects.

The only externally visible change is that vm.Objects is not a stable index across Runs. Test fixtures that compare len(vm.Objects) between Runs (none today) would break.

Reference Implementation

The implementation lands across three trees:

runtime/vm2/vm.go: Objects []any becomes Objects []objSlot; AddObject consults the freelist; new dup(idx), drop(idx) methods; popFrame runs no per-slot work (RC drops happen at the per-opcode level, not at frame retire).
runtime/vm2/cycles.go (new): Bacon-Rajan cycle collector.
compiler2/rc/ (new): the last-use, reuse, borrow passes.
runtime/jit/vm2jit/: emit matching dup / drop in the ARM64 and AMD64 templates.

Each PR lands behind a build tag (vm2_rc) until the gate passes; the append-only path remains the default until the measured-results MEP lands.

Open Questions

Atomic vs non-atomic RC. Today Mochi has no concurrency; non-atomic is correct. When MEP-15 introduces effect-tracked concurrency, we will need a thread-local-or-shared bit per slot (Inko's model) or BRC (the 2018 Biased RC paper). Defer.
Stable slot identity. If a slot is freed and reused, any Cell that still holds the old index has a stale reference. The RC invariant rules this out (refcount-zero is a precondition for freeing the slot), but the JIT must be audited for paths that read a Cell after a drop. Add a dropped debug flag in test builds.
Cycle-collector cost on Run exit. If a program's cyclic-garbage rate is high, the exit cost dominates. We may want to schedule it more often than at Run exit (every N allocations, or at GC.AssistMark-style triggers). Defer until measured.
MMTk fallback ergonomics. If option 3 becomes the path, the FFI boundary needs to coexist with go test. Bun's cgo+Rust experience (Bun blog 2024-09) suggests this is workable but not free.
Slot indirection vs direct pointer. Once RC is precise, the Objects indirection becomes optional; a ptr-tagged Cell could in principle carry a 48-bit Go pointer directly. Go's GC scanning rules block this today (Cells are uint64, not *X), but a parallel Roots []*objSlot array could expose the pointers. Defer.

References

Reference counting and reuse

Reinking, Xie, de Moura, Leijen. Perceus: Garbage Free Reference Counting with Reuse. PLDI 2021. paper, ACM DL
Bacon, Rajan. Concurrent Cycle Collection in Reference Counted Systems. ECOOP 2001. paper
Choi, Lee. Biased Reference Counting: Minimizing Atomic Operations in Garbage Collection. PACT 2018. paper
Roc Programming Language. Functional design page (roc-lang.org/functional).
van Oortmerssen. Memory Management in Lobster. aardappel.github.io
Inko Project. Ownership You Can Count On: A Hybrid Approach to Safe Explicit Memory Management. inko-lang.org/papers/ownership.pdf

Regions

Cheney, Drossopoulou, et al. Reference Capabilities for Flexible Memory Management (Verona / Reggio). OOPSLA 2023. arXiv:2309.02983
Wikipedia. Region-based memory management. link

Tracing and hybrid collectors

Blackburn, McKinley. Immix: a Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance. PLDI 2008. paper
Zhao, Blackburn, McKinley. Low-Latency, High-Throughput Garbage Collection. PLDI 2022 (LXR). paper
Memory Management Toolkit. mmtk.io, Ruby 3.4 integration 2025-01
WebKit. Understanding Garbage Collection in JavaScriptCore From Scratch. webkit.org/blog/12967
Norlinder, Österlund, Black-Schaffer, Wrigstad. Mark-Scavenge: Waiting for Trash to Take Itself Out. OOPSLA 2024.
Generational Shenandoah, Java 25. The Perf Parlor 2025-09
Pauseless Garbage Collection in Java 25: ZGC Deep Dive. andrewbaker.ninja 2025-12

Escape analysis

Anand, Adithya, Rustagi, Seth, Sundaresan, Maier, Nandivada, Thakur. Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes. PLDI 2024. ACM
MEA2: a Lightweight Field-Sensitive Escape Analysis with Points-to Calculation for Go. OOPSLA 2024. splashcon.org

Mochi context

Copyright

This document is placed in the public domain.

Abstract​

Motivation​

What "append-only" costs today​

Why we cannot just lean on Go's GC​

Why this is the right time​

Background: what we are borrowing from​

Perceus (Koka, Roc, Lean 4)​

Inko: runtime single ownership​

Lobster: compile-time RC elimination​

Verona / Reggio: region forest​

Immix / LXR / MMTk​

Generational ZGC, Shenandoah, JSC Riptide​

Escape analysis: PLDI 2024 and OOPSLA 2024​

Summary table​

Constraints unique to vm2​

Three implementation options​

Option 1: Perceus-style reference counting with compile-time RC elision and cycle collection at Run exit​

Option 2: Per-frame region arenas with dynamic heapification​

Option 3: Mark-region collector over the Objects table (Immix / LXR shape, no MMTk dependency)​

Comparison​

Recommendation​

Scope​

Backwards Compatibility​

Reference Implementation​

Open Questions​

References​

Reference counting and reuse​

Regions​

Tracing and hybrid collectors​

Escape analysis​

Mochi context​

Copyright​

Abstract

Motivation

What "append-only" costs today

Why we cannot just lean on Go's GC

Why this is the right time

Background: what we are borrowing from

Perceus (Koka, Roc, Lean 4)

Inko: runtime single ownership

Lobster: compile-time RC elimination

Verona / Reggio: region forest

Immix / LXR / MMTk

Generational ZGC, Shenandoah, JSC Riptide

Escape analysis: PLDI 2024 and OOPSLA 2024

Summary table

Constraints unique to vm2

Three implementation options

Option 1: Perceus-style reference counting with compile-time RC elision and cycle collection at Run exit

Option 2: Per-frame region arenas with dynamic heapification

Option 3: Mark-region collector over the Objects table (Immix / LXR shape, no MMTk dependency)

Comparison

Recommendation

Scope

Backwards Compatibility

Reference Implementation

Open Questions

References

Reference counting and reuse

Regions

Tracing and hybrid collectors

Escape analysis

Mochi context

Copyright