MEP 35. Auto Memory Management for the VM2 Objects Table
| Field | Value |
|---|---|
| MEP | 35 |
| Title | Auto Memory Management for the VM2 Objects Table |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
vm2 carries every reference-typed value through an indirection table: VM.Objects []any (runtime/vm2/vm.go:14-26). A Cell with tag tagPtr is a 48-bit index into that slice (runtime/vm2/cell.go:40,83-85,137). The table is append-only for the duration of a Run. There is no reclamation. There is no cycle detection. There is no escape analysis. popFrame explicitly defers the work to a future MEP:
The frame window is sliced off but its contents are NOT zeroed. vm2 has no ptr-tagged Cells yet so there is nothing to un-pin; once the boxed-object subsystem lands, popFrame must zero ptr-tagged slots before shrinking. Tracked in the upcoming subsystem MEP.
That MEP is this one.
The choice is not whether to reclaim. Append-only allocation is a non-starter for any program that touches lists or maps inside a loop. The choice is how. This MEP surveys the design space the last five years have produced (Perceus reuse analysis in Koka and Roc, Inko's runtime single ownership, Lobster's compile-time RC elimination, Verona's Reggio region forest, LXR over Immix, MMTk, generational ZGC and Shenandoah, JSC's Riptide), pins down the constraints unique to a Go-hosted VM, and proposes three concrete implementation paths with their measured-or-projected trade-offs.
The Phase 1 deliverable is option 1 (Perceus-style reference counting with compiler-driven RC elision and cycle collection at Run-exit), gated on the MEP-23 benchmark corpus and on allocs/op and heap-bytes-retained-at-Run-exit metrics introduced here.
Motivation
What "append-only" costs today
Run this program in vm2:
fun main() {
for i in 0..1_000_000 {
let xs = [i, i+1, i+2]
println(xs[0])
}
}
The loop allocates one *vmList per iteration (runtime/vm2/lists.go:16-28). Every one of those 1,000,000 lists stays pinned in vm.Objects until Run returns. Peak heap is 1_000_000 * sizeof(*vmList + backing array), even though only one list is reachable at a time from the program's perspective. The Go GC cannot help: every entry in Objects []any is a live root for Go.
Three workloads in the MEP-23 corpus already trip this: map_get builds and discards 32 maps per iteration, string_cat accumulates an intern table that never shrinks, and quicksort_recursive allocates sublists that the parent overwrites but never frees. The iter_sum and fib_rec numbers in MEP-30 are clean only because the numeric fast path never touches Objects. The minute we benchmark anything container-shaped, the absence of reclamation becomes the dominant cost.
Why we cannot just lean on Go's GC
The host language is Go and its GC is excellent for Go-shaped programs. It does not help here because the lifetime our users care about is shorter than the lifetime visible to Go: a *vmList is reachable through vm.Objects[i] as long as the program runs, even if the program will never read index i again. We have a precise dependency graph at the bytecode level (the JIT and the type system both know which Cells reference which Objects); Go does not. The job is to publish that reachability to a per-VM allocator that runs inside Go's GC but reclaims at vm2's granularity.
Why this is the right time
Three things changed under us in the last twelve months that make the choice less open than it was when MEP-20 deferred this question:
- Perceus is no longer research. Roc ships it. Lean ships it. Koka ships it. The PLDI 2021 paper has been productized for four years and Roc 2024 numbers show Set/List benchmarks within 1.3-2x of C with no GC pauses (Roc Functional, Perceus PLDI'21).
- MMTk reached Ruby 3.4 in January 2025 (Rails at Scale 2025-01-08). The modular interface (
VMBinding) is the first credible way for a small VM to consume a research-grade collector (Immix, LXR, SemiSpace, MarkSweep) without writing one. The cost is the Rust FFI; the benefit is that the GC algorithm becomes a swap-out. - Generational Shenandoah went production in JDK 25 (Perf Parlor 2025-09-14) and Generational ZGC is the default in JDK 23+. The "pauseless concurrent GC pays back even at small heaps" hypothesis has now been re-checked at scale; the answer is only above 8 GB. At vm2's scale (interpreter heap rarely above 1 GB) the right primitive is per-allocation reclamation, not concurrent tracing.
The conclusion that follows from those three is that reference counting plus reuse analysis is the front-runner for a small register VM in 2026, not tracing GC. This MEP commits to validating that hypothesis on the MEP-23 corpus before locking it in, hence three options rather than one.
Background: what we are borrowing from
This section is the deep-research output, condensed. Each subsection ends with the one or two ideas we are taking from that line of work.
Perceus (Koka, Roc, Lean 4)
Perceus is precise non-deferred reference counting with three compile-time passes that erase most of the runtime cost (Reinking, Xie, de Moura, Leijen, PLDI 2021):
- Precise RC. Drops are inserted by the compiler at the exact last-use point. There is no liveness extension to end-of-scope. Programs are "garbage free" in the sense that an object's refcount hits zero the instant it becomes unreachable.
- Reuse analysis. If a function consumes a unique value of one shape and constructs a value of the same or smaller shape, the compiler rewrites the construction as an in-place mutation of the consumed cell. Koka's
mapover a unique list does zero allocation. - Borrow inference. Function parameters annotated as borrowed do not get a refcount bump on entry. Most local helpers in Roc compile to zero RC ops because borrow-inference handles the steady state.
Roc's 2024 status: ~95% of RC ops elided by compile-time analysis in Lobster (Lobster memory management); Roc reports comparable elision rates and Set / List benchmarks within 1.3-2x of C (Roc Functional). The known weakness is cycles: Roc resolves it by language design (no mutation, no cycles); Koka resolves it with a runtime cycle collector run on demand; Lobster reports cycles at exit ("cycle report") and asks the programmer to break them. Bacon and Rajan's concurrent cycle collector (ECOOP 2001) is the underlying algorithm.
What we take: the Perceus shape (drop at last-use, reuse-on-unique, borrow-on-parameter) is exactly what fits vm2: every container Cell already has a known type tag at JIT compile time (MEP-34 §Cell tag-check fast paths), so the analysis has the information it needs without a whole-program closed-world assumption.
Inko: runtime single ownership
Inko (inko-lang.org/papers/ownership.pdf) gives each value one owner; aliases are runtime-checked borrows. A "unique value" cannot be aliased outside its box and can be sent across processes without copy. Where Rust enforces single ownership at compile time, Inko enforces it at runtime with one extra word per object holding a borrow count; on drop, if the borrow count is non-zero the program aborts.
What we take: the runtime enforcement story is the right one for a dynamic register VM. The Rust-style "prove single ownership at compile time" story is incompatible with Mochi's reflection / any paths.
Lobster: compile-time RC elimination
Lobster (Wouter van Oortmerssen, 2019 talk) interleaves ownership analysis with type checking. The ownership "kind" is a property of every AST node, not just of variables. Result: 95% of RC ops removed at compile time, cycles reported at exit. Influenced Nim's --gc:arc.
What we take: the architecture decision that ownership analysis lives inside the type checker, not as a separate pass. For Mochi this means types/check.go and the soundness work in MEP-7 are the right home, not a new analyzer.
Verona / Reggio: region forest
Verona (OOPSLA 2023, Cheney et al.) organizes all objects into a forest of isolated regions. Each region has a memory-management strategy (trace, RC, arena) chosen per-region. A thread has one "window of mutability" at a time. Region isolation is enforced by a reference-capability type system; regions are reclaimed wholesale when their owning reference is dropped.
What we take: the per-allocation-site strategy idea. A frame's local lists can live in an arena that dies with the frame; the program's global persistent map cannot. Verona's choice to make strategy a region property rather than a global one is the right abstraction.
Immix / LXR / MMTk
Immix (Blackburn & McKinley, PLDI 2008) is a mark-region collector: 128-byte lines grouped into 32 KB blocks, allocate bump-pointer inside a block, reclaim at line granularity, opportunistically copy fragmented blocks. LXR (Zhao, Blackburn, McKinley, PLDI 2022) layers reference counting over Immix: stop-the-world RC pauses (a few ms) reclaim 90%+ of memory without copying; occasional concurrent traces catch cycles. LXR beat ZGC and Shenandoah on the 2022 latency benchmarks while matching G1 throughput.
MMTk (mmtk.io, Ruby 3.4 integration 2025-01) packages Immix, LXR, SemiSpace, MarkSweep, and friends as a Rust library with a VMBinding trait. Bindings exist for OpenJDK, V8, Julia, Ruby.
What we take: if the RC option underperforms, LXR is the fallback. MMTk gives us LXR (and SemiSpace, MarkSweep, Immix) without writing a collector. The cost is the Rust FFI boundary and a write-barrier on every ptr-tagged Cell store.
Generational ZGC, Shenandoah, JSC Riptide
The big concurrent collectors (Java 25 ZGC, WebKit Riptide) are engineered for heaps in the 8 GB to multi-TB range. Their constant overhead (write barriers, colored pointers in ZGC, Brooks pointers in Shenandoah, conservative root scanning in JSC) is meaningful on small heaps. Riptide's "logical versioning" trick (bump a global version rather than physically clearing mark bits) is interesting and cheap; the rest of the bag of tricks is sized wrong for vm2.
What we take: logical versioning for mark bits, in option 3. Nothing else.
Escape analysis: PLDI 2024 and OOPSLA 2024
"Optimistic Stack Allocation and Dynamic Heapification" (PLDI 2024) and MEA2 (OOPSLA 2024) advance the state of the art on stack-promoting heap allocations in managed runtimes. The PLDI'24 idea: do a precise static escape analysis offline, JIT-compile with optimistic stack allocation, and have the JIT and interpreter perform dynamic heapification (copy the stack object to the heap and rewrite references) when an optimistic assumption is invalidated.
What we take: the dynamic-heapification escape valve, in option 2 (regions). It is the answer to "what happens when an allocation outlives the frame we put it in".
Summary table
| System | Mechanism | Cycle handling | Strength | Weakness |
|---|---|---|---|---|
| Perceus / Koka / Roc | Precise RC + reuse + borrow | Cycle collector or by-design | Predictable, allocation-free hot paths | Atomic RC for shared, cycle cost |
| Inko | RC + runtime borrow check | Single ownership rules out | Deterministic destructors, simple model | Aborts on borrow violation |
| Lobster | RC + compile-time elision | Reported at exit | 95% RC ops gone, simple | Cycles leak silently |
| Verona / Reggio | Per-region strategy | Per-region | Mixes arena + RC + trace | Type system complexity |
| LXR / Immix | Mark-region + opportunistic copy | Concurrent trace | Throughput rivals G1, ms pauses | Write barrier, MMTk FFI |
| ZGC / Shenandoah | Concurrent copy, colored / Brooks ptrs | Trace | Pauseless at TB heaps | Overweight at MB heaps |
| JSC Riptide | Non-compacting concurrent mark | Trace | Conservative roots, logical versioning | Conservative roots are imprecise |
Constraints unique to vm2
Before laying out options, the constraints that filter them:
- Host is Go. Anything we allocate lives in Go's heap. The Go GC must remain able to scan our objects; otherwise we get use-after-free at the next concurrent mark. This rules out
unsafe.Pointer-based bit-stealing for object payloads. See MEP-20 §"Why not NaN boxing in Go" for the analogous Cell argument. - Cells are 8 bytes and NaN-boxed. A ptr-tagged Cell is a 48-bit table index, not a pointer (runtime/vm2/cell.go:40). Adding a header word to every object is free for us (we already have an
anyslot perObjects[i]); adding a header bit to every Cell is not (the NaN-box is full). - The JIT and the interpreter share the frame format (MEP-34). Both must produce identical RC operations and identical write barriers, or a JIT'd function that calls an interpreted one (and vice versa) double-counts or skip-counts. The MEP-34 frame-compatibility contract extends to the memory subsystem.
- Determinism is a feature. Mochi's test corpus diffs program output. Anything non-deterministic (concurrent tracing with cooperation points scheduled by wall-clock) breaks reproducibility. Pauses must be deterministic and scheduled at well-known program points (back-edges,
OpReturn, allocation overflow). - No assumption of closed-world. Mochi has FFI (the
Objectstable holds opaqueanypayloads, some of which are Go-side resources). A whole-program escape analysis is not viable in v1. Per-function or per-call-site analysis is.
Three implementation options
Option 1: Perceus-style reference counting with compile-time RC elision and cycle collection at Run exit
Mechanism. Add one uint32 refcount per Objects[i] slot. Compile every Move, Call, container construct, and container drop opcode to emit dup (increment) and drop (decrement-and-maybe-free) operations on ptr-tagged Cells. Run two compiler passes inside compiler2 (MEP-21):
- Last-use insertion. Place each
dropat the last lexical use of a ptr-tagged value within a function. Implemented as a backward dataflow over the bytecode, identical in shape to Koka's pass. - Reuse analysis. When a function consumes a ptr-tagged value via
dropand then constructs a new container of the same shape, rewrite the pair as an in-place reuse if the refcount is observed to be 1 at runtime. Cells of refcount 1 are "unique"; the JIT can inline this check as a one-instruction comparison (Roc Set.insert idiom). - Borrow inference. Function parameters used read-only and not stored into long-lived state are marked
borrowed. Borrowed parameters skip the entrydupand the exitdrop.
Cycles. Detected once, at Run exit, by sweeping Objects for entries with refcount > 0 that are unreachable from the registers. Bacon-Rajan (ECOOP 2001) is the algorithm. A program with no cycles pays nothing. A program with cycles pays one full sweep per Run.
Allocator. A small free-list per object class (*vmList, *vmMap, *vmString, *Closure, *big.Int) reuses freed slots in Objects. The slot index is stable across Run; freed slots are pushed onto a per-class freelist and reused by the next allocator call. Cells that hold the old index continue to compare equal but dereference to whatever now occupies the slot, which is the failure mode RC must rule out before slot reuse; the refcount-hit-zero check is exactly that proof.
Mochi mapping.
| vm2 concept | New RC concept |
|---|---|
Objects []any | Objects []objSlot where objSlot = {payload any; refcount uint32; next uint32} |
CPtr(idx) cell construction | Compiler emits dup on the resulting Cell |
OpReturn | Compiler emits drop on every ptr-tagged register at last use |
popFrame | No change. RC drops happen at last-use, not at frame exit. |
Run exit | runCycleCollect(vm) then reset Objects |
Pros. Allocation-free hot paths once reuse fires (Roc demonstrates this in practice). No GC pauses other than at Run exit. Deterministic. Co-evolves with MEP-7 (the type checker already knows ptr-tagged Cells). Aligns with the precedent set in MEP-20 of "Go-friendly value layout".
Cons. Refcount operations on shared containers are atomic if Mochi grows concurrency. The atomic op cost is ~3-5 ns on modern x86, less on ARM; Biased Reference Counting (PACT'18) is the known mitigation if it becomes a bottleneck. Cycles in long-running programs leak until Run exit; programs that build cyclic graphs (graph algorithms, mutable cyclic data) regress.
Projected numbers (extrapolating from Roc, Lobster, Koka on equivalent workloads):
| Workload | vs append-only baseline | vs LuaJIT |
|---|---|---|
iter_sum (no refs) | 1.00x (unchanged, no ptr-tagged Cells) | parity |
map_get | 30x lower heap retained, 1.1-1.3x throughput | 1.3-1.8x slower (RC bump on each map insert) |
quicksort (in-place reuse fires) | 100x lower heap retained, 1.0x throughput | parity |
string_cat (interned) | 50x lower retained | 1.2-1.5x slower |
Option 2: Per-frame region arenas with dynamic heapification
Mechanism. Each pushFrame opens a new region in a []Region stack. Every allocation made from inside that frame's bytecode goes into the current region's bump allocator. popFrame discards the region in one operation: drop the bump allocator's slab, no per-object work.
When the type system cannot prove a ptr-tagged value is frame-local (the value escapes via OpReturn, is stored into a parent register, is captured by a closure, or flows into the FFI), the compiler inserts a heapification opcode that copies the object out of the frame region into the parent's region or into a long-lived "global" region. This is the PLDI 2024 idea (Anand et al.), adapted: instead of stack-vs-heap, we have region-vs-region, and heapification rewrites the source Cell to point at the copy.
The compiler's escape analysis is conservative: a value escapes unless the bytecode demonstrably does not let it. Cases that demonstrably do not let it (large fraction of the corpus, per MEA2's measurements on Go): list comprehensions over a local list, map literals consumed by a single call, intermediate strings in string_cat.
Mochi mapping.
| vm2 concept | New region concept |
|---|---|
Objects []any | Objects []any becomes one of many regions; one *Region per frame |
pushFrame | vm.Regions = append(vm.Regions, NewRegion()) |
popFrame | vm.Regions[top].free() then Regions = Regions[:top] |
| Container construct | Allocates in current frame's region |
| Escaping value | Compiler emits OpHeapify dst, src that copies into parent region |
Pros. O(1) reclamation per frame. No write barriers in the steady state. Aligns naturally with Mochi's frame structure (MEP-34 §frame-compatibility). Cycles among objects in the same region are reclaimed for free.
Cons. Escape analysis is the bottleneck. If most values escape (we have not measured), regions degrade to a slow heap. Long-lived values (the program's global state, the persistent map a user keeps growing) need their own region with one of the other two strategies layered on top, which means we end up implementing option 1 or 3 anyway, just on a smaller footprint. The Verona / Reggio paper makes precisely this argument (Cheney et al., OOPSLA 2023 §3.2).
Projected numbers.
| Workload | vs append-only baseline | vs LuaJIT |
|---|---|---|
iter_sum | 1.00x | parity |
map_get | 100x lower retained if maps don't escape, 1.0x retained if they do | parity if non-escaping |
quicksort | 200x lower retained (frame-local sublists) | parity |
long_running_repl (global growing map) | regresses, need fallback strategy | regresses |
Option 3: Mark-region collector over the Objects table (Immix / LXR shape, no MMTk dependency)
Mechanism. Replace Objects []any with a slab-allocated region heap: 32 KB blocks, 128 B lines, bump-pointer inside a block. Every block holds objects of the same class (*vmList, *vmMap, ...) to keep the Go GC's type information precise. A mark phase runs at allocation overflow (when no block of the requested class has a free line):
- Stop the mutator at the next safepoint (back-edge or
OpCall/OpReturn). - Scan the register file plus the
Objectsindirection (we still need indirection for slot stability under refcount-free reclamation; see below) for ptr-tagged Cells. The Cells are the precise root set; no conservative scan. - Mark lines and blocks reachable. Use logical versioning (JSC Riptide) so mark bits do not need physical clearing between cycles.
- Reclaim free lines for allocation. Compact only fragmented blocks via the Immix opportunistic-copy path.
LXR layers RC on top of this for incrementality. We would start without LXR and adopt it in a follow-on MEP if pause times exceed the budget.
Mochi mapping.
| vm2 concept | New collector concept |
|---|---|
Objects []any | Per-class block lists; ptr-tagged Cell still carries a 48-bit identity but resolved through a forwarding table |
pushFrame / popFrame | No change |
OpCall / back-edge | Safepoint check; collector runs if requested |
| Container construct | Bump-alloc in current block; allocate new block on overflow |
Run exit | Discard all blocks |
Pros. Throughput within the Immix / LXR range, which is to say competitive with G1 on Java benchmarks. Cycles handled. No compiler-side analysis required (no Perceus-style passes, no escape analysis). Simplifies the JIT contract: the JIT emits a write barrier on ptr-tagged stores, nothing else.
Cons. A write barrier on every ptr-tagged Cell write. The barrier is one branch and one byte-store (LXR §3.2) but it is not zero. Pause times are bounded but not zero: a Stop-The-World mark is in the millisecond range for heaps with a few hundred thousand live objects. The implementation effort is the largest of the three: we are writing a collector. We could lean on MMTk via cgo or a Rust subprocess; the cost is the FFI boundary and a build-system dependency on cargo and a Rust toolchain in CI.
Projected numbers.
| Workload | vs append-only baseline | vs LuaJIT |
|---|---|---|
iter_sum | 0.97x (write-barrier cost) | parity |
map_get | 50x lower retained, 1.0-1.1x throughput | 1.1-1.3x slower |
quicksort | 50x lower retained, 0.95x throughput | parity |
| cycle-heavy graphs | reclaimed correctly | parity |
Comparison
| Property | Option 1 (Perceus RC) | Option 2 (Regions) | Option 3 (Mark-region) |
|---|---|---|---|
| Compiler work | Last-use, reuse, borrow passes | Escape analysis + heapify opcode | Write-barrier emission only |
| Runtime work | RC bump on dup / drop | None in steady state | Mark sweep at overflow |
| Cycles | Bacon-Rajan at Run exit | Free per region | Free during mark |
| Pause | None (RC) or one per Run (cycles) | None | ms-scale per mark |
| Heap overhead | 4 bytes per Object | One slab per frame | Block headers + mark bitvectors |
| Determinism | Full | Full | Pause time varies with heap |
| FFI | Clean (RC visible to Go) | Clean | Clean if pure-Go, FFI if MMTk |
| Failure mode | Cycles leak until Run exit | Escape analysis pessimism | Pause spikes under allocation pressure |
| Lines of code (estimate) | 1500-2500 in compiler + runtime | 2000-3000 in compiler + runtime | 3000-5000 in runtime, or MMTk FFI |
| Risk of regressing MEP-30 numeric path | None (no ptr-tag in fib_rec) | None | Small (write barrier dead-coded out) |
Recommendation
Option 1 is the proposed Phase 1. Specifically:
- Land the slot freelist + refcount in
Objectsfirst, with the compiler emittingdup/dropat every ptr-tagged register move. This is the unoptimized RC baseline. Measure heap-retained on the MEP-23 corpus. - Add last-use insertion. Measure RC op count and heap-retained.
- Add reuse analysis. Measure allocation rate on
quicksortandmap_get. - Add borrow inference. Measure RC op count on closure-heavy workloads.
- Add Bacon-Rajan cycle collection. Measure on graph workloads.
Each of the five steps lands as a separate PR with its own gate against the MEP-23 baseline. If any step regresses geo-mean throughput against the previous step, the work stops and we re-evaluate against option 3.
Option 3 is the explicit fallback. If at step 3 or step 5 we cannot reach 0.9x of Roc-shape numbers on the container corpus, the recommendation pivots to MMTk via FFI. Option 2 is rejected as a primary path because the escape analysis dominates risk; we adopt regions only as an opportunistic optimization layered on whichever of options 1 or 3 wins.
Scope
In scope for the Phase 1 deliverable (option 1):
- The slotted Objects table with refcount + freelist (
runtime/vm2/vm.go). - The compiler passes for last-use, reuse, and borrow inference (
compiler2/). - The Bacon-Rajan cycle collector triggered at Run exit (
runtime/vm2/cycles.go, new). - JIT integration: vm2jit (MEP-34) must emit identical
dup/dropoperations to the interpreter at the matching opcodes. - Bench gates added to the MEP-23 corpus:
heap-bytes-retained-at-Run-exit,rc-ops-per-iteration,cycle-collector-time-at-Run-exit.
Out of scope (deferred):
- Concurrent reference counting / atomic RC ops. Mochi is single-threaded today; revisit when MEP-15 lands.
- Generational RC (Blackburn & McKinley, "Ulterior RC") or Biased RC. Land if the steady-state RC overhead exceeds 10% of runtime on the corpus.
- Region allocator for frame-local values (option 2). Land if escape analysis turns out cheap to bolt onto the type checker.
- Mark-region collector (option 3). Land if option 1 stalls at step 3 or step 5.
Backwards Compatibility
The change is invisible to source-level Mochi programs. The bytecode changes (new dup / drop opcodes, new OpRunCycleCollect) are emitted by the compiler, not the user. The JIT and the interpreter must coordinate per MEP-34; a JIT'd function calling an interpreted one and vice versa must produce identical RC effects.
The Objects []any slot encoding stays compatible: existing CPtr Cells still resolve to the same Go-level payload. The added refcount and freelist are an internal detail of vm.Objects.
The only externally visible change is that vm.Objects is not a stable index across Runs. Test fixtures that compare len(vm.Objects) between Runs (none today) would break.
Reference Implementation
The implementation lands across three trees:
runtime/vm2/vm.go:Objects []anybecomesObjects []objSlot;AddObjectconsults the freelist; newdup(idx),drop(idx)methods;popFrameruns no per-slot work (RC drops happen at the per-opcode level, not at frame retire).runtime/vm2/cycles.go(new): Bacon-Rajan cycle collector.compiler2/rc/(new): the last-use, reuse, borrow passes.runtime/jit/vm2jit/: emit matchingdup/dropin the ARM64 and AMD64 templates.
Each PR lands behind a build tag (vm2_rc) until the gate passes; the append-only path remains the default until the measured-results MEP lands.
Open Questions
- Atomic vs non-atomic RC. Today Mochi has no concurrency; non-atomic is correct. When MEP-15 introduces effect-tracked concurrency, we will need a thread-local-or-shared bit per slot (Inko's model) or BRC (the 2018 Biased RC paper). Defer.
- Stable slot identity. If a slot is freed and reused, any Cell that still holds the old index has a stale reference. The RC invariant rules this out (refcount-zero is a precondition for freeing the slot), but the JIT must be audited for paths that read a Cell after a drop. Add a
droppeddebug flag in test builds. - Cycle-collector cost on Run exit. If a program's cyclic-garbage rate is high, the exit cost dominates. We may want to schedule it more often than at Run exit (every N allocations, or at GC.AssistMark-style triggers). Defer until measured.
- MMTk fallback ergonomics. If option 3 becomes the path, the FFI boundary needs to coexist with
go test. Bun's cgo+Rust experience (Bun blog 2024-09) suggests this is workable but not free. - Slot indirection vs direct pointer. Once RC is precise, the
Objectsindirection becomes optional; a ptr-tagged Cell could in principle carry a 48-bit Go pointer directly. Go's GC scanning rules block this today (Cells areuint64, not*X), but a parallelRoots []*objSlotarray could expose the pointers. Defer.
References
Reference counting and reuse
- Reinking, Xie, de Moura, Leijen. Perceus: Garbage Free Reference Counting with Reuse. PLDI 2021. paper, ACM DL
- Bacon, Rajan. Concurrent Cycle Collection in Reference Counted Systems. ECOOP 2001. paper
- Choi, Lee. Biased Reference Counting: Minimizing Atomic Operations in Garbage Collection. PACT 2018. paper
- Roc Programming Language. Functional design page (roc-lang.org/functional).
- van Oortmerssen. Memory Management in Lobster. aardappel.github.io
- Inko Project. Ownership You Can Count On: A Hybrid Approach to Safe Explicit Memory Management. inko-lang.org/papers/ownership.pdf
Regions
- Cheney, Drossopoulou, et al. Reference Capabilities for Flexible Memory Management (Verona / Reggio). OOPSLA 2023. arXiv:2309.02983
- Wikipedia. Region-based memory management. link
Tracing and hybrid collectors
- Blackburn, McKinley. Immix: a Mark-Region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance. PLDI 2008. paper
- Zhao, Blackburn, McKinley. Low-Latency, High-Throughput Garbage Collection. PLDI 2022 (LXR). paper
- Memory Management Toolkit. mmtk.io, Ruby 3.4 integration 2025-01
- WebKit. Understanding Garbage Collection in JavaScriptCore From Scratch. webkit.org/blog/12967
- Norlinder, Österlund, Black-Schaffer, Wrigstad. Mark-Scavenge: Waiting for Trash to Take Itself Out. OOPSLA 2024.
- Generational Shenandoah, Java 25. The Perf Parlor 2025-09
- Pauseless Garbage Collection in Java 25: ZGC Deep Dive. andrewbaker.ninja 2025-12
Escape analysis
- Anand, Adithya, Rustagi, Seth, Sundaresan, Maier, Nandivada, Thakur. Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes. PLDI 2024. ACM
- MEA2: a Lightweight Field-Sensitive Escape Analysis with Points-to Calculation for Go. OOPSLA 2024. splashcon.org
Mochi context
- MEP-7. Soundness
- MEP-15. Effects, Mutability, and Purity
- MEP-20. Value Representation and Allocation Discipline
- MEP-21. Compiler2 and VM2 Co-Design
- MEP-23. Cross-language Baseline Benchmarks
- MEP-34. VM2 Full-Opcode JIT
Copyright
This document is placed in the public domain.