Skip to main content

MEP 26. VM2 Data Model Option 1 - Monomorphic + One-Way Migration

FieldValue
MEP26
TitleVM2 Data Model Option 1 - Monomorphic + One-Way Migration
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-16

Abstract

MEP 25 specifies the in-memory shapes for vm2's five container kinds (strings, lists, maps, sets, structs). This MEP specifies how to dispatch on those shapes using one-way monotonic migration: every container begins life in the most specialized shape that fits its first operation; whenever an operation cannot be served by the current shape, the container migrates in place to the next-most-general shape. Migration is one-shot, O(n), and never reversed. Steady state is monomorphic: a hot loop touching a container of a fixed shape pays one shape guard per access and zero runtime branching beyond that.

This is Option 1 of three. MEP-27 specifies the polymorphic + inline-cache alternative; MEP-28 specifies the AOT codegen alternative. All three share the same shape spec; only the dispatch strategy differs. A common comparison matrix appears at the end of each MEP so a reader entering at any point can pick.

Motivation

vm2 is currently 1.5x behind Lua on lists/fill_sum and 4.9x behind on maps/fill_sum (MEP-23 baseline). Both gaps are the cost of carrying generic shapes on workloads whose actual element / key type is known and stable. The question is not whether to specialize, but when, at compile time, at first-access time, or at every-access time. Migration picks first-access time:

  • Mochi's front end already produces typed IR. Most lists are typed list<int>, most map keys are typed int or string. The first push or set already tells the runtime which shape is right; subsequent operations don't add information.
  • Reversibility is not earned. Workloads that flip a list back and forth between int-only and mixed are not present in the corpus, and bidirectional specialization is a known anti-pattern (see PyPy storage strategies' rationale for IntStrategy → ObjectStrategy being one-way only).
  • Single-guard hot paths suit a future JIT. A JIT can compile one specialized handler per shape and let the shape guard fail into a recompile / fall-through. Migration's monomorphic steady state means almost every guard succeeds.

Lineage

Migration as a dispatch strategy is one of the most-deployed techniques in modern dynamic-language runtimes. The relevant prior art:

  1. V8 hidden classes (Lars Bak, 2008, drawing on Self 1989). Each JS object carries a hidden class pointer; field-add operations transition the object to a child hidden class via a precomputed transition tree. The transition is one-way in the steady state (you cannot "remove" a field's class lineage), and the JIT compiles inline caches keyed on the hidden class pointer. Property reads become a single pointer compare + offset load. See "An Efficient Implementation of SELF" (Chambers, Ungar, Lee, OOPSLA 1989) for the foundational design.

  2. PyPy storage strategies (Bolz et al., 2013). Each PyPy list has a Strategy object: EmptyStrategy → IntegerStrategy → FloatStrategy → ObjectStrategy, with migrations triggered by the first operation that does not fit. IntegerStrategy stores raw int64 slots; ObjectStrategy stores boxed PyObject pointers. Migration is one-way; the paper explicitly rejects reverse migration because it caused workload flapping in early experiments. ("Storage Strategies for Collections in Dynamically Typed Languages", Bolz, Diekmann, Tratt, OOPSLA 2013.) The vm2 list lattice in §1 below is a direct port of this idea.

  3. Lua table promotion (Roberto Ierusalimschy, 2005, see Lua 5.1 source). Lua tables start as an array part + hash part; insertion of an out-of-sequence key promotes a slot from array to hash. The promotion is one-way per slot. ("The Implementation of Lua 5.0", Ierusalimschy, de Figueiredo, Celes, J. Universal Computer Science, 2005.)

  4. CRuby shape tree (Aaron Patterson et al., Ruby 3.2, 2022). Ruby objects carry a shape ID. Adding an instance variable transitions to a child shape. The shape tree is shared across all objects of a class; the per-object cost is one 32-bit shape ID. This is the closest contemporary analog to the inline-2 / flat-N struct decision in MEP-25 §5.

  5. Rust SmallVec<[T; N]> and smallvec crate. Inline storage up to N elements, heap allocation beyond. The spill from inline to heap is exactly the migration model: O(n) copy once, never reversed. The Rust ecosystem has shown this scales to high-throughput production code (rustc, Servo, ripgrep).

  6. HotSpot inflated locks (Bacon et al., 1998). Object monitors start as a thin lock (one bit in the object header) and inflate to a heavyweight monitor on first contention. Inflation is one-way. The thin/inflated distinction is the same monomorphic-then-general pattern applied to synchronization.

The common pattern across all six: most objects never leave their initial shape, so the steady-state cost is one shape guard per access. Migration cost is amortized to zero over the object's lifetime when O(migrations) ≪ O(operations), which holds for every workload the corpus has so far measured.

The lattice, formally

A shape lattice for a container kind K is a finite directed acyclic graph (V, E) where:

  • V is the set of allowed shapes for kind K (e.g., for lists: {ListI64, List}).
  • E ⊆ V × V is the set of allowed migrations. An edge (s₁, s₂) means a container in shape s₁ may migrate to shape s₂ on a triggering operation; the reverse edge is never present.
  • There is a unique bottom shape (the most specialized; the one OpNew* produces) and a unique top shape (the most general; every other shape can reach it).
  • The lattice has no cycles. A container, once at shape s, never returns to a shape strictly below s in the lattice.

Per-kind lattices (matching MEP-25):

Strings: Inline ─→ Flat ─→ Rope (3 shapes, monotonic by length/depth)
Lists: ListI64 ─→ List (2 shapes)
Maps: MapEmpty ─→ MapI64 ─→ Map (3 shapes)
Sets: SetEmpty ─→ SetI64 ─→ Set (3 shapes)
Structs: StructSmall ─→ Struct (2 shapes, chosen at OpNewStruct from typeID)

Strings are a slight special case: the variant is picked at construction and changes only when concat overflows the current variant's bounds. The same one-way property holds (no rope ever becomes flat again unless a deliberate OpStrFlatten is emitted; flatten is treated as a fresh value, not a migration).

Invariant. For every container c at shape s and every opcode op, exactly one of three things happens:

  1. op is in s's served set: execute on s. No migration.
  2. op is not in s's served set but there is a unique edge (s, s') such that op is in s''s served set: migrate to s', then execute. The migration replaces Objects[c.Ptr()] in place; the Cell encoding is unchanged.
  3. op is not served by s or any reachable shape from s: trap. (This case does not occur for any in-spec opcode against any in-spec shape; it is the type-error case.)

Migration mechanics

In-place Objects-table swap. Every container Cell encodes tagPtr | objects_index. Migration allocates a new shape object and stores it at the same Objects-table index, overwriting the old pointer. Any other Cell that referenced the same container is automatically updated, because they all dereference through Objects[index]. This is the same trick V8 uses for in-place hidden-class transitions on objects whose layout changes.

Atomicity. vm2 is single-threaded today (per MEP-21 v2), so migration is observably atomic: the dispatch loop allocates the new shape, copies the data, swaps the pointer, then returns to the opcode handler. A multi-threaded vm2 would need either a stop-the-world phase per migration or a per-container lock; that is out of scope here.

Cost bound, per container, per program lifetime. Let n be the number of elements the container ever holds and k be the number of distinct shapes the container ever visits. The total migration cost is O(n × k). For the lattices above, k ≤ 3, so migration is O(n) per container over its lifetime.

Cost bound, per operation, in steady state. One shape guard (load Objects[i], type-assert to the expected shape). On a modern CPU this is one L1 load + one branch. The branch is monomorphic on hot loops (always the same shape), so the BTB predicts it perfectly. Measured cost: ~1.5 cycles on Apple M-series.

Per-kind specification

1. Strings

Lattice: Inline → Flat → Rope. Migration triggers:

  • Inline → Flat: any concat whose combined length exceeds 5 and is ≤ flatConcatThreshold (proposed 64 bytes per MEP-25 §1).
  • Flat → Rope: any concat whose combined length exceeds flatConcatThreshold and whose result's would-be depth ≤ maxRopeDepth.
  • Rope → Flat: not a migration. If depth would exceed maxRopeDepth, the rope is flattened into a fresh Flat string (a new Cell, not the same Objects entry). The original Rope remains a Rope.

Steady-state cost per op:

OpInlineFlatRope
OpLenStrO(1) payload extractO(1) struct fieldO(1) cached len
OpIndexStrO(1) payloadO(1) slice idxO(depth) ≤ 32
OpConcatStrO(1) packO(n) buffer copyO(1) alloc node
OpEqualStrO(1) bit compareO(n) byte compare with hash short-circuitO(n) leaf walk with hash short-circuit
OpHashStrO(n) FNV (≤5 bytes)O(n) FNV memoizedO(n) FNV memoized at node

2. Lists

Lattice: ListI64 → List. Migration trigger: OpListPush or OpListSet with a non-int value against ListI64.

Migration cost: allocate *vmList{data: make([]Cell, len, cap)}, then for i, v := range src.data { dst.data[i] = CInt(v) }. This is ~5 ns per element on M-series; for a 100-element list, migration is ~500 ns, less than a single mapKeyOf boxing on the generic path.

Steady-state cost per op (ListI64 hot path):

  • OpListGet: one shape guard + one int64 slice load + one CInt (which is bit packing, no alloc). ~3 cycles.
  • OpListPush: one shape guard + Go slice append. Amortized O(1) at ~5 cycles.
  • OpListLen: one shape guard + one length load. ~2 cycles.

3. Maps

Lattice: MapEmpty → MapI64 → Map. Migration triggers per MEP-25 §3.

Migration cost: MapEmpty → MapI64 is one *vmMapI64 allocation with cap=8 (no copy). MapI64 → Map is one *vmMap allocation plus for i := range src.ctrl { if src.ctrl[i] occupied { dst.entries[src.keys[i]] = src.vals[i] } }. For a 64-entry int map this is ~3 µs; for 1024 entries, ~50 µs. Both are tiny compared to the lifetime cost saved.

Steady-state cost per op (MapI64 hot path):

  • OpMapGet: shape guard + splitmix64 hash + linear probe over ctrl bytes (≤2 probes at 7/8 load factor) + int compare. ~8 cycles.
  • OpMapSet (already present): same as Get + value store. ~12 cycles.
  • OpMapSet (new): same as Get for probe + len/ctrl update.

Compare to generic Map shape on int keys: mapKeyOf type-switch (~5 cycles) + Go map lookup (~30 cycles + interface boxing alloc on insertion). Migration's MapI64 path is roughly 4-5× cheaper than the generic path on the same workload, which closes the 4.86× Lua gap.

4. Sets

Lattice: SetEmpty → SetI64 → Set. Mirrors maps exactly; per-op cost identical (less the value slot, saving 8 bytes per entry).

5. Structs

Lattice: StructSmall → Struct with the shape chosen at construction, not by migration. OpNewStruct reads Program.Types[typeID].NumFields once and picks StructSmall if NumFields ≤ 2 else Struct. Field writes never migrate (the field set is fixed at the type level). Structs therefore have a degenerate one-element lattice per object after construction; the "migration" framework still applies trivially.

This is intentional: Mochi's typed front end means struct shape is known at compile time, so there is nothing to discover at first-write time the way there is for collections.

Migration cost amortization

A standard amortized argument: assign each container a credit of 1 unit per element ever inserted. Migration of a container of size n costs O(n) and consumes the n credits accumulated so far. Subsequent operations on the migrated container pay no migration cost. Therefore the amortized per-operation cost is O(1) over the container's lifetime.

This is the same argument PyPy's paper makes for storage strategies. The argument breaks if migrations are reversible (a container can yo-yo and consume credits faster than they accumulate), which is precisely why this MEP is one-way only.

JIT integration

A future JIT (MEP-22, deferred) compiles one specialized handler per shape. Given a hot loop:

for i in 0..n { sum = sum + xs[i] } // xs : list<int>

the IR shows xs : list<int>. The JIT emits:

guard objects[xs_idx].type == *vmListI64 else deopt
loop:
i64_val = objects[xs_idx].data[i]
sum += i64_val
i++
if i < n goto loop

One guard, hoisted out of the loop. Inside the loop, the load is a single indirect int64 read. No type-switch, no Cell packing, no allocation. This is the ideal-case JIT output and is achievable because:

  1. The shape is known monomorphic at the guard site (hot loops do not migrate; if they did, the guard would have failed already and recompilation would have widened the type).
  2. Migration is one-way, so the guard cannot succeed-then-fail-then-succeed; once it fails, the loop deopts and recompiles for the wider shape.
  3. The Cell encoding is invariant across shapes (tagPtr | idx), so the deopt path is straightforward, just resume the interpreter with the same registers.

Compare with the IC strategy in MEP-27: the JIT would emit a per-call-site cache lookup + branch. Compare with the AOT strategy in MEP-28: the JIT would inline the same specialized code, but the bytecode would already encode the shape.

Cost analysis summary

Per operation, in steady state, on a monomorphic workload:

KindMigration strategy costGeneric-only baseline costSpeedup
Strings (concat-heavy)rope alloc + cached hashO(n²) total copyO(n) / O(n²) per program
Lists (int)1 guard + 1 int load + 1 pack1 cell load + tag dispatch~2x
Maps (int)1 guard + splitmix + probemapKeyOf + Go map + alloc~4-5x
Sets (int)as maps less value slotas maps less value slot~4-5x
Structs (≤2 fields)1 guard + inline load1 ptr deref + slice idx~2x

Per migration, amortized over the container's lifetime, when O(operations) ≫ O(migrations) (true for every measured corpus workload).

Predicted MEP-23 numbers under Option 1

Based on the per-op analysis above and the current MEP-23 baseline (vm2 vs Lua 5.5):

BenchmarkCurrent (vs Lua)Predicted under Option 1Mechanism
lists/fill_sum N=1001.56x slower0.5-0.7x slower (parity-to-faster)ListI64 removes per-element Cell pack
maps/fill_sum N=1004.86x slower0.9-1.3x slower (near parity)MapI64 Swiss table on int keys
sets/fill_check N=100 (new)n/a0.8-1.2xSame as maps, less value slot
strings/concat_loop N=302.0x slower0.4-0.6x (rope wins big)O(N) total vs O(N²) total
strings/concat_loop N=101.07x slower1.07x (unchanged; under threshold)Flat path keeps current behavior
structs/point_dot N=100 (new)n/a0.7-1.0xStructSmall inline-2

The "≤ 1.0x Lua on every measured cross-language row" target from MEP-25 is reached on every projected row.

Code-size estimate

The migration framework, implemented as a thin wrapper in each runtime/vm2/{strings,lists,maps,sets,structs}.go, is small:

ModuleNew LOC (Go)Notes
strings.go~250Rope node + concat decision tree
lists.go~150*vmListI64 + migrate helper
maps.go~400Swiss table layout, splitmix, probe loop
sets.go~250Mirrors maps, smaller
structs.go~120StructSmall + Struct + type-table validation
Total~1170Plus ~30 dispatch cases in eval.go

The largest single piece is the Swiss table; the rest is straightforward.

Disadvantages

This MEP is honest about the costs of choosing migration:

  1. First-touch latency spike. A list that holds 10000 ints and then receives a string will pay ~50 µs at the migration call. Most programs do not see this; programs that do (a 10000-element list whose 10001st operation widens it) will see a stutter. Pre-warming or explicit emitter hints can mitigate but not eliminate.

  2. Wrong-shape allocation cost. A program that creates many small maps that all immediately get string keys pays the MapEmpty → Map migration cost on every map. The two-allocation cost is ~30 ns, comparable to a Go map allocation; not catastrophic but not free. An emitter optimization (OpNewMapStr for emit-time-known string-keyed maps) can elide this; out of scope here.

  3. JIT deopt complexity. When a JITed loop's shape guard fails, the JIT must deopt cleanly. The Cell encoding's shape-invariance helps, but the deopt path is still real engineering. MEP-27's IC strategy avoids deopt entirely (the IC misses and falls through to a slower polymorphic path); this MEP requires a real deopt mechanism.

  4. Per-kind handler duplication. Each opcode handler that touches a container has at minimum two arms (specialized + generic). For maps with three shapes (Empty, MapI64, Map), some opcodes have three arms. This is more switch-case code than the IC strategy.

  5. Reverse-migration anti-pattern is permanent. A container that briefly held a non-int and was emptied is forever generic. Workloads that intentionally clear and reuse containers cannot benefit from re-specialization without an explicit emitter hint or a new MEP.

Failure modes and mitigation

  • Migration storm. If a workload migrates many containers nearly simultaneously (e.g., a query that fills 1000 ListI64s and then widens all of them), the GC pressure spikes briefly. Mitigation: the new shape replaces the old in the Objects table; the old shape is garbage and freed at the next collection. No allocator special-casing required.

  • Hot loop near a migration point. A loop whose body migrates on its first iteration pays migration once, then runs unmigrated for n-1 iterations. The first iteration is a microsecond slower; subsequent iterations are at full speed. Net loss is negligible for n > 10.

  • Migration during iteration. The opcode set does not include a container-iterator opcode that holds a reference to internal storage; OpListGet re-reads Objects[idx] every call. Therefore a migration during iteration is observable only as a length / element change, which is the same as any other concurrent mutation. No iterator invalidation logic needed.

Comparison Matrix (identical across MEP-26/27/28)

Reading guide: lower is better for cost; higher is better for fit.

AxisOption 1 (Migration)Option 2 (IC)Option 3 (AOT)
Per-op steady-state cost (monomorphic)1 shape guard1 IC compare + load0 (specialized handler)
Per-op steady-state cost (polymorphic)n/a (cannot be polymorphic)1 IC compare miss + slow path0, fallback path 2-3x cost
Cost of changing shapeO(n) one-shot, amortizedNone - IC updates lazilyNone - both paths exist
ReversibilityNoYes (IC re-warms)Yes (both handlers exist)
Bytecode sizeUnchanged from MEP-25Unchanged2-3x more opcodes
Handler-table code size (Go LOC)~1200~1500~3500
JIT integration complexityMedium (deopt required)Low (IC slot is JIT-visible)High (must enumerate specializations at JIT time too)
JIT steady-state quality (monomorphic)Excellent (1 hoisted guard)ExcellentExcellent
JIT steady-state quality (polymorphic)Poor (frequent deopt)Good (IC handles 2-4 shapes)Excellent (all paths compiled)
Memory overhead per container0 (shape via Objects type)00
Memory overhead per call site0~16 bytes per IC slot0
Compile-time impactNoneNone20-40% longer emit phase
Predicted Lua-relative on int maps0.9-1.3x slower (near parity)1.1-1.5x slower0.6-0.9x slower (faster!)
Predicted Lua-relative on int lists0.5-0.7x slower0.6-0.8x slower0.4-0.6x slower
First-implementation riskLow - well-troddenMedium - IC sizing trickyMedium-high - codegen explosion
Production deployments (prior art)V8, PyPy, Lua, CRuby, Rust SmallVec, HotSpotSelf, V8 ICs, SpiderMonkeyLuaJIT, Julia, Crystal, Nim

Recommendation (from this MEP's perspective)

Choose Option 1 if: the workload is mostly monomorphic (Mochi's typed front end strongly suggests this), the team prefers a small surface area, the JIT roadmap is willing to invest in deopt, and the codebase prizes a single canonical representation per kind over peak polymorphic performance.

Choose Option 2 (MEP-27) if: workloads are expected to be polymorphic in practice, the team wants to avoid deopt machinery, and per-site IC tuning is acceptable surface area.

Choose Option 3 (MEP-28) if: compile-time cost is unconstrained, peak performance on monomorphic code is the top priority, and the bytecode size cost is acceptable.

For Mochi specifically, this author recommends Option 1 as the first choice, with Option 3's codegen specialization layered on top later (the two compose: the AOT path emits the migration-aware opcodes, and the migration-aware runtime catches the rare polymorphic case). Option 2 is the right choice for runtimes whose source language is untyped (JS, Smalltalk); Mochi is not that runtime.

Open Questions

  • Migration hints from the emitter. Should the compiler emit OpNewListInt distinct from OpNewList when the IR knows the element type? This is a 1-line emit change that pre-specializes the bottom of the lattice. Recommended yes; out of scope as a normative requirement here.
  • Migration auditing. Should the VM record a counter per (kind, src-shape, dst-shape) for profile-guided optimization? Recommended yes (one atomic increment per migration is negligible); out of scope.
  • Lattice extensibility. Future shapes (e.g., *vmListF64, *vmMapStr) extend the lattice by splicing a node into the existing path. Backward compatibility is preserved because the topology stays a DAG; no other invariants break.

References

  • Chambers, Ungar, Lee. "An Efficient Implementation of SELF." OOPSLA 1989.
  • Hölzle, Chambers, Ungar. "Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches." ECOOP 1991. (Relevant for what migration is not doing.)
  • Bolz, Diekmann, Tratt. "Storage Strategies for Collections in Dynamically Typed Languages." OOPSLA 2013.
  • Ierusalimschy, de Figueiredo, Celes. "The Implementation of Lua 5.0." J. Universal Computer Science, 2005.
  • Patterson, "Adventures in Ruby Object Shapes." RubyKaigi 2022.
  • Bacon, Konuru, Murthy, Serrano. "Thin Locks: Featherweight Synchronization for Java." PLDI 1998.
  • Abseil engineering. "Swiss Tables Design Notes." absl/container/internal/raw_hash_set.h.