MEP 26. VM2 Data Model Option 1 - Monomorphic + One-Way Migration
| Field | Value |
|---|---|
| MEP | 26 |
| Title | VM2 Data Model Option 1 - Monomorphic + One-Way Migration |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-16 |
Abstract
MEP 25 specifies the in-memory shapes for vm2's five container kinds (strings, lists, maps, sets, structs). This MEP specifies how to dispatch on those shapes using one-way monotonic migration: every container begins life in the most specialized shape that fits its first operation; whenever an operation cannot be served by the current shape, the container migrates in place to the next-most-general shape. Migration is one-shot, O(n), and never reversed. Steady state is monomorphic: a hot loop touching a container of a fixed shape pays one shape guard per access and zero runtime branching beyond that.
This is Option 1 of three. MEP-27 specifies the polymorphic + inline-cache alternative; MEP-28 specifies the AOT codegen alternative. All three share the same shape spec; only the dispatch strategy differs. A common comparison matrix appears at the end of each MEP so a reader entering at any point can pick.
Motivation
vm2 is currently 1.5x behind Lua on lists/fill_sum and 4.9x behind on maps/fill_sum (MEP-23 baseline). Both gaps are the cost of carrying generic shapes on workloads whose actual element / key type is known and stable. The question is not whether to specialize, but when, at compile time, at first-access time, or at every-access time. Migration picks first-access time:
- Mochi's front end already produces typed IR. Most lists are typed
list<int>, most map keys are typedintorstring. The first push or set already tells the runtime which shape is right; subsequent operations don't add information. - Reversibility is not earned. Workloads that flip a list back and forth between int-only and mixed are not present in the corpus, and bidirectional specialization is a known anti-pattern (see PyPy storage strategies' rationale for
IntStrategy → ObjectStrategybeing one-way only). - Single-guard hot paths suit a future JIT. A JIT can compile one specialized handler per shape and let the shape guard fail into a recompile / fall-through. Migration's monomorphic steady state means almost every guard succeeds.
Lineage
Migration as a dispatch strategy is one of the most-deployed techniques in modern dynamic-language runtimes. The relevant prior art:
-
V8 hidden classes (Lars Bak, 2008, drawing on Self 1989). Each JS object carries a hidden class pointer; field-add operations transition the object to a child hidden class via a precomputed transition tree. The transition is one-way in the steady state (you cannot "remove" a field's class lineage), and the JIT compiles inline caches keyed on the hidden class pointer. Property reads become a single pointer compare + offset load. See "An Efficient Implementation of SELF" (Chambers, Ungar, Lee, OOPSLA 1989) for the foundational design.
-
PyPy storage strategies (Bolz et al., 2013). Each PyPy list has a
Strategyobject:EmptyStrategy → IntegerStrategy → FloatStrategy → ObjectStrategy, with migrations triggered by the first operation that does not fit.IntegerStrategystores rawint64slots;ObjectStrategystores boxed PyObject pointers. Migration is one-way; the paper explicitly rejects reverse migration because it caused workload flapping in early experiments. ("Storage Strategies for Collections in Dynamically Typed Languages", Bolz, Diekmann, Tratt, OOPSLA 2013.) The vm2 list lattice in §1 below is a direct port of this idea. -
Lua table promotion (Roberto Ierusalimschy, 2005, see Lua 5.1 source). Lua tables start as an array part + hash part; insertion of an out-of-sequence key promotes a slot from array to hash. The promotion is one-way per slot. ("The Implementation of Lua 5.0", Ierusalimschy, de Figueiredo, Celes, J. Universal Computer Science, 2005.)
-
CRuby shape tree (Aaron Patterson et al., Ruby 3.2, 2022). Ruby objects carry a shape ID. Adding an instance variable transitions to a child shape. The shape tree is shared across all objects of a class; the per-object cost is one 32-bit shape ID. This is the closest contemporary analog to the inline-2 / flat-N struct decision in MEP-25 §5.
-
Rust
SmallVec<[T; N]>andsmallveccrate. Inline storage up to N elements, heap allocation beyond. The spill from inline to heap is exactly the migration model: O(n) copy once, never reversed. The Rust ecosystem has shown this scales to high-throughput production code (rustc, Servo, ripgrep). -
HotSpot inflated locks (Bacon et al., 1998). Object monitors start as a thin lock (one bit in the object header) and inflate to a heavyweight monitor on first contention. Inflation is one-way. The thin/inflated distinction is the same monomorphic-then-general pattern applied to synchronization.
The common pattern across all six: most objects never leave their initial shape, so the steady-state cost is one shape guard per access. Migration cost is amortized to zero over the object's lifetime when O(migrations) ≪ O(operations), which holds for every workload the corpus has so far measured.
The lattice, formally
A shape lattice for a container kind K is a finite directed acyclic graph (V, E) where:
Vis the set of allowed shapes for kindK(e.g., for lists:{ListI64, List}).E ⊆ V × Vis the set of allowed migrations. An edge(s₁, s₂)means a container in shapes₁may migrate to shapes₂on a triggering operation; the reverse edge is never present.- There is a unique bottom shape (the most specialized; the one
OpNew*produces) and a unique top shape (the most general; every other shape can reach it). - The lattice has no cycles. A container, once at shape
s, never returns to a shape strictly belowsin the lattice.
Per-kind lattices (matching MEP-25):
Strings: Inline ─→ Flat ─→ Rope (3 shapes, monotonic by length/depth)
Lists: ListI64 ─→ List (2 shapes)
Maps: MapEmpty ─→ MapI64 ─→ Map (3 shapes)
Sets: SetEmpty ─→ SetI64 ─→ Set (3 shapes)
Structs: StructSmall ─→ Struct (2 shapes, chosen at OpNewStruct from typeID)
Strings are a slight special case: the variant is picked at construction and changes only when concat overflows the current variant's bounds. The same one-way property holds (no rope ever becomes flat again unless a deliberate OpStrFlatten is emitted; flatten is treated as a fresh value, not a migration).
Invariant. For every container c at shape s and every opcode op, exactly one of three things happens:
opis ins's served set: execute ons. No migration.opis not ins's served set but there is a unique edge(s, s')such thatopis ins''s served set: migrate tos', then execute. The migration replacesObjects[c.Ptr()]in place; the Cell encoding is unchanged.opis not served bysor any reachable shape froms: trap. (This case does not occur for any in-spec opcode against any in-spec shape; it is the type-error case.)
Migration mechanics
In-place Objects-table swap. Every container Cell encodes tagPtr | objects_index. Migration allocates a new shape object and stores it at the same Objects-table index, overwriting the old pointer. Any other Cell that referenced the same container is automatically updated, because they all dereference through Objects[index]. This is the same trick V8 uses for in-place hidden-class transitions on objects whose layout changes.
Atomicity. vm2 is single-threaded today (per MEP-21 v2), so migration is observably atomic: the dispatch loop allocates the new shape, copies the data, swaps the pointer, then returns to the opcode handler. A multi-threaded vm2 would need either a stop-the-world phase per migration or a per-container lock; that is out of scope here.
Cost bound, per container, per program lifetime. Let n be the number of elements the container ever holds and k be the number of distinct shapes the container ever visits. The total migration cost is O(n × k). For the lattices above, k ≤ 3, so migration is O(n) per container over its lifetime.
Cost bound, per operation, in steady state. One shape guard (load Objects[i], type-assert to the expected shape). On a modern CPU this is one L1 load + one branch. The branch is monomorphic on hot loops (always the same shape), so the BTB predicts it perfectly. Measured cost: ~1.5 cycles on Apple M-series.
Per-kind specification
1. Strings
Lattice: Inline → Flat → Rope. Migration triggers:
Inline → Flat: any concat whose combined length exceeds 5 and is ≤flatConcatThreshold(proposed 64 bytes per MEP-25 §1).Flat → Rope: any concat whose combined length exceedsflatConcatThresholdand whose result's would-be depth ≤maxRopeDepth.Rope → Flat: not a migration. If depth would exceedmaxRopeDepth, the rope is flattened into a fresh Flat string (a new Cell, not the same Objects entry). The original Rope remains a Rope.
Steady-state cost per op:
| Op | Inline | Flat | Rope |
|---|---|---|---|
OpLenStr | O(1) payload extract | O(1) struct field | O(1) cached len |
OpIndexStr | O(1) payload | O(1) slice idx | O(depth) ≤ 32 |
OpConcatStr | O(1) pack | O(n) buffer copy | O(1) alloc node |
OpEqualStr | O(1) bit compare | O(n) byte compare with hash short-circuit | O(n) leaf walk with hash short-circuit |
OpHashStr | O(n) FNV (≤5 bytes) | O(n) FNV memoized | O(n) FNV memoized at node |
2. Lists
Lattice: ListI64 → List. Migration trigger: OpListPush or OpListSet with a non-int value against ListI64.
Migration cost: allocate *vmList{data: make([]Cell, len, cap)}, then for i, v := range src.data { dst.data[i] = CInt(v) }. This is ~5 ns per element on M-series; for a 100-element list, migration is ~500 ns, less than a single mapKeyOf boxing on the generic path.
Steady-state cost per op (ListI64 hot path):
OpListGet: one shape guard + one int64 slice load + oneCInt(which is bit packing, no alloc). ~3 cycles.OpListPush: one shape guard + Go slice append. Amortized O(1) at ~5 cycles.OpListLen: one shape guard + one length load. ~2 cycles.
3. Maps
Lattice: MapEmpty → MapI64 → Map. Migration triggers per MEP-25 §3.
Migration cost: MapEmpty → MapI64 is one *vmMapI64 allocation with cap=8 (no copy). MapI64 → Map is one *vmMap allocation plus for i := range src.ctrl { if src.ctrl[i] occupied { dst.entries[src.keys[i]] = src.vals[i] } }. For a 64-entry int map this is ~3 µs; for 1024 entries, ~50 µs. Both are tiny compared to the lifetime cost saved.
Steady-state cost per op (MapI64 hot path):
OpMapGet: shape guard + splitmix64 hash + linear probe over ctrl bytes (≤2 probes at 7/8 load factor) + int compare. ~8 cycles.OpMapSet(already present): same as Get + value store. ~12 cycles.OpMapSet(new): same as Get for probe + len/ctrl update.
Compare to generic Map shape on int keys: mapKeyOf type-switch (~5 cycles) + Go map lookup (~30 cycles + interface boxing alloc on insertion). Migration's MapI64 path is roughly 4-5× cheaper than the generic path on the same workload, which closes the 4.86× Lua gap.
4. Sets
Lattice: SetEmpty → SetI64 → Set. Mirrors maps exactly; per-op cost identical (less the value slot, saving 8 bytes per entry).
5. Structs
Lattice: StructSmall → Struct with the shape chosen at construction, not by migration. OpNewStruct reads Program.Types[typeID].NumFields once and picks StructSmall if NumFields ≤ 2 else Struct. Field writes never migrate (the field set is fixed at the type level). Structs therefore have a degenerate one-element lattice per object after construction; the "migration" framework still applies trivially.
This is intentional: Mochi's typed front end means struct shape is known at compile time, so there is nothing to discover at first-write time the way there is for collections.
Migration cost amortization
A standard amortized argument: assign each container a credit of 1 unit per element ever inserted. Migration of a container of size n costs O(n) and consumes the n credits accumulated so far. Subsequent operations on the migrated container pay no migration cost. Therefore the amortized per-operation cost is O(1) over the container's lifetime.
This is the same argument PyPy's paper makes for storage strategies. The argument breaks if migrations are reversible (a container can yo-yo and consume credits faster than they accumulate), which is precisely why this MEP is one-way only.
JIT integration
A future JIT (MEP-22, deferred) compiles one specialized handler per shape. Given a hot loop:
for i in 0..n { sum = sum + xs[i] } // xs : list<int>
the IR shows xs : list<int>. The JIT emits:
guard objects[xs_idx].type == *vmListI64 else deopt
loop:
i64_val = objects[xs_idx].data[i]
sum += i64_val
i++
if i < n goto loop
One guard, hoisted out of the loop. Inside the loop, the load is a single indirect int64 read. No type-switch, no Cell packing, no allocation. This is the ideal-case JIT output and is achievable because:
- The shape is known monomorphic at the guard site (hot loops do not migrate; if they did, the guard would have failed already and recompilation would have widened the type).
- Migration is one-way, so the guard cannot succeed-then-fail-then-succeed; once it fails, the loop deopts and recompiles for the wider shape.
- The Cell encoding is invariant across shapes (
tagPtr | idx), so the deopt path is straightforward, just resume the interpreter with the same registers.
Compare with the IC strategy in MEP-27: the JIT would emit a per-call-site cache lookup + branch. Compare with the AOT strategy in MEP-28: the JIT would inline the same specialized code, but the bytecode would already encode the shape.
Cost analysis summary
Per operation, in steady state, on a monomorphic workload:
| Kind | Migration strategy cost | Generic-only baseline cost | Speedup |
|---|---|---|---|
| Strings (concat-heavy) | rope alloc + cached hash | O(n²) total copy | O(n) / O(n²) per program |
| Lists (int) | 1 guard + 1 int load + 1 pack | 1 cell load + tag dispatch | ~2x |
| Maps (int) | 1 guard + splitmix + probe | mapKeyOf + Go map + alloc | ~4-5x |
| Sets (int) | as maps less value slot | as maps less value slot | ~4-5x |
| Structs (≤2 fields) | 1 guard + inline load | 1 ptr deref + slice idx | ~2x |
Per migration, amortized over the container's lifetime, when O(operations) ≫ O(migrations) (true for every measured corpus workload).
Predicted MEP-23 numbers under Option 1
Based on the per-op analysis above and the current MEP-23 baseline (vm2 vs Lua 5.5):
| Benchmark | Current (vs Lua) | Predicted under Option 1 | Mechanism |
|---|---|---|---|
lists/fill_sum N=100 | 1.56x slower | 0.5-0.7x slower (parity-to-faster) | ListI64 removes per-element Cell pack |
maps/fill_sum N=100 | 4.86x slower | 0.9-1.3x slower (near parity) | MapI64 Swiss table on int keys |
sets/fill_check N=100 (new) | n/a | 0.8-1.2x | Same as maps, less value slot |
strings/concat_loop N=30 | 2.0x slower | 0.4-0.6x (rope wins big) | O(N) total vs O(N²) total |
strings/concat_loop N=10 | 1.07x slower | 1.07x (unchanged; under threshold) | Flat path keeps current behavior |
structs/point_dot N=100 (new) | n/a | 0.7-1.0x | StructSmall inline-2 |
The "≤ 1.0x Lua on every measured cross-language row" target from MEP-25 is reached on every projected row.
Code-size estimate
The migration framework, implemented as a thin wrapper in each runtime/vm2/{strings,lists,maps,sets,structs}.go, is small:
| Module | New LOC (Go) | Notes |
|---|---|---|
| strings.go | ~250 | Rope node + concat decision tree |
| lists.go | ~150 | *vmListI64 + migrate helper |
| maps.go | ~400 | Swiss table layout, splitmix, probe loop |
| sets.go | ~250 | Mirrors maps, smaller |
| structs.go | ~120 | StructSmall + Struct + type-table validation |
| Total | ~1170 | Plus ~30 dispatch cases in eval.go |
The largest single piece is the Swiss table; the rest is straightforward.
Disadvantages
This MEP is honest about the costs of choosing migration:
-
First-touch latency spike. A list that holds 10000 ints and then receives a string will pay ~50 µs at the migration call. Most programs do not see this; programs that do (a 10000-element list whose 10001st operation widens it) will see a stutter. Pre-warming or explicit emitter hints can mitigate but not eliminate.
-
Wrong-shape allocation cost. A program that creates many small maps that all immediately get string keys pays the
MapEmpty → Mapmigration cost on every map. The two-allocation cost is ~30 ns, comparable to a Go map allocation; not catastrophic but not free. An emitter optimization (OpNewMapStrfor emit-time-known string-keyed maps) can elide this; out of scope here. -
JIT deopt complexity. When a JITed loop's shape guard fails, the JIT must deopt cleanly. The Cell encoding's shape-invariance helps, but the deopt path is still real engineering. MEP-27's IC strategy avoids deopt entirely (the IC misses and falls through to a slower polymorphic path); this MEP requires a real deopt mechanism.
-
Per-kind handler duplication. Each opcode handler that touches a container has at minimum two arms (specialized + generic). For maps with three shapes (Empty, MapI64, Map), some opcodes have three arms. This is more switch-case code than the IC strategy.
-
Reverse-migration anti-pattern is permanent. A container that briefly held a non-int and was emptied is forever generic. Workloads that intentionally clear and reuse containers cannot benefit from re-specialization without an explicit emitter hint or a new MEP.
Failure modes and mitigation
-
Migration storm. If a workload migrates many containers nearly simultaneously (e.g., a query that fills 1000 ListI64s and then widens all of them), the GC pressure spikes briefly. Mitigation: the new shape replaces the old in the Objects table; the old shape is garbage and freed at the next collection. No allocator special-casing required.
-
Hot loop near a migration point. A loop whose body migrates on its first iteration pays migration once, then runs unmigrated for n-1 iterations. The first iteration is a microsecond slower; subsequent iterations are at full speed. Net loss is negligible for n > 10.
-
Migration during iteration. The opcode set does not include a container-iterator opcode that holds a reference to internal storage;
OpListGetre-readsObjects[idx]every call. Therefore a migration during iteration is observable only as a length / element change, which is the same as any other concurrent mutation. No iterator invalidation logic needed.
Comparison Matrix (identical across MEP-26/27/28)
Reading guide: lower is better for cost; higher is better for fit.
| Axis | Option 1 (Migration) | Option 2 (IC) | Option 3 (AOT) |
|---|---|---|---|
| Per-op steady-state cost (monomorphic) | 1 shape guard | 1 IC compare + load | 0 (specialized handler) |
| Per-op steady-state cost (polymorphic) | n/a (cannot be polymorphic) | 1 IC compare miss + slow path | 0, fallback path 2-3x cost |
| Cost of changing shape | O(n) one-shot, amortized | None - IC updates lazily | None - both paths exist |
| Reversibility | No | Yes (IC re-warms) | Yes (both handlers exist) |
| Bytecode size | Unchanged from MEP-25 | Unchanged | 2-3x more opcodes |
| Handler-table code size (Go LOC) | ~1200 | ~1500 | ~3500 |
| JIT integration complexity | Medium (deopt required) | Low (IC slot is JIT-visible) | High (must enumerate specializations at JIT time too) |
| JIT steady-state quality (monomorphic) | Excellent (1 hoisted guard) | Excellent | Excellent |
| JIT steady-state quality (polymorphic) | Poor (frequent deopt) | Good (IC handles 2-4 shapes) | Excellent (all paths compiled) |
| Memory overhead per container | 0 (shape via Objects type) | 0 | 0 |
| Memory overhead per call site | 0 | ~16 bytes per IC slot | 0 |
| Compile-time impact | None | None | 20-40% longer emit phase |
| Predicted Lua-relative on int maps | 0.9-1.3x slower (near parity) | 1.1-1.5x slower | 0.6-0.9x slower (faster!) |
| Predicted Lua-relative on int lists | 0.5-0.7x slower | 0.6-0.8x slower | 0.4-0.6x slower |
| First-implementation risk | Low - well-trodden | Medium - IC sizing tricky | Medium-high - codegen explosion |
| Production deployments (prior art) | V8, PyPy, Lua, CRuby, Rust SmallVec, HotSpot | Self, V8 ICs, SpiderMonkey | LuaJIT, Julia, Crystal, Nim |
Recommendation (from this MEP's perspective)
Choose Option 1 if: the workload is mostly monomorphic (Mochi's typed front end strongly suggests this), the team prefers a small surface area, the JIT roadmap is willing to invest in deopt, and the codebase prizes a single canonical representation per kind over peak polymorphic performance.
Choose Option 2 (MEP-27) if: workloads are expected to be polymorphic in practice, the team wants to avoid deopt machinery, and per-site IC tuning is acceptable surface area.
Choose Option 3 (MEP-28) if: compile-time cost is unconstrained, peak performance on monomorphic code is the top priority, and the bytecode size cost is acceptable.
For Mochi specifically, this author recommends Option 1 as the first choice, with Option 3's codegen specialization layered on top later (the two compose: the AOT path emits the migration-aware opcodes, and the migration-aware runtime catches the rare polymorphic case). Option 2 is the right choice for runtimes whose source language is untyped (JS, Smalltalk); Mochi is not that runtime.
Open Questions
- Migration hints from the emitter. Should the compiler emit
OpNewListIntdistinct fromOpNewListwhen the IR knows the element type? This is a 1-line emit change that pre-specializes the bottom of the lattice. Recommended yes; out of scope as a normative requirement here. - Migration auditing. Should the VM record a counter per (kind, src-shape, dst-shape) for profile-guided optimization? Recommended yes (one atomic increment per migration is negligible); out of scope.
- Lattice extensibility. Future shapes (e.g.,
*vmListF64,*vmMapStr) extend the lattice by splicing a node into the existing path. Backward compatibility is preserved because the topology stays a DAG; no other invariants break.
References
- Chambers, Ungar, Lee. "An Efficient Implementation of SELF." OOPSLA 1989.
- Hölzle, Chambers, Ungar. "Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches." ECOOP 1991. (Relevant for what migration is not doing.)
- Bolz, Diekmann, Tratt. "Storage Strategies for Collections in Dynamically Typed Languages." OOPSLA 2013.
- Ierusalimschy, de Figueiredo, Celes. "The Implementation of Lua 5.0." J. Universal Computer Science, 2005.
- Patterson, "Adventures in Ruby Object Shapes." RubyKaigi 2022.
- Bacon, Konuru, Murthy, Serrano. "Thin Locks: Featherweight Synchronization for Java." PLDI 1998.
- Abseil engineering. "Swiss Tables Design Notes." absl/container/internal/raw_hash_set.h.