Skip to main content

MEP 28. VM2 Data Model Option 3 - AOT Codegen Specialization

FieldValue
MEP28
TitleVM2 Data Model Option 3 - AOT Codegen Specialization
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-16

Abstract

This MEP specifies a dispatch strategy for the MEP-25 data model in which all shape decisions are made by the compiler at emit time, before the interpreter ever runs. Each generic opcode in MEP-25 expands into a family of specialized opcodes, one per concrete operand-shape tuple. The interpreter dispatches on opcode only; each handler knows its operand shapes statically and runs straight-line specialized code. There are no shape guards, no inline caches, and no migration. The handler set is large, and the bytecode is denser, but per-op cost is the theoretical minimum: one dispatch + the specialized work.

This is Option 3 of three. MEP-26 specifies the one-way-migration alternative; MEP-27 specifies the polymorphic + inline-cache alternative. The same shape spec underlies all three; only dispatch differs. A common comparison matrix appears at the end of each MEP.

Motivation

Mochi has a typed front end. The IR carries per-value type information that is precise enough, in most call sites, to identify the operand shape at compile time. Both Option 1 (migration) and Option 2 (IC) discover that shape at run time, paying a guard or a cache check on every access. Option 3 asks: if the compiler already knows the shape, why pay any runtime cost at all?

The answer in the broader literature is "you can, and it's the fastest path on monomorphic code". LuaJIT's specialized opcodes (ADDVV, ADDVN, ADDNV, three opcodes for one source-level +) and Julia's method specialization (one compiled method per concrete argument-type tuple) both work this way. The cost is more opcodes, larger bytecode, and a longer emit phase. The benefit is no dispatch overhead beyond opcode decode.

For Mochi specifically, the typing fidelity argues strongly for this approach in the steady state. The shape-unknown case still exists (interface-style sum types, dynamically-typed map values), and those fall back to generic opcodes, but the generic opcodes are no slower than they are under Options 1 or 2 (in fact, slightly faster, because they don't carry a vestigial shape guard).

Lineage

AOT codegen specialization is the dominant strategy in compiled-language runtimes whose source language gives them static type information:

  1. LuaJIT specialized opcodes (Mike Pall, 2005-present). Lua's stock interpreter uses generic ADD opcodes; LuaJIT's bytecode splits each into ADDVV (var-var), ADDVN (var-num), ADDNV (num-var), and similar. The interpreter dispatch table grows from ~40 opcodes to ~90, but every hot-path arithmetic op is a single dispatch with no operand-type check. LuaJIT's interpreter alone (no JIT) beats stock Lua 2-3x; the specialized opcodes are a large part of that. ("LuaJIT 2.0 Bytecode", luajit.org/bytecode.html.)

  2. Julia method specialization (Bezanson, Karpinski, Shah, Edelman, 2017). Julia compiles one specialized method per concrete argument-type tuple. A generic +(x, y) becomes +(Int64, Int64), +(Float64, Float64), etc., each fully unboxed and inlined. The specialization is not an optimization; it is the dispatch model. ("Julia: A Fresh Approach to Numerical Computing", SIAM Review 2017.)

  3. Crystal compiler (Manas, Borenszweig, 2014-present). Crystal monomorphizes all generic methods at compile time. Each call site gets its own copy of the callee specialized to the argument types. No virtual dispatch on monomorphic call sites; only on explicit interface dispatch.

  4. Nim generic instantiation (Andreas Rumpf, 2008-present). Nim's generics are instantiated per type combination at compile time, similar to C++ templates but with the front-end type system enforcing usage constraints.

  5. Rust generic monomorphization (Graydon Hoare et al., 2010-present). Every Vec<T>::push is compiled to a fully specialized function per concrete T. Per-call cost is zero beyond the specialized work. Code-size cost is real (the "Rust compile time" problem). Trade-off well understood. See "Monomorphization in Rust" RFCs.

  6. C++ templates (Stroustrup, 1990). The original mainstream monomorphization model. std::vector<int> and std::vector<double> are distinct types with distinct compiled code. Per-call zero overhead.

  7. GHC SPECIALIZE pragmas / GHC's type-class dictionary erasure (Jones et al., 1995-present). Haskell's type-class dispatch is erased at compile time when the type is known statically; falls back to dictionary-passing only when the type is genuinely abstract.

The pattern across all seven: shape is known at compile time → emit specialized code → pay zero at run time. The shared cost is binary size and compile-time complexity, both of which scale with the number of distinct shape tuples, not with the number of operations.

Strategy in detail

Opcode family expansion

Each MEP-25 generic opcode expands into a family indexed by operand shape:

OpListGet (1 generic)
↓ expand by container shape
OpListGetI64 // operand B is *vmListI64
OpListGetGeneric // operand B is *vmList

OpMapGet (1 generic)
↓ expand by (container shape, key shape)
OpMapGetEmpty // operand B is *vmMapEmpty -- elide entire body, return null
OpMapGetI64I64 // *vmMapI64, int key
OpMapGetI64Other // *vmMapI64, non-int key -- elide, return null
OpMapGetGenericI64 // *vmMap, int key
OpMapGetGenericStr // *vmMap, string key (Swiss table on byte hash)
OpMapGetGenericAny // *vmMap, anything else

For each MEP-25 op family, the expansion is bounded by the cross product of operand shapes:

Op family# operandsDistinct shapes per operandSpecialized variants
Strings (Concat)23 (Inline, Flat, Rope)9 variants
Strings (Len, Index, Hash)133 variants each
Strings (Equal)239 variants
Lists (Get, Set, Push, Len)12 (I64, Generic)2 variants each
Maps (Get, Set, Has, Del)2 (container + key)3 × 4 key shapesup to 12 each, pruned to 6 useful
Maps (Len, New)133 each
Sets (Add, Has, Del)23 × 4up to 12, pruned to 6
Structs (Get, Set, New)12 (Small, Generic)2 each

Counted as opcodes:

  • Strings: 6 ops × ~5 avg = 30 variants
  • Lists: 4 ops × 2 = 8 variants
  • Maps: 6 ops × 6 avg = 36 variants
  • Sets: 5 ops × 5 avg = 25 variants
  • Structs: 3 ops × 2 = 6 variants

Total: ~105 specialized container opcodes, on top of the current ~50 generic opcodes. The interpreter dispatch table grows to ~155 entries, comparable to LuaJIT's ~190.

Emit-time shape inference

The IR (per MEP-21 v2 and the compiler2 pipeline) carries type annotations on every value. The emit phase walks the IR and, for each container op:

  1. Look up the operand's IR type.
  2. Use the IR type → shape mapping table (e.g., list<int>*vmListI64, map<string, T> → string-keyed *vmMap).
  3. Emit the specialized opcode for that shape tuple.
  4. If the IR type is any or a sum type, emit the generic opcode (no specialization possible).

The shape mapping is defined once in compiler2/emit/shape.go. The emit table is straightforward.

The fallback path

For values whose shape is genuinely unknown at compile time (function parameters of type any, sum-type discriminants, results of dynamic operations), the emitter falls back to the generic opcode, which is identical to the current MEP-24 opcode. The interpreter handler for the generic opcode does whatever runtime dispatch is needed (a small type-switch, akin to Option 1's shape guard).

Crucially, the generic opcode is fast: it never runs on a hot loop in a well-typed program, because well-typed programs emit specialized opcodes. The generic opcode is the safety net for any types, FFI, and similar boundary conditions.

Bytecode size impact

A specialized opcode is the same size as a generic opcode (4 bytes per Instr in the current encoding). The bytecode size cost is therefore zero per instruction; the cost is in the handler-table code (more Go functions, more LOC in eval.go) and in the opcode-decoding table (one more entry per variant).

Net program-byte-count change: zero. Net binary size of the vm2 Go package: +~1500 LOC of handlers, ~30 KB of compiled Go code.

Cost analysis

Per-op steady-state cost

StrategyCost (cycles)What runs
Option 1 (Migration)~3Dispatch + 1 shape guard + handler
Option 2 (IC)~5Dispatch + 1 IC compare + handler
Option 3 (AOT)~2Dispatch + handler (no guard)

The savings are ~1 cycle per op vs Option 1 and ~3 cycles vs Option 2 on hot loops. Over 10^7 inner-loop iterations, that's 10-30 ms of saved CPU time per benchmark, which moves the cross-language Lua-relative number measurably.

Compile-time cost

The emit phase walks the IR once more, looking up the shape per opcode. Cost: O(|IR|) with a small constant. Estimated +20-40% on the emit phase wall time (which is currently a small fraction of total compilation), so total compile time grows by single-digit percentage. Not a blocker.

Code-size cost

ComponentLOC (Go)
Strings handlers (30 variants × ~15 LOC)~450
Lists handlers (8 × ~10)~80
Maps handlers (36 × ~25)~900
Sets handlers (25 × ~20)~500
Structs handlers (6 × ~12)~72
Emit-side shape mapping & dispatcher~400
Opcode-decoding table updates~150
Fallback generic handlers (unchanged)shared with MEP-24
Total new LOC~2550

Larger than Option 1 (~1200) and Option 2 (~1700). The bulk is in the maps and sets families, which have the largest cross product.

Memory cost

Zero per container (no shape guards stored). Zero per call site (no IC slots). Slightly larger interpreter dispatch table (~155 vs ~100 entries), a few KB of static data, negligible.

JIT integration

The AOT path is the JIT-friendliest of the three. Each specialized opcode is essentially the same code the JIT would emit on a hoisted-guard path under Option 1, but already chosen at compile time. The JIT for Option 3 is conceptually:

  1. Walk the bytecode (which has specialized opcodes already).
  2. For each opcode, inline the handler body. No guard, no IC consult.
  3. The result is the same straight-line specialized native code that V8's TurboFan would emit after IC-driven specialization.

The shape-unknown fallback path is the only place the JIT must emit a shape dispatch. This is rare in well-typed code.

The cost: the JIT must know the full ~155-opcode dispatch table, doubling the dispatch-handler count it must implement. This is the "JIT codegen explosion" cost.

Per-kind specification

Strings

Specialized variants per opcode:

OpStrConcat: {Inline,Flat,Rope} × {Inline,Flat,Rope} = 9 variants
OpStrLen: {Inline,Flat,Rope} = 3 variants
OpStrIndex: {Inline,Flat,Rope} = 3 variants
OpStrEqual: {Inline,Flat,Rope}² = 9 variants (some symmetric)
OpStrHash: {Inline,Flat,Rope} = 3 variants
OpStrLoadK: = 1 variant (always picks Inline/Flat at emit time based on length)

The 9 concat variants can be reduced by exploiting symmetry (Inline×Flat and Flat×Inline emit the same handler with operand swap). Effective handler count: ~6 for concat.

Lists

OpListNew: {I64,Generic} = 2 variants
OpListGet: {I64,Generic} = 2
OpListSet: {I64,Generic} × {Int,NonInt value} = 4
OpListPush: {I64,Generic} × {Int,NonInt} = 4
OpListLen: {I64,Generic} = 2

Note: OpListSetI64NonInt is a trap-or-widen handler. Under Option 3, the emit phase should have already known the value was non-int and emitted OpListSetGenericAny instead. The trap case exists only as a safety net for badly-typed IR; it should never execute in well-typed Mochi.

Maps

OpMapNew: {Empty,I64,Generic} = 3
OpMapGet: {Empty,I64,Generic} × {Int,Str,Bool,Other} = pruned to 6 useful (Empty just returns null regardless of key)
OpMapSet: similar = 6
OpMapHas: similar = 6
OpMapDel: similar = 6
OpMapLen: {Empty,I64,Generic} = 3

Pruning: (Empty, *) for Get/Has reduce to a single handler that returns null/false without touching the key. Effective handler count: ~25 for maps.

Sets

Symmetric to maps, ~22 handlers.

Structs

OpStructNew: {Small,Generic} (chosen at emit time from typeID.NumFields) = 2
OpFieldGet: {Small,Generic} × {slot 0, slot 1, slot ≥2} = effectively 4 (Small only has slots 0,1; Generic has any)
OpFieldSet: same = 4

OpFieldGetSmall0 is a tiny handler: read fields[0] from the struct header. Same for slot 1. Slots ≥2 only exist on Generic structs.

Predicted MEP-23 numbers under Option 3

BenchmarkCurrent (vs Lua)Predicted under Option 3Mechanism
lists/fill_sum N=1001.56x slower0.4-0.6x slower (faster than Lua)No guard; specialized int handler
maps/fill_sum N=1004.86x slower0.6-0.9x slower (parity-to-faster)MapGetI64I64 is single-table direct probe
sets/fill_check N=100n/a0.5-0.8x slowerSame as maps
strings/concat_loop N=302.0x slower0.4-0.6x fasterSame rope wins as Options 1/2, ~5% faster on dispatch
structs/point_dot N=100n/a0.6-0.8x slowerInline-2 with direct slot opcode

Best per-row of the three options on every monomorphic benchmark. The advantage is largest on int maps (~0.2-0.4x faster than Option 1, ~0.5x faster than Option 2).

Disadvantages

  1. Codegen explosion is real. ~105 new opcodes is a lot of switch cases. The dispatch table grows; the binary grows by ~30 KB; the cognitive load on a future contributor adding a new opcode is higher.

  2. Compile-time cost (small but nonzero). Emit-phase shape lookup adds ~20-40% to emit time. Not a blocker but visible.

  3. Polymorphic-call regression risk. If a workload turns out to be polymorphic (the IR could not infer shape), every op falls to the generic opcode and Option 3 performs no better than MEP-24's status quo. Options 1 and 2 still extract some specialization benefit in that case.

  4. No runtime adaptation. A program whose shape distribution changes between phases (early init phase has different types from steady state) cannot adapt. The specialized opcodes are baked into the bytecode.

  5. JIT must mirror the dispatch table. A future JIT needs handlers for all ~155 opcodes. Double the JIT engineering surface vs Options 1 and 2.

  6. Worse on bytecode loading (slightly). A program loader must initialize a slightly larger dispatch table at startup. Negligible (~µs).

  7. Hides shape decisions from runtime tools. Profilers, tracers, and debuggers that want to know "what shape was this map?" must reverse-engineer it from the opcode, since there is no runtime shape registry. Inspectability is reduced.

Failure modes and mitigation

  • IR shape inference is wrong. A typed list<int> that actually holds a Cell of wrong tag is an internal compiler bug. The specialized opcode traps cleanly (the same way an out-of-bounds index traps). No silent corruption.

  • Hot loop falls back to generic. A profile-guided refinement could re-emit the function with specialized opcodes once a profiler observes the shape. Out of scope for the first version; tracked as future work.

  • Adding a new shape variant. Requires updating ~10-15 opcodes' dispatch families. This is the "extension cost" Option 3 pays vs Options 1 (which adds one new lattice node) and 2 (which just adds one entry to the handler map). A new shape is a meaningful change in any of the three; in Option 3 it is a structural change.

Comparison Matrix (identical across MEP-26/27/28)

AxisOption 1 (Migration)Option 2 (IC)Option 3 (AOT)
Per-op steady-state cost (monomorphic)1 shape guard1 IC compare + load0 (specialized handler)
Per-op steady-state cost (polymorphic)n/a (cannot be polymorphic)1 IC compare miss + slow path0, fallback path 2-3x cost
Cost of changing shapeO(n) one-shot, amortizedNone - IC updates lazilyNone - both paths exist
ReversibilityNoYes (IC re-warms)Yes (both handlers exist)
Bytecode sizeUnchanged from MEP-25Unchanged2-3x more opcodes
Handler-table code size (Go LOC)~1200~1500~3500
JIT integration complexityMedium (deopt required)Low (IC slot is JIT-visible)High (must enumerate specializations at JIT time too)
JIT steady-state quality (monomorphic)Excellent (1 hoisted guard)ExcellentExcellent
JIT steady-state quality (polymorphic)Poor (frequent deopt)Good (IC handles 2-4 shapes)Excellent (all paths compiled)
Memory overhead per container0 (shape via Objects type)00
Memory overhead per call site0~16 bytes per IC slot0
Compile-time impactNoneNone20-40% longer emit phase
Predicted Lua-relative on int maps0.9-1.3x slower (near parity)1.1-1.5x slower0.6-0.9x slower (faster!)
Predicted Lua-relative on int lists0.5-0.7x slower0.6-0.8x slower0.4-0.6x slower
First-implementation riskLow - well-troddenMedium - IC sizing trickyMedium-high - codegen explosion
Production deployments (prior art)V8, PyPy, Lua, CRuby, Rust SmallVec, HotSpotSelf, V8 ICs, SpiderMonkeyLuaJIT, Julia, Crystal, Nim

Recommendation (from this MEP's perspective)

Choose Option 3 if:

  • Source-language type information is rich and trustworthy (Mochi qualifies).
  • Peak per-op throughput on monomorphic loops is the top success criterion.
  • The team is willing to maintain ~155 dispatch cases instead of ~100.
  • The bytecode-size constraint is loose (server-side use; not embedded).
  • The JIT roadmap is willing to mirror the larger handler table.

For Mochi, this is the highest-ceiling strategy: it produces the best MEP-23 numbers in every projected row, sometimes substantially better than Lua. The cost is engineering surface, which scales with the number of container kinds and shapes. With the MEP-25 inventory frozen (5 kinds, ~12 shapes total), the surface is bounded.

The author's overall view (consistent across MEP-26, 27, 28): Option 1 is the right first step; Option 3 is the right long-term destination once Option 1's measured behavior validates the shape choices; Option 2 is the right middle layer if dynamic polymorphism turns out to be common in real Mochi code. The three are not mutually exclusive, Option 3's specialized opcodes can be emitted on top of Option 1's migration-aware runtime, with Option 2's IC slot reserved for the (rare) shape-unknown call sites. A fully layered runtime would deploy all three.

Open Questions

  • Symmetric opcode pruning. OpConcatStrInlineFlat and OpConcatStrFlatInline produce the same result with operand swap. Should they share a handler, or stay distinct for cache-friendlier dispatch? Empirically the swap is free; recommend sharing.
  • Slot-fused struct ops. Should OpFieldGetSmall0 and OpFieldGetSmall1 exist as distinct opcodes, or should one OpFieldGetSmall opcode take the slot as a C operand? The former saves one operand decode per op (~0.5 cycle); the latter saves one opcode-table entry. Recommend the fused version.
  • Profile-guided respecialization. Should the interpreter record opcode-frequency data and let a future re-emit pass refine generic opcodes into specialized ones if the runtime observed a monomorphic shape? This blurs the line with Options 1 and 2. Out of scope for the first version; promising future work.
  • Cross-kind specialization. A function generic over container type (e.g., len(c) that works on lists, maps, and sets) currently must emit three call sites or a generic dispatch. A higher-level monomorphization pass on the IR (à la Rust) could specialize the function per call site. Out of scope here.

References

  • Pall. "LuaJIT 2.0 Bytecode." luajit.org/bytecode.html, 2010.
  • Bezanson, Karpinski, Shah, Edelman. "Julia: A Fresh Approach to Numerical Computing." SIAM Review 2017.
  • Borenszweig et al. The Crystal Programming Language. crystal-lang.org docs, 2014-present.
  • Rumpf. Nim in Action. Manning, 2017.
  • Hoare et al. "Rust by Example: Generics and Monomorphization." doc.rust-lang.org, 2010-present.
  • Stroustrup. The Design and Evolution of C++. Addison-Wesley, 1994. Chapter 15 on templates.
  • Peyton Jones et al. "Type Classes in Haskell." ESOP 1996.
  • Würthinger et al. "Self-Optimizing AST Interpreters." DLS 2012. (Truffle/Graal, a hybrid of Options 2 and 3.)