Skip to main content

MEP 27. VM2 Data Model Option 2 - Polymorphic + Inline Caches

FieldValue
MEP27
TitleVM2 Data Model Option 2 - Polymorphic + Inline Caches
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-16

Abstract

This MEP specifies a dispatch strategy for the MEP-25 data model in which every container kind is structurally polymorphic (a single Cell can refer to any of the kind's sub-shapes at any time) but dispatch is monomorphized per call site via inline caches. Each opcode that touches a container carries a per-site cache of "the last K shapes seen here"; on a cache hit, dispatch is one compare + one branch; on a cache miss, the runtime updates the cache and falls through to a polymorphic handler. Containers never migrate; the IC instead grows or thrashes.

This is Option 2 of three. MEP-26 specifies the one-way-migration alternative. MEP-28 specifies the AOT codegen alternative. The same shape spec underlies all three; only dispatch differs. A common comparison matrix appears at the end of each MEP.

Motivation

Migration (MEP-26) assumes that, once a container's shape stabilizes, it stays stable. That assumption is correct for most of Mochi's typed workloads but can be wrong in corners:

  • A dict used as a small heterogeneous "options bag" that gains keys of mixed types over time.
  • A query result list that is built up from a mix of branches, some of which produce ints and some of which produce strings.
  • A long-lived map whose key set genuinely varies between phases of execution.

Migration pays for these workloads by either (a) eagerly widening to the generic shape, losing all specialization, or (b) flapping between shapes (which one-way migration forbids). Inline caches handle them gracefully: each call site independently caches the shapes it observes, so a site that always sees MapI64 stays fast even if other sites on the same map see Map.

The cost is a small, per-site memory budget and a slightly heavier hot-path instruction sequence (a compare + branch vs migration's single guard). The benefit is that polymorphic call sites still serve all observed shapes at near-monomorphic speed, and the container shape is decoupled from the caller's expectations.

Lineage

Inline caching is one of the oldest tricks in dynamic-language runtimes. The relevant prior art:

  1. Smalltalk-80 (Goldberg, Robson, 1983). The original inline cache: each send site overwrote its own bytecode with the address of the receiver-class's method table, so the next dispatch was direct. A class-check miss rewrote the bytecode back. ("Smalltalk-80: The Language and Its Implementation", chapter on message dispatch.)

  2. Polymorphic Inline Caches in Self (Hölzle, Chambers, Ungar, 1991). Generalized Smalltalk's monomorphic IC to a small list of (class, target) pairs per call site. The paper introduced the monomorphic → polymorphic → megamorphic state machine that every modern IC implementation uses, including V8 and SpiderMonkey. K=4 was the typical polymorphic cap; above that, fall through to a method lookup. ("Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches.", ECOOP 1991.)

  3. V8 inline caches (Lars Bak, 2008). V8 attached an IC slot to every load / store / call site in the bytecode (Ignition) and the JIT (TurboFan). The IC slot points to a feedback vector entry that records observed shapes, smi/heap-number flags, and similar. The JIT consumes the feedback vector to emit shape-specialized code with a guard. Megamorphic sites are not JIT-specialized; they call into the generic stub. See src/ic/ic.cc in V8.

  4. SpiderMonkey (Mozilla, 2012-present). Same pattern as V8 but with explicit "CacheIR", a small IR for IC chains. Each cached path is one CacheIR sequence. ("CacheIR: A New Approach to Inline Caching in Firefox", Anjana Vakil, 2018.)

  5. PyPy meta-tracing (Bolz et al., 2009-2015). PyPy's JIT inlines guards on observed types into compiled traces; on guard failure the trace is exited and a new trace may be started. This is functionally similar to inline caching but trace-scoped rather than site-scoped. The cost model (guard succeed: ~1 instruction; guard fail: trace exit + retry) is informative for vm2's interpreter.

  6. JavaScriptCore (WebKit, 2012-present). Same monomorphic/polymorphic/megamorphic IC machinery, called "polymorphic access lists" (PolyIC). The PolyIC paper from 2013 quantified the steady-state speedup at 3-7x over a non-IC interpreter on dynamic dispatch-heavy code.

The common pattern across all six: most sites are monomorphic; most polymorphic sites have ≤4 cached shapes; megamorphic sites are rare and worth giving up on. The K=4 threshold is empirical, validated in production for two decades.

Strategy in detail

Every opcode that consumes a container (OpListGet, OpListSet, OpListPush, OpListLen, OpMapGet, OpMapSet, OpMapHas, OpMapDel, OpMapLen, set ops, struct ops, string ops with shape branches) carries a per-site IC slot indexed by the instruction's position in the bytecode.

IC slot layout

type ICSlot struct {
State ICState // monomorphic | polymorphic | megamorphic
Shape0 uintptr // type pointer for first cached shape (Go type identity)
Handler0 uintptr // pointer to the specialized handler for Shape0
// Polymorphic extension (allocated only when State == polymorphic):
Extra *ICExtra
}

type ICExtra struct {
Shapes [3]uintptr // additional 3 shape pointers (total K=4 with Shape0)
Handlers [3]uintptr
Count uint8 // number of valid Extra entries
}

Memory: 16 bytes for the inline part of every IC-carrying instruction. The Extra pointer is nil for the common monomorphic case (~95% of sites in measured workloads). The polymorphic extension is ~56 bytes when allocated.

For a typical Mochi program with ~500 container-touching instructions, the steady-state IC table is ~8 KB inline + occasional 56-byte extras. Negligible against the bytecode and constant pool.

State machine

The classic Self/V8 state machine, with vm2-specific terminology:

first observation
UNINITIALIZED ─────────────────→ MONOMORPHIC(shape₀, handler₀)

second shape │ third/fourth shape
┌─────────────────────────────┴──────────────────────────┐
▼ ▼
POLYMORPHIC(shape₀..shapeₖ, handlers) MEGAMORPHIC (no IC; call generic)
│ fifth distinct shape ────────────▲
│ │
└────────────────────────────────────────────────────────┘

Transitions:

  • UNINITIALIZED → MONOMORPHIC. First time the IC is hit, allocate nothing; just record the observed shape pointer and the handler chosen by the dispatch table.
  • MONOMORPHIC → POLYMORPHIC. Second distinct shape. Allocate ICExtra, demote the inline Shape0 to be one of K entries, add the new shape, set Count = 1.
  • POLYMORPHIC → POLYMORPHIC (still under K=4): append.
  • POLYMORPHIC → MEGAMORPHIC. A fifth distinct shape. Free Extra, set State to megamorphic, future dispatches skip the IC and call the polymorphic-base handler directly.

The K=4 bound is taken from Self, V8, SpiderMonkey, JSC, every production IC implementation uses 4 or 5. The bound is not a parameter to tune per workload; the bound is the polymorphic / megamorphic definition.

Hot-path opcode shape

For OpListGet A B C, the dispatch handler becomes:

// Pseudocode for ic-aware OpListGet
func opListGet(vm *VM, ip int, ic *ICSlot) {
objIdx := vm.regs[code[ip].B].Ptr()
obj := vm.Objects[objIdx]
shape := reflectTypePtr(obj) // ~1 cycle in Go via efaceType

if shape == ic.Shape0 {
// Hot path: direct call to the cached handler.
ic.Handler0(vm, ip) // inlined in practice
return
}
// Cold path: miss handling, possible state transition.
icMissListGet(vm, ip, ic, obj, shape)
}

In Go, the inner type comparison is a single 8-byte pointer compare. With branch prediction, a monomorphic site costs ~2 cycles for the IC check + the handler body (~3-5 cycles for an int list get). Total: ~6-8 cycles per OpListGet on a hot loop, comparable to migration's ~3 cycles.

The slightly higher per-op cost vs migration is paid in exchange for: (a) no migration latency spike ever, (b) polymorphic sites stay fast, (c) no deopt machinery in the JIT.

Specialized handlers per shape

Each container shape registers a specialized handler per applicable opcode:

var listGetHandlers = map[reflect.Type]func(*VM, int){
typeOf(*vmListI64): opListGet_I64,
typeOf(*vmList): opListGet_Generic,
}

The handler table is built once at VM startup; lookup happens only on IC miss. The IC then stores the handler pointer directly, so the hot path never re-consults the table.

A specialized handler is tiny, opListGet_I64 is ~10 LOC. The total handler-table code is:

Kind# ops × # shapesLOC per handlerTotal LOC
Strings6 × 3~15~270
Lists5 × 2~12~120
Maps6 × 3~20~360
Sets5 × 3~18~270
Structs3 × 2~10~60
Total64 handlers~1080

Plus the IC framework itself (~400 LOC) and miss handlers (~200 LOC). Total Go LOC: ~1700.

Cost analysis

Per-op steady-state cost, by site shape

Site StateHot-path cost (cycles)What runs
MONOMORPHIC~2 (IC) + handlerType compare, predicted branch, direct call
POLYMORPHIC (K=2)~4 (IC) + handlerTwo type compares, one predicted hit
POLYMORPHIC (K=4)~8 (IC) + handlerUp to 4 type compares, linear search
MEGAMORPHIC~15 + handlerSkip IC, jump to polymorphic-base

The polymorphic case scales linearly in cached shape count, which is why K is capped: at K=8+ a hash-table-style IC would be needed (V8 has this; it's called a "stub cache"), but for vm2's expected workloads K=4 linear search suffices.

IC memory cost, per program

Let N_ic = number of IC-carrying instructions in the bytecode. Empirically (from measuring Lua bytecode shape on similar workloads), N_ic ≈ 0.4 × |bytecode|. For a Mochi program with 5000 instructions, N_ic ≈ 2000.

  • Inline: 2000 × 16 = 32 KB
  • Polymorphic extras (assume 5% of sites): 100 × 56 = 5.6 KB
  • Total: ~38 KB per program.

This is small compared to bytecode (~50 KB) and constants. The IC table is allocated once at VM.Run startup, not per call, so allocation pressure is zero in steady state.

IC update cost

A monomorphic-to-polymorphic transition allocates one *ICExtra (56 bytes). Cost: ~80 ns. This happens at most once per site per shape pair, so total program-lifetime IC update cost is bounded by N_ic × K = ~8000 allocations worst case, all front-loaded.

Migration is gone

Unlike Option 1, containers never migrate. A *vmListI64 that needs to hold a string Cell cannot in this model, the runtime must instead reject the operation, or the program must allocate a *vmList from the start. To handle the latter:

  • The emitter must use the type IR to choose OpNewListGeneric vs OpNewListInt etc. at construction time.
  • A runtime trap on type-mismatched insertion is the safety net; well-typed programs never hit it.

This is a real difference from Option 1. Option 1 lets a poorly-typed program "just work" via migration. Option 2 puts the burden on the emitter to pick the right shape up front; mismatches trap.

For Mochi, whose front-end type checker is the source of the IR, this constraint is easy to meet. For untyped front-ends (JS, Python), Option 1 is friendlier.

JIT integration

The IC feedback vector is the standard JIT input for shape specialization (V8's playbook). The JIT consumes the IC state for a site:

  • Monomorphic IC: emit a single shape guard + the specialized handler inlined.
  • Polymorphic IC (K≤4): emit a small inline switch (one compare per cached shape), each branch the specialized handler.
  • Megamorphic IC: do not specialize; call the polymorphic-base handler.

The crucial property: no deopt is needed. If a shape guard fails on a JIT-emitted monomorphic site, the JITed code simply calls the IC miss handler, which updates the IC, recompiles the site, and continues. The JITed function does not have to abort or replay; only the failing instruction is patched. This is significantly simpler than Option 1's deopt-and-resume mechanism.

The cost is that the JIT must read the IC feedback at compile time and recompile when it grows. V8 handles this by recompiling at well-defined tiering boundaries (Ignition → Sparkplug → TurboFan); a simpler vm2 JIT would recompile on every IC state transition for affected sites.

Per-kind interaction with MEP-25 shapes

Strings

Three shapes (Inline, Flat, Rope). Most string opcodes have site-shape distributions dominated by Flat in real code; ropes appear only in concat-heavy contexts. An IC on OpConcatStr will quickly stabilize on either (Flat, Flat) or (Rope, *) and stay there. Megamorphic string sites are essentially nonexistent in measured corpora.

Lists

Two shapes (ListI64, List). Almost all IC sites are monomorphic. A site that mixes int and generic lists (rare; would require the same call site to consume both) goes polymorphic K=2; megamorphic does not occur.

Maps

Three shapes (MapEmpty, MapI64, Map). The Empty shape is short-lived (it dies on the first OpMapSet), so IC sites that target maps tend to see (Empty, MapI64) on first iteration then steady at MapI64. Polymorphic K=2 covers this comfortably.

Sets

As maps.

Structs

Two shapes (StructSmall, Struct). Field-access sites are usually monomorphic per call site since the IR knows the struct type. Polymorphic struct sites occur for interface-style dispatch, which Mochi does not yet have; not a current concern.

Predicted MEP-23 numbers under Option 2

BenchmarkCurrent (vs Lua)Predicted under Option 2Mechanism
lists/fill_sum N=1001.56x slower0.6-0.8x slowerIC hits ListI64 handler; cost ~1 cycle higher than migration
maps/fill_sum N=1004.86x slower1.1-1.5x slowerSwiss table accessible via IC; per-op overhead
sets/fill_check N=100n/a1.0-1.4x slowerSame as maps
strings/concat_loop N=302.0x slower0.4-0.6x fasterRope chosen by handler; same as Option 1
structs/point_dot N=100n/a0.8-1.1xInline-2 handler

Slightly worse than Option 1 on int-heavy benchmarks (one extra cycle per op for the IC compare); slightly better on benchmarks where shapes would otherwise force a migration penalty.

Disadvantages

  1. Per-op cost ~1-3 cycles higher than migration's single guard. Real, measurable, ~20-30% on the tightest int loops.

  2. IC memory is not free. ~32 KB inline + extras per program. For embedded vm2 use, this matters.

  3. IC miss is a branch. Even rare misses bloat the binary, and miss handlers cannot inline.

  4. Container shape decoupled from caller expectations. Programs that want a list-of-int and pass it as an arg see two ICs, one in the caller, one in the callee. Each warms independently. Net effect: similar steady state to migration but more startup transient cost.

  5. The "feels redundant" problem. Each call site re-discovers the same shape independently. For a list passed through 5 functions, the same shape is cached 5 times. Migration discovers shape once.

  6. Megamorphic sites are slow. ~15 cycle floor per op when a site sees >4 shapes. The 5-shape case is engineered around (e.g., never let a single site see all 3 map variants), but it is a real trap.

  7. Requires emitter discipline. No fallback migration means the emitter must pick the right OpNewList* / OpNewMap* based on IR type. A buggy emitter produces a runtime trap instead of a graceful widen.

Failure modes

  • IC thrash. If a site oscillates between two shapes (e.g., a function that processes both MapI64 and Map alternately), the IC stays polymorphic K=2 and pays ~4 cycles per op. Worst case is K=4 (~8 cycles). Not catastrophic.

  • Megamorphic creep. A pathological program could create many shape variants (e.g., via interface-style dispatch over a wide set of struct types). The IC would saturate and fall through to the slow path. Mitigation: a future "PIC chain" structure (V8's stub cache) handles this at a higher per-op cost but unbounded shape count.

  • IC slot races. vm2 is single-threaded; no races. A future threaded vm2 would need per-site atomic CAS on IC updates, or a copy-on-write scheme. Out of scope here.

Comparison Matrix (identical across MEP-26/27/28)

AxisOption 1 (Migration)Option 2 (IC)Option 3 (AOT)
Per-op steady-state cost (monomorphic)1 shape guard1 IC compare + load0 (specialized handler)
Per-op steady-state cost (polymorphic)n/a (cannot be polymorphic)1 IC compare miss + slow path0, fallback path 2-3x cost
Cost of changing shapeO(n) one-shot, amortizedNone - IC updates lazilyNone - both paths exist
ReversibilityNoYes (IC re-warms)Yes (both handlers exist)
Bytecode sizeUnchanged from MEP-25Unchanged2-3x more opcodes
Handler-table code size (Go LOC)~1200~1500~3500
JIT integration complexityMedium (deopt required)Low (IC slot is JIT-visible)High (must enumerate specializations at JIT time too)
JIT steady-state quality (monomorphic)Excellent (1 hoisted guard)ExcellentExcellent
JIT steady-state quality (polymorphic)Poor (frequent deopt)Good (IC handles 2-4 shapes)Excellent (all paths compiled)
Memory overhead per container0 (shape via Objects type)00
Memory overhead per call site0~16 bytes per IC slot0
Compile-time impactNoneNone20-40% longer emit phase
Predicted Lua-relative on int maps0.9-1.3x slower (near parity)1.1-1.5x slower0.6-0.9x slower (faster!)
Predicted Lua-relative on int lists0.5-0.7x slower0.6-0.8x slower0.4-0.6x slower
First-implementation riskLow - well-troddenMedium - IC sizing trickyMedium-high - codegen explosion
Production deployments (prior art)V8, PyPy, Lua, CRuby, Rust SmallVec, HotSpotSelf, V8 ICs, SpiderMonkeyLuaJIT, Julia, Crystal, Nim

Recommendation (from this MEP's perspective)

Choose Option 2 if:

  • Workloads are observably polymorphic in production (Mochi's typed front-end argues this is rare today, but it's worth measuring before deciding).
  • The JIT roadmap is uncomfortable with deopt machinery.
  • A small per-site memory cost is acceptable in exchange for graceful handling of shape changes.
  • The team values the JIT's ability to consume IC feedback as a first-class compile input.

For Mochi specifically, this strategy is the strongest middle ground. It performs slightly worse than Option 1 on the int-heavy hot path but performs strictly better on polymorphic corners and requires no JIT deopt. If Mochi grows interface-style dispatch (sum types over heterogeneous structs, polymorphic generic functions over container kinds), this becomes the preferred strategy.

Open Questions

  • K parameter. Empirical K=4 from production runtimes. A K=2 variant trades some polymorphic coverage for a tighter inline IC (fits in one cache line). Worth measuring.
  • IC slot allocation. Per-instruction (proposed) vs per-site-class (one IC shared by all sites that touch the same call-target class). Per-instruction is simpler; per-class is more memory-efficient.
  • Lazy IC. Should the IC be allocated only on first miss, not pre-allocated? Saves memory for cold sites at the cost of one extra branch on first hit.

References

  • Goldberg, Robson. Smalltalk-80: The Language and its Implementation. Addison-Wesley, 1983.
  • Hölzle, Chambers, Ungar. "Optimizing Dynamically-Typed Object-Oriented Languages with Polymorphic Inline Caches." ECOOP 1991.
  • Hölzle, Chambers, Ungar. "Debugging Optimized Code with Dynamic Deoptimization." PLDI 1992.
  • Bak. "Adventures in V8." Talks, 2008-2013.
  • Vakil. "CacheIR: A New Approach to Inline Caching in Firefox." 2018.
  • Pizlo. "The Polymorphic Inline Cache." JavaScriptCore docs, 2014.
  • Bolz, Cuni, Fijalkowski, Rigo. "Tracing the Meta-Level: PyPy's Tracing JIT Compiler." ICOOOLPS 2009.