MEP 19. Adaptive Specialization and Inline Caches
| Field | Value |
|---|---|
| MEP | 19 |
| Title | Adaptive Specialization and Inline Caches |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-16 |
Abstract
Mochi is statically typed, which means most of the specialization work that V8 and SpiderMonkey do at runtime (hidden classes, shape transitions) is already handled at compile time. The Mochi compiler already emits OpAddInt instead of OpAdd when it can prove both operands are int (via the regTag mechanism in runtime/vm/compiler*.go). What remains is the small set of cases where the compiler does not have enough information at emit time but the runtime sees the same pattern every time:
- Polymorphic call sites: a
fun apply(f: fun(int): int, x: int)is called once withf = doubleand a thousand times withf = print. The compiler typesfas a function value; the interpreter could observe that it is always the same closure. - Untyped boundaries: values typed
any, query rows, dynamic JSON, agent intent dispatch. - Indirect operations:
OpIndexon aValuewhose tag is checked at runtime; once observed to always be a list, the typed list-index variant is faster.
This MEP adopts the PEP 659 model ([PEP-659]) for adaptive specialization (quickening) and the classic monomorphic / polymorphic inline cache ([Hölzle-91]) for the remaining dynamic dispatch points. Both reuse the bytecode rewriting slot defined in MEP-18 Phase C. No JIT, no codegen; bytes are patched in place after observation.
Expected ceiling per the literature: PEP 659 alone gave CPython 3.11 roughly 25% over 3.10 ([PEP-659]); inline caches gave Self 4-5x in the original Hölzle/Chambers/Ungar measurements ([Hölzle-91]). Mochi will see less than V8 / Self because most of the property-access wins are already taken by the type system, and more than CPython because Mochi has a register VM (less dispatch overhead to start with).
Motivation
The argument for adaptive specialization is that compile-time information is incomplete in a few places that show up disproportionately in real code. Three concrete examples from the Mochi corpus:
- Agent intent dispatch. An agent declared with three
intentmethods has its dispatch routed through the struct's method table at runtime. The compiler knows the agent type; it does not know which intent will be called from a givenon streamblock. Every observed call site is in practice monomorphic. - Query row access. Inside
from ... select, row fields are typed (MEP-14), but the compiled query plan reads them through a generic accessor when the row type is a generated struct from acsv.read. The accessor checks the tag every time. An IC would replace the check with a one-time guard plus a direct load. - HOF callback invocation. After MEP-15 Stage 3d, the effect set propagates correctly, but the call itself still goes through a generic
OpCallthat resolves the closure at each invocation. The same closure value reaches the same call siteb.Ntimes; resolve it once.
The CPython team's PEP 659 motivation matches: "an enormous amount of work has gone into making the interpreter as fast as possible. We have now reached the limits of what is possible with an interpreter that does not specialize on the basis of dynamic feedback." Mochi is not at that limit yet, but specialization unlocks the next tier without requiring a JIT.
Specification
Adaptive specialization (quickening)
Quickening is the act of rewriting a generic opcode into a typed variant after observing stable types. The mechanism reuses MEP-18 Phase C: each Instr has a Quick uint8 byte; the interpreter dispatches on Quick if non-zero, otherwise on Op.
Three quickening families ship with this MEP:
-
Numeric binary ops (
OpAdd,OpSub,OpMul,OpDiv,OpMod,OpEqual,OpNotEqual,OpLess,OpLessEq). The generic handler observes the operand tags. After two consecutive observations of the same type pair (int+int,float+float), it patchesQuick = OpAdd_II(or the relevant variant). The typed handler runs without tag checks; if the next call sees a different tag pair, it patchesQuick = 0(deopt to generic) and resumes generic dispatch. The patching cost amortizes after the second iteration of any loop. -
Indexed access (
OpIndex,OpSetIndex). The generic handler dispatches on the container tag (list vs map vs string). After observation, it quickens toOpIndex_List/OpIndex_Map/OpIndex_Str. The typed list variant is bounds-checked once and indexes directly into the underlying slice. -
Call dispatch (
OpCall,OpCallV). The generic handler resolves the callee through the env. After observation, it caches the resolved function pointer (closure index) in a side table keyed by the bytecode offset; quickened opcode loads from the cache and skips the env lookup. (This bleeds into inline caches; see below.)
The deoptimization story is simple. Every quickened handler ends with a tag check before it does the typed work. If the check fails, it writes Quick = 0 and re-dispatches to the generic handler in the same iteration. Total cost of a deopt: one branch, one byte write, one re-dispatch. The contract is: quickened handlers are an optimization, never a commitment. A function may quicken, deopt, quicken to a different variant, and so on.
A counter limits churn. Each Instr gets a 4-bit deopt counter packed into a reserved field; after 8 deopts the instruction stops trying to quicken and stays generic for the lifetime of the program. This prevents pathological pinball on truly polymorphic sites.
Inline caches for call dispatch
The PEP 659 model handles the type-specialization case well. It does not handle the case where the callee is dynamic. Method dispatch on agents, function-typed parameters, and match-bound function values all route through OpCall with the function value resolved from a register at run time.
For these, MEP-19 adopts the classic Hölzle/Chambers/Ungar inline cache. A call site has a small inline structure:
type CallSite struct {
key uintptr // observed closure pointer
fn *Function // resolved target
arity uint8
count uint8 // deopt counter
}
The cache lives in a side table indexed by the bytecode offset of the call instruction, allocated lazily on first execution. On entry the handler compares callee.ptr == key; on hit it jumps to fn directly. On miss it falls through to the generic call resolution, updates the cache, and re-dispatches. The structure is one 24-byte slot per cached call site.
Three tiers:
- Uninitialized: first hit, fills the cache. Cost: one cache fill.
- Monomorphic: subsequent hits with the same callee. Cost: one pointer compare and one indirect call. This is the common case.
- Polymorphic: an N-way linear list (N = 4) of
(key, fn)pairs. Misses on all four bump a counter; after the counter trips, the site goes megamorphic. - Megamorphic: generic dispatch only; the cache is bypassed for the rest of the program.
This is exactly the staging V8 uses for property access ICs ([Bynens-Shapes]), scaled down to Mochi's much smaller call graph.
Hidden classes (not needed)
V8 invents hidden classes because JavaScript object layout is dynamic. Mochi structs are static (MEP-2 onward), so each struct already has a stable layout descriptor (types.StructType) known at compile time. Property access obj.x already compiles to a slot load with a constant offset. There is no IC to add at the property level.
The one place hidden-class-like reasoning helps is intent method dispatch on agents: the intent table is indexed by method name. An IC can cache the resolved method pointer per call site exactly as for closures above.
Profile gathering
Quickening and ICs both need first-observation feedback. The mechanism is the dispatch loop itself: the generic handler patches the quickened slot or fills the cache after observing. No separate profile pass, no tier-up event, no recompilation. This is the PEP 659 "specializing adaptive interpreter" model and is the cheapest path from "no feedback" to "useful feedback."
Interaction with MEP-15 effects
Effect-typed call sites (! io, ! fs, etc.) are observable via the callee's FuncType.Effects (after MEP-15). The IC stores the effect set alongside the resolved function pointer so the dispatching code path does not re-resolve effect information per call.
Interaction with MEP-18
Quickening requires the bytecode rewriting slot from MEP-18 Phase C. ICs require the side table indexed by bytecode offset, which is new in this MEP but trivially aligned with the existing Instr indexing.
The combined gain of MEPs 18+19 is the target Mochi should hit before contemplating a JIT. The literature suggests this should land within 2-3x of LuaJIT's interpreter on type-stable workloads.
Status at a glance
| Item | Status |
|---|---|
Quickening: numeric binary ops (OpAdd_II, OpLess_II, ...) | proposed |
Quickening: typed indexed access (OpIndex_List, ...) | proposed |
| Quickening: deopt counter and stable-fail path | proposed |
| Inline cache: monomorphic call site | proposed |
| Inline cache: polymorphic (N=4) call site | proposed |
| Inline cache: megamorphic fallback and counter | proposed |
| Effect-set caching alongside resolved callee | proposed |
Side-table allocation strategy (CallSite per bytecode offset) | proposed |
Risks
- Cache pollution. Long-lived programs may accumulate
CallSiteentries for one-shot code. Mitigation: allocate the side table lazily per function; release with the function. (Mochi has no eval-style function lifetime gymnastics yet.) - Deopt thrashing. A site that flips between two types per iteration would quicken-deopt every step. The 4-bit deopt counter caps this at 8 events; after that the site stays generic.
- Concurrency. The patch model assumes single-goroutine VM execution. A future concurrent VM (MEP-22 or later) must revisit; the patches are not torn but a reader could see a stale
Quickbyte. Stale dispatch is correct (generic handler always works), so the only cost is a missed optimization. Sufficient.
Non-goals
- No JIT. All wins are interpreter-level.
- No tracing or method-based compilation. That is MEP-22.
- No new opcodes beyond quickened variants. The variants are mechanical (
OpAdd_II,OpAdd_FF,OpIndex_List, etc.).
Implementation notes
The three quickening families are independent PRs. Each lands with MEP-17 numbers showing measurable gain on at least iter_sum, string_cat, or hof_map. The IC infrastructure is one PR, with the monomorphic path landing first and the polymorphic extension following only if real workloads need it.
Total opcode budget: 32 new quickened opcodes (within the 256 - 107 = 149 headroom). The naming convention is Op<Generic>_<TypeSig> (e.g. OpAdd_II, OpIndex_List).
References
- [PEP-659] Mark Shannon, PEP 659: Specializing Adaptive Interpreter. https://peps.python.org/pep-0659/
- [Hölzle-91] Urs Hölzle, Craig Chambers, David Ungar, Optimizing Dynamically-Typed Object-Oriented Languages With Polymorphic Inline Caches. ECOOP 1991.
- [Bynens-Shapes] Mathias Bynens, JavaScript engine fundamentals: Shapes and Inline Caches. https://mathiasbynens.be/notes/shapes-ics
- [Chambers-Self] Craig Chambers, David Ungar, Elgin Lee, An Efficient Implementation of SELF, a Dynamically-Typed Object-Oriented Language Based on Prototypes. OOPSLA 1989.
- [V8-FastProps] V8 team, Fast properties in V8. https://v8.dev/blog/fast-properties
- [Ungar-Self-IC] David Ungar, Randall Smith, Self: The Power of Simplicity. OOPSLA 1987 (origin of much of the IC machinery).
- [Shannon-CPython] Mark Shannon, Making CPython Faster: A four-year plan. https://github.com/markshannon/faster-cpython