MEP 32. VM2 JIT Option C - Tiered Method JIT with Type Feedback
| Field | Value |
|---|---|
| MEP | 32 |
| Title | VM2 JIT Option C - Tiered Method JIT with Type Feedback |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
This MEP specifies the tiered method JIT for vm2: a two-tier compilation pipeline stacked on the existing interpreter, where tier 1 is the MEP-30 baseline JIT and tier 2 is a typed-SSA optimizing method compiler that consumes type feedback from MEP-27 inline caches. The optimizing tier performs profile-guided inlining, escape analysis, scalar replacement, and linear-scan register allocation, then emits native code via golang-asm. Frames move bidirectionally across tiers via on-stack replacement; tier-2 code may speculatively assume a stable IC and deopt back to tier 1 (or the interpreter) on guard failure.
The architecture is the same one every serious production engine converges on: HotSpot's C1 / C2 (since 1999), JSC's four-tier LLInt / Baseline / DFG / FTL pipeline, V8's Ignition / Sparkplug / Maglev / Turbofan pipeline. The lesson Google taught with Maglev (2023) is that a mid-tier optimizing JIT between a baseline tier and a peak-optimizing tier is the right shape; this MEP specifies exactly that shape for Mochi.
This is Option C of three. MEP-30 specifies the baseline tier in isolation; MEP-31 specifies the tracing alternative. MEP-32 builds on top of MEP-30 (tier 1 = MEP-30 verbatim) and adds tier 2.
Motivation
The MEP-30 baseline JIT closes the dispatch-cost gap but cannot eliminate boxing, cannot inline across call sites, and cannot exploit loop-invariant structure. A method JIT that does these three things adds another 2-3x on top of the baseline tier, which is the same ratio HotSpot's C2 adds over C1 and JSC's DFG adds over Baseline.
The published evidence for tiered designs is unusually consistent across engines:
- HotSpot (1999-present): two-tier C1 (~5x over interpreter) + C2 (~2x over C1) has been the dominant Java performance architecture for 25 years. The brief C2-only experiment in early HotSpot was abandoned because warmup cost dominated short-running programs.
- JSC's four-tier strategy: LLInt -> Baseline (~2x) -> DFG (~2-3x) -> FTL (~1.1x). Each tier costs 4-6x the compile time of the one below.
- V8's Maglev (2023) was added specifically to fill the gap between Sparkplug (no IR) and Turbofan (deeply optimizing): "good enough code, fast enough". The lesson: even a team Google's size needed an intermediate tier; jumping straight from a baseline to a peak optimizer leaves too much performance on the table during warmup.
For Mochi, the case for a tier 2 is:
- The MEP-30 baseline JIT will land at ~2-4x over the interpreter. The remaining gap to LuaJIT and to Go-native is ~3x and is dominated by boxing and call overhead, both of which are tier-2 problems.
- The IC infrastructure from MEP-27 already collects the type feedback a tier 2 needs. Building a tier 2 without that feedback (i.e. without first doing MEP-27) would not work; building one with that feedback is mostly a matter of consuming what is already there.
- A tier 2 also unlocks ergonomic features: inlining means Mochi-language abstractions (small helper functions, getters) become free at runtime, which raises the ceiling on idiomatic Mochi performance.
The case against:
- Tier 2 is substantially more engineering than tiers 0 (interpreter) and 1 (baseline) combined. JSC's DFG is several KLOC; Maglev shipped as a multi-engineer multi-year project even at Google.
- Speculative deopt is the most subtle correctness hazard in any optimizing JIT.
This MEP exists to specify the tier-2 design precisely enough that the team can decide, after MEP-30 is shipped and measured, whether to fund the tier-2 work. It is the natural Phase 2 for the JIT effort.
Specification
Tier topology
+-----------------+
| Interpreter | (tier 0, today)
+-------+---------+
|
call hot / back-edge hot
v
+-----------------+
| Tier 1 baseline | (MEP-30, copy-and-patch)
+-------+---------+
|
same function called >= TieringThreshold,
IC slot has been monomorphic for >= ICStableCount
v
+-----------------+
| Tier 2 optimizing| (this MEP)
+-----------------+
Frames transition both ways: a tier-2 deopt sends the frame back to tier 1 (or the interpreter if tier 1 is not present for that function); an OSR back-edge in tier 1 sends the frame forward to tier 2. The interpreter remains the source of truth for semantics; tiers 1 and 2 are observably-equivalent optimizations.
Tier 1: by reference
Tier 1 is MEP-30 verbatim. This MEP introduces no changes to tier 1; it consumes its frame layout, code cache, and trampoline ABI.
Tier 2: optimizing method JIT
Tier 2 takes a function's vm2 bytecode plus its IC slot table as input and emits native code. The pipeline:
- IR construction: walk bytecode, build typed SSA. Type each SSA value from the local IC slot, narrowing where possible.
- Inlining: replace
OpCallwith the callee's IR when the callee is small enough and the IC at that site is monomorphic. Recursive; bounded byMaxInlineDepthandMaxInlineSize. - Type propagation: forward dataflow over the inlined IR; collapse
Box(Unbox(x))andUnbox(Box(x))pairs revealed by inlining. - Guard insertion: for every speculation (monomorphic shape, monomorphic callee, non-overflowing int op), insert an explicit
Guardnode. Failure semantics are specified below. - Escape analysis: standard reachability analysis on SSA. Any allocation not stored to a heap location nor returned is scalar-replaced.
- Loop-invariant code motion: standard.
- Common subexpression elimination + constant folding: standard.
- Linear-scan register allocator: a small linear-scan pass over the SSA values. Live ranges are computed per basic block.
- Backend: emit native code via
golang-asm, sharing the executable code cache with tier 1.
Tier 2 IR
The IR is conventional typed SSA, modeled on Cranelift's CLIF and JSC's DFG IR:
%0:i64 = LoadReg %frame, #r0
%1:listptr = LoadObject %vm, %0
Guard %1, ShapeIs(ListI64)
%2:i64 = LoadReg %frame, #r1
%3:i64 = ListGetI64 %1, %2
Guard %3, NoOverflow ; meaningless for get; example for AddI64
StoreReg %frame, #r2, %3
Types are restricted to a small set: i64, f64, bool, listptr<ListI64|Generic>, strptr<Inline|Flat|Rope>, setptr, mapptr, structptr<shape-id>, and the catch-all Cell. Promotion from Cell to a narrower type costs an unbox plus a tag guard.
Speculation and deopt
A Guard node carries the speculation kind, the SSA value being checked, and a deopt descriptor: the bytecode PC to resume at and the live-variable map at that PC. On guard failure the tier-2 code jumps to a per-guard stub that:
- Walks the deopt descriptor and writes each tier-2 SSA value back into the frame's register file as the interpreter expects it.
- Sets
frame.PCto the resume PC. - Jumps to tier 1 if present, else to the interpreter.
Deopt is the single largest correctness hazard in this MEP. The protocol is constrained as follows to bound the hazard:
- Deopt descriptors are computed at IR-build time, before any optimization. Optimizations may rewrite IR but must preserve the descriptor's reachability and the mapping from PC to live-variable set.
- Inlining replicates the deopt descriptors of the inlined callee, adjusted for the new call chain. The descriptor records the full inlined call stack so the interpreter can resume with the right frame stack.
- A failed deopt (interpreter and tier 2 disagree on observable behavior after the deopt) is a process-fatal bug. There is no soft recovery. The fuzzers in
runtime/vm2/jit/conformance/exist to find every such bug pre-merge.
The same protocol Google adopted for Maglev. The cost is high; the alternative (non-speculative tier 2) is a worthless tier 2.
Tiering policy
Tier transitions are governed by simple counters and thresholds, the same shape every production engine uses:
| Counter / Trigger | Threshold | Default | Tunable via |
|---|---|---|---|
| Function call count -> tier 1 | JITThreshold | 1000 | MOCHI_JIT_THRESHOLD |
| Function call count -> tier 2 | OptThreshold | 10000 | MOCHI_OPT_THRESHOLD |
| Back-edge OSR -> tier 1 | BackEdgeThreshold | 10000 | MOCHI_BACKEDGE_THRESHOLD |
| IC stability -> permits tier 2 | ICStableCount | 100 | MOCHI_IC_STABLE |
| Deopts before disabling tier 2 | MaxDeoptsPerFn | 10 | MOCHI_MAX_DEOPTS |
A function that deopts more than MaxDeoptsPerFn times has its tier-2 code discarded and is denylisted from tier 2 for the process lifetime; subsequent calls run in tier 1. This mirrors V8's "never-optimize" flag and bounds the cost of pathological speculation cycles.
Compilation off the main goroutine
Tier-2 compilation is performed on a dedicated background goroutine pool (runtime.NumCPU() / 4, minimum 1). The triggering call runs in tier 1 while compilation proceeds; the tier-2 code is installed atomically when ready, and subsequent calls pick it up via the JITCode pointer.
This is the V8 / HotSpot / JSC pattern. Synchronous tier-2 compilation is not specified because tier-2 compile times in published engines range from milliseconds to hundreds of milliseconds per function and would block whatever goroutine triggered them. Mochi cannot afford that.
Code-cache management
Tier 2 shares the executable code-cache pool with tier 1, but with two added rules:
- Per-function size cap: a single tier-2 compilation may not exceed 64 KiB. Functions whose IR exceeds this are denylisted from tier 2 (very rare in practice; bounds the cache fragmentation risk).
- Eviction policy: when the code cache exceeds
MaxCacheSize(default 64 MiB), the least-recently-executed tier-2 code is evicted and the corresponding function falls back to tier 1. Tier-1 code is never evicted (it is cheaper to keep than to recompile).
Interaction with MEP-27 inline caches
Tier 2 reads the IC slots both at compile time (as type predictions) and consumes IC stability counts as a tiering signal (a function whose ICs are still churning is a bad tier-2 candidate). Tier 2 does not emit ICs of its own; speculation is baked in as guards.
A change to an IC slot from monomorphic to polymorphic at runtime can invalidate tier-2 code. Two recovery options:
- Active invalidation: walk the JIT'd function list, mark any function whose IR depended on the changed IC as
Stale, force-deopt all running frames of that function on the next safepoint. Correct, expensive. - Passive invalidation (v1): do nothing; rely on the per-frame guard to fail naturally when the polymorphic case is hit. Simpler, slightly less optimal (some tier-2 frames keep running with stale assumptions until they hit the changed code path).
v1 ships passive invalidation. v2 may add active invalidation if measurements warrant.
Cost model
Per-call steady-state cost on a typical Mochi function (10-20 vm2 ops, one IC, one call), Apple M4:
| Path | Cycles per call |
|---|---|
| vm2 interpreter (post-MEP-19) | ~250 |
| Tier 1 baseline JIT (MEP-30) | ~80 |
| Tier 2 optimizing JIT (this MEP) | ~30 |
| Theoretical floor (Go inline) | ~15 |
Tier 2's win over tier 1 is dominated by inlining and unboxing, not by the optimizer. A function with no inlining opportunities sees only a ~1.3x tier-2 win; a function with substantial inlining can see ~3x.
Engineering scope
Beyond the MEP-30 baseline tier:
| Component | Lines of Go | Engineer-weeks |
|---|---|---|
| Typed SSA IR + builder | 2500 | 8 |
| Type propagation + constant folding | 1200 | 5 |
| Inliner | 1500 | 6 |
| Escape analysis + scalar replacement | 1800 | 8 |
| LICM + CSE + DCE | 1500 | 5 |
| Linear-scan register allocator | 1200 | 6 |
golang-asm backend (SSA lowering) | 2000 | 8 |
| Guard / deopt protocol + stubs | 1500 | 8 |
| Tiering policy + counters | 400 | 2 |
| Background compilation pool | 400 | 2 |
| Code-cache eviction | 500 | 3 |
| IC invalidation hooks | 300 | 2 |
| Conformance + deopt fuzzers (mandatory) | 3000 | 12 |
| Tier-2-debug tooling | 1500 | 6 |
| Tier 2 total (incremental over MEP-30) | ~19300 | ~81 weeks |
Combined with MEP-30, roughly 1.5 engineer-years to prototype and another year to harden. Consistent with the smaller end of published method-JIT timelines (Maglev took longer with a larger team).
JIT integration with the rest of vm2
- MEP-19 quickening: tier-2 IR consumes quickened opcodes directly; no separate path.
- MEP-25 shapes: shape transitions deopt tier-2 frames whose guards assumed the prior shape.
- MEP-27 inline caches: read at compile time as type predictions; consulted at tiering time as stability signal.
- MEP-28 AOT specialization: tier 2 subsumes MEP-28 for hot code (it specializes more aggressively); MEP-28 remains the right strategy for cold code.
- MEP-30 baseline JIT: tier 1 of this MEP. Hard prerequisite.
Risks
- Deopt correctness. The single largest failure mode. Mitigation: extensive conformance fuzzing in CI (mandatory before merge); deopt path is the only path tier 2 takes on any guard failure, exercised by every test that touches polymorphic code.
- Engineering scope. ~1.5 engineer-years is a substantial commitment for the Mochi team. Mitigation: gate the tier-2 work explicitly on measured tier-1 ceiling (a future Informational MEP, analogous to MEP-29, comparing measured tier-1 numbers to the LuaJIT floor).
- Inlining-driven code bloat. The cache eviction policy bounds it, but pathological cases (deeply recursive inlining of a small leaf function across many call sites) can dominate cache usage. Mitigation: per-call-site inlining budget; per-function code cap.
- Background compilation worker starvation under load. With one compile pool and N application goroutines, hot functions may take seconds to install tier-2 code under high load. Mitigation: prioritize compile jobs by accumulated tier-1 execution time; expose pool size via env var.
- Maglev's lesson on warmup. Even with two tiers, V8's Maglev exists because Sparkplug -> Turbofan was too slow a transition. If tier-1 -> tier-2 in Mochi proves similarly slow, the design admits a third tier; until then, accept the trade.
Alternatives considered
- Single-tier method JIT (skip tier 1): rejected. Every production engine that tried this regressed on short-running programs and reverted to a tiered design.
- Single-tier optimizing JIT triggered from interpreter (no baseline): HotSpot tried this. Warmup is unacceptable. Rejected.
- Adopt a third-party JIT framework (LLVM ORC, Cranelift): viable for tier 2, breaks single-binary distribution. Worth a follow-on MEP as an optional
mochi-jit-llvmbuild tag, not the default. - Skip tier 2 entirely, do only MEP-30: leaves ~2-3x on the table. Acceptable if the team chooses to invest the saved engineering elsewhere (e.g. on language features). The right answer depends on whether Mochi positions itself on the "fast scripting" vs "production-grade VM" axis. This MEP specifies the design; the decision to fund it is separate.
Comparison matrix
| Dimension | Option A (MEP-30) | Option B (MEP-31) | Option C (this MEP) |
|---|---|---|---|
Predicted speedup on fill_sum | 2-4x | 5-15x | 4-8x |
| Workload coverage | Universal | Loop-heavy | Universal |
| Engineering scope (KLOC) | ~2.2 | ~11 | ~25 (incl. MEP-30) |
| Engineer-months to prototype | ~3 | ~12 | ~18 |
| Deopt complexity | None | Medium (local) | High (cross-method) |
| OSR complexity | None (free) | N/A | High |
| Single static binary preserved | Yes | Yes | Yes |
| Reuses MEP-27 IC infrastructure | Read-only | Yes | Yes (heavily) |
| Performance ceiling | Modest | High on loops | High broadly |
| Lineage | Sparkplug, CPython 3.13 | LuaJIT, PyPy | HotSpot, JSC, V8 |
Predicted MEP-23 numbers
On the MEP-23 cross-language lists/fill_sum bench, N=1024 on Apple M4, current vm2 ~3805 ns/op; predicted with the full tier-1 + tier-2 stack: ~800-1100 ns/op, reaching parity with Go-native on this monomorphic loop. On concat_loop with rope-shape allocations, ~1500-2000 ns/op, a 4-5x win over the current strings baseline. On fib (recursive, branchy), ~1.5-2x speedup from tier-2 inlining of the recursive call, which neither MEP-30 nor MEP-31 can match.
Open questions
- Whether to share IR with
compiler2. The Mochi compiler2 work (MEP-17 area) introduced a typed SSA IR for compile-time optimizations. A unified IR across compile-time and tier 2 would reduce maintenance cost but raise coupling. Likely yes, with a small adapter; specified in a follow-on MEP. - Speculative call inlining policy. Inlining a monomorphic callee is uncontroversial; inlining the most-common-of-two polymorphic callee with a guard is the V8 Maglev choice and is worth measuring on Mochi's corpus before committing.
- Tier-2 compile latency target. Maglev targets ~1 ms/function; Turbofan targets ~100 ms. Where Mochi tier-2 should land depends on the typical Mochi function size and how often programs cross the optimization threshold. Suggest 10 ms p99 as a draft target.
- Whether tier 2 should be off by default in 1.0. Tier 1 should ship on by default once stable (per MEP-30). Tier 2's deopt-bug risk argues for a longer opt-in period. Decide post-measurement.
Related work
- Java HotSpot VM Performance Enhancements (Oracle, 2014)
- JavaScriptCore Deep Dive (WebKit Docs)
- Maglev: V8's Fastest Optimizing JIT (Verwaest, 2023)
- Sparkplug: a non-optimizing JavaScript compiler (Verwaest, 2021)
- Speculation in JavaScriptCore (Pizlo, 2020)
- Introducing the WebKit FTL JIT (Pizlo, 2014)
- Oracle Drops GraalVM JIT Compiler from JDK (InfoQ, 2024)
- Cranelift Codegen Primer (Bouvier)
- MEP-25, MEP-27, MEP-28, MEP-29: data model, IC infrastructure, AOT specialization, measured results.
- MEP-30: the baseline tier this MEP builds on.
- MEP-31: the tracing JIT alternative.