MEP 31. VM2 JIT Option B - Tracing JIT over Typed Loops
| Field | Value |
|---|---|
| MEP | 31 |
| Title | VM2 JIT Option B - Tracing JIT over Typed Loops |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
This MEP specifies the tracing JIT for vm2: a per-loop compilation strategy that records the actual instructions executed on a hot back-edge, optimizes the recorded trace as straight-line code with explicit guards, and compiles it to native. The compiled trace runs subsequent iterations directly; any deviation (a guard fails) exits to the interpreter. The lineage runs through LuaJIT 2 (Pall, 2010-) for the hand-coded single-level form, PyPy / RPython for the meta-tracing form, and SpiderMonkey TraceMonkey (2008) as the cautionary tale.
This is Option B of three. MEP-30 specifies the template / copy-and-patch baseline JIT; MEP-32 specifies the tiered method JIT. All three share the same vm2 bytecode, Cell ABI, and MEP-25 shapes; only the compilation strategy differs.
Motivation
Tracing JITs are uniquely good at one specific shape of program: a hot loop whose dynamic behavior is far simpler than its static text. The interpreter walks branchy generic code; the trace records only what actually happened on the path through. The compiler then optimizes a linear sequence with no merge points, where escape analysis, constant folding, and guard hoisting become local rewrites instead of dataflow analyses. The published evidence:
- LuaJIT routinely matches or beats C on numeric loops and stays within 2-4x on most idiomatic Lua, on a codebase of ~30 KLOC of C plus DynASM.
- PyPy is typically 4-7x faster than CPython on numeric workloads and competitive on broader Python despite the dynamism overhead.
- Bolz & Tratt (2013) found that adding meta-tracing to an interpreter typically required ~2 KLOC of optimization code on top of the recorder, because trace optimizations are local.
The cost is well documented too:
- TraceMonkey was removed from SpiderMonkey because it lost on branchy code, and the warmup pathology (cold traces, megamorphic dispatch, trace abort cycles) was hard to diagnose.
- PyPy maintains a small expert team and a substantial set of trace-debugging tools because the failure modes are non-local.
For Mochi specifically, the workload mix matters. MEP-23's crosslang corpus is dominated by fill_sum, concat_loop, fib, and similar loop-heavy benchmarks; these are exactly the shape tracing JITs were built for. A tracing JIT would deliver a 5-15x win on this subset. The risk is what happens off this subset: code with deeply nested polymorphic dispatch, exceptions, or generators.
This MEP exists to specify the tracing alternative precisely enough that it can be measured against the baseline (MEP-30) and the tiered (MEP-32) options. A follow-on Informational MEP (analogous to MEP-29) will pick the winner from measured numbers.
Specification
Overview
The JIT consists of five components:
- Hotness profiler: per-back-edge counters in the interpreter.
- Recorder: records a typed-Cell trace from the moment a back-edge counter trips until control returns to the same back-edge, an inner trace exits, or a record budget is exceeded.
- Trace IR + optimizer: a small SSA IR with explicit
Guardoperations; passes are constant folding, guard hoisting, allocation removal, and dead store elimination. - Backend: emits native code per trace via
golang-asm(same backend as MEP-30, different IR upstream). - Side-exit table: per-guard map from native PC to (interpreter PC, register snapshot) so the deopt path can resume.
Hotness profiling
A BackEdgeCount is added to the existing function metadata, indexed by source PC (back-edge target). The interpreter increments on each backward branch; when a counter crosses TraceThreshold (default 50), recording begins on the next entry.
TraceThreshold is deliberately low compared to MEP-30's 1000 because tracing has higher warmup cost per attempt; we want to amortize recording over many iterations but not over a million.
Trace recording
The recorder is a separate execution mode of the interpreter, not a separate VM. On each opcode dispatch, in addition to the normal handler, the recorder appends an IR node:
type TraceOp struct {
Op vm.OpCode
Args [3]TraceRef // SSA references to prior IR nodes or guard outputs
Imm int64 // immediate from bytecode
Guards []Guard // type guards attached to this op
PC int // originating interpreter PC for deopt
}
Type guards are inserted at every shape-dependent operation:
OpListGetrecords asOpListGetTracedwith aGuard{kind: ShapeIs, expect: ListI64}. If the list is later notListI64the guard fails.OpAddrecords asOpAddI64orOpAddF64per the observed types, each with a numeric tag guard.
Recording terminates on any of:
- Loop close: control reaches the same back-edge that started the trace. The trace is closed and queued for optimization.
- Inner side-exit: control reaches a guard that was recorded as a trace exit (trace tree split).
- Record budget exceeded: trace length > 1024 IR nodes, indicating non-loop code or runaway recursion. Trace is discarded, back-edge marked "blacklisted" for
BlacklistDuration(default 100 entries) before retry. - Hard abort: any non-traceable opcode (allocation that must call into Go-managed code with a stack-growth check, GC barrier, exception throw). Trace discarded; back-edge counter decremented but not blacklisted.
Trace IR and optimizations
The IR is SSA with one block (the trace itself). Operations are typed: integer add, float add, Cell box, Cell unbox, list-data-load (specialized to ListI64), and so on.
Optimization passes, in order:
- Constant folding: standard SSA constant propagation. Removes the bulk of address arithmetic produced by the recorder.
- Guard hoisting: move loop-invariant guards out of the loop body, into the trace's prologue. This is the largest single win and the reason tracing JITs are fast: a typed inner loop pays guard cost once on entry, not per iteration.
- Allocation removal: any
vmList,vmString, orvmStructallocated inside the trace and not escaping (no store to a heap location, no return) is scalar-replaced. Its fields live in SSA values. PyPy reports this as the dominant optimization for Python performance and the analog should hold for Mochi. - Dead store elimination: stores to
frame.Registers[i]whose value is overwritten before the next side-exit or trace end can be dropped, since side-exits restore from the snapshot table, not from the live frame. - Common subexpression elimination: standard.
The optimizer is intentionally local and intentionally small: trace IR is straight-line by construction, so each pass is a single forward sweep.
Trace shape: linear, then trees, then meta-traces
Three trace shapes are specified in order of complexity. v1 ships only Linear.
- Linear (v1): each trace is a single loop closed by a back-edge. A side-exit drops to interpreter.
- Tree (v2): a side-exit that itself becomes hot is recorded as a child trace and patched into the parent's guard, forming a trace tree. Sharply reduces interpreter time in code with biased branches.
- Meta-traces (deferred): only relevant if Mochi grows a second front-end interpreter. Not in scope.
Backend
Code generation reuses the MEP-30 golang-asm backend, but the input is a trace-SSA function rather than per-opcode templates. The backend performs a single linear-scan register allocation over the trace's SSA values, mapping live values to architecture registers and spilling to frame.Registers only at side-exits.
This is the key performance differentiator over MEP-30: a trace can keep an int64 accumulator in x0 across many iterations, never touching memory, because there are no opcode boundaries that need a memory-consistent state.
Side-exit table
Each guard in the compiled trace stores:
type SideExit struct {
NativePC uint32 // offset within the trace
InterpPC uint32 // PC to resume at
Snapshot []SnapshotEntry // SSA value -> frame register / immediate
ExitCount uint32 // for tree formation
}
On guard failure the trace jumps to a trampoline that walks Snapshot, writes each captured SSA value back into frame.Registers, then returns control to the interpreter at InterpPC. This is the deopt path. Unlike MEP-32's deopt, it is local to the trace and never has to reconstruct state from an arbitrary point in an optimized method.
Blacklisting and trace stability
If the same back-edge produces more than MaxAbortsPerSite (default 3) aborted traces in a row, the back-edge is permanently blacklisted for the process lifetime. This is the TraceMonkey lesson: trace abort cycles, where the recorder keeps trying and failing, are the worst possible interpreter state and worse than not tracing at all. Hard-blacklisting after a small retry count bounds the damage.
Memory and concurrency
Compiled traces share the MEP-30 executable code cache (per-process append-only mmap'd pages). Side-exit tables live in regular Go heap memory and are kept alive by a back-pointer from each Function to its trace set.
Mochi's per-goroutine VM model means each goroutine has its own profiling counters and may trigger independent tracing; the recorded trace is shared across goroutines once compiled, since the resulting native code is pure (no goroutine-local state beyond R_VM).
Cost model
Per-iteration steady-state cost on a fill_sum-style loop, Apple M4:
| Path | Cycles per loop iter |
|---|---|
| vm2 interpreter (post-MEP-19) | ~25 |
| vm2 + Option A baseline JIT (MEP-30) | ~10 |
| vm2 + Option B tracing JIT (this MEP) | ~3 |
Theoretical floor (Go native, for i := 0; i < n; summing []int64) | ~1 |
The expected end-to-end speedup over the interpreter on numeric loops is 5-15x. On branchy or polymorphic code that aborts traces, the cost falls back to the interpreter plus a small profiling tax.
Engineering scope
| Component | Lines of Go | Engineer-weeks |
|---|---|---|
| Back-edge profiler hooks | 100 | 1 |
| Trace recorder | 1500 | 6 |
| Trace IR + SSA infra | 1200 | 5 |
| Optimization passes (5) | 1500 | 8 |
| Linear-scan register allocator | 800 | 4 |
golang-asm backend (trace lowering) | 1200 | 6 |
| Side-exit table + deopt trampoline | 600 | 4 |
| Trace tree formation (v2) | 1000 | 6 |
| Blacklisting + recorder budget | 200 | 1 |
| Conformance tests + trace-shape fuzzers | 1500 | 8 |
| Trace-debug tooling (mandatory, not nice-to-have) | 1500 | 8 |
| Total | ~11000 | ~57 weeks |
Roughly one engineer-year to prototype, plus another year to harden and ship trace trees. Consistent with the published timelines for LuaJIT 2 (one architect, multi-year) and PyPy (small team, multi-year, with the trace-debug tooling cost dominating once correctness regressions show up in the wild).
JIT integration with the rest of vm2
- MEP-19 quickening: the trace recorder works at the quickened-opcode level, so the IR is already type-narrow before recording starts.
- MEP-25 shapes: shape transitions on a hot loop's container trigger side-exit, which is the correct behavior; the next entry retries the trace under the new shape.
- MEP-27 inline caches: the recorder reads ICs as type predictions but always emits its own guard, so trace correctness does not depend on IC freshness.
- MEP-30 baseline JIT: complementary. Baseline JIT compiles whole functions cheaply; tracing JIT compiles hot inner loops expensively. A coexistence design is possible (baseline JIT serves as the fallback path on trace side-exit), but v1 ships tracing-only and reuses the interpreter as the side-exit target.
Risks
- Trace abort pathology. The single biggest source of failure for tracing JITs in the field. Mitigation: hard blacklisting after 3 aborts; comprehensive trace-debug logging (
MOCHI_JIT_TRACE_DEBUG=1); measured corpus tests in CI that fail if any expected-traceable loop aborts. - Code with mixed types in the loop body. Each type change is a guard; many guards make the trace fat. Mitigation: cap trace length at 1024 IR nodes; bail out and let the interpreter handle the loop if the type histogram is too wide.
- Allocation in the loop. Allocations that escape cannot be removed and become a serialization point in the trace. Mitigation: rely on Mochi's existing escape analysis at compile time; emit a hint to the trace recorder so it can pre-emptively abort traces with known-escaping allocations.
- Trace-debug tooling cost. PyPy's experience: trace-debug tooling ends up the same order of magnitude as the optimizer itself. We have explicitly budgeted ~8 engineer-weeks for it above. Underestimating this is the single largest risk to the project schedule.
- Interaction with Go's GC. Trace code holds raw
*Cellpointers intoframe.Registers. The frame must be GC-pinned for the duration of trace execution. Mitigation: trace prologue takes aruntime.KeepAliveanalog; trace epilogue releases. Detailed protocol specified inruntime/vm2/jit/trace_gc.godesign doc (deferred).
Alternatives considered
- Pure method JIT (MEP-32 only): gives up the tracing JIT's structural advantage of seeing only the path taken. For numeric loops this is a 2-3x performance gap, per the LuaJIT-vs-V8 comparison on numeric benchmarks.
- Meta-tracing in RPython style: requires Mochi to be implemented in RPython, which it is not. The Go ecosystem has no equivalent meta-tracing toolkit. Rejected as out-of-scope.
- Tracing without trace trees: the v1 design, deliberately. Trade some peak performance for a simpler initial implementation.
Comparison matrix
| Dimension | Option A (MEP-30, template) | Option B (this MEP) | Option C (MEP-32, tiered) |
|---|---|---|---|
Predicted speedup on fill_sum | 2-4x | 5-15x | 4-8x |
| Engineering scope (KLOC) | ~2.2 | ~11 | ~25 |
| Engineer-months to prototype | ~3 | ~12 | ~18 |
| Deopt complexity | None | Medium (local) | High (cross-method) |
| OSR complexity | None (free) | N/A (back-edge native) | High |
| Single static binary preserved | Yes | Yes | Yes (tier 1) |
| Reuses MEP-27 IC infrastructure | Read-only | Yes | Yes |
| Performance ceiling | Modest | High | High |
| Failure mode on bad workload | None (= interpreter) | Abort + blacklist | Tier-2 deopt cascades |
Predicted MEP-23 numbers
On the MEP-23 cross-language lists/fill_sum bench, N=1024, current vm2 ~3805 ns/op; predicted with Option B: ~600-900 ns/op, reaching parity with Go native and roughly 1.5x faster than LuaJIT on the same workload (LuaJIT pays a small Lua-table overhead vm2 does not). On concat_loop the win is modest (~2x) because the loop body is dominated by allocation, which tracing can only partially eliminate. On fib (recursive, no loop) the trace recorder cannot help; predict a small regression from the recorder's per-call overhead, mitigated by hot-recursion-as-loop detection (a v2 feature, deferred).
Open questions
- Function inlining inside a trace. Tracing JITs naturally inline because they record the called function's body inline. This is mostly good but can blow the trace length budget. Bound by an
InlineDepthparameter? Per-function annotation? - Tracing across goroutine boundaries. A
go func()inside a trace must abort. But blocking on a channel might or might not be worth tracing depending on contention. Defer. - Whether to share the executable code cache with MEP-30. Both fragment the cache differently. Recommendation: share with eviction policy that prefers retaining trace code.
- Pause-for-recording cost. Recording roughly halves interpreter throughput while active. If a back-edge trips at the start of a long loop, recording cost is amortized; if at the end, it is pure waste. Mitigate via
MinIterationsBeforeRecord?
Related work
- The LuaJIT Wiki - Bytecode and IR
- Mike Pall, LuaJIT and tracing (LWN, 2010)
- Bolz et al., Tracing the Meta-Level (ICOOOLPS 2009)
- Bolz & Tratt, The Impact of Meta-Tracing on VM Design and Implementation (2013)
- Bolz-Tereick, Musings on Tracing in PyPy (2025)
- Gal et al., Trace-based Just-in-Time Type Specialization for Dynamic Languages (PLDI 2009) (TraceMonkey)
- Why TraceMonkey was removed (Bug 858032)
- MEP-25, MEP-27: data model and IC infrastructure consumed.
- MEP-30, MEP-32: the two competing JIT strategies.