MEP 31. VM2 JIT Option B - Tracing JIT over Typed Loops

Field	Value
MEP	31
Title	VM2 JIT Option B - Tracing JIT over Typed Loops
Author	Mochi core
Status	Draft
Type	Standards Track
Created	2026-05-17

Abstract

This MEP specifies the tracing JIT for vm2: a per-loop compilation strategy that records the actual instructions executed on a hot back-edge, optimizes the recorded trace as straight-line code with explicit guards, and compiles it to native. The compiled trace runs subsequent iterations directly; any deviation (a guard fails) exits to the interpreter. The lineage runs through LuaJIT 2 (Pall, 2010-) for the hand-coded single-level form, PyPy / RPython for the meta-tracing form, and SpiderMonkey TraceMonkey (2008) as the cautionary tale.

This is Option B of three. MEP-30 specifies the template / copy-and-patch baseline JIT; MEP-32 specifies the tiered method JIT. All three share the same vm2 bytecode, Cell ABI, and MEP-25 shapes; only the compilation strategy differs.

Motivation

Tracing JITs are uniquely good at one specific shape of program: a hot loop whose dynamic behavior is far simpler than its static text. The interpreter walks branchy generic code; the trace records only what actually happened on the path through. The compiler then optimizes a linear sequence with no merge points, where escape analysis, constant folding, and guard hoisting become local rewrites instead of dataflow analyses. The published evidence:

LuaJIT routinely matches or beats C on numeric loops and stays within 2-4x on most idiomatic Lua, on a codebase of ~30 KLOC of C plus DynASM.
PyPy is typically 4-7x faster than CPython on numeric workloads and competitive on broader Python despite the dynamism overhead.
Bolz & Tratt (2013) found that adding meta-tracing to an interpreter typically required ~2 KLOC of optimization code on top of the recorder, because trace optimizations are local.

The cost is well documented too:

TraceMonkey was removed from SpiderMonkey because it lost on branchy code, and the warmup pathology (cold traces, megamorphic dispatch, trace abort cycles) was hard to diagnose.
PyPy maintains a small expert team and a substantial set of trace-debugging tools because the failure modes are non-local.

For Mochi specifically, the workload mix matters. MEP-23's crosslang corpus is dominated by fill_sum, concat_loop, fib, and similar loop-heavy benchmarks; these are exactly the shape tracing JITs were built for. A tracing JIT would deliver a 5-15x win on this subset. The risk is what happens off this subset: code with deeply nested polymorphic dispatch, exceptions, or generators.

This MEP exists to specify the tracing alternative precisely enough that it can be measured against the baseline (MEP-30) and the tiered (MEP-32) options. A follow-on Informational MEP (analogous to MEP-29) will pick the winner from measured numbers.

Specification

Overview

The JIT consists of five components:

Hotness profiler: per-back-edge counters in the interpreter.
Recorder: records a typed-Cell trace from the moment a back-edge counter trips until control returns to the same back-edge, an inner trace exits, or a record budget is exceeded.
Trace IR + optimizer: a small SSA IR with explicit Guard operations; passes are constant folding, guard hoisting, allocation removal, and dead store elimination.
Backend: emits native code per trace via golang-asm (same backend as MEP-30, different IR upstream).
Side-exit table: per-guard map from native PC to (interpreter PC, register snapshot) so the deopt path can resume.

Hotness profiling

A BackEdgeCount is added to the existing function metadata, indexed by source PC (back-edge target). The interpreter increments on each backward branch; when a counter crosses TraceThreshold (default 50), recording begins on the next entry.

TraceThreshold is deliberately low compared to MEP-30's 1000 because tracing has higher warmup cost per attempt; we want to amortize recording over many iterations but not over a million.

Trace recording

The recorder is a separate execution mode of the interpreter, not a separate VM. On each opcode dispatch, in addition to the normal handler, the recorder appends an IR node:

type TraceOp struct {
    Op       vm.OpCode
    Args     [3]TraceRef     // SSA references to prior IR nodes or guard outputs
    Imm      int64           // immediate from bytecode
    Guards   []Guard         // type guards attached to this op
    PC       int             // originating interpreter PC for deopt
}

Type guards are inserted at every shape-dependent operation:

OpListGet records as OpListGetTraced with a Guard{kind: ShapeIs, expect: ListI64}. If the list is later not ListI64 the guard fails.
OpAdd records as OpAddI64 or OpAddF64 per the observed types, each with a numeric tag guard.

Recording terminates on any of:

Loop close: control reaches the same back-edge that started the trace. The trace is closed and queued for optimization.
Inner side-exit: control reaches a guard that was recorded as a trace exit (trace tree split).
Record budget exceeded: trace length > 1024 IR nodes, indicating non-loop code or runaway recursion. Trace is discarded, back-edge marked "blacklisted" for BlacklistDuration (default 100 entries) before retry.
Hard abort: any non-traceable opcode (allocation that must call into Go-managed code with a stack-growth check, GC barrier, exception throw). Trace discarded; back-edge counter decremented but not blacklisted.

Trace IR and optimizations

The IR is SSA with one block (the trace itself). Operations are typed: integer add, float add, Cell box, Cell unbox, list-data-load (specialized to ListI64), and so on.

Optimization passes, in order:

Constant folding: standard SSA constant propagation. Removes the bulk of address arithmetic produced by the recorder.
Guard hoisting: move loop-invariant guards out of the loop body, into the trace's prologue. This is the largest single win and the reason tracing JITs are fast: a typed inner loop pays guard cost once on entry, not per iteration.
Allocation removal: any vmList, vmString, or vmStruct allocated inside the trace and not escaping (no store to a heap location, no return) is scalar-replaced. Its fields live in SSA values. PyPy reports this as the dominant optimization for Python performance and the analog should hold for Mochi.
Dead store elimination: stores to frame.Registers[i] whose value is overwritten before the next side-exit or trace end can be dropped, since side-exits restore from the snapshot table, not from the live frame.
Common subexpression elimination: standard.

The optimizer is intentionally local and intentionally small: trace IR is straight-line by construction, so each pass is a single forward sweep.

Trace shape: linear, then trees, then meta-traces

Three trace shapes are specified in order of complexity. v1 ships only Linear.

Linear (v1): each trace is a single loop closed by a back-edge. A side-exit drops to interpreter.
Tree (v2): a side-exit that itself becomes hot is recorded as a child trace and patched into the parent's guard, forming a trace tree. Sharply reduces interpreter time in code with biased branches.
Meta-traces (deferred): only relevant if Mochi grows a second front-end interpreter. Not in scope.

Backend

Code generation reuses the MEP-30 golang-asm backend, but the input is a trace-SSA function rather than per-opcode templates. The backend performs a single linear-scan register allocation over the trace's SSA values, mapping live values to architecture registers and spilling to frame.Registers only at side-exits.

This is the key performance differentiator over MEP-30: a trace can keep an int64 accumulator in x0 across many iterations, never touching memory, because there are no opcode boundaries that need a memory-consistent state.

Side-exit table

Each guard in the compiled trace stores:

type SideExit struct {
    NativePC    uint32              // offset within the trace
    InterpPC    uint32              // PC to resume at
    Snapshot    []SnapshotEntry     // SSA value -> frame register / immediate
    ExitCount   uint32              // for tree formation
}

On guard failure the trace jumps to a trampoline that walks Snapshot, writes each captured SSA value back into frame.Registers, then returns control to the interpreter at InterpPC. This is the deopt path. Unlike MEP-32's deopt, it is local to the trace and never has to reconstruct state from an arbitrary point in an optimized method.

Blacklisting and trace stability

If the same back-edge produces more than MaxAbortsPerSite (default 3) aborted traces in a row, the back-edge is permanently blacklisted for the process lifetime. This is the TraceMonkey lesson: trace abort cycles, where the recorder keeps trying and failing, are the worst possible interpreter state and worse than not tracing at all. Hard-blacklisting after a small retry count bounds the damage.

Memory and concurrency

Compiled traces share the MEP-30 executable code cache (per-process append-only mmap'd pages). Side-exit tables live in regular Go heap memory and are kept alive by a back-pointer from each Function to its trace set.

Mochi's per-goroutine VM model means each goroutine has its own profiling counters and may trigger independent tracing; the recorded trace is shared across goroutines once compiled, since the resulting native code is pure (no goroutine-local state beyond R_VM).

Cost model

Per-iteration steady-state cost on a fill_sum-style loop, Apple M4:

Path	Cycles per loop iter
vm2 interpreter (post-MEP-19)	~25
vm2 + Option A baseline JIT (MEP-30)	~10
vm2 + Option B tracing JIT (this MEP)	~3
Theoretical floor (Go native, `for i := 0; i < n;` summing `[]int64`)	~1

The expected end-to-end speedup over the interpreter on numeric loops is 5-15x. On branchy or polymorphic code that aborts traces, the cost falls back to the interpreter plus a small profiling tax.

Engineering scope

Component	Lines of Go	Engineer-weeks
Back-edge profiler hooks	100	1
Trace recorder	1500	6
Trace IR + SSA infra	1200	5
Optimization passes (5)	1500	8
Linear-scan register allocator	800	4
`golang-asm` backend (trace lowering)	1200	6
Side-exit table + deopt trampoline	600	4
Trace tree formation (v2)	1000	6
Blacklisting + recorder budget	200	1
Conformance tests + trace-shape fuzzers	1500	8
Trace-debug tooling (mandatory, not nice-to-have)	1500	8
Total	~11000	~57 weeks

Roughly one engineer-year to prototype, plus another year to harden and ship trace trees. Consistent with the published timelines for LuaJIT 2 (one architect, multi-year) and PyPy (small team, multi-year, with the trace-debug tooling cost dominating once correctness regressions show up in the wild).

JIT integration with the rest of vm2

MEP-19 quickening: the trace recorder works at the quickened-opcode level, so the IR is already type-narrow before recording starts.
MEP-25 shapes: shape transitions on a hot loop's container trigger side-exit, which is the correct behavior; the next entry retries the trace under the new shape.
MEP-27 inline caches: the recorder reads ICs as type predictions but always emits its own guard, so trace correctness does not depend on IC freshness.
MEP-30 baseline JIT: complementary. Baseline JIT compiles whole functions cheaply; tracing JIT compiles hot inner loops expensively. A coexistence design is possible (baseline JIT serves as the fallback path on trace side-exit), but v1 ships tracing-only and reuses the interpreter as the side-exit target.

Risks

Trace abort pathology. The single biggest source of failure for tracing JITs in the field. Mitigation: hard blacklisting after 3 aborts; comprehensive trace-debug logging (MOCHI_JIT_TRACE_DEBUG=1); measured corpus tests in CI that fail if any expected-traceable loop aborts.
Code with mixed types in the loop body. Each type change is a guard; many guards make the trace fat. Mitigation: cap trace length at 1024 IR nodes; bail out and let the interpreter handle the loop if the type histogram is too wide.
Allocation in the loop. Allocations that escape cannot be removed and become a serialization point in the trace. Mitigation: rely on Mochi's existing escape analysis at compile time; emit a hint to the trace recorder so it can pre-emptively abort traces with known-escaping allocations.
Trace-debug tooling cost. PyPy's experience: trace-debug tooling ends up the same order of magnitude as the optimizer itself. We have explicitly budgeted ~8 engineer-weeks for it above. Underestimating this is the single largest risk to the project schedule.
Interaction with Go's GC. Trace code holds raw *Cell pointers into frame.Registers. The frame must be GC-pinned for the duration of trace execution. Mitigation: trace prologue takes a runtime.KeepAlive analog; trace epilogue releases. Detailed protocol specified in runtime/vm2/jit/trace_gc.go design doc (deferred).

Alternatives considered

Pure method JIT (MEP-32 only): gives up the tracing JIT's structural advantage of seeing only the path taken. For numeric loops this is a 2-3x performance gap, per the LuaJIT-vs-V8 comparison on numeric benchmarks.
Meta-tracing in RPython style: requires Mochi to be implemented in RPython, which it is not. The Go ecosystem has no equivalent meta-tracing toolkit. Rejected as out-of-scope.
Tracing without trace trees: the v1 design, deliberately. Trade some peak performance for a simpler initial implementation.

Comparison matrix

Dimension	Option A (MEP-30, template)	Option B (this MEP)	Option C (MEP-32, tiered)
Predicted speedup on `fill_sum`	2-4x	5-15x	4-8x
Engineering scope (KLOC)	~2.2	~11	~25
Engineer-months to prototype	~3	~12	~18
Deopt complexity	None	Medium (local)	High (cross-method)
OSR complexity	None (free)	N/A (back-edge native)	High
Single static binary preserved	Yes	Yes	Yes (tier 1)
Reuses MEP-27 IC infrastructure	Read-only	Yes	Yes
Performance ceiling	Modest	High	High
Failure mode on bad workload	None (= interpreter)	Abort + blacklist	Tier-2 deopt cascades

Predicted MEP-23 numbers

On the MEP-23 cross-language lists/fill_sum bench, N=1024, current vm2 ~3805 ns/op; predicted with Option B: ~600-900 ns/op, reaching parity with Go native and roughly 1.5x faster than LuaJIT on the same workload (LuaJIT pays a small Lua-table overhead vm2 does not). On concat_loop the win is modest (~2x) because the loop body is dominated by allocation, which tracing can only partially eliminate. On fib (recursive, no loop) the trace recorder cannot help; predict a small regression from the recorder's per-call overhead, mitigated by hot-recursion-as-loop detection (a v2 feature, deferred).

Open questions

Function inlining inside a trace. Tracing JITs naturally inline because they record the called function's body inline. This is mostly good but can blow the trace length budget. Bound by an InlineDepth parameter? Per-function annotation?
Tracing across goroutine boundaries. A go func() inside a trace must abort. But blocking on a channel might or might not be worth tracing depending on contention. Defer.
Whether to share the executable code cache with MEP-30. Both fragment the cache differently. Recommendation: share with eviction policy that prefers retaining trace code.
Pause-for-recording cost. Recording roughly halves interpreter throughput while active. If a back-edge trips at the start of a long loop, recording cost is amortized; if at the end, it is pure waste. Mitigate via MinIterationsBeforeRecord?

The LuaJIT Wiki - Bytecode and IR
Mike Pall, LuaJIT and tracing (LWN, 2010)
Bolz et al., Tracing the Meta-Level (ICOOOLPS 2009)
Bolz & Tratt, The Impact of Meta-Tracing on VM Design and Implementation (2013)
Bolz-Tereick, Musings on Tracing in PyPy (2025)
Gal et al., Trace-based Just-in-Time Type Specialization for Dynamic Languages (PLDI 2009) (TraceMonkey)
Why TraceMonkey was removed (Bug 858032)
MEP-25, MEP-27: data model and IC infrastructure consumed.
MEP-30, MEP-32: the two competing JIT strategies.

Abstract​

Motivation​

Specification​

Overview​

Hotness profiling​

Trace recording​

Trace IR and optimizations​

Trace shape: linear, then trees, then meta-traces​

Backend​

Side-exit table​

Blacklisting and trace stability​

Memory and concurrency​

Cost model​

Engineering scope​

JIT integration with the rest of vm2​

Risks​

Alternatives considered​

Comparison matrix​

Predicted MEP-23 numbers​

Open questions​

Related work​