Skip to main content

MEP 31. VM2 JIT Option B - Tracing JIT over Typed Loops

FieldValue
MEP31
TitleVM2 JIT Option B - Tracing JIT over Typed Loops
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-17

Abstract

This MEP specifies the tracing JIT for vm2: a per-loop compilation strategy that records the actual instructions executed on a hot back-edge, optimizes the recorded trace as straight-line code with explicit guards, and compiles it to native. The compiled trace runs subsequent iterations directly; any deviation (a guard fails) exits to the interpreter. The lineage runs through LuaJIT 2 (Pall, 2010-) for the hand-coded single-level form, PyPy / RPython for the meta-tracing form, and SpiderMonkey TraceMonkey (2008) as the cautionary tale.

This is Option B of three. MEP-30 specifies the template / copy-and-patch baseline JIT; MEP-32 specifies the tiered method JIT. All three share the same vm2 bytecode, Cell ABI, and MEP-25 shapes; only the compilation strategy differs.

Motivation

Tracing JITs are uniquely good at one specific shape of program: a hot loop whose dynamic behavior is far simpler than its static text. The interpreter walks branchy generic code; the trace records only what actually happened on the path through. The compiler then optimizes a linear sequence with no merge points, where escape analysis, constant folding, and guard hoisting become local rewrites instead of dataflow analyses. The published evidence:

  • LuaJIT routinely matches or beats C on numeric loops and stays within 2-4x on most idiomatic Lua, on a codebase of ~30 KLOC of C plus DynASM.
  • PyPy is typically 4-7x faster than CPython on numeric workloads and competitive on broader Python despite the dynamism overhead.
  • Bolz & Tratt (2013) found that adding meta-tracing to an interpreter typically required ~2 KLOC of optimization code on top of the recorder, because trace optimizations are local.

The cost is well documented too:

  • TraceMonkey was removed from SpiderMonkey because it lost on branchy code, and the warmup pathology (cold traces, megamorphic dispatch, trace abort cycles) was hard to diagnose.
  • PyPy maintains a small expert team and a substantial set of trace-debugging tools because the failure modes are non-local.

For Mochi specifically, the workload mix matters. MEP-23's crosslang corpus is dominated by fill_sum, concat_loop, fib, and similar loop-heavy benchmarks; these are exactly the shape tracing JITs were built for. A tracing JIT would deliver a 5-15x win on this subset. The risk is what happens off this subset: code with deeply nested polymorphic dispatch, exceptions, or generators.

This MEP exists to specify the tracing alternative precisely enough that it can be measured against the baseline (MEP-30) and the tiered (MEP-32) options. A follow-on Informational MEP (analogous to MEP-29) will pick the winner from measured numbers.

Specification

Overview

The JIT consists of five components:

  1. Hotness profiler: per-back-edge counters in the interpreter.
  2. Recorder: records a typed-Cell trace from the moment a back-edge counter trips until control returns to the same back-edge, an inner trace exits, or a record budget is exceeded.
  3. Trace IR + optimizer: a small SSA IR with explicit Guard operations; passes are constant folding, guard hoisting, allocation removal, and dead store elimination.
  4. Backend: emits native code per trace via golang-asm (same backend as MEP-30, different IR upstream).
  5. Side-exit table: per-guard map from native PC to (interpreter PC, register snapshot) so the deopt path can resume.

Hotness profiling

A BackEdgeCount is added to the existing function metadata, indexed by source PC (back-edge target). The interpreter increments on each backward branch; when a counter crosses TraceThreshold (default 50), recording begins on the next entry.

TraceThreshold is deliberately low compared to MEP-30's 1000 because tracing has higher warmup cost per attempt; we want to amortize recording over many iterations but not over a million.

Trace recording

The recorder is a separate execution mode of the interpreter, not a separate VM. On each opcode dispatch, in addition to the normal handler, the recorder appends an IR node:

type TraceOp struct {
Op vm.OpCode
Args [3]TraceRef // SSA references to prior IR nodes or guard outputs
Imm int64 // immediate from bytecode
Guards []Guard // type guards attached to this op
PC int // originating interpreter PC for deopt
}

Type guards are inserted at every shape-dependent operation:

  • OpListGet records as OpListGetTraced with a Guard{kind: ShapeIs, expect: ListI64}. If the list is later not ListI64 the guard fails.
  • OpAdd records as OpAddI64 or OpAddF64 per the observed types, each with a numeric tag guard.

Recording terminates on any of:

  • Loop close: control reaches the same back-edge that started the trace. The trace is closed and queued for optimization.
  • Inner side-exit: control reaches a guard that was recorded as a trace exit (trace tree split).
  • Record budget exceeded: trace length > 1024 IR nodes, indicating non-loop code or runaway recursion. Trace is discarded, back-edge marked "blacklisted" for BlacklistDuration (default 100 entries) before retry.
  • Hard abort: any non-traceable opcode (allocation that must call into Go-managed code with a stack-growth check, GC barrier, exception throw). Trace discarded; back-edge counter decremented but not blacklisted.

Trace IR and optimizations

The IR is SSA with one block (the trace itself). Operations are typed: integer add, float add, Cell box, Cell unbox, list-data-load (specialized to ListI64), and so on.

Optimization passes, in order:

  1. Constant folding: standard SSA constant propagation. Removes the bulk of address arithmetic produced by the recorder.
  2. Guard hoisting: move loop-invariant guards out of the loop body, into the trace's prologue. This is the largest single win and the reason tracing JITs are fast: a typed inner loop pays guard cost once on entry, not per iteration.
  3. Allocation removal: any vmList, vmString, or vmStruct allocated inside the trace and not escaping (no store to a heap location, no return) is scalar-replaced. Its fields live in SSA values. PyPy reports this as the dominant optimization for Python performance and the analog should hold for Mochi.
  4. Dead store elimination: stores to frame.Registers[i] whose value is overwritten before the next side-exit or trace end can be dropped, since side-exits restore from the snapshot table, not from the live frame.
  5. Common subexpression elimination: standard.

The optimizer is intentionally local and intentionally small: trace IR is straight-line by construction, so each pass is a single forward sweep.

Trace shape: linear, then trees, then meta-traces

Three trace shapes are specified in order of complexity. v1 ships only Linear.

  • Linear (v1): each trace is a single loop closed by a back-edge. A side-exit drops to interpreter.
  • Tree (v2): a side-exit that itself becomes hot is recorded as a child trace and patched into the parent's guard, forming a trace tree. Sharply reduces interpreter time in code with biased branches.
  • Meta-traces (deferred): only relevant if Mochi grows a second front-end interpreter. Not in scope.

Backend

Code generation reuses the MEP-30 golang-asm backend, but the input is a trace-SSA function rather than per-opcode templates. The backend performs a single linear-scan register allocation over the trace's SSA values, mapping live values to architecture registers and spilling to frame.Registers only at side-exits.

This is the key performance differentiator over MEP-30: a trace can keep an int64 accumulator in x0 across many iterations, never touching memory, because there are no opcode boundaries that need a memory-consistent state.

Side-exit table

Each guard in the compiled trace stores:

type SideExit struct {
NativePC uint32 // offset within the trace
InterpPC uint32 // PC to resume at
Snapshot []SnapshotEntry // SSA value -> frame register / immediate
ExitCount uint32 // for tree formation
}

On guard failure the trace jumps to a trampoline that walks Snapshot, writes each captured SSA value back into frame.Registers, then returns control to the interpreter at InterpPC. This is the deopt path. Unlike MEP-32's deopt, it is local to the trace and never has to reconstruct state from an arbitrary point in an optimized method.

Blacklisting and trace stability

If the same back-edge produces more than MaxAbortsPerSite (default 3) aborted traces in a row, the back-edge is permanently blacklisted for the process lifetime. This is the TraceMonkey lesson: trace abort cycles, where the recorder keeps trying and failing, are the worst possible interpreter state and worse than not tracing at all. Hard-blacklisting after a small retry count bounds the damage.

Memory and concurrency

Compiled traces share the MEP-30 executable code cache (per-process append-only mmap'd pages). Side-exit tables live in regular Go heap memory and are kept alive by a back-pointer from each Function to its trace set.

Mochi's per-goroutine VM model means each goroutine has its own profiling counters and may trigger independent tracing; the recorded trace is shared across goroutines once compiled, since the resulting native code is pure (no goroutine-local state beyond R_VM).

Cost model

Per-iteration steady-state cost on a fill_sum-style loop, Apple M4:

PathCycles per loop iter
vm2 interpreter (post-MEP-19)~25
vm2 + Option A baseline JIT (MEP-30)~10
vm2 + Option B tracing JIT (this MEP)~3
Theoretical floor (Go native, for i := 0; i < n; summing []int64)~1

The expected end-to-end speedup over the interpreter on numeric loops is 5-15x. On branchy or polymorphic code that aborts traces, the cost falls back to the interpreter plus a small profiling tax.

Engineering scope

ComponentLines of GoEngineer-weeks
Back-edge profiler hooks1001
Trace recorder15006
Trace IR + SSA infra12005
Optimization passes (5)15008
Linear-scan register allocator8004
golang-asm backend (trace lowering)12006
Side-exit table + deopt trampoline6004
Trace tree formation (v2)10006
Blacklisting + recorder budget2001
Conformance tests + trace-shape fuzzers15008
Trace-debug tooling (mandatory, not nice-to-have)15008
Total~11000~57 weeks

Roughly one engineer-year to prototype, plus another year to harden and ship trace trees. Consistent with the published timelines for LuaJIT 2 (one architect, multi-year) and PyPy (small team, multi-year, with the trace-debug tooling cost dominating once correctness regressions show up in the wild).

JIT integration with the rest of vm2

  • MEP-19 quickening: the trace recorder works at the quickened-opcode level, so the IR is already type-narrow before recording starts.
  • MEP-25 shapes: shape transitions on a hot loop's container trigger side-exit, which is the correct behavior; the next entry retries the trace under the new shape.
  • MEP-27 inline caches: the recorder reads ICs as type predictions but always emits its own guard, so trace correctness does not depend on IC freshness.
  • MEP-30 baseline JIT: complementary. Baseline JIT compiles whole functions cheaply; tracing JIT compiles hot inner loops expensively. A coexistence design is possible (baseline JIT serves as the fallback path on trace side-exit), but v1 ships tracing-only and reuses the interpreter as the side-exit target.

Risks

  1. Trace abort pathology. The single biggest source of failure for tracing JITs in the field. Mitigation: hard blacklisting after 3 aborts; comprehensive trace-debug logging (MOCHI_JIT_TRACE_DEBUG=1); measured corpus tests in CI that fail if any expected-traceable loop aborts.
  2. Code with mixed types in the loop body. Each type change is a guard; many guards make the trace fat. Mitigation: cap trace length at 1024 IR nodes; bail out and let the interpreter handle the loop if the type histogram is too wide.
  3. Allocation in the loop. Allocations that escape cannot be removed and become a serialization point in the trace. Mitigation: rely on Mochi's existing escape analysis at compile time; emit a hint to the trace recorder so it can pre-emptively abort traces with known-escaping allocations.
  4. Trace-debug tooling cost. PyPy's experience: trace-debug tooling ends up the same order of magnitude as the optimizer itself. We have explicitly budgeted ~8 engineer-weeks for it above. Underestimating this is the single largest risk to the project schedule.
  5. Interaction with Go's GC. Trace code holds raw *Cell pointers into frame.Registers. The frame must be GC-pinned for the duration of trace execution. Mitigation: trace prologue takes a runtime.KeepAlive analog; trace epilogue releases. Detailed protocol specified in runtime/vm2/jit/trace_gc.go design doc (deferred).

Alternatives considered

  • Pure method JIT (MEP-32 only): gives up the tracing JIT's structural advantage of seeing only the path taken. For numeric loops this is a 2-3x performance gap, per the LuaJIT-vs-V8 comparison on numeric benchmarks.
  • Meta-tracing in RPython style: requires Mochi to be implemented in RPython, which it is not. The Go ecosystem has no equivalent meta-tracing toolkit. Rejected as out-of-scope.
  • Tracing without trace trees: the v1 design, deliberately. Trade some peak performance for a simpler initial implementation.

Comparison matrix

DimensionOption A (MEP-30, template)Option B (this MEP)Option C (MEP-32, tiered)
Predicted speedup on fill_sum2-4x5-15x4-8x
Engineering scope (KLOC)~2.2~11~25
Engineer-months to prototype~3~12~18
Deopt complexityNoneMedium (local)High (cross-method)
OSR complexityNone (free)N/A (back-edge native)High
Single static binary preservedYesYesYes (tier 1)
Reuses MEP-27 IC infrastructureRead-onlyYesYes
Performance ceilingModestHighHigh
Failure mode on bad workloadNone (= interpreter)Abort + blacklistTier-2 deopt cascades

Predicted MEP-23 numbers

On the MEP-23 cross-language lists/fill_sum bench, N=1024, current vm2 ~3805 ns/op; predicted with Option B: ~600-900 ns/op, reaching parity with Go native and roughly 1.5x faster than LuaJIT on the same workload (LuaJIT pays a small Lua-table overhead vm2 does not). On concat_loop the win is modest (~2x) because the loop body is dominated by allocation, which tracing can only partially eliminate. On fib (recursive, no loop) the trace recorder cannot help; predict a small regression from the recorder's per-call overhead, mitigated by hot-recursion-as-loop detection (a v2 feature, deferred).

Open questions

  • Function inlining inside a trace. Tracing JITs naturally inline because they record the called function's body inline. This is mostly good but can blow the trace length budget. Bound by an InlineDepth parameter? Per-function annotation?
  • Tracing across goroutine boundaries. A go func() inside a trace must abort. But blocking on a channel might or might not be worth tracing depending on contention. Defer.
  • Whether to share the executable code cache with MEP-30. Both fragment the cache differently. Recommendation: share with eviction policy that prefers retaining trace code.
  • Pause-for-recording cost. Recording roughly halves interpreter throughput while active. If a back-edge trips at the start of a long loop, recording cost is amortized; if at the end, it is pure waste. Mitigate via MinIterationsBeforeRecord?