MEP 30. VM2 JIT Option A - Template / Copy-and-Patch Baseline JIT
| Field | Value |
|---|---|
| MEP | 30 |
| Title | VM2 JIT Option A - Template / Copy-and-Patch Baseline JIT |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
This MEP specifies the template / copy-and-patch baseline JIT for vm2: the lowest-complexity tier in the JIT taxonomy, in the lineage of V8 Sparkplug (2021), JavaScriptCore's Baseline JIT, and CPython 3.13's copy-and-patch JIT (PEP 744). The vm2 compiler emits its existing bytecode; on the first call past a hotness threshold, a JIT pass walks the bytecode and concatenates one pre-compiled native snippet per (opcode, shape-tuple), patches in immediates and branch targets, then mmaps the result as executable. The interpreter frame layout is preserved bit-for-bit, so on-stack replacement (OSR) is a function-pointer swap and bailout is a JMP back into the interpreter.
This is Option A of three. MEP-31 specifies the tracing JIT alternative; MEP-32 specifies the tiered method JIT with type feedback. All three share the same vm2 bytecode, the same Cell ABI, and the same MEP-25 data-model shapes; only the compilation strategy differs.
Motivation
The vm2 interpreter, after the MEP-17 through MEP-29 work, is within ~4x of native Go on fill_sum-class loops and within ~5x of LuaJIT on the broader corpus. The remaining gap is dominated by dispatch overhead and Cell boxing on the hot path, not by missing optimizations. A baseline JIT that erases the opcode dispatch and inlines the per-shape handler bodies will close the bulk of that gap with no IR, no register allocator, and no speculative type system.
The design constraint for Mochi is sharper than for V8 or CPython: Mochi must continue to distribute as a single static Go binary, with no cgo and no external toolchain on the user's machine. This rules out LLVM-at-runtime (CPython's offline approach is fine, but their runtime is C; ours is Go) and rules out invoking the Go compiler as a subprocess. The only emission backend that satisfies the constraint is direct machine-code patching, for which github.com/twitchyliquid64/golang-asm (the Go compiler's own assembler, repackaged) is the proven Go-native path.
The historical evidence is consistent: every serious JIT engine started with a baseline tier of roughly this shape, and every one of them banked a 3-5x win at engineering costs measured in engineer-quarters, not engineer-years:
- V8 Sparkplug (2021): no IR, "switch in a for loop", compiles in ~10 us per function, 5-15% on real workloads and +45% on JetStream over the Ignition interpreter.
- JSC Baseline JIT: the second tier above the LLInt interpreter; ~2x over LLInt at a fraction of DFG's complexity.
- CPython 3.13 copy-and-patch JIT: ~0% in 3.13 (parity with the tier-2 interpreter, by design), single-digit gains in 3.14, ~12% on macOS arm64 in 3.15 alphas. The peak ceiling is modest precisely because it is the baseline tier; the engineering cost is modest for the same reason.
Mochi's relative gain should be larger than CPython's because vm2's interpreter is less aggressively optimized than CPython's tier-2 specializer, and the per-opcode handler bodies in vm2 are already short and shape-typed thanks to MEP-19 (quickening) and MEP-29's measured-AOT work.
Specification
Overview
The JIT is a build-time + run-time pipeline:
- Build time: a code generator produces one Go source function per (opcode, shape-tuple) handler. The Go compiler emits each as a regular function. A harvesting tool walks the resulting object file, extracts the function body bytes, records relocation sites (immediates, branch targets, helper-function addresses), and writes the result into a
templates_<arch>.gotable compiled into the Mochi binary. - Run time: when a function's call count crosses
JITThreshold(default 1000), the JIT walks its bytecode once, looks up each instruction's template, appends the body bytes to a per-function[]bytebuffer, fills in the recorded relocation slots from the instruction's immediates and from the JIT's own jump-target map, thenmmaps the buffer withPROT_EXECand stores the entry pointer in the function'sJITCodefield. - Dispatch: the interpreter's main loop checks
frame.fn.JITCode != nilon function entry and on back-edges; if set, it jumps into the JIT code via a CGO-free trampoline.
Snippet ABI
Every snippet is compiled under one fixed calling convention so the patcher can splice them safely:
| Register / Slot | Role |
|---|---|
R_VM (arm64 x19, amd64 r14) | *vm2.VM |
R_FP | Pointer to the current frame's register file (*Cell) |
R_PC | Current PC; only meaningful at instruction boundaries |
R_OBJECTS | Cached vm.Objects slice header |
R_TMP0..R_TMP3 | Caller-saved scratch |
Each snippet expects its operand register indexes already encoded as immediate slots within the body. The patcher fills those slots at copy time. Snippets never spill, never allocate Go stack frames, and never call into Go-managed code on the hot path; the only off-snippet calls are to a small set of pre-resolved C-ABI helpers (alloc, GC barrier, deopt return) whose addresses are patched into snippet bodies at JIT time.
Templates table
templates_<arch>.go is generated by cmd/mochi-jit-gen at build time and looks like:
// AUTOGENERATED by cmd/mochi-jit-gen. Do not edit.
package jit
var templatesARM64 = map[opShape]template{
{Op: vm.OpListGetI64, Shape: shapeListI64Int}: {
Body: []byte{0xfd, 0x7b, 0xbf, 0xa9, /* ... */ },
Slots: []slot{{Off: 12, Kind: slotImmReg, Field: "Dst"}, ...},
},
// ... one entry per (opcode, shape-tuple)
}
Slot kinds:
slotImmReg: write a small integer (register index) into the immediate field of anadd/ldrinstruction.slotImmI64: write a 64-bit constant into a pair ofmovz/movkinstructions.slotBranch: write a relative offset to another snippet within the same function.slotHelper: write the absolute address of one of the pre-registered runtime helpers (resolved at JIT init).
The number of templates is bounded by len(opcodes) * len(shape-tuples). With ~80 opcodes and an average of ~4 shape tuples per dispatchable opcode, the table is ~320 entries. At an average snippet size of 64 bytes, the templates table compiles into ~20 KB of .rodata. This is small enough to embed in every Mochi binary regardless of whether the JIT is enabled at runtime.
Frame compatibility (free OSR, free bailout)
The single most consequential design choice in Sparkplug was making JIT frames bit-compatible with interpreter frames so that OSR was a function-pointer swap. Mochi adopts the same rule:
- The JIT does not allocate its own register file. It reads and writes the same
frame.Registers []Cellslab the interpreter uses. - The JIT does not invent its own PC encoding. It maintains a
PC -> NativeOffsettable so the interpreter can resume at any instruction boundary. - The JIT never holds a Mochi value in an arch register across an instruction boundary. Between snippets, all live state is in
frame.Registers, exactly as in the interpreter.
This means:
- OSR: a back-edge in the interpreter checks
JITCode != nil; if set, it loads the correspondingNativeOffset, setsR_FPfrom the interpreter's frame pointer, and jumps. No frame rewriting. - Bailout (e.g. shape changed under the JIT, or a helper signals deopt): the snippet writes the current PC into
R_PCand returns to the trampoline, which falls through into the interpreter loop. No state reconstruction.
The cost of this rule is that the JIT cannot keep a value pinned in a register across opcodes (no cross-opcode register allocation). The benefit is the elimination of deopt as a category of engineering problem. This is the same tradeoff Sparkplug took explicitly. Mochi takes it for the same reason: a baseline tier is the wrong place to pay for a regalloc.
Hot-path example
OpListGetI64 dst, list, idx (Mochi int-specialized list get, post-MEP-19 quickening) compiles, on arm64, to roughly:
ldr x0, [R_FP, #idx_off] ; idx (Cell)
and x0, x0, #0x0000FFFFFFFFFFFF ; unbox int
ldr x1, [R_FP, #list_off] ; list (Cell)
and x1, x1, #0x0000FFFFFFFFFFFF ; unbox ptr
ldr x2, [R_OBJECTS, x1, lsl #3] ; *vmListI64
ldr x2, [x2, #data_off] ; backing slice header
ldr x0, [x2, x0, lsl #3] ; load int64 element
mov x3, #tagInt_hi
movk x3, #tagInt_lo, lsl #48
orr x0, x0, x3 ; rebox int
str x0, [R_FP, #dst_off] ; write result
Eleven instructions, no branches, no calls. The interpreter's equivalent path is ~30 Go statements plus a switch arm, an interface-type assertion, and a Cell pack/unpack helper call. The expected speedup on fill_sum-class loops is the ratio of these two paths, modulo Go's inlining of the interpreter loop and the bench harness; benchmarks should land in the 2-4x range.
Hotness threshold and warmup
A function's CallCount increments on entry and BackEdgeCount on each backward branch. JIT is triggered when either exceeds JITThreshold (default: 1000 for CallCount, 10000 for BackEdgeCount). Both thresholds are tunable via MOCHI_JIT_THRESHOLD and can be set to 1 for testing.
Compilation is synchronous on the calling goroutine. The expected compile time, measured against Sparkplug's ~10 us/function, is a few microseconds per Mochi function; this is well below the threshold at which a background-compilation thread would pay for its complexity.
Memory and code-cache management
JIT code is allocated from a per-process pool of 64 KiB executable pages allocated via syscall.Mmap(PROT_EXEC|PROT_READ). Pages are append-only within a process; there is no eviction or recompilation. A future MEP may add code-cache GC if real Mochi programs prove to need it.
Distribution and build constraints
The JIT subsystem lives under runtime/vm2/jit/. It is compiled into every Mochi binary but gated by a runtime flag MOCHI_JIT=1. The default in 0.x releases is off; the default in 1.0 will flip to on. Two GOOS/GOARCH pairs are supported in tier 1: darwin/arm64 and linux/amd64. Other pairs fall back to the interpreter with no source-level changes.
No cgo. No external toolchain. The Mochi binary remains a single static Go file.
Cost model
Per-opcode steady-state cost, in machine cycles on Apple M4:
| Path | Cycles |
|---|---|
| vm2 interpreter (post-MEP-19) | 8-12 |
| vm2 + AOT specialization (MEP-29) | 6-9 |
| vm2 + Option A JIT (this MEP) | 2-4 |
| Theoretical floor (Go native) | 1-2 |
On fill_sum N=1024, this translates to a predicted ~2.5x over the current interpreter and ~2x over the MEP-29 AOT prototype.
Engineering scope
| Component | Lines of Go | Engineer-weeks |
|---|---|---|
cmd/mochi-jit-gen (template harvester) | 800 | 3 |
runtime/vm2/jit/templates_arm64.go | generated | - |
runtime/vm2/jit/patcher.go | 600 | 4 |
runtime/vm2/jit/trampoline_arm64.s | 80 | 1 |
runtime/vm2/jit/codecache.go | 200 | 1 |
| Interpreter integration (OSR hooks) | 150 | 2 |
| Conformance tests + fuzzers | 400 | 3 |
| Total | ~2200 | ~14 weeks |
Roughly one engineer-quarter to prototype, two to harden. Consistent with Sparkplug's reported timeline and CPython 3.13's "small team" framing.
JIT integration with the rest of vm2
- MEP-19 quickening: each quickened opcode (e.g.
OpListGetI64) is a separate template entry. The JIT consumes whatever shape the interpreter already saw. - MEP-25 shapes: the JIT does not perform speculation. If a shape changes after JIT compilation, the snippet's shape guard fails and execution falls back to the interpreter via the bailout path.
- MEP-27 inline caches: ICs are read by the JIT as immediate hints when picking templates, but the JIT emits no IC of its own; it embeds the specialized handler directly.
- MEP-28 AOT specialization: orthogonal. MEP-28 specializes the interpreter loop; this MEP replaces the interpreter loop for hot functions.
Risks
- Go runtime safety on JIT code. JIT code must respect the goroutine preemption protocol, must not appear as a stack frame the GC tries to walk, and must not block stack growth. Mitigation: route all calls back into Go through the trampoline, never let the JIT execute across a back-edge without checking for
g.preempt. - Snippet ABI drift between Go versions. The Go compiler's choice of callee-saved registers and stack frame shape can change. Mitigation: pin the Go toolchain version per Mochi release; verify the templates table at JIT init by re-running a small known-result snippet.
- Architecture coverage. arm64 + amd64 covers ~95% of Mochi's measured installs but excludes Windows-arm64 and Linux-arm64 (notable for CI runners). Mitigation: explicit fallback to the interpreter on unsupported pairs; track adoption in
MEP-29style measured-results MEPs. - Code-cache pressure on long-running processes. There is no eviction in v1. Mitigation: document the per-process cap; add eviction in a follow-on MEP if measurements warrant.
Alternatives considered
- LLVM via cgo: highest peak performance, but breaks single-binary distribution and adds 50-100 MB of build dependencies. Rejected for the baseline tier; viable as an optional
mochi-jit-llvmtag in a future MEP. go build -buildmode=plugin: works only on linux/darwin, requires the Go toolchain at the user's machine, and adds whole-second compile latency. Rejected.- Emit WASM, run via wazero: viable, gives ~10x over interpreter per wazero's own numbers, but inherits Wasm's calling convention overhead on every host callback. A reasonable Option D worth a follow-on MEP; rejected here because the goal is to share frame layout with the interpreter.
- Cranelift via cgo: 10x faster codegen than LLVM but requires a Rust toolchain plus cgo. Rejected.
Comparison matrix
| Dimension | Option A (this MEP) | Option B (MEP-31, tracing) | Option C (MEP-32, tiered) |
|---|---|---|---|
Predicted speedup on fill_sum | 2-4x | 5-15x | 4-8x |
| Engineering scope (KLOC) | ~2.2 | ~12 | ~25 |
| Engineer-months to prototype | ~3 | ~12 | ~18 |
| Deopt complexity | None | Medium | High |
| OSR complexity | None (free) | Medium | High |
| Single static binary preserved | Yes | Yes (with golang-asm) | Yes (tier 1) |
| Reuses MEP-27 IC infrastructure | Read-only | Yes | Yes (heavily) |
| Performance ceiling | Modest | High | High |
Predicted MEP-23 numbers
On the MEP-23 cross-language lists/fill_sum bench, N=1024 on Apple M4 (Go 1.25), current measured ~3805 ns/op for vm2; predicted with Option A: ~1500-1900 ns/op, closing ~70% of the gap to the Go-native floor at 935 ns/op. On strings/concat_loop (rope-shape baseline from MEP-29), the JIT's win is smaller (~1.3x) because the hot path is already dominated by allocation, not dispatch.
Open questions
- Per-shape vs per-shape-tuple templates. Templates for binary opcodes (e.g.
OpListConcat) explode quadratically in shape count. Cap the explosion by deferring shape-tuples beyond a threshold to the interpreter? Or factor them via a runtime shape-merging table inside the snippet? - AArch64 PAC interaction. Apple Silicon enforces pointer authentication on return addresses. The trampoline must sign/authenticate its own return path; the JIT body is leaf and unaffected. Verify on real hardware before merging.
- Whether to ship the JIT off by default in 1.0. Sparkplug shipped on by default; CPython 3.13's JIT shipped off. Mochi should pick once tier 1 lands and
MEP-29-equivalentmeasurements are in hand.
Related work
- V8 Sparkplug post (Verwaest, 2021)
- PEP 744 - JIT Compilation
- Copy-and-Patch Compilation (Xu & Kjolstad, PLDI 2021)
- JSC Baseline JIT internals (Zon8)
- golang-asm
- Runtime code generation in Go (mathetake)
- MEP-25, MEP-27, MEP-28, MEP-29: the data model and dispatch-strategy MEPs whose work this JIT consumes.
- MEP-31, MEP-32: the two competing JIT strategies.