MEP 30. VM2 JIT Option A - Template / Copy-and-Patch Baseline JIT

Field	Value
MEP	30
Title	VM2 JIT Option A - Template / Copy-and-Patch Baseline JIT
Author	Mochi core
Status	Draft
Type	Standards Track
Created	2026-05-17

Abstract

This MEP specifies the template / copy-and-patch baseline JIT for vm2: the lowest-complexity tier in the JIT taxonomy, in the lineage of V8 Sparkplug (2021), JavaScriptCore's Baseline JIT, and CPython 3.13's copy-and-patch JIT (PEP 744). The vm2 compiler emits its existing bytecode; on the first call past a hotness threshold, a JIT pass walks the bytecode and concatenates one pre-compiled native snippet per (opcode, shape-tuple), patches in immediates and branch targets, then mmaps the result as executable. The interpreter frame layout is preserved bit-for-bit, so on-stack replacement (OSR) is a function-pointer swap and bailout is a JMP back into the interpreter.

This is Option A of three. MEP-31 specifies the tracing JIT alternative; MEP-32 specifies the tiered method JIT with type feedback. All three share the same vm2 bytecode, the same Cell ABI, and the same MEP-25 data-model shapes; only the compilation strategy differs.

Motivation

The vm2 interpreter, after the MEP-17 through MEP-29 work, is within ~4x of native Go on fill_sum-class loops and within ~5x of LuaJIT on the broader corpus. The remaining gap is dominated by dispatch overhead and Cell boxing on the hot path, not by missing optimizations. A baseline JIT that erases the opcode dispatch and inlines the per-shape handler bodies will close the bulk of that gap with no IR, no register allocator, and no speculative type system.

The design constraint for Mochi is sharper than for V8 or CPython: Mochi must continue to distribute as a single static Go binary, with no cgo and no external toolchain on the user's machine. This rules out LLVM-at-runtime (CPython's offline approach is fine, but their runtime is C; ours is Go) and rules out invoking the Go compiler as a subprocess. The only emission backend that satisfies the constraint is direct machine-code patching, for which github.com/twitchyliquid64/golang-asm (the Go compiler's own assembler, repackaged) is the proven Go-native path.

The historical evidence is consistent: every serious JIT engine started with a baseline tier of roughly this shape, and every one of them banked a 3-5x win at engineering costs measured in engineer-quarters, not engineer-years:

V8 Sparkplug (2021): no IR, "switch in a for loop", compiles in ~10 us per function, 5-15% on real workloads and +45% on JetStream over the Ignition interpreter.
JSC Baseline JIT: the second tier above the LLInt interpreter; ~2x over LLInt at a fraction of DFG's complexity.
CPython 3.13 copy-and-patch JIT: ~0% in 3.13 (parity with the tier-2 interpreter, by design), single-digit gains in 3.14, ~12% on macOS arm64 in 3.15 alphas. The peak ceiling is modest precisely because it is the baseline tier; the engineering cost is modest for the same reason.

Mochi's relative gain should be larger than CPython's because vm2's interpreter is less aggressively optimized than CPython's tier-2 specializer, and the per-opcode handler bodies in vm2 are already short and shape-typed thanks to MEP-19 (quickening) and MEP-29's measured-AOT work.

Specification

Overview

The JIT is a build-time + run-time pipeline:

Build time: a code generator produces one Go source function per (opcode, shape-tuple) handler. The Go compiler emits each as a regular function. A harvesting tool walks the resulting object file, extracts the function body bytes, records relocation sites (immediates, branch targets, helper-function addresses), and writes the result into a templates_<arch>.go table compiled into the Mochi binary.
Run time: when a function's call count crosses JITThreshold (default 1000), the JIT walks its bytecode once, looks up each instruction's template, appends the body bytes to a per-function []byte buffer, fills in the recorded relocation slots from the instruction's immediates and from the JIT's own jump-target map, then mmaps the buffer with PROT_EXEC and stores the entry pointer in the function's JITCode field.
Dispatch: the interpreter's main loop checks frame.fn.JITCode != nil on function entry and on back-edges; if set, it jumps into the JIT code via a CGO-free trampoline.

Snippet ABI

Every snippet is compiled under one fixed calling convention so the patcher can splice them safely:

Register / Slot	Role
`R_VM` (arm64 x19, amd64 r14)	`*vm2.VM`
`R_FP`	Pointer to the current frame's register file (`*Cell`)
`R_PC`	Current PC; only meaningful at instruction boundaries
`R_OBJECTS`	Cached `vm.Objects` slice header
`R_TMP0..R_TMP3`	Caller-saved scratch

Each snippet expects its operand register indexes already encoded as immediate slots within the body. The patcher fills those slots at copy time. Snippets never spill, never allocate Go stack frames, and never call into Go-managed code on the hot path; the only off-snippet calls are to a small set of pre-resolved C-ABI helpers (alloc, GC barrier, deopt return) whose addresses are patched into snippet bodies at JIT time.

Templates table

templates_<arch>.go is generated by cmd/mochi-jit-gen at build time and looks like:

// AUTOGENERATED by cmd/mochi-jit-gen. Do not edit.
package jit

var templatesARM64 = map[opShape]template{
    {Op: vm.OpListGetI64, Shape: shapeListI64Int}: {
        Body:  []byte{0xfd, 0x7b, 0xbf, 0xa9, /* ... */ },
        Slots: []slot{{Off: 12, Kind: slotImmReg, Field: "Dst"}, ...},
    },
    // ... one entry per (opcode, shape-tuple)
}

Slot kinds:

slotImmReg: write a small integer (register index) into the immediate field of an add/ldr instruction.
slotImmI64: write a 64-bit constant into a pair of movz/movk instructions.
slotBranch: write a relative offset to another snippet within the same function.
slotHelper: write the absolute address of one of the pre-registered runtime helpers (resolved at JIT init).

The number of templates is bounded by len(opcodes) * len(shape-tuples). With ~80 opcodes and an average of ~4 shape tuples per dispatchable opcode, the table is ~320 entries. At an average snippet size of 64 bytes, the templates table compiles into ~20 KB of .rodata. This is small enough to embed in every Mochi binary regardless of whether the JIT is enabled at runtime.

Frame compatibility (free OSR, free bailout)

The single most consequential design choice in Sparkplug was making JIT frames bit-compatible with interpreter frames so that OSR was a function-pointer swap. Mochi adopts the same rule:

The JIT does not allocate its own register file. It reads and writes the same frame.Registers []Cell slab the interpreter uses.
The JIT does not invent its own PC encoding. It maintains a PC -> NativeOffset table so the interpreter can resume at any instruction boundary.
The JIT never holds a Mochi value in an arch register across an instruction boundary. Between snippets, all live state is in frame.Registers, exactly as in the interpreter.

This means:

OSR: a back-edge in the interpreter checks JITCode != nil; if set, it loads the corresponding NativeOffset, sets R_FP from the interpreter's frame pointer, and jumps. No frame rewriting.
Bailout (e.g. shape changed under the JIT, or a helper signals deopt): the snippet writes the current PC into R_PC and returns to the trampoline, which falls through into the interpreter loop. No state reconstruction.

The cost of this rule is that the JIT cannot keep a value pinned in a register across opcodes (no cross-opcode register allocation). The benefit is the elimination of deopt as a category of engineering problem. This is the same tradeoff Sparkplug took explicitly. Mochi takes it for the same reason: a baseline tier is the wrong place to pay for a regalloc.

Hot-path example

OpListGetI64 dst, list, idx (Mochi int-specialized list get, post-MEP-19 quickening) compiles, on arm64, to roughly:

ldr   x0, [R_FP, #idx_off]        ; idx (Cell)
and   x0, x0, #0x0000FFFFFFFFFFFF ; unbox int
ldr   x1, [R_FP, #list_off]       ; list (Cell)
and   x1, x1, #0x0000FFFFFFFFFFFF ; unbox ptr
ldr   x2, [R_OBJECTS, x1, lsl #3] ; *vmListI64
ldr   x2, [x2, #data_off]         ; backing slice header
ldr   x0, [x2, x0, lsl #3]        ; load int64 element
mov   x3, #tagInt_hi
movk  x3, #tagInt_lo, lsl #48
orr   x0, x0, x3                  ; rebox int
str   x0, [R_FP, #dst_off]        ; write result

Eleven instructions, no branches, no calls. The interpreter's equivalent path is ~30 Go statements plus a switch arm, an interface-type assertion, and a Cell pack/unpack helper call. The expected speedup on fill_sum-class loops is the ratio of these two paths, modulo Go's inlining of the interpreter loop and the bench harness; benchmarks should land in the 2-4x range.

Hotness threshold and warmup

A function's CallCount increments on entry and BackEdgeCount on each backward branch. JIT is triggered when either exceeds JITThreshold (default: 1000 for CallCount, 10000 for BackEdgeCount). Both thresholds are tunable via MOCHI_JIT_THRESHOLD and can be set to 1 for testing.

Compilation is synchronous on the calling goroutine. The expected compile time, measured against Sparkplug's ~10 us/function, is a few microseconds per Mochi function; this is well below the threshold at which a background-compilation thread would pay for its complexity.

Memory and code-cache management

JIT code is allocated from a per-process pool of 64 KiB executable pages allocated via syscall.Mmap(PROT_EXEC|PROT_READ). Pages are append-only within a process; there is no eviction or recompilation. A future MEP may add code-cache GC if real Mochi programs prove to need it.

Distribution and build constraints

The JIT subsystem lives under runtime/vm2/jit/. It is compiled into every Mochi binary but gated by a runtime flag MOCHI_JIT=1. The default in 0.x releases is off; the default in 1.0 will flip to on. Two GOOS/GOARCH pairs are supported in tier 1: darwin/arm64 and linux/amd64. Other pairs fall back to the interpreter with no source-level changes.

No cgo. No external toolchain. The Mochi binary remains a single static Go file.

Cost model

Per-opcode steady-state cost, in machine cycles on Apple M4:

Path	Cycles
vm2 interpreter (post-MEP-19)	8-12
vm2 + AOT specialization (MEP-29)	6-9
vm2 + Option A JIT (this MEP)	2-4
Theoretical floor (Go native)	1-2

On fill_sum N=1024, this translates to a predicted ~2.5x over the current interpreter and ~2x over the MEP-29 AOT prototype.

Engineering scope

Component	Lines of Go	Engineer-weeks
`cmd/mochi-jit-gen` (template harvester)	800	3
`runtime/vm2/jit/templates_arm64.go`	generated	-
`runtime/vm2/jit/patcher.go`	600	4
`runtime/vm2/jit/trampoline_arm64.s`	80	1
`runtime/vm2/jit/codecache.go`	200	1
Interpreter integration (OSR hooks)	150	2
Conformance tests + fuzzers	400	3
Total	~2200	~14 weeks

Roughly one engineer-quarter to prototype, two to harden. Consistent with Sparkplug's reported timeline and CPython 3.13's "small team" framing.

JIT integration with the rest of vm2

MEP-19 quickening: each quickened opcode (e.g. OpListGetI64) is a separate template entry. The JIT consumes whatever shape the interpreter already saw.
MEP-25 shapes: the JIT does not perform speculation. If a shape changes after JIT compilation, the snippet's shape guard fails and execution falls back to the interpreter via the bailout path.
MEP-27 inline caches: ICs are read by the JIT as immediate hints when picking templates, but the JIT emits no IC of its own; it embeds the specialized handler directly.
MEP-28 AOT specialization: orthogonal. MEP-28 specializes the interpreter loop; this MEP replaces the interpreter loop for hot functions.

Risks

Go runtime safety on JIT code. JIT code must respect the goroutine preemption protocol, must not appear as a stack frame the GC tries to walk, and must not block stack growth. Mitigation: route all calls back into Go through the trampoline, never let the JIT execute across a back-edge without checking for g.preempt.
Snippet ABI drift between Go versions. The Go compiler's choice of callee-saved registers and stack frame shape can change. Mitigation: pin the Go toolchain version per Mochi release; verify the templates table at JIT init by re-running a small known-result snippet.
Architecture coverage. arm64 + amd64 covers ~95% of Mochi's measured installs but excludes Windows-arm64 and Linux-arm64 (notable for CI runners). Mitigation: explicit fallback to the interpreter on unsupported pairs; track adoption in MEP-29 style measured-results MEPs.
Code-cache pressure on long-running processes. There is no eviction in v1. Mitigation: document the per-process cap; add eviction in a follow-on MEP if measurements warrant.

Alternatives considered

LLVM via cgo: highest peak performance, but breaks single-binary distribution and adds 50-100 MB of build dependencies. Rejected for the baseline tier; viable as an optional mochi-jit-llvm tag in a future MEP.
go build -buildmode=plugin: works only on linux/darwin, requires the Go toolchain at the user's machine, and adds whole-second compile latency. Rejected.
Emit WASM, run via wazero: viable, gives ~10x over interpreter per wazero's own numbers, but inherits Wasm's calling convention overhead on every host callback. A reasonable Option D worth a follow-on MEP; rejected here because the goal is to share frame layout with the interpreter.
Cranelift via cgo: 10x faster codegen than LLVM but requires a Rust toolchain plus cgo. Rejected.

Comparison matrix

Dimension	Option A (this MEP)	Option B (MEP-31, tracing)	Option C (MEP-32, tiered)
Predicted speedup on `fill_sum`	2-4x	5-15x	4-8x
Engineering scope (KLOC)	~2.2	~12	~25
Engineer-months to prototype	~3	~12	~18
Deopt complexity	None	Medium	High
OSR complexity	None (free)	Medium	High
Single static binary preserved	Yes	Yes (with golang-asm)	Yes (tier 1)
Reuses MEP-27 IC infrastructure	Read-only	Yes	Yes (heavily)
Performance ceiling	Modest	High	High

Predicted MEP-23 numbers

On the MEP-23 cross-language lists/fill_sum bench, N=1024 on Apple M4 (Go 1.25), current measured ~3805 ns/op for vm2; predicted with Option A: ~1500-1900 ns/op, closing ~70% of the gap to the Go-native floor at 935 ns/op. On strings/concat_loop (rope-shape baseline from MEP-29), the JIT's win is smaller (~1.3x) because the hot path is already dominated by allocation, not dispatch.

Open questions

Per-shape vs per-shape-tuple templates. Templates for binary opcodes (e.g. OpListConcat) explode quadratically in shape count. Cap the explosion by deferring shape-tuples beyond a threshold to the interpreter? Or factor them via a runtime shape-merging table inside the snippet?
AArch64 PAC interaction. Apple Silicon enforces pointer authentication on return addresses. The trampoline must sign/authenticate its own return path; the JIT body is leaf and unaffected. Verify on real hardware before merging.
Whether to ship the JIT off by default in 1.0. Sparkplug shipped on by default; CPython 3.13's JIT shipped off. Mochi should pick once tier 1 lands and MEP-29-equivalent measurements are in hand.

V8 Sparkplug post (Verwaest, 2021)
PEP 744 - JIT Compilation
Copy-and-Patch Compilation (Xu & Kjolstad, PLDI 2021)
JSC Baseline JIT internals (Zon8)
golang-asm
Runtime code generation in Go (mathetake)
MEP-25, MEP-27, MEP-28, MEP-29: the data model and dispatch-strategy MEPs whose work this JIT consumes.
MEP-31, MEP-32: the two competing JIT strategies.

Abstract​

Motivation​

Specification​

Overview​

Snippet ABI​

Templates table​

Frame compatibility (free OSR, free bailout)​

Hot-path example​

Hotness threshold and warmup​

Memory and code-cache management​

Distribution and build constraints​

Cost model​

Engineering scope​

JIT integration with the rest of vm2​

Risks​

Alternatives considered​

Comparison matrix​

Predicted MEP-23 numbers​

Open questions​

Related work​