MEP 34. VM2 Full-Opcode JIT - Lists, Strings, Maps, Sets, Structs

Field	Value
MEP	34
Title	VM2 Full-Opcode JIT - Lists, Strings, Maps, Sets, Structs
Author	Mochi core
Status	Active
Type	Standards Track
Created	2026-05-17

Abstract

This MEP specifies the production JIT for vm2: a template / copy-and-patch baseline JIT, in the lineage of V8 Sparkplug and CPython 3.13, that lowers every vm2 opcode (current count: 41, plus the planned set and struct families) to native AArch64 and AMD64 code. Where the MEP-30 prototype proved the dispatch architecture on a 6-opcode toy bytecode (see MEP-33), this MEP scopes the work of carrying the same approach across the full Mochi container surface: Cell arithmetic, lists, strings, maps, sets, structs, calls, control flow.

The JIT is single-tier baseline by design. Tier-2 optimization (MEP-32) is deferred until the baseline ships and the measured-result MEP for this work lands. Tracing (MEP-31) is deferred per MEP-33 §Recommendations.

The benchmark plan is the gate: merge is contingent on (1) JIT correctness across the existing runtime/vm2/... test corpus, (2) measurable speedup over the vm2 interpreter on every MEP-23 workload, with no per-workload regression, and (3) head-to-head numbers against LuaJIT 2.1, Lua 5.5, and CPython 3.14 on the same corpus.

Motivation

The MEP-33 prototype demonstrated that on a 6-opcode arithmetic bytecode, a template JIT delivers 17x over the matching switch interpreter, matches LuaJIT, and rivals hand-written Go. The result is encouraging but narrow:

The toy bytecode has no boxing, no allocation, no shape dispatch. Every real Mochi workload has all three.
The MEP-23 corpus is dominated by list, string, map, and set workloads. A JIT that does not handle those is not a JIT for Mochi.
vm2's Cell is NaN-boxed (LuaJIT/JSC style, see runtime/vm2/cell.go:24-49). Fast paths must inline the tag-check pattern; slow paths must call back into Go for Objects table operations.

The spec exists to bound the work between "MEP-30 prototype validated" and "JIT shipped". Without it, the implementation drifts toward unbounded ambition (tier-2 optimizations, deopt protocols, tracing) before the baseline tier exists. With it, the path is mechanical: enumerate opcodes, write templates, integrate with the interpreter's frame format, measure.

Scope

In scope:

A baseline JIT (no IR, no register allocator beyond the existing vm2 register file) for every opcode currently in runtime/vm2/ops.go plus the planned set opcodes (MEP-29 §Sets) and struct opcodes (MEP-24 §5).
Two architectures from day one: darwin/arm64 (primary) and linux/amd64 (parity gate).
The frame-compatibility contract that lets the JIT and the interpreter share the same Cell-shaped register file, so a JIT'd function can call an interpreted one and vice versa without copy-conversion.
The per-Cell tag-check fast paths and the slow-path runtime callbacks for the four container subsystems.
The benchmark plan and the gates that govern merge.

Out of scope (deferred to follow-on MEPs):

Tier-2 optimizing JIT (MEP-32, funded after this MEP ships).
Tracing JIT (MEP-31, funded only if Phase-2 numbers warrant).
Inline caches in the JIT itself: the interpreter's MEP-19 and MEP-27 ICs are read by the JIT as type hints, but the JIT does not learn or update IC state in v1.
Allocation-removal, escape analysis, profile-guided inlining: all tier-2 features.
Speculative deopt: there is no tier-2 to deopt to.
Garbage collection coordination beyond the MEP-20 frame layout already preserved by the interpreter.

Background: vm2 opcode surface

From runtime/vm2/ops.go and the planned MEP-24/MEP-29 additions, the JIT must cover roughly 60 opcodes across nine categories. Counts below are current + planned in parentheses.

Category	Opcodes (current + planned)	Canonical examples
Misc / control	2	`OpHalt`, `OpMove`
Arithmetic (I64)	8	`OpAddI64`, `OpAddI64K`, `OpSubI64`, `OpMulI64`, `OpDivI64`, `OpModI64`, `OpLessI64`, `OpEqualI64`
Control flow	8	`OpJump`, `OpJumpIfFalse`, `OpJumpIfLessI64`, `OpJumpIfLessEqI64`, `OpJumpIfGreaterI64`, `OpJumpIfGreaterEqI64`, `OpJumpIfEqualI64`, `OpJumpIfNotEqualI64`
Closure / call	4	`OpCall`, `OpTailCall`, `OpTailCallSelf`, `OpReturn`
Constants	1	`OpLoadConstI`
String	6	`OpLoadStrK`, `OpConcatStr`, `OpLenStr`, `OpIndexStr`, `OpEqualStr`, `OpHashStr`
List	5	`OpNewList`, `OpListLen`, `OpListGet`, `OpListSet`, `OpListPush`
Map	6	`OpNewMap`, `OpMapLen`, `OpMapGet`, `OpMapHas`, `OpMapSet`, `OpMapDel`
Set (planned, MEP-29)	0 (+6)	`OpNewSet`, `OpSetAdd`, `OpSetHas`, `OpSetDel`, `OpSetLen`, `OpSetIter`
Struct (planned, MEP-24)	0 (+6)	`OpNewStruct`, `OpStructGet`, `OpStructSet`, `OpStructTagCheck`, `OpStructLen`, `OpStructEqual`
Total	40 + ~12 = ~52

A few notes on the inventory:

The string opcodes already include both small-string-inline (the tagSStr work, MEP-19 PR4) and pointer-tagged paths. The JIT must support both branches in a single fused fast-path. See §String opcodes.
Map opcodes are currently shape-monomorphic; the MEP-29 §Maps measured-results MEP recommends keeping them so. The JIT inherits the shape.
Set and struct opcodes are still pre-spec. The JIT lands their templates after the interpreter ships them; this MEP commits to the templates being a strict mechanical extension of the map and list templates respectively.

Phase 1.5 post-mortem: why "BLR into Go" failed

The original §Architecture overview below specified that allocation-and-Go-touching opcodes (OpNewList, OpListPush, every New*, every string concat, etc.) would lower to a thin slow-path callout: emit a BLR x16 that targets a Go function pointer. The Phase 1.5 implementation tried this for the five list opcodes; the resulting code crashes inside the goroutine runtime on the very first OpNewList test, in three different ways depending on which backup strategy we tried. The root cause is structural, not a local bug, and we are documenting it so future MEP-34 work does not re-attempt the same shape.

The three failures, in the order we tripped them:

Stack-slot backup at [sp+40] — corrupted across morestack. Saved a heap pointer; the goroutine stack grew during makeslice; copystack walked back from the grown stack into our JIT frame; the JIT frame is not in pclntab, so the runtime threw "unknown caller pc" before our restore ever ran.
Callee-saved register backup in x21/x22 — overwritten by the JITNewList morestack stub, which saves R0,R1 at [JIT_SP+8],[JIT_SP+16], exactly where our STP put x21/x22.
Heap-backed scratch at VM.JITScratch addressed via x20 — crashes because x20 itself is not preserved across BLR. Disassembly of runtime.nanotime_trampoline (the libc wrapper invoked deep in any allocator path) shows it clobbers R19, R20, R21, R22, and R27 without saving them. It is a non-conformant ABI shim. JITNewList does not save R19/R20 either, because it does not use them locally — under Go's ABIInternal, a callee only saves the callee-saved registers it actually touches.

The unifying lesson:

A JIT frame that is invisible to pclntab cannot have any Go function called from it that may itself trigger morestack, gentraceback, the GC's stack scan, the runtime profiler, or a goroutine preempt. Every Go allocator path can trigger at least one of these. Therefore the original §Slow-path callouts design — JIT body BLRs into a Go function that allocates — is unimplementable without first making the JIT frame walkable by Go's runtime, which requires hacking Go-internal data structures we have no supported way to write to.

We considered three workarounds and rejected them:

Make the JIT frame walkable. Requires registering JIT pages with runtime.moduledata and shipping fake pclntab/funcdata entries. Done in github.com/bytedance/sonic, but reverse-engineered against Go internals and fragile across point releases. Out of scope for a baseline JIT.
Trampoline-as-pclntab-bridge. Wrap each slow path in a Go assembly trampoline that is itself in pclntab, save R19/R20 in the trampoline's own frame. Solves the register-clobber issue but not the unwinder issue: any panic, preempt, or stack scan inside the slow-path body still tries to walk past the trampoline into our (invisible) JIT frame and fails.
Pre-grow the goroutine stack so morestack never fires inside JIT-called Go code. Defers the bug rather than fixing it; hostile to long-running servers; does nothing for the GC stack scan or profiler tick.

The supported pattern, as practised by every production JIT we surveyed (LuaJIT, V8 Sparkplug, CPython 3.13 copy-and-patch, PyPy), is the same: JIT code does not call into the host runtime in-place. Instead, an opcode that needs the host either (a) compiles to an allocation-free fast path inlined into the JIT, or (b) takes a side exit / deopt to the interpreter, which is a normal Go function running on a normal Go frame, runs the slow op, and either re-enters JIT at the next safepoint or finishes the function on the interpreter.

The revised spec adopts this pattern. §Deoptimization protocol below replaces the old "slow-path callout" design.

Deoptimization protocol

The deopt protocol is the single mechanism the JIT uses for every opcode it cannot lower to an allocation-free fast path. It is borrowed near-verbatim from LuaJIT's side-exit model.

Contract

A deopt point is a vm2 instruction index pc inside a JIT'd function such that the JIT was unable to lower the instruction at pc to native code. At a deopt point, the JIT emits a fixed sequence that:

Stores all live vm2 registers from their JIT-assigned host registers (x9-x15 on AArch64) back to the shared register file at [regs_ptr + i*8]. After this step, the interpreter and the JIT agree on register state.
Loads pc (an immediate, known at compile time) into the standard JIT return register (x0 on AArch64), tagged with a sentinel bit pattern that distinguishes a deopt return from an OpReturn return.
Executes the standard JIT epilogue (LDP x19,x20,[sp],#64; RET).

The Go-side wrapper that invokes the JIT (the trampoline.Call -> vm2.runJITFunction path) inspects the returned value:

If it carries the deopt sentinel, the wrapper unpacks pc, sets the current frame's IP to pc, and resumes the interpreter on the same frame.
If it carries an OpReturn value, the wrapper returns it as the function's result.

The interpreter, on resume, runs one or more opcodes through its normal switch dispatch. At the next re-entry safepoint (function entry, back-edge with non-trivial loop body, or after N interpreted opcodes), the wrapper may re-enter the JIT at the current IP if the function's compiled code covers that IP. v1 does not re-enter mid-function; once a function deopts it finishes on the interpreter. Mid-function re-entry is Phase 2.

Sentinel encoding

The vm2.Cell is a NaN-boxed 64-bit value. Deopt returns reuse the int48 tag (0xFFFC in the top 16 bits) with a never-otherwise-emitted bit set in the low 48 bits — concretely, we set bit 47, which is always zero in a sign-extended int48 fast-path result. The wrapper checks (cell >> 47) & 1 != 0 && cell.tag() == tagInt to detect a deopt, then masks bit 47 off and zero-extends the bottom 47 bits to recover pc. Functions with more than 2^47 instructions are not supported, which is consistent with every other limit in vm2.

Cost

The deopt sequence is N + 3 instructions on AArch64: N register spills (one per live JIT register) plus MOV of the sentinel into x0, plus the two-instruction epilogue. For a typical loop body with 5 live registers this is 8 instructions. The interpreter wrapper's deopt-vs-return branch is one Cell read, one mask, one conditional — comparable to a single interpreted opcode.

Compile-time fallback

If a function's opcode density of deopt points exceeds a threshold (provisional: 50% of executed instructions on the first 1000 calls), the JIT marks the function as "not worth compiling" and the interpreter handles all future calls. This is a Phase 2 refinement; Phase 1.5 always attempts to compile, always tolerates deopts.

What this changes

The "slow-path Go function pointer" approach in §Architecture overview is deprecated. New code does not emit BLR into Go functions from the JIT body.
The five list-opcode lowerings in runtime/jit/vm2jit/lower_arm64.go (added in the Phase 1.5 work) are reverted; they are replaced by deopt stubs.
runtime/vm2/lists.go keeps the JIT*List* Go functions (they are clean Go-callable shims) — they are now reached only from the interpreter, never from JIT code. They may eventually be inlined into the interpreter's opcode handlers.
runtime/vm2/vm.go does not need the JITScratch field anymore; the deopt model has no register backup hazard.

Specification

Architecture overview

+----------------------+    JIT compile     +-------------------+
|   vm2 bytecode for   | -----------------> | Native code page  |
|   a Function         |   (per-function)   | (mmap'd MAP_JIT)  |
+----------------------+                    +---------+---------+
        ^                                             |
        |    interpret unchanged                      | call via
        |                                             | runtime trampoline
+-------+--------------+                              v
|  runtime/vm2/eval    |    <--- jit returns into     |
|  switch dispatch     |        frame.RetReg          |
|  (fallback, calls)   |    <--- or returns a deopt   |
+----------------------+         sentinel; wrapper    |
        ^                        resumes interp here  |
        +---------------------------------------------+
                no in-place callbacks into Go;
                see §Deoptimization protocol

Compile unit: one vm2 Function at a time. The JIT walks the function's bytecode and emits one native instruction sequence per opcode, plus a function prologue (load frame pointer, materialise constant pool base) and one epilogue per OpReturn or deopt point.
Per-call dispatch: the runtime keeps both an interpreter pointer (func(*VM)) and an optional compiled-code pointer on each Function. A Call opcode that lands on a function with a non-nil compiled-code pointer enters the JIT; otherwise it enters the interpreter. Cross-tier calls are free: both sides see the same *Frame layout.
No in-place callbacks: the JIT body does not call Go functions. Opcodes that touch the allocator, the Objects heap, or any other host runtime surface emit a deopt stub (§Deoptimization protocol) instead. The wrapper that invoked the JIT runs the slow opcode on the interpreter and either re-enters or finishes interpreted. This sidesteps the structural failures documented in §Phase 1.5 post-mortem.
No cgo at runtime: every JIT'd page is entered via a pure-Go trampoline written in .s. The MEP-30 prototype's cgo wrapper is replaced before any benchmark in this MEP is published; cgo is forbidden on the JIT hot path.

Frame compatibility

The single largest correctness obligation. The contract:

Register file layout is identical between interpreter and JIT. Frame.RegsBase points at the start of NumRegs Cell-sized slots in the shared Stack. The JIT compiles to native code that addresses [frame_ptr + RegsBase*8 + reg*8] for every register read, and writes back to the same slot on every register write. There is no shadow register file.
Frame metadata fields are read-only to the JIT except for Frame.IP and Frame.RetReg. The JIT increments Frame.IP only at safepoints (back-edges, calls, allocations); within a straight-line opcode sequence the IP is undefined.
Safepoints are deterministic. The JIT inserts a one-instruction goroutine-preemption check at every back-edge and at every slow-path callout. The check polls runtime.gcWaitOnPreempt analog (the precise primitive is TBD; see Open Questions).
Cell reads are tag-aware. The JIT does not assume regs[i] is a particular type unless the bytecode came from an MEP-19-quickened typed opcode (OpAddI64 etc). For untyped opcodes the JIT emits the tag-check fast path inline.

The contract means a JIT'd function can OpCall an interpreted callee, the interpreter can OpCall a JIT'd callee, and runtime.Stack(t) panics traverse both stacks indistinguishably.

Cell fast paths

Every typed opcode (OpAddI64, OpListGet, OpMapGet, ...) decomposes into:

tag-check (1-2 instructions, branch on mismatch to slow path)
type-specific work (1-4 instructions)
write back to register (1 instruction)

The tag-check exploits the NaN-boxing layout in runtime/vm2/cell.go. For int48, the check is a single ubfx (AArch64) or shr+cmp (AMD64) against 0xFFFC; the int48 payload is then extracted by sign-extending the bottom 48 bits. For pointer tags (0xFFFF), the payload is the Objects table index, and the slot fetch is a base+scaled-index load.

The MEP-30 prototype demonstrated that an int-only loop body lowers to ~7 native instructions per iteration (post-MEP-32 peephole), with no tag check. The full vm2 JIT pays a per-untyped-op tag check, expected at ~2 extra instructions and one well-predicted branch per op. MEP-19's quickening removes the tag check for fully-typed code paths; this is why MEP-19 is on the critical path for the JIT to shine.

List opcodes

Under the revised model (§Deoptimization protocol), list opcodes split by whether they touch the Go allocator:

Opcode	Phase 1.5	Phase 2 fast path	Phase 3 (optional)
`OpNewList`	deopt to interpreter	-	systemstack alloc
`OpListLen`	deopt to interpreter	~6 instrs (tag-check ptr, load `*vmList`, load `Len`) - allocation-free, stays in JIT	-
`OpListGet`	deopt to interpreter	~10 instrs (tag-check, deref, bounds-check, indexed load) - allocation-free	-
`OpListSet`	deopt to interpreter	~8 instrs if list is fully-owned (post-MEP-26 single-writer); deopt on cap-exceeded	shared-write barrier in JIT
`OpListPush`	deopt to interpreter	-	systemstack alloc

Phase 1.5 lands all five as deopt stubs. The deopt stub is a fixed sequence: spill live JIT regs back to the register file, return the sentinel-tagged PC. The Go-side wrapper resumes the interpreter at that PC; the interpreter runs the list op against vm2.JIT*List* (the same functions that were the slow-path targets in the failed original design, now only reached from interpreter dispatch) and finishes the function on the interpreter.

Phase 2 promotes the three allocation-free ops (OpListLen, OpListGet, OpListSet in the unshared-cap-OK case) to inline fast paths. The fast path is straight-line code: tag-check, deref, indexed load. No Go calls. Bounds-check failures and shared/grow paths deopt the same way Phase 1.5 does, so the protocol is unchanged.

Phase 3, optional, revisits in-JIT allocation only if Phase 2 benchmarks leave significant headroom. The candidate mechanism is runtime.systemstack-style switching to the g0 stack (fixed-size, never grows), which avoids the morestack / JIT-frame-not-in-pclntab hazard. Whether it is faster than deopt-and-interpret is an open question and is not committed in this MEP.

The MEP-23 lists/fill_sum workload (build a list of N ints, sum it) under Phase 1.5 becomes:

init  : OpNewList -> deopt; interpreter allocates list; function continues interpreted
loop  : OpListPush, OpAddI64, OpJumpIfLessI64 back-edge (all interpreted post-deopt)
sum   : OpListGet, OpAddI64, OpJumpIfLessI64 (all interpreted post-deopt)

Predicted Phase 1.5 speedup on lists/fill_sum: ~1.0x (parity; the function deopts on instruction 0 and runs entirely interpreted). Phase 2 lifts the sum-only phase into JIT (predicted ~1.5x for that phase). Phase 3, if pursued, predicts 2-3x by JITing both phases.

The honest read is that the JIT is not a list-allocation optimization. The interpreter already calls the same Go allocator paths with negligible dispatch overhead, and the JIT cannot beat the allocator without escape analysis (tier-2, MEP-32). The JIT's win on list-heavy code comes from JITing the arithmetic and iteration around the list ops, which is what Phase 2's read-only fast paths buy.

String opcodes

The string subsystem has two physical representations:

Inline small string (tagSStr): up to 5 bytes packed into the Cell itself. No heap object. Implemented in MEP-19 PR4.
Heap string: pointer tag, *vmString in the Objects table.

Inline strings are an excellent fit for the deopt model: every operation on them is allocation-free, so it stays in JIT. Heap strings are mostly allocation-free for reads but allocation-bound for writes; reads stay in JIT, writes deopt.

Opcode	Phase 1.5	Phase 2 fast path
`OpLoadStrK`	inline JIT (const)	inline JIT (1-2 instrs)
`OpLenStr`	deopt to interpreter	~3 (inline) / ~6 (heap deref) instrs - allocation-free, stays in JIT
`OpIndexStr`	deopt to interpreter	~6 (inline byte extract) / ~10 (heap base+offset load) - allocation-free
`OpEqualStr`	deopt to interpreter	~4 (Cell == Cell for inline pairs and interned-heap pairs) - allocation-free
`OpConcatStr`	deopt to interpreter	always deopts (heap allocation); Phase 3 may revisit with systemstack inline-string concat
`OpHashStr`	deopt to interpreter	~8 (xxhash) for inline; deopt for heap until xxhash3-inline lands

OpEqualStr exploits the vm2 interner (MEP-19 PR2) so that even heap-string equality reduces to a Cell == Cell compare for the common case of constant or short-string operands. The interesting workloads stay in JIT once Phase 2 lifts them; only concat and the heap-string hash path remain deopt-bound.

Phase 1.5 prediction: ~1.0x on strings/concat_loop (all opcodes deopt; one deopt per loop iteration leaves the function fully interpreted after the first concat). Phase 2 prediction: 1.5-2x on strings/equal_loop, ~1.0x on strings/concat_loop. Phase 3, if pursued, predicts 2-3x on strings/concat_loop via systemstack inline-only concat.

Map opcodes

Maps are open-addressed shape-monomorphic (MEP-29 §Maps). Two key types in v1: int48 and interned-string. Allocation-bound opcodes (OpNewMap, OpMapSet on growth, OpMapDel) deopt; read paths and steady-state writes stay in JIT once Phase 2 lands.

Opcode	Phase 1.5	Phase 2 fast path
`OpNewMap`	deopt to interpreter	-
`OpMapLen`	deopt to interpreter	~6 (deref `*vmMap`, load `Len`) - allocation-free, stays in JIT
`OpMapGet`	deopt to interpreter	~12 (hash, probe loop with 1 unroll, int48 key); collision-overflow path deopts
`OpMapHas`	deopt to interpreter	~10 (same probe, return boolean)
`OpMapSet`	deopt to interpreter	~14 in steady state (probe + write, no growth); growth and rehash deopt
`OpMapDel`	deopt to interpreter	-

The probe loop is unrolled exactly once. The interpreter walks the probe loop in a fast Go loop (MEP-29 measurement: 5-9 ns/op); the JIT's unroll-once advantage is small for hot lookups and zero for misses.

Phase 1.5 prediction: ~1.0x on maps/fill_probe (fill phase deopts on OpNewMap then runs the loop interpreted). Phase 2 prediction: 1.2-1.5x on maps/keys (read-only iteration stays in JIT); fill-and-grow workloads stay near parity since the allocator dominates.

Set opcodes

Sets reuse the map's open-addressing infrastructure (MEP-29 §Sets). The JIT templates are identical to map templates with the value-slot writes elided, and the deopt boundaries are the same: allocation deopts, steady-state membership and iteration stay in JIT post-Phase-2. The set spec lands first (interpreter); this MEP commits the JIT templates as a mechanical derivation.

Phase 1.5 prediction: parity. Phase 2 prediction: same as maps (~1.2-1.5x on read-heavy workloads).

Struct opcodes

Structs are flat tuples of Cells with a shape tag (MEP-24 §5). The shape tag is a uint32 cached on the struct header. Structs are the friendliest collection type for the JIT: shape is fixed at allocation, so reads and writes are pure pointer arithmetic with no growth path.

Opcode	Phase 1.5	Phase 2 fast path
`OpNewStruct`	deopt to interpreter	-
`OpStructGet`	deopt to interpreter	~7 instrs (tag-check, deref, indexed load) - allocation-free
`OpStructSet`	deopt to interpreter	~6 instrs (no GC barrier) / ~10 (with barrier on shared-heap write)
`OpStructTagCheck`	deopt to interpreter	~4 instrs
`OpStructLen`	deopt to interpreter	constant from shape, ~2 instrs
`OpStructEqual`	deopt to interpreter	~12 instrs (shape-check + memcmp for ≤32 bytes); larger or nested deopts

Phase 1.5 prediction: parity (any OpNewStruct deopts the function). Phase 2 prediction: 2-3x on structs/fill_field and 1.5-2x on structs/equal_loop once allocation can be done up-front and the loop body stays in JIT.

Calls and returns

Calls touch vm.Frames (an append-grown slice) and vm.Stack (a make-grown slice). Both can reallocate, both reside on the Go heap, both need GC visibility. Under the deopt protocol all of OpCall / OpTailCall deopt in Phase 1.5; only the intra-function self-tail and the function's own OpReturn stay in JIT.

Opcode	Phase 1.5	Phase 2 fast path
`OpCall`	deopt to interpreter	-
`OpTailCall`	deopt to interpreter	-
`OpTailCallSelf`	inline JIT branch	inline JIT branch (~3 instrs, no frame manipulation)
`OpReturn`	inline JIT epilogue	inline JIT epilogue (~3 instrs, returns sentinel-or-value to wrapper)

Phase 3 may revisit OpCall to a JIT'd callee by emitting a JIT-to-JIT call using the same stack-allocated jitFrame layout the trampoline uses; this avoids the vm.Frames append entirely on the call path. Whether it pays for itself depends on whether enough call sites land on stably-JIT-compiled callees. Out of scope for this MEP.

OpTailCallSelf is the JIT's flagship call shape: recursive functions written as tail recursion (the recommended Mochi idiom for loops over disjoint cases) compile to a single in-function branch and never touch vm.Frames. This is what makes arith/fib_rec-style benchmarks competitive with iterative Go.

Arithmetic and control flow

These are the MEP-30 prototype's home territory. Templates are short (1-4 instructions each) and the only new wrinkle is the comparison-and-branch fusion: OpJumpIfLessI64 is a single cmp; b.lt on AArch64 and cmp; jl on AMD64. The MEP-30 prototype emitted cmp; csinc; cbnz (three instructions) because it had no fused branch opcode; vm2 already has the fused opcodes, so the JIT is strictly simpler here than the prototype was.

Backend split: AArch64 + AMD64

Two backends, written in parallel, sharing one mid-level opcode-to-template table that takes an (arch, opcode) -> template shape:

runtime/jit/vm2jit/
  arch/
    arm64/      asm encoders, per-opcode lowerings, deopt stub emitter
    amd64/      same, with Linux-flavored mmap and W^X
  templates/    arch-agnostic per-vm2-opcode lowerings (one file per category)
  deopt/        sentinel encoding/decoding, interpreter-resume wrapper
  trampoline/   pure-Go `.s` trampolines for cross-tier calls
  compile.go    the per-Function compiler driver
  cache.go      the executable code cache (shared mmap pool)

Note that there is no runtime/ subpackage of Go slow-path shims in the revised design. Opcode-specific runtime work happens in runtime/vm2/ (the interpreter), reached only via deopt-and-resume, never via direct call from JIT code.

The AArch64 backend reuses the MEP-30 encoders (runtime/jit/tmpljit/emit_arm64.go). The AMD64 backend writes its own encoders; the encoding table for the opcode subset the JIT needs is ~40 entries, ~200 lines of Go.

Trampoline (cgo replacement)

The MEP-30 prototype calls JIT'd code via cgo. The production JIT must not. The MEP-30 spec (§6.2) specifies a pure-Go .s trampoline. This MEP commits the trampoline as a hard merge gate: no benchmark in §Benchmark plan is published with cgo. The trampoline shape is the standard "call assembly with a fixed-prototype function pointer" pattern; see Go's own runtime/asm_arm64.s for the precedent.

Engineering phases

The revised phasing reflects the deopt model. Phase 1 (arithmetic, control flow, Move, Return) is already merged; per-loop speedup of 5-8x measured (§Appendix A). The remaining phases:

Phase 1.5: deopt protocol + universal opcode coverage (estimated 1-2 engineer-weeks)

Implement the deopt sentinel encoding in runtime/vm2/cell.go and a deopt.Decode(Cell) (pc int, ok bool) helper.
Implement the interpreter-resume wrapper in runtime/jit/vm2jit/: read the JIT return, branch on sentinel, set frame.IP = pc and call vm.runInterp(frame) to finish the function.
Implement a deopt-stub emitter in lower_arm64.go: for any unsupported opcode, spill live JIT regs to the register file and return the sentinel-tagged PC.
Wire every non-arithmetic, non-control-flow, non-Move, non-Return opcode to the deopt stub. The function may still be compiled (and benefit from JIT'd arithmetic up to the first deopt point), but lists/strings/maps/sets/structs/calls all deopt.
Benchmark gate: arithmetic loops stay at 5-8x; all MEP-23 list/string/map/set/struct/call workloads stay within 5% of interpreter baseline (i.e., the deopt round-trip cost is small enough to be invisible).

Phase 2: allocation-free fast paths (estimated 4-6 engineer-weeks)

Lift the read-only opcodes (OpListLen, OpListGet, OpListSet no-grow, OpLenStr, OpIndexStr, OpEqualStr, OpMapLen, OpMapGet, OpMapHas, OpStructGet, OpStructSet, OpStructLen, OpStructTagCheck, OpStructEqual) to inline JIT fast paths. Each fast path is straight-line code with deopt on any unhappy path (bounds-fail, growth-needed, type-mismatch).
Set opcodes ride along (mechanical from maps).
Benchmark gate: lists/sum, strings/equal_loop, maps/keys, structs/fill_field reach 1.5x or better vs interpreter; allocation-bound workloads stay near parity.

Phase 3 (optional): in-JIT allocation (estimated 4-8 engineer-weeks)

Investigate runtime.systemstack-style allocation from JIT code for the small-allocation opcodes (OpNewList with small cap, OpConcatStr for inline result, OpNewStruct).
Investigate JIT-to-JIT direct call to avoid vm.Frames append on the hot path.
Decision gate: each candidate ships only if its measured speedup vs Phase 2 deopt is at least 1.3x on the relevant workload. Otherwise the deopt path is fast enough and we don't add code.

Total (Phase 1.5 + Phase 2): ~5-8 engineer-weeks, ~2 KLOC of Go, on top of merged Phase 1. Phase 3 is unscoped and conditional.

Benchmark plan

Three benchmark groups, all run on the same hardware in the same session, all reported with five-sample medians and benchstat-style variance:

Group A: vm2 interpreter head-to-head

For every MEP-23 workload, three numbers:

vm2 interpreter (current main).
vm2 + this JIT.
The ratio.

Merge gate: ratio < 1.0 on every workload, with at least 1.5x on the loop-dominated subset (fill_sum, concat_loop, fill_probe, fib_iter).

Group B: cross-language

For each Group A workload that has a published port:

Language	Implementation
Mochi	vm2 interpreter + this JIT
Lua	Stock Lua 5.5 (loadable binary on macOS)
LuaJIT	LuaJIT 2.1 (master)
Python	CPython 3.14 (default tier-2 interpreter)
Go (reference)	Hand-translated, for the theoretical floor

Reportable: the ratio table, the workload-level analysis, the threats-to-validity section. Not gated on any specific ratio; ship the numbers.

Group C: per-opcode microbenchmarks

For every category in §Background, one microbench per opcode hot path. Reports ns/op for:

The interpreter handler.
The JIT'd fast path.
The JIT'd slow path.

Used internally only, not in the public results MEP. The role is to flag regressions during development (the per-opcode fast path should not regress between phases) and to give the tier-2 MEP its baseline numbers.

Reporting MEP

The numbers land in a new Informational MEP, analogous to MEP-33. Provisional number: MEP-35. Full vm2-opcode JIT - Measured Results. The MEP-35 draft is written alongside Phase 3 and merged simultaneously with the JIT flip-default change in vm2.

Risks

Deopt frequency dominates speedup on allocation-heavy workloads. A loop that allocates every iteration spends most of its time in the interpreter, with the JIT contributing only the arithmetic between deopt points. The honest read is that the JIT does not speed up such loops; it lands at parity. Mitigation: explicit per-workload predictions in §List/String/Map/Set/Struct opcodes. The merge gate is "no regression", not "speedup on every workload".
The deopt protocol itself becomes the bug surface. Sentinel encoding, live-reg spill list, IP synchronization, and re-entry safepoints have to agree between JIT and interpreter. Mitigation: keep the protocol minimal (one sentinel encoding, one spill convention, no mid-function re-entry in v1); add a deopt_test.go that fuzzes JIT-deopt-interpret-finish for every opcode.
W^X policy differences between darwin and linux. macOS arm64 requires pthread_jit_write_protect_np; linux/amd64 wants PROT_READ|PROT_WRITE flips around the page write. The MEP-30 prototype handles macOS; the production JIT must abstract this into runtime/jit/vm2jit/arch/{arm64,amd64}/page.go.
Frame-format drift between interpreter and JIT. The contract in §Frame compatibility is the most subtle correctness hazard, especially across deopt boundaries where the interpreter picks up state the JIT just spilled. Mitigation: a single Go struct (type Frame struct { ... }) shared between both sides; any field reorder must update both lowerings or fail a compile-time assertion.
MEP-19 quickening coverage drives JIT win size. The JIT's tag-check fast path is good but pays 2 instructions per untyped op. Workloads the quickening pass misses look slower than expected. Mitigation: instrument the JIT to report per-call quickening coverage in the MEP-35 results; if coverage is below 80% on the corpus, file follow-on MEP-19 work.
Goroutine preemption check granularity. Too frequent costs the JIT its loop-tightness advantage; too rare risks scheduler stalls. Under the deopt model the simplest answer is "every deopt point is a safepoint" plus a per-back-edge check; measure in MEP-35, tune if needed.
Single-binary distribution. The MEP-30 spec required this; this MEP inherits the constraint. The pure-Go .s trampoline is the only acceptable cross-tier call mechanism; cgo is forbidden after the prototypes.
Phase 3 chases diminishing returns. Once read paths are in JIT, the remaining headroom is allocation, which requires runtime.systemstack tricks or escape analysis (a tier-2 problem). Phase 3 is explicitly conditional on measured headroom, not a deliverable.

Open questions

Where does the goroutine preemption check live, and at what granularity? Go's morestack check is not exposed and is unsafe for JIT code anyway (see §Phase 1.5 post-mortem). The deopt model gives an easy answer: every deopt point is implicitly a safepoint, since the wrapper resumes the interpreter, which honors preemption normally. Open question is whether we also need a per-back-edge atomic-load preemption check for tight all-JIT loops, or whether deopt-on-overflow at a slow path is sufficient.
Re-entry strategy after deopt. v1 does not re-enter the JIT mid-function (once deopted, the function finishes interpreted). Phase 2 may add re-entry at function-entry safepoints; full mid-function re-entry would require a JIT entry table keyed by PC, which is the same machinery a tier-2 deopt-into-tier-1 reverse path needs. Defer.
JIT cache lifetime across long-running processes. Per-Function compiled-code pointer is the simplest answer, lifetime tied to the Function. Cross-Function code sharing for common templates is appealing but unscoped; defer to Phase 3 or later.
Should OpNewList with a known small constant cap deopt, or compile? The JIT could materialize a small list inline using stack-allocated backing storage (since lists are escape-analysis-friendly when scope is bounded), but this requires teaching the JIT about lifetime. Out of scope for Phase 1.5/2; revisit in Phase 3 with measured headroom.
AMD64 register pressure. AArch64 has 30 GPRs; AMD64 has 14 useable. The JIT's per-opcode templates compile cleanly on both, but a future tier-2 MEP-32 will need an actual register allocator for AMD64. The deopt spill list is shorter on AMD64 (fewer regs to spill) so the deopt-stub size is similar. Spec is fine; allocator is post-this-MEP.

Comparison with the three JIT options

The three options (MEP-30, MEP-31, MEP-32) are about strategy: template vs tracing vs tiered. This MEP is about coverage: how the chosen strategy (template, per MEP-30 + MEP-33) is carried across the full vm2 opcode surface.

Aspect	This MEP (MEP-34)	MEP-30	MEP-31	MEP-32
Scope	full vm2 opcode set	6-opcode toy	hot loops (recorded)	full vm2 + tier-2 opt
Tier	tier 1 only	tier 1 (prototype)	tier 2 (orthogonal)	tier 1 + tier 2
Coverage	inline arith/control/move/return; deopt for	arith only	depends on workload	this MEP + IR optimizations
	allocating ops; Phase 2 adds read-only fast paths
Engineering (remaining)	5-8 weeks (Phase 1.5 + Phase 2)	done (afternoon)	12+ months	18+ months
Predicted speedup	5-8x on arith (measured); 1.5-2x on read-heavy	17x on toy bytecode	5-15x on loops, abort risk	4-8x broad
	post-Phase-2; parity on alloc-heavy

Predicted MEP-23 numbers

Predictions split by phase. Phase 1 numbers are measured (§Appendix A). Phase 1.5 numbers assume the deopt round-trip is ~20-30 ns; Phase 2 numbers assume inline read-path templates execute at MEP-30-prototype-comparable speed. All numbers are ns/op at N=1024 on Apple M4, predicted unless marked measured:

Workload	vm2 (now)	+ Phase 1.5	+ Phase 2	LuaJIT 2.1	Phase 2 / LuaJIT
`arith/fib_iter`	3500	700 (meas)	700	600	1.17x
`lists/fill_sum`	6000	6000	5000	1200	4.17x
`lists/sum`	2200	2200	600	700	0.86x
`strings/concat_loop`	8000	8000	8000	2500	3.20x
`strings/equal_loop`	1800	1800	800	400	2.00x
`maps/fill_probe`	11000	11000	11000	4000	2.75x
`maps/keys`	3500	3500	2000	1800	1.11x
`sets/fill_probe`	11500	11500	11500	-	-
`structs/fill_field`	2400	2400	1000	-	-
`structs/equal_loop`	1300	1300	600	-	-

The honest read: Phase 1.5 ships parity-or-faster on every workload (arithmetic improves, others stay flat). Phase 2 brings the read-heavy workloads into 0.85-2x of LuaJIT. The allocation-bound workloads (fill_sum, fill_probe, concat_loop) stay near interpreter speed without tier-2; that is the honest limit of a non-IR JIT.

The CPython 3.14 ratios will be in the 8-25x range across the board, consistent with MEP-33.

If Phase 2 numbers fall short of these by more than 30%, the failure modes and remediations should be the explicit subject of the MEP-35 reporting and may motivate funding MEP-32 sooner than planned.

MEP-23, the cross-language benchmark methodology this MEP's gates use.
MEP-24, the vm2 subsystem spec defining the list/string/map/set/struct shapes the JIT must handle.
MEP-19, MEP-27, the IC infrastructure whose type feedback the JIT reads but does not write.
MEP-29, the dispatch-strategy measured results that motivate keeping vm2 monomorphic, which this JIT inherits.
MEP-30, the template / copy-and-patch baseline JIT strategy this MEP carries to full opcode coverage.
MEP-31, the tracing JIT alternative; remains deferred.
MEP-32, the tier-2 optimizing JIT; funded after this MEP ships and MEP-35 lands.
MEP-33, the MEP-30 prototype's measured results that calibrate this MEP's predictions.

Files to add (provisional)

Files already merged from Phase 1 are shown in italics. Files added by Phase 1.5 are bold. Phase 2 adds the remaining per-category template files.

runtime/jit/vm2jit/doc.go, package overview.
runtime/jit/vm2jit/compile.go, per-Function compiler driver.
runtime/jit/vm2jit/cache.go, executable code cache.
runtime/jit/vm2jit/lower_arm64.go, AArch64 lowering for arithmetic/control/move/return.
runtime/jit/vm2jit/trampoline/{trampoline_arm64.s,trampoline.go}, pure-Go cross-tier call.
runtime/jit/vm2jit/deopt/{sentinel.go,resume.go}, sentinel encoding + interpreter-resume wrapper.
runtime/jit/vm2jit/lower_deopt_arm64.go, deopt stub emitter (one stub shape, parameterised by spill list and PC).
runtime/jit/vm2jit/lower_arm64_lists.go etc., Phase 2 per-category inline templates (one file per category).
runtime/jit/vm2jit/arch/amd64/{encode.go,page.go,lower.go}, AMD64 backend (Phase 2 or later).
runtime/jit/vm2jit/vm2jit_test.go plus per-category test files, plus deopt_test.go fuzzing JIT-deopt-interpret-finish.
runtime/jit/vm2jit/bench/, microbenches and MEP-23 corpus harness.
website/docs/mep/mep-0035.md, the measured-results MEP (lands with Phase 2).

Appendix A: Phase 1 Measured Results

Hardware: Apple M4, darwin/arm64, Go 1.24. Command: go test -bench=. -benchtime=5s -count=5 ./runtime/jit/vm2jit/ All numbers are 5-sample medians. The trampoline is CGo-based in Phase 1 (pure-Go .s is Phase 1.5); CGo boundary cost is ~25-30 ns per call and is visible in the per-opcode microbenchmarks but is amortized across loop iterations in the loop benchmarks.

A.1 fib_iter (iterative Fibonacci)

Loop body: OpJumpIfGreaterEqI64, OpAddI64, two OpMove, OpAddI64K, OpJump — 6 opcodes per iteration.

Benchmark	JIT (ns/op)	Interp (ns/op)	Speedup
fib_iter N=20	35	205	5.9x
fib_iter N=100	120	978	8.1x

Speedup grows with N because the ~30 ns call overhead is amortized. JIT per-iteration cost: ~1.6 ns at N=20, ~1.2 ns at N=100. Interpreter per-iteration cost: ~10 ns, consistent with MEP-29 dispatch measurements. At large N the steady-state ratio approaches ~8x.

A.2 sum_n (integer accumulation loop)

Loop body: OpAddI64, OpAddI64K, OpJumpIfLessI64 — 3 opcodes per iteration.

Benchmark	JIT (ns/op)	Interp (ns/op)	Speedup
sum_n N=100	130	542	4.2x
sum_n N=1k	1149	5819	5.1x
sum_n N=10k	11714	62286	5.3x

JIT steady-state cost: ~1.15 ns/iteration. Speedup converges to ~5x, matching the ~3-opcode loop body vs the interpreter's ~5.5 ns/opcode dispatch overhead. Both exceed the Phase 1 gate of ≥1.5x.

A.3 Per-opcode microbenchmarks

One-shot calls: prologue + one opcode + epilogue, called once per benchmark iteration through the CGo trampoline. The dominant cost (~25-30 ns) is the trampoline boundary; the JIT opcode itself costs under 5 ns.

Opcode	Total (ns/op)
OpAdd (I64)	32
OpMul (I64)	31
OpDiv (I64)	30
OpMod (I64)	32
OpLess (I64)	30

All five cluster at 30-32 ns, confirming flat per-opcode cost. The arithmetic opcodes are bounded by the NaN-box unpack/repack sequence (sbfx + and + op + and + movz + orr = 6 AArch64 instructions), not the operation itself.

A.4 Phase 1 scope, the failed Phase 1.5 attempt, and the pivot

Phase 1 covers arithmetic, control flow, Move, and Return. The pure-Go .s trampoline shipped during Phase 1, replacing the prototype's CGo wrapper.

An initial Phase 1.5 attempt added inline lowerings for the five list opcodes that called the corresponding runtime/vm2.JIT*List* Go functions from JIT code via BLR. This attempt is documented in §Phase 1.5 post-mortem; briefly, it crashed in three independent ways (R19 clobber by nanotime_trampoline, JIT frame absence from pclntab, and morestack-induced spill corruption) and the design is structurally unfixable in Go. The work was reverted.

The revised Phase 1.5 design, captured in §Deoptimization protocol, replaces in-place Go callouts with a deopt-and-resume mechanism. The arithmetic loop-body speedup numbers above are unaffected, as they never depended on calling into Go from JIT code.

The Phase 1 results of 5-8x on arithmetic loops exceed the predicted 3-6x range in §Predicted MEP-23 numbers and clear the ≥1.5x Phase 1 benchmark gate with substantial margin.

Abstract​

Motivation​

Scope​

Background: vm2 opcode surface​

Phase 1.5 post-mortem: why "BLR into Go" failed​

Deoptimization protocol​

Contract​

Sentinel encoding​

Cost​

Compile-time fallback​

What this changes​

Specification​

Architecture overview​

Frame compatibility​

Cell fast paths​

List opcodes​

String opcodes​

Map opcodes​

Set opcodes​

Struct opcodes​

Calls and returns​

Arithmetic and control flow​

Backend split: AArch64 + AMD64​

Trampoline (cgo replacement)​

Engineering phases​

Benchmark plan​

Group A: vm2 interpreter head-to-head​

Group B: cross-language​

Group C: per-opcode microbenchmarks​

Reporting MEP​

Risks​

Open questions​

Comparison with the three JIT options​

Predicted MEP-23 numbers​

Related work​

Files to add (provisional)​

Appendix A: Phase 1 Measured Results​

A.1 fib_iter (iterative Fibonacci)​

A.2 sum_n (integer accumulation loop)​

A.3 Per-opcode microbenchmarks​

A.4 Phase 1 scope, the failed Phase 1.5 attempt, and the pivot​