MEP 34. VM2 Full-Opcode JIT - Lists, Strings, Maps, Sets, Structs
| Field | Value |
|---|---|
| MEP | 34 |
| Title | VM2 Full-Opcode JIT - Lists, Strings, Maps, Sets, Structs |
| Author | Mochi core |
| Status | Active |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
This MEP specifies the production JIT for vm2: a template / copy-and-patch baseline JIT, in the lineage of V8 Sparkplug and CPython 3.13, that lowers every vm2 opcode (current count: 41, plus the planned set and struct families) to native AArch64 and AMD64 code. Where the MEP-30 prototype proved the dispatch architecture on a 6-opcode toy bytecode (see MEP-33), this MEP scopes the work of carrying the same approach across the full Mochi container surface: Cell arithmetic, lists, strings, maps, sets, structs, calls, control flow.
The JIT is single-tier baseline by design. Tier-2 optimization (MEP-32) is deferred until the baseline ships and the measured-result MEP for this work lands. Tracing (MEP-31) is deferred per MEP-33 §Recommendations.
The benchmark plan is the gate: merge is contingent on (1) JIT correctness across the existing runtime/vm2/... test corpus, (2) measurable speedup over the vm2 interpreter on every MEP-23 workload, with no per-workload regression, and (3) head-to-head numbers against LuaJIT 2.1, Lua 5.5, and CPython 3.14 on the same corpus.
Motivation
The MEP-33 prototype demonstrated that on a 6-opcode arithmetic bytecode, a template JIT delivers 17x over the matching switch interpreter, matches LuaJIT, and rivals hand-written Go. The result is encouraging but narrow:
- The toy bytecode has no boxing, no allocation, no shape dispatch. Every real Mochi workload has all three.
- The MEP-23 corpus is dominated by list, string, map, and set workloads. A JIT that does not handle those is not a JIT for Mochi.
- vm2's
Cellis NaN-boxed (LuaJIT/JSC style, seeruntime/vm2/cell.go:24-49). Fast paths must inline the tag-check pattern; slow paths must call back into Go forObjectstable operations.
The spec exists to bound the work between "MEP-30 prototype validated" and "JIT shipped". Without it, the implementation drifts toward unbounded ambition (tier-2 optimizations, deopt protocols, tracing) before the baseline tier exists. With it, the path is mechanical: enumerate opcodes, write templates, integrate with the interpreter's frame format, measure.
Scope
In scope:
- A baseline JIT (no IR, no register allocator beyond the existing vm2 register file) for every opcode currently in
runtime/vm2/ops.goplus the planned set opcodes (MEP-29 §Sets) and struct opcodes (MEP-24 §5). - Two architectures from day one: darwin/arm64 (primary) and linux/amd64 (parity gate).
- The frame-compatibility contract that lets the JIT and the interpreter share the same
Cell-shaped register file, so a JIT'd function can call an interpreted one and vice versa without copy-conversion. - The per-Cell tag-check fast paths and the slow-path runtime callbacks for the four container subsystems.
- The benchmark plan and the gates that govern merge.
Out of scope (deferred to follow-on MEPs):
- Tier-2 optimizing JIT (MEP-32, funded after this MEP ships).
- Tracing JIT (MEP-31, funded only if Phase-2 numbers warrant).
- Inline caches in the JIT itself: the interpreter's MEP-19 and MEP-27 ICs are read by the JIT as type hints, but the JIT does not learn or update IC state in v1.
- Allocation-removal, escape analysis, profile-guided inlining: all tier-2 features.
- Speculative deopt: there is no tier-2 to deopt to.
- Garbage collection coordination beyond the MEP-20 frame layout already preserved by the interpreter.
Background: vm2 opcode surface
From runtime/vm2/ops.go and the planned MEP-24/MEP-29 additions, the JIT must cover roughly 60 opcodes across nine categories. Counts below are current + planned in parentheses.
| Category | Opcodes (current + planned) | Canonical examples |
|---|---|---|
| Misc / control | 2 | OpHalt, OpMove |
| Arithmetic (I64) | 8 | OpAddI64, OpAddI64K, OpSubI64, OpMulI64, OpDivI64, OpModI64, OpLessI64, OpEqualI64 |
| Control flow | 8 | OpJump, OpJumpIfFalse, OpJumpIfLessI64, OpJumpIfLessEqI64, OpJumpIfGreaterI64, OpJumpIfGreaterEqI64, OpJumpIfEqualI64, OpJumpIfNotEqualI64 |
| Closure / call | 4 | OpCall, OpTailCall, OpTailCallSelf, OpReturn |
| Constants | 1 | OpLoadConstI |
| String | 6 | OpLoadStrK, OpConcatStr, OpLenStr, OpIndexStr, OpEqualStr, OpHashStr |
| List | 5 | OpNewList, OpListLen, OpListGet, OpListSet, OpListPush |
| Map | 6 | OpNewMap, OpMapLen, OpMapGet, OpMapHas, OpMapSet, OpMapDel |
| Set (planned, MEP-29) | 0 (+6) | OpNewSet, OpSetAdd, OpSetHas, OpSetDel, OpSetLen, OpSetIter |
| Struct (planned, MEP-24) | 0 (+6) | OpNewStruct, OpStructGet, OpStructSet, OpStructTagCheck, OpStructLen, OpStructEqual |
| Total | 40 + ~12 = ~52 |
A few notes on the inventory:
- The string opcodes already include both small-string-inline (the
tagSStrwork, MEP-19 PR4) and pointer-tagged paths. The JIT must support both branches in a single fused fast-path. See §String opcodes. - Map opcodes are currently shape-monomorphic; the MEP-29 §Maps measured-results MEP recommends keeping them so. The JIT inherits the shape.
- Set and struct opcodes are still pre-spec. The JIT lands their templates after the interpreter ships them; this MEP commits to the templates being a strict mechanical extension of the map and list templates respectively.
Phase 1.5 post-mortem: why "BLR into Go" failed
The original §Architecture overview below specified that allocation-and-Go-touching opcodes (OpNewList, OpListPush, every New*, every string concat, etc.) would lower to a thin slow-path callout: emit a BLR x16 that targets a Go function pointer. The Phase 1.5 implementation tried this for the five list opcodes; the resulting code crashes inside the goroutine runtime on the very first OpNewList test, in three different ways depending on which backup strategy we tried. The root cause is structural, not a local bug, and we are documenting it so future MEP-34 work does not re-attempt the same shape.
The three failures, in the order we tripped them:
- Stack-slot backup at
[sp+40]— corrupted acrossmorestack. Saved a heap pointer; the goroutine stack grew duringmakeslice; copystack walked back from the grown stack into our JIT frame; the JIT frame is not inpclntab, so the runtime threw "unknown caller pc" before our restore ever ran. - Callee-saved register backup in x21/x22 — overwritten by the JITNewList morestack stub, which saves R0,R1 at
[JIT_SP+8],[JIT_SP+16], exactly where our STP put x21/x22. - Heap-backed scratch at
VM.JITScratchaddressed via x20 — crashes because x20 itself is not preserved across BLR. Disassembly ofruntime.nanotime_trampoline(the libc wrapper invoked deep in any allocator path) shows it clobbers R19, R20, R21, R22, and R27 without saving them. It is a non-conformant ABI shim. JITNewList does not save R19/R20 either, because it does not use them locally — under Go's ABIInternal, a callee only saves the callee-saved registers it actually touches.
The unifying lesson:
A JIT frame that is invisible to
pclntabcannot have any Go function called from it that may itself triggermorestack,gentraceback, the GC's stack scan, the runtime profiler, or a goroutine preempt. Every Go allocator path can trigger at least one of these. Therefore the original §Slow-path callouts design — JIT body BLRs into a Go function that allocates — is unimplementable without first making the JIT frame walkable by Go's runtime, which requires hacking Go-internal data structures we have no supported way to write to.
We considered three workarounds and rejected them:
- Make the JIT frame walkable. Requires registering JIT pages with
runtime.moduledataand shipping fakepclntab/funcdataentries. Done ingithub.com/bytedance/sonic, but reverse-engineered against Go internals and fragile across point releases. Out of scope for a baseline JIT. - Trampoline-as-pclntab-bridge. Wrap each slow path in a Go assembly trampoline that is itself in
pclntab, save R19/R20 in the trampoline's own frame. Solves the register-clobber issue but not the unwinder issue: any panic, preempt, or stack scan inside the slow-path body still tries to walk past the trampoline into our (invisible) JIT frame and fails. - Pre-grow the goroutine stack so morestack never fires inside JIT-called Go code. Defers the bug rather than fixing it; hostile to long-running servers; does nothing for the GC stack scan or profiler tick.
The supported pattern, as practised by every production JIT we surveyed (LuaJIT, V8 Sparkplug, CPython 3.13 copy-and-patch, PyPy), is the same: JIT code does not call into the host runtime in-place. Instead, an opcode that needs the host either (a) compiles to an allocation-free fast path inlined into the JIT, or (b) takes a side exit / deopt to the interpreter, which is a normal Go function running on a normal Go frame, runs the slow op, and either re-enters JIT at the next safepoint or finishes the function on the interpreter.
The revised spec adopts this pattern. §Deoptimization protocol below replaces the old "slow-path callout" design.
Deoptimization protocol
The deopt protocol is the single mechanism the JIT uses for every opcode it cannot lower to an allocation-free fast path. It is borrowed near-verbatim from LuaJIT's side-exit model.
Contract
A deopt point is a vm2 instruction index pc inside a JIT'd function such that the JIT was unable to lower the instruction at pc to native code. At a deopt point, the JIT emits a fixed sequence that:
- Stores all live vm2 registers from their JIT-assigned host registers (
x9-x15on AArch64) back to the shared register file at[regs_ptr + i*8]. After this step, the interpreter and the JIT agree on register state. - Loads
pc(an immediate, known at compile time) into the standard JIT return register (x0 on AArch64), tagged with a sentinel bit pattern that distinguishes a deopt return from anOpReturnreturn. - Executes the standard JIT epilogue (
LDP x19,x20,[sp],#64; RET).
The Go-side wrapper that invokes the JIT (the trampoline.Call -> vm2.runJITFunction path) inspects the returned value:
- If it carries the deopt sentinel, the wrapper unpacks
pc, sets the current frame's IP topc, and resumes the interpreter on the same frame. - If it carries an
OpReturnvalue, the wrapper returns it as the function's result.
The interpreter, on resume, runs one or more opcodes through its normal switch dispatch. At the next re-entry safepoint (function entry, back-edge with non-trivial loop body, or after N interpreted opcodes), the wrapper may re-enter the JIT at the current IP if the function's compiled code covers that IP. v1 does not re-enter mid-function; once a function deopts it finishes on the interpreter. Mid-function re-entry is Phase 2.
Sentinel encoding
The vm2.Cell is a NaN-boxed 64-bit value. Deopt returns reuse the int48 tag (0xFFFC in the top 16 bits) with a never-otherwise-emitted bit set in the low 48 bits — concretely, we set bit 47, which is always zero in a sign-extended int48 fast-path result. The wrapper checks (cell >> 47) & 1 != 0 && cell.tag() == tagInt to detect a deopt, then masks bit 47 off and zero-extends the bottom 47 bits to recover pc. Functions with more than 2^47 instructions are not supported, which is consistent with every other limit in vm2.
Cost
The deopt sequence is N + 3 instructions on AArch64: N register spills (one per live JIT register) plus MOV of the sentinel into x0, plus the two-instruction epilogue. For a typical loop body with 5 live registers this is 8 instructions. The interpreter wrapper's deopt-vs-return branch is one Cell read, one mask, one conditional — comparable to a single interpreted opcode.
Compile-time fallback
If a function's opcode density of deopt points exceeds a threshold (provisional: 50% of executed instructions on the first 1000 calls), the JIT marks the function as "not worth compiling" and the interpreter handles all future calls. This is a Phase 2 refinement; Phase 1.5 always attempts to compile, always tolerates deopts.
What this changes
- The "slow-path Go function pointer" approach in §Architecture overview is deprecated. New code does not emit BLR into Go functions from the JIT body.
- The five list-opcode lowerings in
runtime/jit/vm2jit/lower_arm64.go(added in the Phase 1.5 work) are reverted; they are replaced by deopt stubs. runtime/vm2/lists.gokeeps theJIT*List*Go functions (they are clean Go-callable shims) — they are now reached only from the interpreter, never from JIT code. They may eventually be inlined into the interpreter's opcode handlers.runtime/vm2/vm.godoes not need theJITScratchfield anymore; the deopt model has no register backup hazard.
Specification
Architecture overview
+----------------------+ JIT compile +-------------------+
| vm2 bytecode for | -----------------> | Native code page |
| a Function | (per-function) | (mmap'd MAP_JIT) |
+----------------------+ +---------+---------+
^ |
| interpret unchanged | call via
| | runtime trampoline
+-------+--------------+ v
| runtime/vm2/eval | <--- jit returns into |
| switch dispatch | frame.RetReg |
| (fallback, calls) | <--- or returns a deopt |
+----------------------+ sentinel; wrapper |
^ resumes interp here |
+---------------------------------------------+
no in-place callbacks into Go;
see §Deoptimization protocol
- Compile unit: one vm2
Functionat a time. The JIT walks the function's bytecode and emits one native instruction sequence per opcode, plus a function prologue (load frame pointer, materialise constant pool base) and one epilogue perOpReturnor deopt point. - Per-call dispatch: the runtime keeps both an interpreter pointer (
func(*VM)) and an optional compiled-code pointer on eachFunction. ACallopcode that lands on a function with a non-nil compiled-code pointer enters the JIT; otherwise it enters the interpreter. Cross-tier calls are free: both sides see the same*Framelayout. - No in-place callbacks: the JIT body does not call Go functions. Opcodes that touch the allocator, the
Objectsheap, or any other host runtime surface emit a deopt stub (§Deoptimization protocol) instead. The wrapper that invoked the JIT runs the slow opcode on the interpreter and either re-enters or finishes interpreted. This sidesteps the structural failures documented in §Phase 1.5 post-mortem. - No cgo at runtime: every JIT'd page is entered via a pure-Go trampoline written in
.s. The MEP-30 prototype's cgo wrapper is replaced before any benchmark in this MEP is published; cgo is forbidden on the JIT hot path.
Frame compatibility
The single largest correctness obligation. The contract:
- Register file layout is identical between interpreter and JIT.
Frame.RegsBasepoints at the start ofNumRegsCell-sized slots in the sharedStack. The JIT compiles to native code that addresses[frame_ptr + RegsBase*8 + reg*8]for every register read, and writes back to the same slot on every register write. There is no shadow register file. - Frame metadata fields are read-only to the JIT except for
Frame.IPandFrame.RetReg. The JIT incrementsFrame.IPonly at safepoints (back-edges, calls, allocations); within a straight-line opcode sequence the IP is undefined. - Safepoints are deterministic. The JIT inserts a one-instruction goroutine-preemption check at every back-edge and at every slow-path callout. The check polls
runtime.gcWaitOnPreemptanalog (the precise primitive is TBD; see Open Questions). Cellreads are tag-aware. The JIT does not assumeregs[i]is a particular type unless the bytecode came from an MEP-19-quickened typed opcode (OpAddI64etc). For untyped opcodes the JIT emits the tag-check fast path inline.
The contract means a JIT'd function can OpCall an interpreted callee, the interpreter can OpCall a JIT'd callee, and runtime.Stack(t) panics traverse both stacks indistinguishably.
Cell fast paths
Every typed opcode (OpAddI64, OpListGet, OpMapGet, ...) decomposes into:
tag-check (1-2 instructions, branch on mismatch to slow path)
type-specific work (1-4 instructions)
write back to register (1 instruction)
The tag-check exploits the NaN-boxing layout in runtime/vm2/cell.go. For int48, the check is a single ubfx (AArch64) or shr+cmp (AMD64) against 0xFFFC; the int48 payload is then extracted by sign-extending the bottom 48 bits. For pointer tags (0xFFFF), the payload is the Objects table index, and the slot fetch is a base+scaled-index load.
The MEP-30 prototype demonstrated that an int-only loop body lowers to ~7 native instructions per iteration (post-MEP-32 peephole), with no tag check. The full vm2 JIT pays a per-untyped-op tag check, expected at ~2 extra instructions and one well-predicted branch per op. MEP-19's quickening removes the tag check for fully-typed code paths; this is why MEP-19 is on the critical path for the JIT to shine.
List opcodes
Under the revised model (§Deoptimization protocol), list opcodes split by whether they touch the Go allocator:
| Opcode | Phase 1.5 | Phase 2 fast path | Phase 3 (optional) |
|---|---|---|---|
OpNewList | deopt to interpreter | - | systemstack alloc |
OpListLen | deopt to interpreter | ~6 instrs (tag-check ptr, load *vmList, load Len) - allocation-free, stays in JIT | - |
OpListGet | deopt to interpreter | ~10 instrs (tag-check, deref, bounds-check, indexed load) - allocation-free | - |
OpListSet | deopt to interpreter | ~8 instrs if list is fully-owned (post-MEP-26 single-writer); deopt on cap-exceeded | shared-write barrier in JIT |
OpListPush | deopt to interpreter | - | systemstack alloc |
Phase 1.5 lands all five as deopt stubs. The deopt stub is a fixed sequence: spill live JIT regs back to the register file, return the sentinel-tagged PC. The Go-side wrapper resumes the interpreter at that PC; the interpreter runs the list op against vm2.JIT*List* (the same functions that were the slow-path targets in the failed original design, now only reached from interpreter dispatch) and finishes the function on the interpreter.
Phase 2 promotes the three allocation-free ops (OpListLen, OpListGet, OpListSet in the unshared-cap-OK case) to inline fast paths. The fast path is straight-line code: tag-check, deref, indexed load. No Go calls. Bounds-check failures and shared/grow paths deopt the same way Phase 1.5 does, so the protocol is unchanged.
Phase 3, optional, revisits in-JIT allocation only if Phase 2 benchmarks leave significant headroom. The candidate mechanism is runtime.systemstack-style switching to the g0 stack (fixed-size, never grows), which avoids the morestack / JIT-frame-not-in-pclntab hazard. Whether it is faster than deopt-and-interpret is an open question and is not committed in this MEP.
The MEP-23 lists/fill_sum workload (build a list of N ints, sum it) under Phase 1.5 becomes:
init : OpNewList -> deopt; interpreter allocates list; function continues interpreted
loop : OpListPush, OpAddI64, OpJumpIfLessI64 back-edge (all interpreted post-deopt)
sum : OpListGet, OpAddI64, OpJumpIfLessI64 (all interpreted post-deopt)
Predicted Phase 1.5 speedup on lists/fill_sum: ~1.0x (parity; the function deopts on instruction 0 and runs entirely interpreted). Phase 2 lifts the sum-only phase into JIT (predicted ~1.5x for that phase). Phase 3, if pursued, predicts 2-3x by JITing both phases.
The honest read is that the JIT is not a list-allocation optimization. The interpreter already calls the same Go allocator paths with negligible dispatch overhead, and the JIT cannot beat the allocator without escape analysis (tier-2, MEP-32). The JIT's win on list-heavy code comes from JITing the arithmetic and iteration around the list ops, which is what Phase 2's read-only fast paths buy.
String opcodes
The string subsystem has two physical representations:
- Inline small string (
tagSStr): up to 5 bytes packed into the Cell itself. No heap object. Implemented in MEP-19 PR4. - Heap string: pointer tag,
*vmStringin theObjectstable.
Inline strings are an excellent fit for the deopt model: every operation on them is allocation-free, so it stays in JIT. Heap strings are mostly allocation-free for reads but allocation-bound for writes; reads stay in JIT, writes deopt.
| Opcode | Phase 1.5 | Phase 2 fast path |
|---|---|---|
OpLoadStrK | inline JIT (const) | inline JIT (1-2 instrs) |
OpLenStr | deopt to interpreter | ~3 (inline) / ~6 (heap deref) instrs - allocation-free, stays in JIT |
OpIndexStr | deopt to interpreter | ~6 (inline byte extract) / ~10 (heap base+offset load) - allocation-free |
OpEqualStr | deopt to interpreter | ~4 (Cell == Cell for inline pairs and interned-heap pairs) - allocation-free |
OpConcatStr | deopt to interpreter | always deopts (heap allocation); Phase 3 may revisit with systemstack inline-string concat |
OpHashStr | deopt to interpreter | ~8 (xxhash) for inline; deopt for heap until xxhash3-inline lands |
OpEqualStr exploits the vm2 interner (MEP-19 PR2) so that even heap-string equality reduces to a Cell == Cell compare for the common case of constant or short-string operands. The interesting workloads stay in JIT once Phase 2 lifts them; only concat and the heap-string hash path remain deopt-bound.
Phase 1.5 prediction: ~1.0x on strings/concat_loop (all opcodes deopt; one deopt per loop iteration leaves the function fully interpreted after the first concat). Phase 2 prediction: 1.5-2x on strings/equal_loop, ~1.0x on strings/concat_loop. Phase 3, if pursued, predicts 2-3x on strings/concat_loop via systemstack inline-only concat.
Map opcodes
Maps are open-addressed shape-monomorphic (MEP-29 §Maps). Two key types in v1: int48 and interned-string. Allocation-bound opcodes (OpNewMap, OpMapSet on growth, OpMapDel) deopt; read paths and steady-state writes stay in JIT once Phase 2 lands.
| Opcode | Phase 1.5 | Phase 2 fast path |
|---|---|---|
OpNewMap | deopt to interpreter | - |
OpMapLen | deopt to interpreter | ~6 (deref *vmMap, load Len) - allocation-free, stays in JIT |
OpMapGet | deopt to interpreter | ~12 (hash, probe loop with 1 unroll, int48 key); collision-overflow path deopts |
OpMapHas | deopt to interpreter | ~10 (same probe, return boolean) |
OpMapSet | deopt to interpreter | ~14 in steady state (probe + write, no growth); growth and rehash deopt |
OpMapDel | deopt to interpreter | - |
The probe loop is unrolled exactly once. The interpreter walks the probe loop in a fast Go loop (MEP-29 measurement: 5-9 ns/op); the JIT's unroll-once advantage is small for hot lookups and zero for misses.
Phase 1.5 prediction: ~1.0x on maps/fill_probe (fill phase deopts on OpNewMap then runs the loop interpreted). Phase 2 prediction: 1.2-1.5x on maps/keys (read-only iteration stays in JIT); fill-and-grow workloads stay near parity since the allocator dominates.
Set opcodes
Sets reuse the map's open-addressing infrastructure (MEP-29 §Sets). The JIT templates are identical to map templates with the value-slot writes elided, and the deopt boundaries are the same: allocation deopts, steady-state membership and iteration stay in JIT post-Phase-2. The set spec lands first (interpreter); this MEP commits the JIT templates as a mechanical derivation.
Phase 1.5 prediction: parity. Phase 2 prediction: same as maps (~1.2-1.5x on read-heavy workloads).
Struct opcodes
Structs are flat tuples of Cells with a shape tag (MEP-24 §5). The shape tag is a uint32 cached on the struct header. Structs are the friendliest collection type for the JIT: shape is fixed at allocation, so reads and writes are pure pointer arithmetic with no growth path.
| Opcode | Phase 1.5 | Phase 2 fast path |
|---|---|---|
OpNewStruct | deopt to interpreter | - |
OpStructGet | deopt to interpreter | ~7 instrs (tag-check, deref, indexed load) - allocation-free |
OpStructSet | deopt to interpreter | ~6 instrs (no GC barrier) / ~10 (with barrier on shared-heap write) |
OpStructTagCheck | deopt to interpreter | ~4 instrs |
OpStructLen | deopt to interpreter | constant from shape, ~2 instrs |
OpStructEqual | deopt to interpreter | ~12 instrs (shape-check + memcmp for ≤32 bytes); larger or nested deopts |
Phase 1.5 prediction: parity (any OpNewStruct deopts the function). Phase 2 prediction: 2-3x on structs/fill_field and 1.5-2x on structs/equal_loop once allocation can be done up-front and the loop body stays in JIT.
Calls and returns
Calls touch vm.Frames (an append-grown slice) and vm.Stack (a make-grown slice). Both can reallocate, both reside on the Go heap, both need GC visibility. Under the deopt protocol all of OpCall / OpTailCall deopt in Phase 1.5; only the intra-function self-tail and the function's own OpReturn stay in JIT.
| Opcode | Phase 1.5 | Phase 2 fast path |
|---|---|---|
OpCall | deopt to interpreter | - |
OpTailCall | deopt to interpreter | - |
OpTailCallSelf | inline JIT branch | inline JIT branch (~3 instrs, no frame manipulation) |
OpReturn | inline JIT epilogue | inline JIT epilogue (~3 instrs, returns sentinel-or-value to wrapper) |
Phase 3 may revisit OpCall to a JIT'd callee by emitting a JIT-to-JIT call using the same stack-allocated jitFrame layout the trampoline uses; this avoids the vm.Frames append entirely on the call path. Whether it pays for itself depends on whether enough call sites land on stably-JIT-compiled callees. Out of scope for this MEP.
OpTailCallSelf is the JIT's flagship call shape: recursive functions written as tail recursion (the recommended Mochi idiom for loops over disjoint cases) compile to a single in-function branch and never touch vm.Frames. This is what makes arith/fib_rec-style benchmarks competitive with iterative Go.
Arithmetic and control flow
These are the MEP-30 prototype's home territory. Templates are short (1-4 instructions each) and the only new wrinkle is the comparison-and-branch fusion: OpJumpIfLessI64 is a single cmp; b.lt on AArch64 and cmp; jl on AMD64. The MEP-30 prototype emitted cmp; csinc; cbnz (three instructions) because it had no fused branch opcode; vm2 already has the fused opcodes, so the JIT is strictly simpler here than the prototype was.
Backend split: AArch64 + AMD64
Two backends, written in parallel, sharing one mid-level opcode-to-template table that takes an (arch, opcode) -> template shape:
runtime/jit/vm2jit/
arch/
arm64/ asm encoders, per-opcode lowerings, deopt stub emitter
amd64/ same, with Linux-flavored mmap and W^X
templates/ arch-agnostic per-vm2-opcode lowerings (one file per category)
deopt/ sentinel encoding/decoding, interpreter-resume wrapper
trampoline/ pure-Go `.s` trampolines for cross-tier calls
compile.go the per-Function compiler driver
cache.go the executable code cache (shared mmap pool)
Note that there is no runtime/ subpackage of Go slow-path shims in the revised design. Opcode-specific runtime work happens in runtime/vm2/ (the interpreter), reached only via deopt-and-resume, never via direct call from JIT code.
The AArch64 backend reuses the MEP-30 encoders (runtime/jit/tmpljit/emit_arm64.go). The AMD64 backend writes its own encoders; the encoding table for the opcode subset the JIT needs is ~40 entries, ~200 lines of Go.
Trampoline (cgo replacement)
The MEP-30 prototype calls JIT'd code via cgo. The production JIT must not. The MEP-30 spec (§6.2) specifies a pure-Go .s trampoline. This MEP commits the trampoline as a hard merge gate: no benchmark in §Benchmark plan is published with cgo. The trampoline shape is the standard "call assembly with a fixed-prototype function pointer" pattern; see Go's own runtime/asm_arm64.s for the precedent.
Engineering phases
The revised phasing reflects the deopt model. Phase 1 (arithmetic, control flow, Move, Return) is already merged; per-loop speedup of 5-8x measured (§Appendix A). The remaining phases:
Phase 1.5: deopt protocol + universal opcode coverage (estimated 1-2 engineer-weeks)
- Implement the deopt sentinel encoding in
runtime/vm2/cell.goand adeopt.Decode(Cell) (pc int, ok bool)helper. - Implement the interpreter-resume wrapper in
runtime/jit/vm2jit/: read the JIT return, branch on sentinel, setframe.IP = pcand callvm.runInterp(frame)to finish the function. - Implement a deopt-stub emitter in
lower_arm64.go: for any unsupported opcode, spill live JIT regs to the register file and return the sentinel-tagged PC. - Wire every non-arithmetic, non-control-flow, non-Move, non-Return opcode to the deopt stub. The function may still be compiled (and benefit from JIT'd arithmetic up to the first deopt point), but lists/strings/maps/sets/structs/calls all deopt.
- Benchmark gate: arithmetic loops stay at 5-8x; all MEP-23 list/string/map/set/struct/call workloads stay within 5% of interpreter baseline (i.e., the deopt round-trip cost is small enough to be invisible).
Phase 2: allocation-free fast paths (estimated 4-6 engineer-weeks)
- Lift the read-only opcodes (
OpListLen,OpListGet,OpListSetno-grow,OpLenStr,OpIndexStr,OpEqualStr,OpMapLen,OpMapGet,OpMapHas,OpStructGet,OpStructSet,OpStructLen,OpStructTagCheck,OpStructEqual) to inline JIT fast paths. Each fast path is straight-line code with deopt on any unhappy path (bounds-fail, growth-needed, type-mismatch). - Set opcodes ride along (mechanical from maps).
- Benchmark gate:
lists/sum,strings/equal_loop,maps/keys,structs/fill_fieldreach 1.5x or better vs interpreter; allocation-bound workloads stay near parity.
Phase 3 (optional): in-JIT allocation (estimated 4-8 engineer-weeks)
- Investigate
runtime.systemstack-style allocation from JIT code for the small-allocation opcodes (OpNewListwith small cap,OpConcatStrfor inline result,OpNewStruct). - Investigate JIT-to-JIT direct call to avoid
vm.Framesappendon the hot path. - Decision gate: each candidate ships only if its measured speedup vs Phase 2 deopt is at least 1.3x on the relevant workload. Otherwise the deopt path is fast enough and we don't add code.
Total (Phase 1.5 + Phase 2): ~5-8 engineer-weeks, ~2 KLOC of Go, on top of merged Phase 1. Phase 3 is unscoped and conditional.
Benchmark plan
Three benchmark groups, all run on the same hardware in the same session, all reported with five-sample medians and benchstat-style variance:
Group A: vm2 interpreter head-to-head
For every MEP-23 workload, three numbers:
- vm2 interpreter (current main).
- vm2 + this JIT.
- The ratio.
Merge gate: ratio < 1.0 on every workload, with at least 1.5x on the loop-dominated subset (fill_sum, concat_loop, fill_probe, fib_iter).
Group B: cross-language
For each Group A workload that has a published port:
| Language | Implementation |
|---|---|
| Mochi | vm2 interpreter + this JIT |
| Lua | Stock Lua 5.5 (loadable binary on macOS) |
| LuaJIT | LuaJIT 2.1 (master) |
| Python | CPython 3.14 (default tier-2 interpreter) |
| Go (reference) | Hand-translated, for the theoretical floor |
Reportable: the ratio table, the workload-level analysis, the threats-to-validity section. Not gated on any specific ratio; ship the numbers.
Group C: per-opcode microbenchmarks
For every category in §Background, one microbench per opcode hot path. Reports ns/op for:
- The interpreter handler.
- The JIT'd fast path.
- The JIT'd slow path.
Used internally only, not in the public results MEP. The role is to flag regressions during development (the per-opcode fast path should not regress between phases) and to give the tier-2 MEP its baseline numbers.
Reporting MEP
The numbers land in a new Informational MEP, analogous to MEP-33. Provisional number: MEP-35. Full vm2-opcode JIT - Measured Results. The MEP-35 draft is written alongside Phase 3 and merged simultaneously with the JIT flip-default change in vm2.
Risks
- Deopt frequency dominates speedup on allocation-heavy workloads. A loop that allocates every iteration spends most of its time in the interpreter, with the JIT contributing only the arithmetic between deopt points. The honest read is that the JIT does not speed up such loops; it lands at parity. Mitigation: explicit per-workload predictions in §List/String/Map/Set/Struct opcodes. The merge gate is "no regression", not "speedup on every workload".
- The deopt protocol itself becomes the bug surface. Sentinel encoding, live-reg spill list, IP synchronization, and re-entry safepoints have to agree between JIT and interpreter. Mitigation: keep the protocol minimal (one sentinel encoding, one spill convention, no mid-function re-entry in v1); add a
deopt_test.gothat fuzzes JIT-deopt-interpret-finish for every opcode. - W^X policy differences between darwin and linux. macOS arm64 requires
pthread_jit_write_protect_np; linux/amd64 wantsPROT_READ|PROT_WRITEflips around the page write. The MEP-30 prototype handles macOS; the production JIT must abstract this intoruntime/jit/vm2jit/arch/{arm64,amd64}/page.go. - Frame-format drift between interpreter and JIT. The contract in §Frame compatibility is the most subtle correctness hazard, especially across deopt boundaries where the interpreter picks up state the JIT just spilled. Mitigation: a single Go struct (
type Frame struct { ... }) shared between both sides; any field reorder must update both lowerings or fail a compile-time assertion. - MEP-19 quickening coverage drives JIT win size. The JIT's tag-check fast path is good but pays 2 instructions per untyped op. Workloads the quickening pass misses look slower than expected. Mitigation: instrument the JIT to report per-call quickening coverage in the MEP-35 results; if coverage is below 80% on the corpus, file follow-on MEP-19 work.
- Goroutine preemption check granularity. Too frequent costs the JIT its loop-tightness advantage; too rare risks scheduler stalls. Under the deopt model the simplest answer is "every deopt point is a safepoint" plus a per-back-edge check; measure in MEP-35, tune if needed.
- Single-binary distribution. The MEP-30 spec required this; this MEP inherits the constraint. The pure-Go
.strampoline is the only acceptable cross-tier call mechanism; cgo is forbidden after the prototypes. - Phase 3 chases diminishing returns. Once read paths are in JIT, the remaining headroom is allocation, which requires
runtime.systemstacktricks or escape analysis (a tier-2 problem). Phase 3 is explicitly conditional on measured headroom, not a deliverable.
Open questions
- Where does the goroutine preemption check live, and at what granularity? Go's
morestackcheck is not exposed and is unsafe for JIT code anyway (see §Phase 1.5 post-mortem). The deopt model gives an easy answer: every deopt point is implicitly a safepoint, since the wrapper resumes the interpreter, which honors preemption normally. Open question is whether we also need a per-back-edge atomic-load preemption check for tight all-JIT loops, or whether deopt-on-overflow at a slow path is sufficient. - Re-entry strategy after deopt. v1 does not re-enter the JIT mid-function (once deopted, the function finishes interpreted). Phase 2 may add re-entry at function-entry safepoints; full mid-function re-entry would require a JIT entry table keyed by PC, which is the same machinery a tier-2 deopt-into-tier-1 reverse path needs. Defer.
- JIT cache lifetime across long-running processes. Per-Function compiled-code pointer is the simplest answer, lifetime tied to the Function. Cross-Function code sharing for common templates is appealing but unscoped; defer to Phase 3 or later.
- Should
OpNewListwith a known small constant cap deopt, or compile? The JIT could materialize a small list inline using stack-allocated backing storage (since lists are escape-analysis-friendly when scope is bounded), but this requires teaching the JIT about lifetime. Out of scope for Phase 1.5/2; revisit in Phase 3 with measured headroom. - AMD64 register pressure. AArch64 has 30 GPRs; AMD64 has 14 useable. The JIT's per-opcode templates compile cleanly on both, but a future tier-2 MEP-32 will need an actual register allocator for AMD64. The deopt spill list is shorter on AMD64 (fewer regs to spill) so the deopt-stub size is similar. Spec is fine; allocator is post-this-MEP.
Comparison with the three JIT options
The three options (MEP-30, MEP-31, MEP-32) are about strategy: template vs tracing vs tiered. This MEP is about coverage: how the chosen strategy (template, per MEP-30 + MEP-33) is carried across the full vm2 opcode surface.
| Aspect | This MEP (MEP-34) | MEP-30 | MEP-31 | MEP-32 |
|---|---|---|---|---|
| Scope | full vm2 opcode set | 6-opcode toy | hot loops (recorded) | full vm2 + tier-2 opt |
| Tier | tier 1 only | tier 1 (prototype) | tier 2 (orthogonal) | tier 1 + tier 2 |
| Coverage | inline arith/control/move/return; deopt for | arith only | depends on workload | this MEP + IR optimizations |
| allocating ops; Phase 2 adds read-only fast paths | ||||
| Engineering (remaining) | 5-8 weeks (Phase 1.5 + Phase 2) | done (afternoon) | 12+ months | 18+ months |
| Predicted speedup | 5-8x on arith (measured); 1.5-2x on read-heavy | 17x on toy bytecode | 5-15x on loops, abort risk | 4-8x broad |
| post-Phase-2; parity on alloc-heavy |
Predicted MEP-23 numbers
Predictions split by phase. Phase 1 numbers are measured (§Appendix A). Phase 1.5 numbers assume the deopt round-trip is ~20-30 ns; Phase 2 numbers assume inline read-path templates execute at MEP-30-prototype-comparable speed. All numbers are ns/op at N=1024 on Apple M4, predicted unless marked measured:
| Workload | vm2 (now) | + Phase 1.5 | + Phase 2 | LuaJIT 2.1 | Phase 2 / LuaJIT |
|---|---|---|---|---|---|
arith/fib_iter | 3500 | 700 (meas) | 700 | 600 | 1.17x |
lists/fill_sum | 6000 | 6000 | 5000 | 1200 | 4.17x |
lists/sum | 2200 | 2200 | 600 | 700 | 0.86x |
strings/concat_loop | 8000 | 8000 | 8000 | 2500 | 3.20x |
strings/equal_loop | 1800 | 1800 | 800 | 400 | 2.00x |
maps/fill_probe | 11000 | 11000 | 11000 | 4000 | 2.75x |
maps/keys | 3500 | 3500 | 2000 | 1800 | 1.11x |
sets/fill_probe | 11500 | 11500 | 11500 | - | - |
structs/fill_field | 2400 | 2400 | 1000 | - | - |
structs/equal_loop | 1300 | 1300 | 600 | - | - |
The honest read: Phase 1.5 ships parity-or-faster on every workload (arithmetic improves, others stay flat). Phase 2 brings the read-heavy workloads into 0.85-2x of LuaJIT. The allocation-bound workloads (fill_sum, fill_probe, concat_loop) stay near interpreter speed without tier-2; that is the honest limit of a non-IR JIT.
The CPython 3.14 ratios will be in the 8-25x range across the board, consistent with MEP-33.
If Phase 2 numbers fall short of these by more than 30%, the failure modes and remediations should be the explicit subject of the MEP-35 reporting and may motivate funding MEP-32 sooner than planned.
Related work
- MEP-23, the cross-language benchmark methodology this MEP's gates use.
- MEP-24, the vm2 subsystem spec defining the list/string/map/set/struct shapes the JIT must handle.
- MEP-19, MEP-27, the IC infrastructure whose type feedback the JIT reads but does not write.
- MEP-29, the dispatch-strategy measured results that motivate keeping vm2 monomorphic, which this JIT inherits.
- MEP-30, the template / copy-and-patch baseline JIT strategy this MEP carries to full opcode coverage.
- MEP-31, the tracing JIT alternative; remains deferred.
- MEP-32, the tier-2 optimizing JIT; funded after this MEP ships and MEP-35 lands.
- MEP-33, the MEP-30 prototype's measured results that calibrate this MEP's predictions.
Files to add (provisional)
Files already merged from Phase 1 are shown in italics. Files added by Phase 1.5 are bold. Phase 2 adds the remaining per-category template files.
runtime/jit/vm2jit/doc.go, package overview.runtime/jit/vm2jit/compile.go, per-Function compiler driver.runtime/jit/vm2jit/cache.go, executable code cache.runtime/jit/vm2jit/lower_arm64.go, AArch64 lowering for arithmetic/control/move/return.runtime/jit/vm2jit/trampoline/{trampoline_arm64.s,trampoline.go}, pure-Go cross-tier call.runtime/jit/vm2jit/deopt/{sentinel.go,resume.go}, sentinel encoding + interpreter-resume wrapper.runtime/jit/vm2jit/lower_deopt_arm64.go, deopt stub emitter (one stub shape, parameterised by spill list and PC).runtime/jit/vm2jit/lower_arm64_lists.goetc., Phase 2 per-category inline templates (one file per category).runtime/jit/vm2jit/arch/amd64/{encode.go,page.go,lower.go}, AMD64 backend (Phase 2 or later).runtime/jit/vm2jit/vm2jit_test.goplus per-category test files, plusdeopt_test.gofuzzing JIT-deopt-interpret-finish.runtime/jit/vm2jit/bench/, microbenches and MEP-23 corpus harness.website/docs/mep/mep-0035.md, the measured-results MEP (lands with Phase 2).
Appendix A: Phase 1 Measured Results
Hardware: Apple M4, darwin/arm64, Go 1.24.
Command: go test -bench=. -benchtime=5s -count=5 ./runtime/jit/vm2jit/
All numbers are 5-sample medians. The trampoline is CGo-based in Phase 1 (pure-Go .s is Phase 1.5); CGo boundary cost is ~25-30 ns per call and is visible in the per-opcode microbenchmarks but is amortized across loop iterations in the loop benchmarks.
A.1 fib_iter (iterative Fibonacci)
Loop body: OpJumpIfGreaterEqI64, OpAddI64, two OpMove, OpAddI64K, OpJump — 6 opcodes per iteration.
| Benchmark | JIT (ns/op) | Interp (ns/op) | Speedup |
|---|---|---|---|
| fib_iter N=20 | 35 | 205 | 5.9x |
| fib_iter N=100 | 120 | 978 | 8.1x |
Speedup grows with N because the ~30 ns call overhead is amortized. JIT per-iteration cost: ~1.6 ns at N=20, ~1.2 ns at N=100. Interpreter per-iteration cost: ~10 ns, consistent with MEP-29 dispatch measurements. At large N the steady-state ratio approaches ~8x.
A.2 sum_n (integer accumulation loop)
Loop body: OpAddI64, OpAddI64K, OpJumpIfLessI64 — 3 opcodes per iteration.
| Benchmark | JIT (ns/op) | Interp (ns/op) | Speedup |
|---|---|---|---|
| sum_n N=100 | 130 | 542 | 4.2x |
| sum_n N=1k | 1149 | 5819 | 5.1x |
| sum_n N=10k | 11714 | 62286 | 5.3x |
JIT steady-state cost: ~1.15 ns/iteration. Speedup converges to ~5x, matching the ~3-opcode loop body vs the interpreter's ~5.5 ns/opcode dispatch overhead. Both exceed the Phase 1 gate of ≥1.5x.
A.3 Per-opcode microbenchmarks
One-shot calls: prologue + one opcode + epilogue, called once per benchmark iteration through the CGo trampoline. The dominant cost (~25-30 ns) is the trampoline boundary; the JIT opcode itself costs under 5 ns.
| Opcode | Total (ns/op) |
|---|---|
| OpAdd (I64) | 32 |
| OpMul (I64) | 31 |
| OpDiv (I64) | 30 |
| OpMod (I64) | 32 |
| OpLess (I64) | 30 |
All five cluster at 30-32 ns, confirming flat per-opcode cost. The arithmetic opcodes are bounded by the NaN-box unpack/repack sequence (sbfx + and + op + and + movz + orr = 6 AArch64 instructions), not the operation itself.
A.4 Phase 1 scope, the failed Phase 1.5 attempt, and the pivot
Phase 1 covers arithmetic, control flow, Move, and Return. The pure-Go .s trampoline shipped during Phase 1, replacing the prototype's CGo wrapper.
An initial Phase 1.5 attempt added inline lowerings for the five list opcodes that called the corresponding runtime/vm2.JIT*List* Go functions from JIT code via BLR. This attempt is documented in §Phase 1.5 post-mortem; briefly, it crashed in three independent ways (R19 clobber by nanotime_trampoline, JIT frame absence from pclntab, and morestack-induced spill corruption) and the design is structurally unfixable in Go. The work was reverted.
The revised Phase 1.5 design, captured in §Deoptimization protocol, replaces in-place Go callouts with a deopt-and-resume mechanism. The arithmetic loop-body speedup numbers above are unaffected, as they never depended on calling into Go from JIT code.
The Phase 1 results of 5-8x on arithmetic loops exceed the predicted 3-6x range in §Predicted MEP-23 numbers and clear the ≥1.5x Phase 1 benchmark gate with substantial margin.