MEP 18. Bytecode Dispatch in a Go-Hosted VM
| Field | Value |
|---|---|
| MEP | 18 |
| Title | Bytecode Dispatch in a Go-Hosted VM |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-16 |
Abstract
The interpreter loop is the single hottest piece of code in any non-JIT VM. Wren measures 5-10% from computed gotos alone ([Wren-perf]); CPython 3.11's adaptive interpreter (PEP 659) gives another 10-60% on type-stable code without a JIT ([PEP-659]). Mochi today runs a single switch ins.Op over 107 opcodes in runtime/vm/vm_eval.go (1728 lines). This MEP describes which of the classic dispatch optimizations are reachable from Go, which are not, and what the realistic ceiling is for an interpreter that must stay portable Go.
The conclusion up front: Go has neither labels-as-values nor a guaranteed tail-call ABI, so the classic direct-threaded and tail-call-threaded interpreters do not port. The wins available without unsafe assembly tricks are: (a) a dense byte-sized opcode space so the Go compiler turns the dispatch switch into a jump table, (b) opcode fusion for hot bigrams, (c) hot-field hoisting of ip and the register array into local variables across the loop body, and (d) the bytecode rewriting infrastructure that MEP-19 will reuse for inline caches. Combined, internal microbenchmarks (see MEP-17) suggest a 1.5-2x speedup is reachable; we should not promise more without numbers.
Motivation
Interpreter dispatch is a five-instruction sequence in every bytecode VM: fetch the next opcode, decode its operands, jump to the handler, run the handler, return to step 1. Of those, the indirect jump is the single instruction that dominates because the branch predictor has to guess the next handler from the last one. A switch dispatch has one shared indirect jump for every opcode, so the predictor sees ~107 destinations from one site and is wrong most of the time.
The literature is consistent on the fix. Ertl and Gregg's 2003 paper ([Ertl-Gregg]) measures direct-threaded dispatch (each handler ends with its own indirect jump) at 1.5-2x over switch dispatch on the same hardware, mostly from BTB locality: the predictor learns common bigrams (LOAD_CONST -> ADD) per handler. Wren reports 5-10% in a much smaller op set. CPython 3.11+ moved from switch to computed goto (USE_COMPUTED_GOTOS) and reported significant wins; CPython 3.13 is experimenting with a tail-call interpreter ([CPython-tailcall]).
Mochi cannot have any of those primitives directly. The Go specification has no labels-as-values, no goto *ptr, and no [[clang::musttail]]. The Go compiler's escape analysis and bounds-checker also fight the kind of hand-tuned loop that wins in C. So the question this MEP answers is: what is the ceiling for a portable Go interpreter, and how do we approach it?
Specification
Constraints of the Go host
Three properties of Go shape every design choice below.
- No labels-as-values.
gotoonly takes statically-known label targets. The closest equivalent isswitchover a dense integer; the Go compiler emits a single jump-table indirect jump for that switch, recovering most (but not all) of computed-goto behavior. Wren in C with computed gotos gets 5-10%; the Go switch retains the per-call indirect jump. - No guaranteed tail calls. Function-per-op designs that thread via tail calls (Wasm3, CPython 3.13 tier-2) require the compiler to compile
return f(...)tojmp f. The Go compiler does not. A function-per-op dispatcher in Go pays the call/return overhead every opcode, which is worse than the switch. - Bounds checks and escape analysis. Indexing
regs[i]orcode[ip]inserts a bounds check. The Go compiler eliminates the check when it can prove the index is in range; the dispatch loop must be written so it can prove it. Similarly, taking the address of a local can force a heap allocation; the loop must keep its hot variables in registers, not on the heap.
Phase A: Dispatch loop discipline
The first PR of this MEP rewrites the existing 1728-line vm_eval.go switch loop with no algorithmic change, only with the layout the Go compiler can optimize.
- Opcode type narrowing.
Opis alreadyuint8, which is good. Confirm viago tool objdumpthat the switch compiles to a jump table, not a chain of compares. Today's loop is a single bigswitch; verify it. - Hoist hot state. Lift
ip,code(the[]Instrslice), andregs(the register array) into local variables at the top of the loop and write them back to the frame only on call/return. Today's loop readsfr.fn.Code[fr.ip]every iteration, which is two pointer dereferences and a bounds check. After:ins := code[ip]. Benchstat baseline before the change; verify a measurable win (expect 5-15% oniter_sumandfib). - Bounds-check elimination. Where the compiler cannot prove
ip < len(code), add a single check at the top of the loop and rely on the proof inside. Use_ = code[len(code)-1]style asserts where helpful. - Eliminate per-iteration allocation. Audit every handler for accidental heap allocation.
interface{}conversions are the usual culprit. The numeric path (OpAddInt,OpAddFloat) must allocate nothing. - Frame stack reuse. Today's
stack []*framegrows and shrinks per call. Use async.Poolfor*frameand a slab forregs. (Pool detail is covered by MEP-20; this MEP just notes the dependency.)
Expected gain from Phase A alone: 1.3-1.8x geometric mean on the MEP-17 suite, based on the gap between idiomatic and tuned Go interpreters in the literature.
Phase B: Superinstructions
Once Phase A lands, profile the bench suite (go test -cpuprofile) and identify the top ten bytecode bigrams. Typical winners in Lua-family VMs ([Casey-superinstr]):
OpConstfollowed byOpAdd(loading a literal and adding it)OpGetGlobalfollowed byOpCallOpLoad(register move) followed by any binary opOpJumpIfFalseafterOpLess(the loop-back compare)
Each fused pair becomes a single opcode (e.g. OpAddConstInt). The compiler emits the fused opcode when the source pattern is detected during bytecode emission. The interpreter gains a handler that does both operations without a dispatch in between.
The implementation cost is one new opcode and one match in the compiler per fusion. The risk is opcode-space inflation; the budget is 32 fused opcodes (the 107-op set has headroom to 256 before the uint8 overflows). New fusions land only with bench numbers showing a measurable improvement on at least one program.
Expected gain from Phase B: 1.1-1.3x on top of Phase A, concentrated in tight loops.
Phase C: Bytecode rewriting infrastructure
The third PR of this MEP does not change the opcode set or the loop. It adds the ability to rewrite a bytecode instruction in place at runtime. This is the foundation MEP-19 (specialization and inline caches) requires: when the interpreter observes that BINARY_OP always sees two ints, it should be able to rewrite the instruction to BINARY_OP_INT without recompiling the function.
The mechanism:
Instrgains an optionalQuick uint8byte that holds the quickened opcode. Initially zero.- The dispatch loop reads
ins.Op(the original), unlessins.Quick != 0, in which case it dispatches onQuick. The check is a single compare-and-branch per instruction. - Handlers may patch
ins.Quick = OpXyzIntwhen they observe stable types. The patch is a single byte write; no synchronization is needed in the single-goroutine VM model. A future concurrent VM (MEP-22 or later) would revisit.
This phase ships with no quickened opcodes. It is purely the plumbing.
Expected gain from Phase C alone: 0%. The gain shows up in MEP-19.
Phase D (deferred): unsafe interpreter loop
A future MEP may revisit a unsafe.Pointer based interpreter that bypasses Go's bounds checks and stores ip as a raw pointer into the bytecode. This would close the remaining gap to a hand-tuned C interpreter, at the cost of giving up Go's memory safety in the hot loop. It is not in this MEP because Phase A + B + the MEP-19 specialization should clear the per-suite goal without crossing that line.
Non-goals
- No JIT in this MEP. Copy-and-patch is MEP-22.
- No new value representation. That is MEP-20.
- No inline caches. That is MEP-19.
- No assembly interpreter. Pall-style hand-coded interpreters require giving up portability across the architectures Go supports.
Status at a glance
| Item | Status |
|---|---|
| Phase A: hot-state hoisting, bounds-check elimination | proposed |
| Phase A: dispatch jump-table verification | proposed |
| Phase A: numeric path is allocation-free | proposed |
| Phase B: superinstruction infrastructure (compiler + handlers) | proposed |
| Phase B: first 8 fused opcodes | proposed |
Phase C: quickening byte on Instr, dispatch indirection | proposed |
| Phase D: unsafe interpreter loop | deferred |
Risks
- The Go compiler may not emit a jump table. A 107-arm switch on
uint8should; verify in objdump before claiming any of these wins. - Frame allocation churn. Pool integration is in MEP-20, but if it slips, Phase A's measured wins may be partly eaten by GC noise. Run the bench harness with
GOGC=offas a sanity check. - Quickening invalidation. A function compiled for one input may be re-entered with different types after quickening. MEP-19 must explain when quickened ops fall back; this MEP just provides the slot.
Implementation notes
Phase A is one PR, two engineer-weeks. Each rewrite of a handler ships with benchstat numbers against the MEP-17 suite. Phase B is one PR per fusion family (binary-op-with-const, jump-with-compare); each lands only with numbers. Phase C is one PR, plumbing only.
The cumulative ceiling claim (1.5-2x geometric mean) is conservative against the literature. LuaJIT's interpreter gets 2-3x over PUC-Lua's C interpreter, but that is hand-coded assembly. Wren's computed-goto win is 5-10% over its own switch baseline. The Mochi target sits between: more than Wren (because Mochi starts further from optimal) and less than LuaJIT (because Go forecloses the assembly path).
References
- [Wren-perf] Bob Nystrom, Wren Performance. https://wren.io/performance.html
- [PEP-659] Mark Shannon, PEP 659: Specializing Adaptive Interpreter. https://peps.python.org/pep-0659/
- [Ertl-Gregg] Anton Ertl and David Gregg, The Structure and Performance of Efficient Interpreters. JILP 5, 2003. https://www.jilp.org/vol5/v5paper12.pdf
- [CPython-tailcall] Mark Shannon et al., A more efficient Python interpreter via tail calls. CPython 3.13 dev notes, 2024.
- [Casey-superinstr] Kevin Casey, M. Anton Ertl, David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters. ACM TOPLAS 29(6), 2007.
- [DynASM] Mike Pall, DynASM: A Dynamic Assembler. https://luajit.org/dynasm.html (background on LuaJIT's interpreter approach)
- [Wasm3] Wasm3, the fastest WebAssembly interpreter. https://github.com/wasm3/wasm3 (tail-call-threaded design)