Skip to main content

MEP 18. Bytecode Dispatch in a Go-Hosted VM

FieldValue
MEP18
TitleBytecode Dispatch in a Go-Hosted VM
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-16

Abstract

The interpreter loop is the single hottest piece of code in any non-JIT VM. Wren measures 5-10% from computed gotos alone ([Wren-perf]); CPython 3.11's adaptive interpreter (PEP 659) gives another 10-60% on type-stable code without a JIT ([PEP-659]). Mochi today runs a single switch ins.Op over 107 opcodes in runtime/vm/vm_eval.go (1728 lines). This MEP describes which of the classic dispatch optimizations are reachable from Go, which are not, and what the realistic ceiling is for an interpreter that must stay portable Go.

The conclusion up front: Go has neither labels-as-values nor a guaranteed tail-call ABI, so the classic direct-threaded and tail-call-threaded interpreters do not port. The wins available without unsafe assembly tricks are: (a) a dense byte-sized opcode space so the Go compiler turns the dispatch switch into a jump table, (b) opcode fusion for hot bigrams, (c) hot-field hoisting of ip and the register array into local variables across the loop body, and (d) the bytecode rewriting infrastructure that MEP-19 will reuse for inline caches. Combined, internal microbenchmarks (see MEP-17) suggest a 1.5-2x speedup is reachable; we should not promise more without numbers.

Motivation

Interpreter dispatch is a five-instruction sequence in every bytecode VM: fetch the next opcode, decode its operands, jump to the handler, run the handler, return to step 1. Of those, the indirect jump is the single instruction that dominates because the branch predictor has to guess the next handler from the last one. A switch dispatch has one shared indirect jump for every opcode, so the predictor sees ~107 destinations from one site and is wrong most of the time.

The literature is consistent on the fix. Ertl and Gregg's 2003 paper ([Ertl-Gregg]) measures direct-threaded dispatch (each handler ends with its own indirect jump) at 1.5-2x over switch dispatch on the same hardware, mostly from BTB locality: the predictor learns common bigrams (LOAD_CONST -> ADD) per handler. Wren reports 5-10% in a much smaller op set. CPython 3.11+ moved from switch to computed goto (USE_COMPUTED_GOTOS) and reported significant wins; CPython 3.13 is experimenting with a tail-call interpreter ([CPython-tailcall]).

Mochi cannot have any of those primitives directly. The Go specification has no labels-as-values, no goto *ptr, and no [[clang::musttail]]. The Go compiler's escape analysis and bounds-checker also fight the kind of hand-tuned loop that wins in C. So the question this MEP answers is: what is the ceiling for a portable Go interpreter, and how do we approach it?

Specification

Constraints of the Go host

Three properties of Go shape every design choice below.

  1. No labels-as-values. goto only takes statically-known label targets. The closest equivalent is switch over a dense integer; the Go compiler emits a single jump-table indirect jump for that switch, recovering most (but not all) of computed-goto behavior. Wren in C with computed gotos gets 5-10%; the Go switch retains the per-call indirect jump.
  2. No guaranteed tail calls. Function-per-op designs that thread via tail calls (Wasm3, CPython 3.13 tier-2) require the compiler to compile return f(...) to jmp f. The Go compiler does not. A function-per-op dispatcher in Go pays the call/return overhead every opcode, which is worse than the switch.
  3. Bounds checks and escape analysis. Indexing regs[i] or code[ip] inserts a bounds check. The Go compiler eliminates the check when it can prove the index is in range; the dispatch loop must be written so it can prove it. Similarly, taking the address of a local can force a heap allocation; the loop must keep its hot variables in registers, not on the heap.

Phase A: Dispatch loop discipline

The first PR of this MEP rewrites the existing 1728-line vm_eval.go switch loop with no algorithmic change, only with the layout the Go compiler can optimize.

  1. Opcode type narrowing. Op is already uint8, which is good. Confirm via go tool objdump that the switch compiles to a jump table, not a chain of compares. Today's loop is a single big switch; verify it.
  2. Hoist hot state. Lift ip, code (the []Instr slice), and regs (the register array) into local variables at the top of the loop and write them back to the frame only on call/return. Today's loop reads fr.fn.Code[fr.ip] every iteration, which is two pointer dereferences and a bounds check. After: ins := code[ip]. Benchstat baseline before the change; verify a measurable win (expect 5-15% on iter_sum and fib).
  3. Bounds-check elimination. Where the compiler cannot prove ip < len(code), add a single check at the top of the loop and rely on the proof inside. Use _ = code[len(code)-1] style asserts where helpful.
  4. Eliminate per-iteration allocation. Audit every handler for accidental heap allocation. interface{} conversions are the usual culprit. The numeric path (OpAddInt, OpAddFloat) must allocate nothing.
  5. Frame stack reuse. Today's stack []*frame grows and shrinks per call. Use a sync.Pool for *frame and a slab for regs. (Pool detail is covered by MEP-20; this MEP just notes the dependency.)

Expected gain from Phase A alone: 1.3-1.8x geometric mean on the MEP-17 suite, based on the gap between idiomatic and tuned Go interpreters in the literature.

Phase B: Superinstructions

Once Phase A lands, profile the bench suite (go test -cpuprofile) and identify the top ten bytecode bigrams. Typical winners in Lua-family VMs ([Casey-superinstr]):

  • OpConst followed by OpAdd (loading a literal and adding it)
  • OpGetGlobal followed by OpCall
  • OpLoad (register move) followed by any binary op
  • OpJumpIfFalse after OpLess (the loop-back compare)

Each fused pair becomes a single opcode (e.g. OpAddConstInt). The compiler emits the fused opcode when the source pattern is detected during bytecode emission. The interpreter gains a handler that does both operations without a dispatch in between.

The implementation cost is one new opcode and one match in the compiler per fusion. The risk is opcode-space inflation; the budget is 32 fused opcodes (the 107-op set has headroom to 256 before the uint8 overflows). New fusions land only with bench numbers showing a measurable improvement on at least one program.

Expected gain from Phase B: 1.1-1.3x on top of Phase A, concentrated in tight loops.

Phase C: Bytecode rewriting infrastructure

The third PR of this MEP does not change the opcode set or the loop. It adds the ability to rewrite a bytecode instruction in place at runtime. This is the foundation MEP-19 (specialization and inline caches) requires: when the interpreter observes that BINARY_OP always sees two ints, it should be able to rewrite the instruction to BINARY_OP_INT without recompiling the function.

The mechanism:

  • Instr gains an optional Quick uint8 byte that holds the quickened opcode. Initially zero.
  • The dispatch loop reads ins.Op (the original), unless ins.Quick != 0, in which case it dispatches on Quick. The check is a single compare-and-branch per instruction.
  • Handlers may patch ins.Quick = OpXyzInt when they observe stable types. The patch is a single byte write; no synchronization is needed in the single-goroutine VM model. A future concurrent VM (MEP-22 or later) would revisit.

This phase ships with no quickened opcodes. It is purely the plumbing.

Expected gain from Phase C alone: 0%. The gain shows up in MEP-19.

Phase D (deferred): unsafe interpreter loop

A future MEP may revisit a unsafe.Pointer based interpreter that bypasses Go's bounds checks and stores ip as a raw pointer into the bytecode. This would close the remaining gap to a hand-tuned C interpreter, at the cost of giving up Go's memory safety in the hot loop. It is not in this MEP because Phase A + B + the MEP-19 specialization should clear the per-suite goal without crossing that line.

Non-goals

  • No JIT in this MEP. Copy-and-patch is MEP-22.
  • No new value representation. That is MEP-20.
  • No inline caches. That is MEP-19.
  • No assembly interpreter. Pall-style hand-coded interpreters require giving up portability across the architectures Go supports.

Status at a glance

ItemStatus
Phase A: hot-state hoisting, bounds-check eliminationproposed
Phase A: dispatch jump-table verificationproposed
Phase A: numeric path is allocation-freeproposed
Phase B: superinstruction infrastructure (compiler + handlers)proposed
Phase B: first 8 fused opcodesproposed
Phase C: quickening byte on Instr, dispatch indirectionproposed
Phase D: unsafe interpreter loopdeferred

Risks

  • The Go compiler may not emit a jump table. A 107-arm switch on uint8 should; verify in objdump before claiming any of these wins.
  • Frame allocation churn. Pool integration is in MEP-20, but if it slips, Phase A's measured wins may be partly eaten by GC noise. Run the bench harness with GOGC=off as a sanity check.
  • Quickening invalidation. A function compiled for one input may be re-entered with different types after quickening. MEP-19 must explain when quickened ops fall back; this MEP just provides the slot.

Implementation notes

Phase A is one PR, two engineer-weeks. Each rewrite of a handler ships with benchstat numbers against the MEP-17 suite. Phase B is one PR per fusion family (binary-op-with-const, jump-with-compare); each lands only with numbers. Phase C is one PR, plumbing only.

The cumulative ceiling claim (1.5-2x geometric mean) is conservative against the literature. LuaJIT's interpreter gets 2-3x over PUC-Lua's C interpreter, but that is hand-coded assembly. Wren's computed-goto win is 5-10% over its own switch baseline. The Mochi target sits between: more than Wren (because Mochi starts further from optimal) and less than LuaJIT (because Go forecloses the assembly path).

References

  • [Wren-perf] Bob Nystrom, Wren Performance. https://wren.io/performance.html
  • [PEP-659] Mark Shannon, PEP 659: Specializing Adaptive Interpreter. https://peps.python.org/pep-0659/
  • [Ertl-Gregg] Anton Ertl and David Gregg, The Structure and Performance of Efficient Interpreters. JILP 5, 2003. https://www.jilp.org/vol5/v5paper12.pdf
  • [CPython-tailcall] Mark Shannon et al., A more efficient Python interpreter via tail calls. CPython 3.13 dev notes, 2024.
  • [Casey-superinstr] Kevin Casey, M. Anton Ertl, David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters. ACM TOPLAS 29(6), 2007.
  • [DynASM] Mike Pall, DynASM: A Dynamic Assembler. https://luajit.org/dynasm.html (background on LuaJIT's interpreter approach)
  • [Wasm3] Wasm3, the fastest WebAssembly interpreter. https://github.com/wasm3/wasm3 (tail-call-threaded design)