Skip to main content

MEP 22. Baseline JIT via Copy-and-Patch (Deferred)

FieldValue
MEP22
TitleBaseline JIT via Copy-and-Patch
AuthorMochi core
StatusDeferred
TypeInformational
Created2026-05-16

Abstract

This MEP describes Mochi's planned baseline JIT and explicitly defers its implementation. The point of writing it now is to make the design space visible to readers of MEPs 17 through 21, so they can see why the interpreter MEPs are sufficient for the next 18-24 months and which JIT technology Mochi will adopt when it becomes worth the engineering cost.

The chosen technology is copy-and-patch compilation (Xu and Kjolstad, OOPSLA 2021, [Copy-Patch]), the same approach the experimental CPython 3.13+ JIT uses ([PEP-744]). Copy-and-patch generates code by memcpy-ing pre-compiled "stencils" into a code buffer and patching the immediates. There is no register allocator, no instruction selector, and no optimizer at runtime; all of that work happens at build time when the stencils are produced by LLVM. The runtime is a few hundred lines of Go.

This is a baseline tier. The interpreter remains the cold tier (MEPs 18 to 20). A function tiers up to the JIT after a hot threshold; the JIT generates code that performs the same operations the interpreter would. It does not specialize, inline, or eliminate guards. That work is reserved for a hypothetical optimizing tier (a separate, future MEP), which is not currently planned.

Motivation

Three questions decide whether to add a JIT:

  1. Has the interpreter been pushed to its limit? If not, a JIT is premature; interpreter wins are cheaper per unit speedup. Mochi has not approached this limit. MEPs 18 (dispatch), 19 (specialization + ICs), and 20 (value layout) are projected to give cumulative 2-4x speedups before a JIT is necessary.
  2. Are workloads long-running enough to amortize compile cost? Mochi targets scripting, agent loops, and AI tools. Some workloads are short-lived (CLI invocations); some are long-lived (agent reactors, query servers). For the long-lived case, a JIT is justified eventually. For the short-lived case, compile latency dominates and a JIT must keep compile cost low.
  3. Can the team afford the maintenance cost? Optimizing JITs (TurboFan, Maglev, LuaJIT) are multi-person-year investments. Baseline JITs (Sparkplug, JSC LLInt, copy-and-patch) are weeks-to-months. Mochi can plan for the second class but not the first.

Copy-and-patch hits all three constraints: it is cheaper to build than a register allocator, its compile cost is bounded by memcpy speed (Xu and Kjolstad report 1-2 GB/s on commodity hardware), and it gives meaningful speedups (2-5x over an interpreter at near-zero startup cost). The CPython team chose it for the same combination of reasons ([PEP-744]).

The alternative paths considered and rejected for Mochi:

  • Trace-based JIT (LuaJIT). Highest ceiling, multi-year build, requires deopt/OSR infrastructure and a custom backend. Rejected because the ceiling is unreachable given engineering capacity, and the interpreter MEPs already capture most of the value.
  • Method-based optimizing JIT (V8 TurboFan). Higher ceiling than baseline, much more complex than copy-and-patch. Rejected for the same reason.
  • Sparkplug-style single-pass codegen. Comparable speedup to copy-and-patch but requires writing an assembler for each target ISA. Copy-and-patch sidesteps this by reusing LLVM at build time.
  • Truffle / partial evaluation. Requires a Graal-equivalent in Go, which does not exist.

Specification (sketch)

The specification is intentionally a sketch. A real proposal will write the details after MEPs 18-21 land and the bench harness shows where the interpreter ceiling is.

Build-time stencil generation

For each Op in the bytecode set, a hand-written C function implements the opcode's behavior. The C is compiled by LLVM with specific calling conventions and PIC settings. A custom build script extracts each function's object code into a stencil of bytes plus a relocation table. Stencils are checked into runtime/vm/jit/stencils/<arch>/. The build script runs once per release; the runtime never invokes LLVM.

The stencil set is large but mechanical: roughly one stencil per opcode per type-specialization variant. For Mochi's 107 + ~32 quickened opcodes, that is ~150 stencils per architecture (x86_64, arm64). Stencil compile time at the LLVM step is in minutes; the artifacts are byte arrays in the repo.

Runtime code generation

When a function tiers up:

  1. Allocate an executable page via syscall.Mmap with PROT_EXEC (or VirtualAlloc with PAGE_EXECUTE_READWRITE on Windows).
  2. For each instruction in the function, memcpy the matching stencil into the page.
  3. Apply relocations: patch immediates (register offsets, constants, branch targets) into the copied stencil.
  4. Mark the page executable and read-only via mprotect(PROT_READ|PROT_EXEC) (W^X discipline).
  5. Update the function's entry slot to point at the generated code instead of the interpreter trampoline.

Code generation cost is O(instruction count) memcpys plus the patches. On a 1000-instruction function it is well under a millisecond. The trade-off is that the generated code is roughly interpreter-equivalent: no register allocation, each opcode is its own self-contained block with stack-passed operands.

Tier-up heuristic

Functions tier up after their entry counter exceeds a threshold (initial proposal: 10,000 calls). The counter lives on the Function struct and increments on entry. When it crosses the threshold, the VM enqueues the function for JIT compilation; subsequent calls run the JITted code.

The counter and tier-up logic interact with MEP-19 quickening: quickened bytecode produces better stencils (OpAdd_II is a tighter stencil than OpAdd). The JIT consumes the quickened bytecode after the function has warmed up in the interpreter. This is the same staging V8 uses (Ignition warmup feeds Sparkplug).

Deoptimization

Copy-and-patch JIT code can deoptimize back to the interpreter at any instruction boundary because the stack frame layout is identical to the interpreter frame. (This is the Sparkplug trick: keep frames compatible, no on-stack replacement gymnastics.) When a quickened type-check fails inside generated code, control returns to the interpreter at the same ip, which re-quickens or stays generic. Mochi inherits this design choice directly from Sparkplug ([V8-Sparkplug]).

Oracle harness

MEP-21 Layer 4 provides the differential harness that compares JIT output to interpreter output across the test corpus. Any divergence is a P1 bug. This is the only correctness mechanism the JIT relies on; there is no separate JIT test suite, only the interpreter suite run twice (once cold, once after tier-up).

Cross-platform notes

  • x86_64 Linux/macOS: straightforward mmap+mprotect flow.
  • arm64 macOS: Apple Silicon requires MAP_JIT and per-thread pthread_jit_write_protect_np calls. The flow is well-documented in the JSC and v8 source.
  • Windows: VirtualAlloc with PAGE_EXECUTE_READWRITE then VirtualProtect to drop write. CFG / hardware-enforced stack protection may need carve-outs.
  • Other Go targets: The JIT ships disabled by default; the interpreter remains the only required path. Adding a target is one architecture's worth of stencils.

Build-tag gate

The JIT is gated behind -tags jit. Default builds use the interpreter only. This means the JIT cannot break a default build, and a user can opt in if they have a long-running workload that benefits.

Why this is deferred

The interpreter MEPs (18, 19, 20) are projected to deliver:

  • MEP-18 (dispatch): 1.5-2x geometric mean on the bench suite.
  • MEP-19 (specialization + ICs): another 1.5-2x on type-stable code.
  • MEP-20 (value layout): another 1.2-1.5x, mostly from reduced allocation.

Compounded, that is 2.7x to 6x without leaving the interpreter. Real-world dynamic-language interpreters in this performance class (CPython 3.13, Wren, the LuaJIT interpreter) are within a small constant factor of a baseline JIT for typical workloads.

The decision to start the JIT is data-driven: once the bench suite from MEP-17 shows interpreter improvements plateauing at less than 5% per MEP, and at least one production Mochi workload has measurable headroom against the interpreter ceiling, this MEP gets revisited and promoted from Deferred to Draft.

Open questions for the future draft

  • Stencil generator's place in the build. Is it a Go go:generate step or a separate make target? Which LLVM version is pinned? How are stencils versioned against opcode-set changes?
  • Code-cache eviction. The current design assumes JIT code lives as long as the function. Long-running agent processes that load many functions may need an LRU eviction policy on the code cache.
  • Interaction with future concurrency. A goroutine-parallel VM (not currently planned) would need per-goroutine entry counters and JIT-state-aware tier-up.
  • Comparative numbers. Once a prototype exists, what is the speedup over MEP-18+19+20? If less than 1.5x, the JIT is not worth shipping. If more than 3x, it justifies a follow-up optimizing tier.
  • Safety story for unsafe.Pointer and W^X. Go is a memory-safe language; a JIT that writes executable pages is by definition outside that safety boundary. The code cache must be the only place where Mochi gives that up, and the boundary must be documented.

References

  • [Copy-Patch] Haoran Xu, Fredrik Kjolstad, Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA 2021. https://arxiv.org/abs/2011.13127
  • [PEP-744] Brandt Bucher, PEP 744: JIT Compilation. https://peps.python.org/pep-0744/
  • [V8-Sparkplug] Leszek Swirski, Sparkplug: a non-optimizing JavaScript compiler. https://v8.dev/blog/sparkplug
  • [V8-Maglev] V8 team, Maglev: V8's Fastest Optimizing JIT. https://v8.dev/blog/maglev
  • [JSC-LLInt] WebKit team, Introducing the LLInt: the WebKit JavaScriptCore Low-Level Interpreter. https://webkit.org/blog/189/
  • [YJIT] Maxime Chevalier-Boisvert et al., YJIT: A Basic Block Versioning JIT Compiler for Ruby. MPLR 2023.
  • [Deegen] Haoran Xu et al., Deegen: A high-performance language-VM generator from a declarative spec. PLDI 2024.
  • [Truffle] Thomas Würthinger et al., One VM to rule them all. Onward! 2013.
  • [Hölzle-Deopt] Urs Hölzle, Craig Chambers, David Ungar, Debugging Optimized Code with Dynamic Deoptimization. PLDI 1992 (deopt foundations).