MEP 22. Baseline JIT via Copy-and-Patch (Deferred)
| Field | Value |
|---|---|
| MEP | 22 |
| Title | Baseline JIT via Copy-and-Patch |
| Author | Mochi core |
| Status | Deferred |
| Type | Informational |
| Created | 2026-05-16 |
Abstract
This MEP describes Mochi's planned baseline JIT and explicitly defers its implementation. The point of writing it now is to make the design space visible to readers of MEPs 17 through 21, so they can see why the interpreter MEPs are sufficient for the next 18-24 months and which JIT technology Mochi will adopt when it becomes worth the engineering cost.
The chosen technology is copy-and-patch compilation (Xu and Kjolstad, OOPSLA 2021, [Copy-Patch]), the same approach the CPython 3.13+ experimental JIT uses ([PEP-744]). Copy-and-patch generates code by memcpy-ing pre-compiled "stencils" into a code buffer and patching the immediates. There is no runtime register allocator, no instruction selector, and no optimizer; all of that work happens at build time when the stencils are produced by LLVM. The runtime is a few hundred lines of Go.
This is a baseline tier. The interpreter remains the cold tier (MEPs 18 to 20). A function tiers up to the JIT after a hot threshold; the JIT generates code that performs the same operations the interpreter would. It does not specialize, inline, or eliminate guards. That work is reserved for a hypothetical optimizing tier (a separate, future MEP), which is not currently planned.
Co-design with MEP-21. The interpreter's typed bytecode (MEP-21) is the JIT's input. Because the type checker has already resolved operand tags, the stencil set is type-monomorphic by construction. Each stencil is a few instructions of straight-line code with no speculation, no inline cache lookup, and no tag dispatch. This is the property that lets a copy-and-patch baseline approach optimizing-JIT quality on type-stable workloads, the same effect Wasmtime achieves on typed Wasm input via Cranelift's "no IR optimizations" path.
Motivation
Three questions decide whether to add a JIT:
- Has the interpreter been pushed to its limit? If not, a JIT is premature; interpreter wins are cheaper per unit speedup. Mochi has not approached this limit. MEPs 18 (dispatch), 19 (specialization + ICs), 20 (value layout), and 21 (typed emit) are projected to give cumulative 3-6x speedups before a JIT is necessary.
- Are workloads long-running enough to amortize compile cost? Mochi targets scripting, agent loops, and AI tools. Some workloads are short-lived (CLI invocations); some are long-lived (agent reactors, query servers). For the long-lived case, a JIT is justified eventually. For the short-lived case, compile latency dominates and a JIT must keep compile cost low.
- Can the team afford the maintenance cost? Optimizing JITs (TurboFan, Maglev, LuaJIT, HotSpot C2) are multi-person-year investments. Baseline JITs (Sparkplug, JSC LLInt, copy-and-patch, Cranelift baseline mode) are weeks-to-months. Mochi can plan for the second class but not the first.
Copy-and-patch hits all three constraints: it is cheaper to build than a register allocator, its compile cost is bounded by memcpy speed (Xu and Kjolstad report 1-2 GB/s on commodity hardware), and it gives meaningful speedups (2-5x over an interpreter at near-zero startup cost). The CPython team chose it for the same combination of reasons ([PEP-744]).
Why typed bytecode (MEP-21) changes the JIT's job
A JIT for a dynamic source language has to do two unrelated jobs: (a) translate bytecode to machine code, and (b) speculate on operand types because the front end could not pin them. Job (b) is where most of the engineering goes in V8 TurboFan, JSC DFG, and HotSpot C2. It demands an SSA IR, type lattice, escape analysis, OSR exit metadata, side-table maintenance, and a deoptimizer.
Mochi's front end (with MEP-21 in place) does job (b) once, at compile time. The JIT only has to do job (a). That puts Mochi in the same architectural neighborhood as:
- Wasmtime/Cranelift baseline mode. Input is already typed (
i32.addvsf64.addare distinct opcodes); the baseline tier is a fast translator with no speculation. - ART-OAT (Android). dex bytecode carries type info; the AOT compiler is a single-pass code emitter for the common path.
- Sparkplug. Consumes Ignition bytecode after warmup so quickened ops are type-stable; the baseline emits one fixed instruction sequence per op.
- HotSpot C1 (client compiler). Single-pass, frame-compatible with the interpreter, no speculation beyond what the verifier already proved.
None of these baselines need a deoptimizer because the input cannot lie about types. Mochi's JIT inherits the same property.
Why a JIT at all if the interpreter is fast
Even with MEPs 18 to 21, the interpreter's per-instruction cost is dominated by the dispatch indirect, the operand decode, and the bounds check on fr.Regs[...]. A copy-and-patch JIT collapses these into straight-line code: dispatch becomes fall-through, decode becomes a load-immediate, bounds check becomes a constant compare the CPU folds. The 2-5x figure from the copy-and-patch paper and the CPython 3.13 prototype reflects exactly this collapse. There is no architectural trick that lets an interpreter close that gap without becoming a JIT.
Alternatives considered and rejected for Mochi
| Alternative | Ceiling | Engineering cost | Why rejected |
|---|---|---|---|
| Trace JIT (LuaJIT) | Very high (often within 2x of C) | Multi-year. Needs trace recorder, SSA IR, register allocator, side-exit infra, custom backend per arch. | Ceiling is unreachable given engineering capacity. The trace recorder duplicates work the type checker already does. |
| Method-based optimizing JIT (V8 TurboFan, JSC FTL, HotSpot C2) | High | Multi-year. SSA IR, escape analysis, inliner, type lattice, full deopt. | Same. Mochi has no source of profile-only type info to exploit, because the source language is typed. |
| Maglev (V8 mid-tier) | Medium-high | Person-years. Single-pass-ish but still needs IR + dataflow + ICs. | Closer to the right neighborhood, but still SSA-heavy. Copy-and-patch with typed input matches the steady-state numbers at a fraction of the LOC. |
| Sparkplug-style single-pass codegen | Medium (2-3x) | Months. One hand-written assembler per ISA. | Comparable result to copy-and-patch but requires writing the assembler. Copy-and-patch reuses LLVM at build time and ships byte arrays. |
| Cranelift (Wasmtime) | Medium-high | Existing library, but with a Rust dependency and a heavy IR. | Would force a Rust runtime dependency. Copy-and-patch keeps the runtime pure Go. |
| Truffle / partial evaluation (Graal) | Very high | Requires a Graal-equivalent in Go. | Does not exist. |
| Method JIT in pure Go via cgo to LLVM | High | LLVM at runtime is huge. | Compile latency contradicts a short-running CLI workload. |
Copy-and-patch is the unique design point that gives baseline-tier speedups without a runtime IR or runtime assembler.
Specification (sketch)
The specification is intentionally a sketch. A real proposal will write the details after MEPs 18-21 land and the bench harness shows where the interpreter ceiling is.
Tier architecture
Mochi ships two tiers at production:
[ Interpreter (vm2, typed bytecode) ]
| 10,000-call threshold
v
[ Baseline JIT (copy-and-patch) ]
Compare against the industry:
| Project | Tiers |
|---|---|
| Mochi (planned) | Interpreter, Baseline |
| HotSpot | Interpreter, C1, C2 |
| V8 | Ignition, Sparkplug, Maglev, TurboFan |
| JSC | LLInt, Baseline, DFG, FTL |
| CoreCLR | Tier 0 RyuJIT, Tier 1 RyuJIT (with opts) |
| Dart | Interpreter (mobile), AOT (production) |
| LuaJIT | Interpreter, Trace JIT |
| Wasmtime | Cranelift Baseline, Cranelift Optimizing |
| CPython 3.13+ | Specializing Interpreter, Copy-Patch JIT (experimental) |
The four-tier designs (V8, JSC) earn their tier count because their source language is dynamically typed and each tier serves a different speculation budget. Mochi does not need that. The two-tier design matches Dart AOT, CoreCLR Tier 0/1, and the planned CPython architecture, all of which serve typed or quasi-typed source.
A third optimizing tier is not on the roadmap. If the benchmark gap after Baseline ships is wide enough to justify it, the natural choice would be Cranelift via cgo or a hand-written method JIT on top of Mochi's existing SSA-friendly IR (which does not yet exist). Both decisions are deferred.
Build-time stencil generation
For each Op in the bytecode set, a hand-written C function implements the opcode's behavior. The C is compiled by LLVM with specific calling conventions (one Go-stack-pointer in, one frame pointer in, no callee-saved register reliance) and PIC settings. A custom build script extracts each function's object code into a stencil of bytes plus a relocation table. Stencils are checked into runtime/vm2/jit/stencils/<arch>/. The build script runs once per release; the runtime never invokes LLVM.
The stencil set is large but mechanical: roughly one stencil per typed opcode. With MEP-21's typed surface (Add_II, Add_FF, Sub_II, ..., Index_List, Index_Map, ...) the count is around 80 stencils per architecture (x86_64, arm64). Stencil compile time at the LLVM step is in minutes; the artifacts are byte arrays in the repo. Generic opcodes (OpAdd, OpAdd_II's slow path) also get stencils, since MEP-19 quickening still applies to the residual polymorphic surface.
This is the same model Deegen ([Deegen]) uses to generate the LuaJIT-Remake interpreter and baseline JIT from a single C source. Mochi does not adopt Deegen wholesale (it is a research artifact) but borrows the build-time-LLVM, runtime-no-LLVM split.
Runtime code generation
When a function tiers up:
- Allocate an executable page via
syscall.MmapwithPROT_EXEC(orVirtualAllocwithPAGE_EXECUTE_READWRITEon Windows). - For each instruction in the function,
memcpythe matching stencil into the page. - Apply relocations: patch immediates (register offsets, constants, branch targets) into the copied stencil.
- Mark the page executable and read-only via
mprotect(PROT_READ|PROT_EXEC)(W^X discipline). - Update the function's entry slot to point at the generated code instead of the interpreter trampoline.
Code generation cost is O(instruction count) memcpys plus the patches. On a 1000-instruction function it is well under a millisecond. The trade-off is that the generated code is roughly interpreter-equivalent: no register allocation, each opcode is its own self-contained block with stack-passed operands.
The compile-cost discipline matches Sparkplug (a few microseconds per function), JSC Baseline (similar), and Cranelift Baseline (sub-ms). It is far below the 100ms-to-seconds range of TurboFan/C2/FTL, which is what makes a "JIT for short-lived programs" viable at all.
Tier-up heuristic
Functions tier up after their entry counter exceeds a threshold (initial proposal: 10,000 calls). The counter lives on the Function struct and increments on entry. When it crosses the threshold, the VM enqueues the function for JIT compilation; subsequent calls run the JITted code.
The counter and tier-up logic interact with MEP-19 quickening: quickened bytecode produces better stencils (OpAdd_II is a tighter stencil than OpAdd). The JIT consumes the typed-and-quickened bytecode after the function has warmed up in the interpreter. This is the same staging V8 uses (Ignition warmup feeds Sparkplug). When MEP-21 lands, most ops are typed at emit time, so the warmup-feeds-stencil-quality interaction becomes a smaller win (the typed op was already there); the counter still gates JIT entry to avoid compiling cold code.
Deoptimization
Copy-and-patch JIT code can deoptimize back to the interpreter at any instruction boundary because the stack frame layout is identical to the interpreter frame. (This is the Sparkplug trick: keep frames compatible, no on-stack replacement gymnastics.) When a quickened type-check fails inside generated code (only possible on the residual polymorphic surface that MEP-21 left to quickening), control returns to the interpreter at the same ip, which re-quickens or stays generic. Mochi inherits this design choice directly from Sparkplug ([V8-Sparkplug]) and JSC Baseline ([JSC-Baseline]).
Critically, typed sites do not need a deopt path. OpAdd_II emitted by the compiler (MEP-21) is unconditionally an int-int add; the verifier proved both operands carry TagInt. The corresponding stencil contains no tag-check and no exit. This is the same reason Wasm baseline tiers have no deopt and HotSpot C1 has only the small set of explicit-exception exits the JVM spec requires.
The deopt mechanism that does exist services:
- side-exits on residual quickened sites (rare under MEP-21),
- exceptions / panics inside the generated code (jumped to a per-function landing pad that returns to the interpreter),
- debugger / profiler triggers (future).
The minimal deopt footprint here is the Hölzle-Chambers-Ungar model ([Hölzle-Deopt]) reduced to its irreducible core.
Oracle harness
The JIT runs against a differential harness: every program in the test corpus is executed under the interpreter and under the JIT, and outputs must match. Any divergence is a P1 bug. This is the only correctness mechanism the JIT relies on; there is no separate JIT test suite, only the interpreter suite run twice (once cold, once after tier-up). The harness skeleton is gated by -tags jit and lands as part of this MEP, not as a prerequisite.
This is the same correctness story HotSpot has used since 1999: C1 and C2 are both validated by re-running the JVM TCK with each tier forced on, and any divergence is a tier bug.
Inline cache integration
MEP-19's polymorphic ICs become first-class citizens of generated code. Each OpCallV stencil reserves a few words at a fixed offset for the call-site IC. The runtime IC update (on a cache miss) writes to those words directly; the generated code reads from them on the next call. This is the JSC IC model ([JSC-IC]) ported to copy-and-patch.
Resolved calls (the OpCall path that MEP-21 emits when the target is statically known) skip the IC entirely. The stencil contains a direct call instruction whose target is patched at JIT time. The CPU's indirect predictor never gets involved.
Cross-platform notes
- x86_64 Linux/macOS: straightforward
mmap+mprotectflow. - arm64 macOS: Apple Silicon requires
MAP_JITand per-threadpthread_jit_write_protect_npcalls. The flow is well-documented in the JSC and v8 source. - Windows:
VirtualAllocwithPAGE_EXECUTE_READWRITEthenVirtualProtectto drop write. CFG / hardware-enforced stack protection may need carve-outs. - arm64 instruction cache. Every page transition needs an
__builtin___clear_cacheequivalent before the first execute; Go does not expose this directly, so the runtime calls a tiny cgo helper. (LuaJIT and Wasmtime do the same.) - Other Go targets: The JIT ships disabled by default; the interpreter remains the only required path. Adding a target is one architecture's worth of stencils.
Build-tag gate
The JIT is gated behind -tags jit. Default builds use the interpreter only. This means the JIT cannot break a default build, and a user can opt in if they have a long-running workload that benefits.
Why this is deferred
The interpreter MEPs (18, 19, 20, 21) are projected to deliver:
- MEP-18 (dispatch): 1.5-2x geometric mean on the bench suite.
- MEP-19 (specialization + ICs): another 1.5-2x on type-stable code.
- MEP-20 (value layout): another 1.2-1.5x, mostly from reduced allocation. (Already validated: vm2 fib(25) ships at ~4.4x runtime/vm and 232,000x less memory churn.)
- MEP-21 (typed emit): another ≥1.3x cold and ≥5% steady on typed code, plus the elimination of cold-start quickening cost on the entire typed corpus.
Compounded, that is in the 5-10x range without leaving the interpreter. Real-world dynamic-language interpreters in this performance class (CPython 3.13, Wren, the LuaJIT interpreter) are within a small constant factor of a baseline JIT for typical workloads, and Wasm interpreters (wasm3, wasmi) similarly approach Wasmtime baseline on typed input.
The decision to start the JIT is data-driven: once the bench suite from MEP-17 shows interpreter improvements plateauing at less than 5% per MEP, and at least one production Mochi workload has measurable headroom against the interpreter ceiling, this MEP gets revisited and promoted from Deferred to Draft.
Open questions for the future draft
- Stencil generator's place in the build. Is it a Go
go:generatestep or a separate make target? Which LLVM version is pinned? How are stencils versioned against opcode-set changes? - Code-cache eviction. The current design assumes JIT code lives as long as the function. Long-running agent processes that load many functions may need an LRU eviction policy on the code cache. JSC uses an eviction scheme keyed on the global GC; Mochi's GC is Go's, which complicates the integration.
- Interaction with future concurrency. A goroutine-parallel VM (not currently planned) would need per-goroutine entry counters and JIT-state-aware tier-up, plus a code cache visible to all goroutines (read-mostly, so a sync.Map is fine).
- Comparative numbers. Once a prototype exists, what is the speedup over MEP-18+19+20+21? If less than 1.5x, the JIT is not worth shipping. If more than 3x, it justifies a follow-up optimizing tier (Cranelift via cgo is the leading candidate).
- Safety story for
unsafe.Pointerand W^X. Go is a memory-safe language; a JIT that writes executable pages is by definition outside that safety boundary. The code cache must be the only place where Mochi gives that up, and the boundary must be documented. The same argument applies to every Go-hosted JIT (e.g. wazero's compiler mode), so prior art exists. - Side-table compactness. Each JITted function needs a side table mapping native PC ranges back to bytecode IPs for deopt and stack traces. JSC's BytecodeIndex is the right model.
References
Core technology
- [Copy-Patch] Haoran Xu, Fredrik Kjolstad. Copy-and-Patch Compilation: A fast compilation algorithm for high-level languages and bytecode. OOPSLA 2021. https://arxiv.org/abs/2011.13127
- [PEP-744] Brandt Bucher. PEP 744: JIT Compilation. https://peps.python.org/pep-0744/
- [Deegen] Haoran Xu, Fredrik Kjolstad. Deegen: A high-performance language-VM generator from a declarative spec. PLDI 2024.
Baseline tier prior art
- [V8-Sparkplug] Leszek Swirski. Sparkplug: a non-optimizing JavaScript compiler. https://v8.dev/blog/sparkplug
- [JSC-LLInt] WebKit team. Introducing the LLInt: the WebKit JavaScriptCore Low-Level Interpreter. https://webkit.org/blog/189/
- [JSC-Baseline] Filip Pizlo. Speculation in JavaScriptCore. https://webkit.org/blog/10308/ (covers the Baseline JIT's role between LLInt and DFG.)
- [JSC-IC] Filip Pizlo. JavaScriptCore's inline caches. WebKit blog series.
- [HotSpot-C1] Cliff Click. HotSpot Client Compiler design. JVM internals talks.
- [Cranelift] Bytecode Alliance. Cranelift compiler backend. https://github.com/bytecodealliance/wasmtime/tree/main/cranelift
- [Wasmtime] Bytecode Alliance. Wasmtime architecture: baseline vs optimizing. https://docs.wasmtime.dev/
Optimizing tier prior art (for comparison only)
- [V8-Maglev] V8 team. Maglev: V8's Fastest Optimizing JIT. https://v8.dev/blog/maglev
- [V8-TurboFan] V8 team. Launching Ignition and TurboFan. https://v8.dev/blog/launching-ignition-and-turbofan
- [HotSpot-C2] Cliff Click, Michael Paleczny. A simple graph-based intermediate representation. IR papers 1995.
- [YJIT] Maxime Chevalier-Boisvert et al. YJIT: A Basic Block Versioning JIT Compiler for Ruby. MPLR 2023.
- [ZJIT] Shopify Ruby team. ZJIT: a method JIT for Ruby. 2025.
Deopt and tiering
- [Hölzle-Deopt] Urs Hölzle, Craig Chambers, David Ungar. Debugging Optimized Code with Dynamic Deoptimization. PLDI 1992.
- [Self] Urs Hölzle, David Ungar. A Third-Generation SELF Implementation: Reconciling Responsiveness with Performance. OOPSLA 1994.
Typed-input baselines (the Mochi analog)
- [Wasm-Spec] WebAssembly Specification 2.0. https://webassembly.github.io/spec/core/
- [Dart-AOT] Vyacheslav Egorov et al. Compiling Dart to native code. https://mrale.ph/dartvm/
- [CoreCLR] Microsoft. RyuJIT overview and tiered compilation. https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/jit/ryujit-overview.md
- [ART] Android Runtime Team. ART and Dalvik. https://source.android.com/docs/core/runtime
- [Truffle] Thomas Würthinger et al. One VM to rule them all. Onward! 2013.
Trace alternative (rejected for Mochi)
- [LuaJIT] Mike Pall. The LuaJIT 2.0 Wiki. http://wiki.luajit.org/Home
- [LuaJIT-Trace] Mike Pall. LuaJIT trace compiler design notes. http://wiki.luajit.org/SSA-IR-2.0