MEP 42. Native Code Emission: copy-and-patch JIT, C-as-target AOT, and a Wasm-first cross-platform story
| Field | Value |
|---|---|
| MEP | 42 |
| Title | Native Code Emission |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-18 |
| Depends | MEP-23 (Compile-time budget), MEP-40 (vm3 + compiler3), MEP-41 (Memory Safety) |
Abstract
Mochi today ships exactly one execution model: the vm3 bytecode interpreter with vm2jit-derived golang-asm JIT for hot methods on x86_64 and aarch64. There is no AOT path; mochi build produces a Go binary that embeds the interpreter, not native code emitted from compiler3 IR. There is no Wasm output. There is no Windows native target. There is no story for mochi run script.mochi to compete with python script.py on startup time, and no story for mochi build --portable to compete with go build on distributable artifact size.
This MEP specifies the from-scratch native code emission layer for Mochi. The architecture is dual-backend by design: copy-and-patch JIT for the interpreter tier (sub-millisecond compile, 3-5x faster than vm3 interpreter, inherits Clang -O2 stencil quality), and C-as-target AOT for shipped binaries (covers every target the user's /notes/Spec/5500/) validates this pair: copy-and-patch is the technique CPython 3.13 shipped in October 2024 (PEP 744); C-as-target is the technique that lets Nim, V, Vala, and Cython cover every embedded toolchain on Earth.cc supports, including embedded Cortex-M and microcontrollers). The two backends share compiler3's typed IR; neither requires LLVM or cgo at Mochi build time. The naive-emission research substrate (
Phase 1 targets five host combinations: x86_64 Linux (ELF/SysV), aarch64 Linux (ELF/AAPCS64), aarch64 macOS (Mach-O/Apple ABI), x86_64 macOS (Mach-O), and wasm32 with WasmGC. These cover every CI runner, every modern cloud ARM instance, every Apple Silicon developer, every browser, and every standalone Wasm runtime (Wasmtime, WAMR, Wasmer, WasmEdge, Spin). Phase 2 adds Windows x86_64/aarch64 (PE/COFF with .pdata/.xdata), riscv64 Linux (RVA22/RVA23), an APE bundler for single-artifact polyglot distribution, plus a native Wasm AOT emitter and a QBE backend for users who want sub-MB stripped binaries without a libc dependency.
The performance bet, deduced from the research substrate: copy-and-patch JIT lands Mochi in the same 3-5x-of-Go band that the §6.16 close-out of MEP-39 left as out of reach for vm3 alone. C-as-target AOT lands mochi build artifacts in the 1-10 MB band (Crystal-like) with binary size and code quality bounded by the user's C compiler, not by Mochi. The Wasm path gives Mochi a distribution channel no other Go-hosted language has: browser, edge, and Wasmtime AOT all from a single emit pipeline.
The closest existing architectural analog is Crystal (closed-world, typed IR, managed runtime, same target tier). The case study to learn most from is .NET NativeAOT (mature managed-language AOT pipeline, trim model, source-generator alternative to runtime reflection, single-file deployment UX). Both are documented in ~/notes/Spec/5500/aot/.
This MEP is a Standards Track design document. The phased plan (Phase 0 spec freeze through Phase 8 Wasm AOT) ships incrementally; no phase ships until its gate is green. The MEP and the code ship in the same PR (MEP-spec-in-sync rule). No phase introduces cgo on the Mochi build host.
Motivation
What MEP-40 left on the table
MEP-40 (vm3 + compiler3) produced a typed IR that propagates Mochi's static type system end-to-end. Every SSA value carries a proven type at IR-emit time; every opcode encodes the type in the opcode itself; the three-bank register file (regsI64 / regsF64 / regsCell) reads and writes native machine words without Cell envelope traffic. This is exactly the precondition a code generator needs: no runtime type guards, no fallback paths, no escape valve. The §6.16 close-out of MEP-39 listed four structural ceilings that vm2 could not lift; vm3 lifted all four. What remains is to spend that headroom by emitting native code instead of dispatching bytecode.
The vm3jit method JIT (MEP-40 Phase 5) covers the inner-loop case but inherits vm2jit's golang-asm encoding: no register allocator, no cross-op optimization, no AOT path. The shipping Mochi binary still embeds the vm3 interpreter, and mochi build my_program.mochi still produces a Go binary that runs the interpreter on my_program.mochi. That is a distribution model, not a code-generation model.
What changed in 2024-2026
Four things between PLDI 2021 and May 2026 make this MEP unavoidable.
Copy-and-patch shipped in CPython 3.13 (October 2024). PEP 744 enabled the copy-and-patch JIT (Xu+Kjolstad, PLDI 2021) behind --enable-experimental-jit in Python 3.13.0. The technique is now production-validated at the scale of CPython: ~1000 lines of Python build-time tooling plus ~100 lines of C runtime per ISA, hand-written C stencils compiled by Clang -O2 at build time, memcpy + patch at runtime. The risk profile is well understood, the macOS arm64 JIT entitlement story is documented, and Brandt Bucher's writeups are now reference material. CPython measured a 9-15% throughput improvement on pyperformance for a first-cut JIT with no register allocation. Mochi, with typed IR and reserved arena-base registers, expects 3-5x on hot loops.
Wasm GC + WASIp2 shipped in 2024. WasmGC reached browser baseline in 2024 (Chrome 119, Firefox 120, Safari 18.4); WASI Preview 2 (component model + WIT bindings) reached stable in Wasmtime 17 (October 2024). Wasm is now a credible AOT target for a managed-runtime language, not a JavaScript fallback. The Mochi handle Cell maps directly to a wasmtime externref or to a typed GC reference under WasmGC; the typed arenas map directly to typed GC structs. The September 2025 Wasm 3.0 release added 64-bit memories and atomic operations, closing the last two gaps for a Mochi Wasm port.
Apple Silicon adoption crossed 50% of developer machines (DeveloperEcosystem 2025 survey). Mochi without an aarch64 macOS native binary is no longer credible for individual developer adoption. The Mach-O writer, the Apple variadic ABI delta, the ad-hoc signing requirement, and the JIT entitlement plist are mandatory work for any "professional language" claim in 2026.
Zig 0.13 + zig cc graduated to "default cross-compiler" status in many teams. Zig's bundled libcs (musl, glibc, mingw) and zero-config cross compilation set the new floor for what users expect from a language's cross-compilation UX. Nim already pairs with zig cc; Crystal users wrap --cross-compile around zig cc; Rust users layer cargo-zigbuild on top of cargo build. Mochi mochi build --target=aarch64-linux should "just work" from a macOS dev machine, and zig cc is the cheapest way to deliver that.
Why two phase-1 backends, not one
The naive-emission survey (~/notes/Spec/5500/naive/00_naive_summary.md) and the backends survey (~/notes/Spec/5500/backends/00_backends_summary.md) agree on the same conclusion through different lenses: no single backend covers the four MEP-42 priority surfaces (fast JIT, distributable AOT, Wasm, embedded) with acceptable engineering cost and Mochi's pure-Go-no-cgo identity preserved. The two-backend strategy is the pragmatic compromise GHC adopted (NCG + LLVM) and Zig adopted (in-house + LLVM + C). Mochi adopts the same shape with smaller pieces: copy-and-patch + C-as-target in phase 1, Wasm emitter + QBE in phase 2.
The pair is complementary, not redundant. Copy-and-patch is millisecond-compile and runtime-tier; C-as-target is cc-bound compile time and ship-tier. Copy-and-patch produces machine code in an mmap'd executable region; C-as-target produces an ELF/Mach-O/PE on disk. Copy-and-patch covers two ISAs out of the gate; C-as-target covers every ISA the user's cc understands. Neither covers the other's surface, and shipping both costs less than shipping either alone with the gaps patched by ad-hoc tools.
Scope
In scope:
- Complete design and implementation of
compiler3/emit/copypatch/(copy-and-patch JIT: stencil generator, runtime patcher, mmap+W^X manager). - Complete design and implementation of
compiler3/emit/c/(C-as-target AOT: typed-IR-to-C lowering, runtime header,ccdriver). - Initial implementation of
compiler3/emit/wasm/(Wasm 3.0 + WasmGC emitter, browser + standalone targets). - Stencil generation tooling (
tools/stencilgen/) that invokes Clang at build time and emits a generated Go file per ISA. - Linker driver (
compiler3/link/) that invokes LLD by default with system linker fallback. - Object file readers/writers (
compiler3/objfile/elf/,compiler3/objfile/macho/,compiler3/objfile/pe/) using Go'sdebug/elf,debug/macho, and a hand-rolled PE writer. - Cross-compilation support for the five phase-1 targets from any host, optionally via
zig cc. - DWARF 5 line-table emission for native targets (phase 1); full DWARF + optional PDB in phase 2.
- A
mochi buildUX that produces a single distributable binary, with--target,--portable(musl static-PIE), and--mode={dev,release,embedded}flags. - A
mochi runpath that selects copy-and-patch JIT for hot loops when available, falling back to vm3 interpreter when not. - Bench harness integration: every BG kernel runs under all three execution modes (interpreter, JIT, AOT) on every supported host, with cross-mode parity gates.
Out of scope (deferred to successor MEPs):
- LLVM as a primary backend. Available as a phase-3 opt-in (
compiler3/emit/llvmir/emits.lltext, shells tollc); not required for any phase-1 or phase-2 deliverable. - MLIR. Reserved for an SIMT / GPU successor MEP.
- libgccjit. Rejected outright: GPL contagion risk.
- iOS / iPadOS / visionOS targets. Provisioning, App Store review, and MH_BUNDLE machinery deserve a dedicated mobile MEP.
- GPU codegen (Metal AIR, CUDA PTX, ROCm, SPIR-V, WGSL). Separate MEP.
- Tracing JIT. vm3jit is a method JIT; tracing is MEP-50+ territory.
- IL2CPU / Bartok-style two-stage AOT through an intermediate C++ pass.
- Profile-guided optimization. Phase 3+ once base AOT and JIT are stable.
Specification
§1 Architecture
The native code emission layer sits between compiler3 (typed IR producer) and the host toolchain (system cc, system linker, host kernel loader). Three emit packages share the typed IR:
compiler3/ir (MEP-40)
|
+---------------+----------------+
| | |
v v v
emit/copypatch emit/c emit/wasm (phase 2 AOT)
(JIT, phase 1) (AOT, phase 1) emit/qbe (AOT, phase 2)
| | |
v v v
mmap exec ELF/Mach-O/PE .wasm module
memcpy+patch via system cc via builtin emit
The boundary is the typed IR. Every emit package consumes the same IR shape, the same SSA value types, the same three-bank register convention, the same Cell ABI. No emit package may add IR ops or modify the type lattice; either the IR already expresses what the backend needs, or the IR change is a separate PR that lands first.
The four-bit arena tag and 12-bit generation encoding of MEP-40 plus the verifier rules of MEP-41 are load-bearing for every backend. Stencils may not mask, shift, or otherwise destructure the generation field; the backend treats it as opaque per MEP-41's Tag Confidentiality Enforcement analog. The C-as-target lowering wraps every handle deref in mochi_deref_T(handle) calls so the C compiler cannot inline gen-extraction; the copy-and-patch stencils never name a register holding a raw gen value.
§2 Copy-and-patch JIT (phase 1)
Hand-write one C function per vm3 opcode in runtime/vm3/op.go. Each stencil takes the vm3 frame, the operand registers, and returns the dispatch target for the next op. Compile each stencil with Clang -O2 -fno-asynchronous-unwind-tables -fno-stack-protector -mno-red-zone at Mochi build time. Extract the resulting machine code and relocations from the .text section; emit them as a generated Go file (compiler3/emit/copypatch/stencils_amd64.go, ..._arm64.go) containing a per-opcode struct: {bytes []byte, holes []Reloc}.
At Mochi runtime, the JIT walks the typed IR for a hot method, picks the stencil for each op, memcpy's the bytes into an mmap'd executable region, and patches the relocations (immediates, jump targets, runtime symbols) in place. The patched code is then jumped to via a Go-friendly entry trampoline that preserves Go's stack invariants. Code-cache management uses a simple bump allocator with a high-water mark; when the cache fills, the JIT falls back to vm3 interpretation for cold ops and recycles the cache on the next GC cycle.
Register convention (x86_64 SysV; mirrored on arm64 AAPCS64 with x19-x28):
- R12: pointer to current Frame.
- R13: pointer to typed-arena base table.
- R14: pointer to per-VM context (PC stash, deopt sentinel slot).
- R15: scratch.
- RAX/RDI/RSI/RDX/RCX/R8/R9: Cell operand registers (caller-save, follow stencil ABI).
Reserved callee-saves match the MEP-40 three-bank register-file design. The JIT never spills R12-R14 because stencils assume them on entry; the only spill path is when a stencil's internal codegen needs more than R15 of scratch, in which case the stencil uses the red-zone (Linux) or a Mochi-private scratch slab on the frame (macOS arm64, which has no red zone).
W^X is enforced via the dual-mapping pattern: the code cache is mmap'd twice, once RW (for the patcher) and once RX (for the runtime jump), with the kernel guaranteeing the same physical pages. On Apple Silicon, pthread_jit_write_protect_np(0) toggles the per-thread write-permission bit during patching; the JIT thread holds the toggle for the patch window only. On Linux with PaX or grsec, the dual-mapping is required; on stock Linux, mprotect toggling is the fallback path.
PAC and BTI hardening on aarch64: every stencil entry carries a bti j instruction; every cross-stencil call uses blraa with the appropriate PAC modifier. The PAC key is per-Mochi-process, derived at startup from /dev/urandom and stored in a register only the patcher knows. This is the MEP-41 §8 JIT hardening checklist; copy-and-patch satisfies it without any per-stencil logic because Clang -O2 already emits PAC+BTI when targeted at arm64-apple-darwin.
Stencil set scope (phase 1):
- All non-allocating vm3 opcodes: arithmetic (i64/f64), comparison, conditional jumps, register move, frame load/store, typed-array element load/store.
- Inline allocation for short-lived Cells (small int, short string, bool).
- Slow-path call into the vm3 runtime for: handle dereference miss, arena exhaustion, deopt sentinel, MEP-41 verifier rule check failure.
- Branch fusion: chained conditional jumps in a single basic block fold into one stencil where possible (Liftoff-style is phase 2; phase 1 keeps every op as a separate stencil).
Not in phase-1 scope (phase 2): cross-op register allocation, inline caching for first-class function dispatch, SIMD intrinsics, generational write-barrier elision via static analysis.
§3 C-as-target AOT (phase 1)
Lower compiler3 IR to C in compiler3/emit/c/. Strategy follows Nim's: one C function per Mochi function, one C struct per Mochi type, every basic block becomes a labeled statement, control flow via goto. Computed goto (GCC extension) is used for the interpreter tier within AOT'd code (for indirect dispatch on dynamic-typed values that escape the static-type discipline); standard switch is the portable fallback for MSVC.
The Mochi C runtime header (runtime/c/mochi.h) declares:
mochi_Cell(uint64_t) and the inline NaN-boxing accessors.mochi_arena_tand the typed-arena APIs from MEP-40.mochi_handle_T(arena, gen, idx)constructors andmochi_deref_T(handle)accessors.- The verifier-checked operations from MEP-41 (
mochi_try_deref_T,mochi_kill, etc.). - The slow-path callbacks the JIT and AOT'd code share.
The runtime header is C99-portable and depends only on <stdint.h>, <stdlib.h>, <string.h>. On glibc and musl it adds <unistd.h> for mmap (for the JIT path; AOT code does not mmap). On Windows it uses <windows.h> for VirtualAlloc. Every implementation file (runtime/c/mochi.c) is built into a static library libmochi.a (or mochi.lib on Windows) that the linker driver bundles into the final executable.
The mochi build driver:
- Parses + type-checks + lowers the program to compiler3 IR.
- Calls
compiler3/emit/c/to produce a temporary.cfile (or files, if multi-module). - Shells out to the user's C compiler: prefers
zig ccif available (zero-config cross compilation), falls back tocc, falls back toclang, falls back togcc. - Compiles the
.cfiles pluslibmochi.ato a single executable using the chosen linker (LLD by default, systemldfallback). - Strips debug info on
releasemode; preserves DWARF ondevmode; emits embedded-mode subset onembeddedmode.
The C compiler choice is documented but not enforced. mochi build --cc=zig selects zig cc explicitly; mochi build --cc=tcc selects TCC (useful for sub-second build times on small programs); mochi build --cc=clang -- -fsanitize=address passes through C compiler flags. The default is mochi build with no --cc flag, which picks zig cc if installed, else cc.
Cross compilation via zig cc:
mochi build --target=aarch64-linux-musl hello.mochi
mochi build --target=x86_64-windows-gnu hello.mochi
mochi build --target=wasm32-wasi hello.mochi
Each --target lookup maps to a zig cc -target triple. The triple list ships in compiler3/emit/c/triples.go; users can extend it via a mochi.toml config.
§4 Wasm emit (phase 1 minimal, phase 2 AOT)
Phase 1 ships a minimal Wasm 3.0 + WasmGC emitter in compiler3/emit/wasm/ that handles the BG kernel subset (arithmetic, control flow, typed arrays, simple structs). The output module imports a small Mochi-Wasm host shim (runtime/wasm/host.js for browser, runtime/wasm/host.wat for Wasmtime/standalone) that provides the slow-path callbacks the JIT and AOT both need.
Handle Cell mapping: 64-bit Mochi Cell becomes a Wasm i64 for the inline-encoded variants (small int, float, bool, null) and a (ref $mochi_handle) GC reference for handle variants. Typed arenas become WasmGC (struct ...) types per arena, instantiated lazily. The four-bit arena tag is the WasmGC type index; the 12-bit generation is a struct field; the 32-bit slab index is the struct array index.
Phase 2 promotes the Wasm emitter to full AOT through Wasmtime's wasmtime compile (~/notes/Spec/5500/backends/12_wasmtime_aot.md): Mochi emits .wasm, wasmtime compile lowers to native .cwasm, the Mochi loader maps the .cwasm directly. This gives Mochi a universal IR (Wasm) and reaches every Cranelift-supported target transitively.
Browser DWARF: Wasm modules carry DWARF in custom sections (./custom("name").data) per the Chrome C/C++ DevTools Support extension. Phase 1 emits line tables only; phase 2 adds full type and variable info.
§5 Linker strategy
Phase 1: LLD by default, system linker fallback.
- Linux:
ld.lld(default),ld.bfdorld.gold(fallback). - macOS: system
ld(which isld_primesince Xcode 15, default),ld.lld(fallback for cross builds). - Windows:
lld-link(default), systemlink.exe(fallback if MSVC is installed). - Wasm:
wasm-ld(LLVM).
Bundle LLD inside the Mochi distribution under Apache 2 + LLVM Exception license. The bundled LLD is a single ~25 MB binary covering all four formats (ELF, Mach-O, PE, Wasm). The total Mochi binary size impact is acceptable for desktop installs; mochi build --no-bundle-lld is the opt-out for users on restricted disks.
Phase 2: self-hosted writers for ELF, Mach-O, and PE in compiler3/objfile/. Pattern follows Go's cmd/link: the compiler emits the final image directly without an external linker subprocess on the common path. LLD remains the fallback for the "I need to link against a C library that ships as .a" case. This halves cold-start build time (no fork+exec of the linker) and lets Mochi tune the output for compiler3-specific metadata sections (typed-arena debug info, MEP-40 vm3 metadata, MEP-41 verifier-proof manifests).