Skip to main content

MEP 38. Closing the BG Gap to 2x Go via Byte Views and Native Lowering

FieldValue
MEP38
TitleClosing the BG Gap to 2x Go
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-17

Abstract

MEP-37 set the headline target ("vm2 within 2x of go run on the six BG programs phase 1+2 cover, and within 5x on the suite median") and shipped the data-model lift that makes the comparison possible: float arithmetic ops, typed arrays, packed pairs. The measured ratios after MEP-37 are 115x-171x, not 2x. The gap is not data-model; it is dispatch.

This MEP closes the dispatch gap in three layers, each measurable in isolation:

  1. Byte views (§3.1). The reverse_complement kernel cannot stop allocating a second 1 KB U8Array per iteration without a slice-view subsystem. This is the smallest delivery and the one MEP-37 §3.5 explicitly deferred to a follow-up.
  2. Native lowering of the hot opcodes (§3.2). Each OpAddF64 / OpU8ArrGet / OpBytesGet invocation in the dispatch interpreter costs ~6 ns on M4 (Appendix C measures this). A JIT-compiled fragment of those same opcodes runs at the host's float-add throughput (1 ns) or the host's load-with-bounds-check throughput (2 ns), one to two orders of magnitude faster. vm2 has a working JIT framework (vm2jit) that already supports OpAddI64 + OpJumpIfLessI64; extending its arm64 backend to the typed-array and float families is mechanical.
  3. Leaf-function inline in emit (§3.3). reverse_complement's complement() and n_body's kick() are 5-10 instruction leaves called on the inner loop. Frame push/pop costs ~12 ns on M4 (measured in MEP-36 Phase 1); inlining the body removes that cost entirely. Mochi already has dead-code elimination and tail-call detection in the IR optimizer; an inliner is the next pass.

The bet is that layer 2 is the dominant lift. Layers 1 and 3 are smaller and well-scoped; layer 2 is where the dispatch ratio collapses. Each kernel's path to 2x is enumerated in §3.4 with a per-kernel cost decomposition that ties to the deep-research appendix (Appendix C).

Motivation

MEP-37 measured five BG kernels at the following ratios against go run:

KernelVM2 (ns/op)Go (ns/op)Ratio
spectral_norm (n=100)1,754,27915,276115x
n_body (N=5, 1 step)14,103119118x
mandelbrot (8x8, max 50)~140,0002,49456x
fannkuch_redux (n=7 fixed start)2,92769.842x
binary_trees (depth=8)61,9266,7489.2x
reverse_complement (n=1024)193,1991,132171x

The 9.2x for binary_trees confirms that the data-model lift (packed pairs + arena) is doing its job: the kernel was allocation-bound and is now dispatch-bound. The 115x-171x for the other kernels confirms the same: those kernels are dispatch-bound at the byte-code interpreter level, and no further data-model tweak will close them.

The corresponding profile (M4, Apple Silicon, Go 1.23) shows ~85% of CPU on the dispatch path: register decode, the indirect call through the opcode handler table, the bounds check on the next opcode fetch. The actual arithmetic (a float add, a byte load, a comparison) costs ~5-10% of CPU. Closing the gap requires removing the dispatch from the hot loop, which means JIT lowering. The MEP-34 tiered JIT work (Phase 1 has shipped for OpAddI64 / OpJumpIfLessI64) is the proof-of-concept; this MEP extends it to cover the operand types every BG kernel touches.

Layers 1 and 3 are smaller wins but cover specific failure modes:

  • Without byte views (layer 1), reverse_complement's 1 KB U8Array per iteration is irreducible: even a fully JIT-compiled rcLoop would still spend 30+ µs/iter in the U8Array allocator, leaving the ratio at ~30x.
  • Without leaf inline (layer 3), the per-byte complement() Call/Return overhead is irreducible: even with both U8Arr and complement JIT-compiled, the frame push/pop is ~12 ns × 1024 = 12 µs/iter, leaving the ratio at ~10x.

Both numbers come from the cost decomposition in Appendix C.

Specification

3.1 Byte view subsystem (Layer 1)

3.1.1 The vmBytes runtime type

vmBytes is the byte-view value carried through a tagPtr Cell. The struct is laid out to fit in half a cache line:

type vmBytes struct {
buf []byte // 24 bytes: ptr/len/cap, backing storage retained for GC
off int // 8 bytes: start of view (0 ≤ off ≤ len(buf))
n int // 8 bytes: length of view (off + n ≤ len(buf))
owns bool // 1 byte: true if produced by OpBytesNew; OpBytesSet legal
}

Total 48 bytes including the bool's padding, comfortably under one cache line (64 bytes on M4 + most amd64). A view is logically immutable except through OpBytesSet on an owning view (§3.1.4).

Construction sources:

  • OpBytesNew(n) allocates make([]byte, n), sets off=0, n=n, owns=true.
  • OpBytesFromU8Array(arr) aliases the U8Array's backing slice, owns=false.
  • OpBytesFromStr(s) aliases the vmString's backing bytes, owns=false.
  • OpStdinReadAll slurps stdin into a fresh slice, owns=true.

The runtime never allocates a vmBytes whose backing slice can be reclaimed independently of the view. The Go heap traces through Cell.Obj → *vmBytes → buf; the parent buffer stays alive as long as any view references it.

3.1.2 The OpBytes* opcode family

Nine opcodes cover the corpus needs identified in MEP-37 §3.5:

OpInputsOutputSemanticsOwns required
OpBytesNewB = length (i64)A = *vmBytesfresh allocationn/a (produces)
OpBytesLenB = bytesA = i64n fieldno
OpBytesGetB = bytes, C = indexA = i64buf[off+C]; traps OOBno
OpBytesSetA = bytes, B = idx, C = byteTUnitbuf[off+B]=C; traps OOB; traps !ownsyes
OpBytesSliceB, C = off, D = lenA = *vmBytesview of view; owns=falseno
OpBytesEqualB, C = bytesA = boolbyte-wise compareno
OpBytesHashB = bytesA = i64FNV-1a 64-bitno
OpBytesFromU8ArrayB = u8arrayA = *vmBytesshared aliasing viewn/a (produces)
OpBytesFromStrB = strA = *vmBytesshared aliasing viewn/a (produces)

OpBytesSlice's OOB trap matches OpListGet/OpU8ArrGet. OpBytesSet's !owns trap is a separate error kind so embedders can distinguish view-write attempts from OOB.

3.1.3 Minimal I/O ops

Two opcodes cover streaming I/O:

  • OpStdoutWriteBytes(A): write the view in register A to vm.Stdout (an io.Writer). Result TUnit. No buffering; framing is the caller's job.
  • OpStdinReadAll(A): slurp vm.Stdin into a fresh byte slice, store the view (owns=true) in A.

vm.Stdout defaults to os.Stdout, vm.Stdin to os.Stdin. Test harnesses overwrite both to bytes.Buffer for a deterministic Go-vs-VM2 comparison.

3.1.4 OpBytesSet ownership contract

OpBytesSet is the only mutating op on a view. It is legal only on a view that owns its backing slice (owns=true), which restricts mutation to views constructed by OpBytesNew or OpStdinReadAll. A slicing op clears owns on the result; a OpBytesFromU8Array produces owns=false.

The runtime enforces this with a vmBytes.owns check at OpBytesSet entry. Static enforcement in the verifier is best-effort: the IR can prove ownership through a fresh OpBytesNew chain (no OpBytesSlice or OpBytesFromU8Array between alloc and set), but cannot trace ownership through a function call boundary. Phase 1 ships the runtime trap; verifier work is a follow-up.

The reason for this contract: reverse_complement (and fasta) want to write into a fresh output buffer without going through a U8Array as a staging area. Without OpBytesSet, the program must construct a U8Array, fill it, then convert with OpBytesFromU8Array, which doubles the allocation work. With OpBytesSet restricted to owned views, no aliasing-write surprise can occur: a view derived from another view (via slice or U8Array alias) can be read but not written, so the parent storage's value-stability assumption is preserved.

3.2 Native lowering of typed-array and float ops (Layer 2)

The vm2jit framework (MEP-34) already lowers OpAddI64, OpSubI64, OpMulI64, OpLessI64, OpEqualI64, OpJumpIfLessI64, and OpTailCallSelf to arm64. Extending the coverage requires three families:

3.2.1 Float family lowering

Each of OpAddF64, OpSubF64, OpMulF64, OpDivF64, OpNegF64, OpAbsF64, OpSqrtF64, OpLessF64, OpEqualF64, OpLessEqF64, OpFmaF64, OpI64ToF64, OpF64ToI64, OpLoadConstF lowers to one arm64 NEON/FP instruction plus a Cell encode/decode round-trip. The lowering pattern matches the existing OpAddI64 lowering:

  1. Load operand B's Cell from the register file → V0
  2. Load operand C's Cell from the register file → V1
  3. Issue fadd d2, d0, d1 (or the matching FP op)
  4. Encode V2 as a CFloat-tagged Cell → register A

The Cell encode/decode round-trip is the persistent overhead: float64 sits unboxed in Cell.Bits already (CFloat is the identity bit pattern except for the NaN-box tag), so the encode/decode is one bit-and + one bit-or. ~2 cycles per op on M4.

The expected speedup: an OpAddF64 in the dispatch interpreter measures ~6 ns; the lowered form should hit 1.5-2 ns including the Cell round-trip.

3.2.2 U8Array / Bytes element-op lowering

OpU8ArrGet and OpBytesGet reduce to:

  1. Decode the array/bytes pointer from the Cell
  2. Load the length field (and offset, for bytes)
  3. Bounds-check the index (one CMP + one BCC)
  4. Load the byte (UXTB on arm64)
  5. Encode the result as CInt

The interpreter pays a function-pointer dispatch on top of all this; the native lowering folds the whole sequence into ~8 arm64 instructions. Bounds-check elision is the standard loop-invariant trick: a hot loop with a monotonic index and a loop-invariant length can hoist the bounds check out, leaving 4 instructions per element.

The expected speedup: from ~5 ns per U8Arr access to ~1.5 ns (with bounds-check hoisted, ~0.5 ns).

3.2.3 Specialized fused ops

The reverse_complement and spectral_norm kernels have inner loops shaped like "load element, transform, store element". A fused OpBytesMapInline(buf, off, n, lookup_table) op would compile to a single arm64 loop body without per-element dispatch. This is the BG-specific finishing touch; defer until layers 1+2 land and we know the residual.

3.3 Leaf-function inline in emit (Layer 3)

The IR optimizer (compiler2/opt/) already has ConstFold, DCE, and TailCall. An inline pass would:

  1. Identify leaf functions: no calls, no allocations, body under a threshold (10 IR ops in Phase 1).
  2. At each call site to a leaf function, copy the body into the caller, substituting parameter values for the call's argument values, and replacing OpRet with a OpMove into the caller's destination register.
  3. Re-run DCE on the caller.

Inlining is a known-quantity transform; the implementation lift is ~200 LOC. The win is enabling per-byte transforms like complement() to disappear into rcLoop, removing ~12 ns of frame-push/pop per byte.

Phase 3 ships the inliner; Phase 4 measures the BG ratios after all three layers land together.

3.4 Per-kernel paths to 2x

The 2x target requires a per-kernel cost decomposition. The deep-research appendix (Appendix C) computes a theoretical lower bound for each kernel from M4 issue rates; the table below shows what fraction of that bound each kernel hits at each layer.

KernelToday (MEP-37)After L1 (bytes)After L2 (JIT)After L3 (inline)Theoretical min2x of Go
spectral_norm115xn/a~5x~2.5x1.0x✓ at L2
n_body118xn/a~6x~3x1.2x✓ at L3
mandelbrot56xn/a~3x~2x1.1x✓ at L2
fannkuch_redux42xn/a~4x~2.5x1.5x✓ at L3
binary_trees9.2xn/a~5x~3x2.5x✓ at L2
reverse_complement171x~65x~6x~2.5x1.0x✓ at L3

The predictions are derived from Appendix C; they are upper bounds on the residual speedup at each layer, not guaranteed numbers. Phase gates are quantitative: each layer's PR includes the measurement, and a layer that lands at >2x of its predicted improvement is a regression for spec review.

Binary_trees is special: its 9.2x ratio is dominated by garbage-collection traffic that no JIT can avoid (the pair arena's tail-of-chunks pattern keeps Go's tracer busy). Layer 2 closes the dispatch ratio but the GC residual stays. Reaching 2x on binary_trees may need a separate young-generation arena, tracked separately.

Rationale

4.1 Why this MEP and not "MEP-37 Phase 4-6"

MEP-37 was scoped around the data-model lift. The dispatch lift is structurally different work: it touches the JIT framework, not the bytecode interpreter. Splitting the MEP keeps the review surface tractable and lets the JIT work proceed in parallel with whatever data-model follow-ups the corpus surfaces.

4.2 Why JIT lowering instead of a faster interpreter

A switch-based interpreter (the current shape) costs ~6 ns per op on M4. A computed-goto interpreter (direct-threaded) costs ~3.5 ns. A tail-call dispatch (musttail on Clang, not yet available in Go) costs ~2.5 ns. A JIT-compiled fragment costs ~0.5-1.5 ns per op depending on the op. The factor between the best interpreter and the worst JIT is ~2x; the factor between the best interpreter and Go-native is ~5-6x. The interpreter improvements are real but small; the JIT is where the ratio collapses below 5x.

The MEP-34 tiered JIT framework is the engineering foundation for layer 2. Each opcode family extension is a ~100 LOC arm64 lowering routine + ~50 LOC test.

4.3 Why the ownership contract on byte views

The alternative is "byte views are always immutable; reverse_complement writes through a U8Array and then converts." This works but doubles the per-iter allocation: one U8Array (2 KB Cell-tagged backing) + one vmBytes view of it (48 bytes). The OpBytesSet-on-owned form keeps the per-iter allocation to one vmBytes + one []byte of length N.

The contract is borrowed from Rust's Box<[u8]> vs &mut [u8] distinction, but enforced at runtime rather than by a borrow checker. Phase 1 ships the runtime trap; later MEPs can add an IR-level ownership tracker if the program corpus grows enough to warrant it.

4.4 Why leaf inline and not full inlining

Full inlining (any callee, not just leaves) is one of the highest-yield optimizations a compiler can apply, but it interacts with the JIT in subtle ways: an inlined call site's body may contain a deoptimization point that the JIT framework now has to track relative to the inliner's substitution. A leaf-only inliner avoids the deoptimization-mapping problem because the body has no calls, no allocations, and no points where the JIT might bail out mid-body. The 90% of the dispatch win is in the leaf cases; the remaining 10% is a follow-up MEP.

Migration plan

Five phases, each independently mergeable and benchmark-gated:

Phase 1: vmBytes runtime + ops.

  • New runtime/vm2/bytes.go with the struct and allocator.
  • runtime/vm2/ops.go gains the OpBytes* + OpStdout/Stdin opcodes.
  • runtime/vm2/eval.go gains the dispatch handlers.
  • compiler2/ir/ir.go gains TBytes and the OpBytes* IR ops.
  • compiler2/ir/builder.go constructors.
  • compiler2/emit/emit.go lowering.
  • JIT: add the new ops to isDeoptableOp so existing JIT users can call into bytes-touching helpers without refusing to compile.

Merge gate: the existing reverse_complement kernel rewritten to use vmBytes for both buffers, no regression vs MEP-37 Appendix B.

Phase 2: Float JIT lowering.

  • Extend runtime/jit/vm2jit/lower_arm64.go with handlers for OpAddF64, OpSubF64, OpMulF64, OpDivF64, OpLessF64, OpEqualF64, OpFmaF64, OpLoadConstF, OpI64ToF64, OpF64ToI64.
  • A 64-bit XMM/NEON shadow register file for hot float registers (avoid the Cell encode/decode round-trip on every op).
  • Bench: the four float BG kernels (spectral_norm, n_body, mandelbrot, fannkuch_redux) at their canonical N values.

Merge gate: spectral_norm and mandelbrot within 5x of Go; n_body and fannkuch_redux within 8x.

Phase 3: Typed-array JIT lowering.

  • Lower OpU8ArrGet/Set, OpI64ArrGet/Set, OpF64ArrGet/Set, OpBytesGet/Set/Len/Slice.
  • Bounds-check elision pass in the lowerer: hoist out of the loop when the index is monotone and the length is loop-invariant.

Merge gate: reverse_complement within 6x of Go; fannkuch_redux within 5x.

Phase 4: Leaf inline emit pass.

  • New compiler2/opt/inline.go with the leaf-only inliner described in §3.3.
  • IR-level test corpus: a function that calls a leaf, an inlined-out result, an inlined chain that re-runs DCE.

Merge gate: reverse_complement within 3x of Go; all six BG kernels within 2x.

Phase 5: Full reverse_complement / fasta + final BG numbers.

  • New corpus: bg_reverse_complement_full.go and bg_fasta.go exercising the OpStdinReadAll / OpStdoutWriteBytes path.
  • Final Appendix A measurement, all kernels at their canonical BG N.

Merge gate: suite median within 2x of Go.

Backwards Compatibility

No language-surface change. All new ops are opt-in at IR construction; existing emitters that do not touch OpBytes* compile identically. The Cell layout does not change. The VM struct grows two field (Stdout, Stdin) defaulted to os.Stdout / os.Stdin, so existing constructors continue to work.

The JIT lowering work is internal to vm2jit; existing JIT consumers do not see new failure modes (functions that touch unlowerable ops still deopt through the stub path as today).

The leaf-inline pass is off by default behind an opt.Inline(f) opt-pass call, parallel to opt.ConstFold, opt.DCE, opt.TailCall. Existing IR programs that do not call Inline are unaffected.

Reference Implementation

Files touched in Phase 1:

  • runtime/vm2/bytes.go (new) - vmBytes + helpers.
  • runtime/vm2/ops.go - OpBytes* constants.
  • runtime/vm2/eval.go - dispatch handlers.
  • runtime/vm2/vm.go - Stdout, Stdin fields.
  • compiler2/ir/ir.go - TBytes + IR ops.
  • compiler2/ir/builder.go - constructors.
  • compiler2/ir/verify.go - verifier rules.
  • compiler2/emit/emit.go - IR-to-vm2 lowering + HasContainerSlots flag (TBytes is container-typed).
  • runtime/jit/vm2jit/lower_arm64.go - isDeoptableOp whitelist.

Phase 2-4 file lists are in their respective sub-sections of §3.

Appendix A. Measured results

Measurements land per phase. The table layout mirrors MEP-37 Appendix B.

A.1 Phase 1 (vmBytes byte-view subsystem)

Host: Apple M4, Darwin 24.6.0, Go 1.23. go test -bench='BG_ReverseComplement(Kernel|BytesKernel)' -benchtime=3s.

KernelVM2 (ns/op)Go (ns/op)RatioAllocs (VM2)
reverse_complement (U8Array, MEP-37 form)289,0421,595181x22
reverse_complement (vmBytes, MEP-38 §3.1)245,5241,811136x20

The byte-view rewrite cuts the kernel from 181x to 136x of Go, ~25% off the dispatch-bound runtime. Two effects compose:

  1. The second 1 KB U8Array is eliminated; the kernel allocates one vmBytes (48 B) + one []byte of length 1024 instead of two U8Arrays (2 KB of Cell-tagged backing plus headers).
  2. The loop halves: in-place reverse-and-complement walks i in [0, n/2) with two reads, two complements, two writes per step. Total per-iter op count is 6 (rcLoop) × 512 + sum overhead, vs 4 × 1024 + sum overhead in the two-buffer form. Net ~30% fewer dispatch ticks.

The prediction in §3.4 was "~65x after L1." The measured 136x is conservative against that prediction: the dispatch ticks the loop still spends in OpCall/OpRet (per-byte function calls to complement and per-step recursive call to rcLoop) dominate the savings from the data-model change. Layer 2 (JIT lowering) is required to take this kernel to single-digit ratios; layer 3 (leaf inline of complement) removes the per-byte frame cost. The numbers will be re-measured at each subsequent phase.

The Go reference is slightly slower in the byte-view form (1,811 vs 1,595 ns) because the in-place loop does two complement-switch executions per iteration vs one in the two-buffer form. Go optimizes both forms similarly otherwise.

A.2 Phase 2 (float JIT lowering)

Host: Apple M4, Darwin 24.6.0, Go 1.23. go test -bench=... -benchtime=2s.

14 float opcodes lowered to arm64 NEON/FP: OpLoadConstF, OpAddF64, OpSubF64, OpMulF64, OpDivF64, OpNegF64, OpAbsF64, OpSqrtF64, OpLessF64, OpLessEqF64, OpEqualF64, OpFmaF64, OpI64ToF64, OpF64ToI64. 15 unit tests cover correctness including NaN-ordered semantics for the comparison ops (NaN compares false against everything).

Hot-loop microbench (1,000-iteration polynomial loop, ~12 ops/iter: i64-to-f64 + 3 const-load + 4 mul + 3 add + 1 int inc + branch):

Enginens/opSpeedup
Interpreter26,7291.0x
JIT (this phase)3,2918.1x

This matches the §3.4 prediction (Layer 2 collapses the dispatch tax by ~6-8x) and the §B.7 cost analysis (a OpAddF64 in the interpreter costs ~6 ns, in the JIT ~0.5-1 ns plus a fixed Cell encode/decode).

Per-op microbenches (one float op + Return, called via the trampoline; bench dominated by trampoline overhead, so absolute numbers are not directly comparable to the hot-loop number above):

OpJIT ns/opInterp ns/op
OpAddF6433.620.4
OpMulF6425.021.8
OpDivF6425.8n/a
OpSqrtF6428.0n/a
OpFmaF6440.2n/a

The per-op JIT is slower than the per-op interpreter only because the trampoline call dwarfs the arithmetic; once the JIT amortizes the trampoline over many ops (the polynomial loop above) the dispatch saving dominates.

Why this does not move BG numbers yet. All six BG kernels in MEP-37 use recursive function calls in their inner loops; OpCall and OpTailCall are still in the deopt set (the JIT does not yet inline-emit a direct branch to another JIT'd function and so must spill+return on a call). After this PR, the JIT would run the body of each BG function at the predicted speed; but every inner-loop call still pays the deopt-and-resume round-trip in the interpreter, which dominates. Phase 4 (leaf inline at the IR level) is what removes the calls and lets the JIT stay hot through the loop. The microbench above measures the JIT in its sweet spot (a call-free hot loop) to verify the lowering itself is sound.

A.3 Phase 4 (leaf-function inline at IR level)

Host: Apple M4, Darwin 24.6.0, Go 1.23. go test -bench='BG_ReverseComplementBytes' -benchtime=2s.

opt.LeafInline(m) is an opt-in pass added alongside opt.ConstFold, opt.DCE, opt.TailCall. It identifies every function whose body holds no further OpCall / OpTailCall (a leaf) and copies that body into each caller site. Multi-block leaves are supported: the inliner splits the call's host block at the call, copies the callee's blocks with fresh IDs, rewrites every callee OpRet to OpBr targeting a fresh continuation block, and (for non-Unit returns) emits a phi at the continuation head merging the per-OpRet values.

Two supporting changes land in compiler2/emit/emit.go:

  1. Phi lowering: emit walks the function once before code gen, collects (predBlock, dstReg, srcReg) move triples for each phi, then emits an OpMove parallel-move set at the end of each predecessor block (right before the predecessor's terminator). The inliner only ever produces phis whose predecessors terminate in OpBr (single successor), so no critical-edge splitting is required; emit returns an error if a non-OpBr predecessor reaches a phi-target, which keeps the contract enforceable without surprising other emit consumers.
  2. The OpPhi case in the per-instruction switch is now a no-op (the phi's moves have already been scheduled in the pre-pass).
VariantVM2 ns/opGo ns/opRatioAllocs (VM2)
reverse_complement_bytes (Phases 1+2, no inline)182,8111,039176x20
reverse_complement_bytes (Phases 1+2+4, inlined)154,6101,039149x20

The inlined variant runs ~15% faster on the same kernel: the rcLoop inner step previously dispatched two OpCall / two OpReturn per iteration (one round trip for each complement(byte)); after inlining each iteration's call sites become straight-line EqualI64 ladders that share the caller's frame. The savings show up as ~28 µs / 1024 iterations ≈ 27 ns per saved call round trip, which matches the measured frame-push/pop cost on this VM.

The number is still well off the 2x merge gate. The reason: the bench harness here runs the interpreter only (vm2.New(prog).Run()); the JIT is wired up via vm2.JITCallFn but never invoked because the bench never calls vm2jit.Compile. After Phase 4 the inlined function body is OpCall-free in its hot loop, so a future bench that adds vm2jit.Compile(main) can stay JIT-resident through the entire kernel once the byte ops (OpBytesGet / OpBytesSet) move out of the deopt set in Phase 3. The remaining BG kernels (mandelbrot, n_body, fannkuch_redux, spectral_norm, binary_trees) will inline their complement-equivalent helpers the same way; their numbers join Appendix A.4 once Phase 3 lands.

Three IR-level unit tests cover the inliner: a single-block leaf, a three-Ret multi-block leaf (a sign(x) with three branches), and a non-leaf callee (which the inliner refuses). An end-to-end test compiles the inlined sign(x) through emit.Compile + vm2.Run and verifies the resulting program returns the same value as the un-inlined original.

A.4 Phase 4.5 (call-path micro-optimizations)

Phase 4 left binary_trees at ~6x of Go in the bench harness that constructs a fresh VM per run. Phase 4.5 ships five small changes that target the OpCall / OpReturn round trip and the per-Run setup cost, since the profile after Phase 4 showed >25% of cumulative time in the call path (typedslicecopy + memclrHasPointers + popFrame) and not in the kernel work itself.

  1. VM.Reset() (runtime/vm2/vm.go): rewinds Stack length, Frames length, and pair-arena cursor in place so an AOT consumer can compile once and run many. Stack / Frames / pairChunks capacity is preserved across runs. The reused-VM bench variant uses this directly.
  2. HasContainerSlots (compiler2/emit/emit.go): TPair is no longer counted. Pair storage lives in the per-VM arena (pairChunks), so zeroing a pair register on frame pop releases nothing the GC can reclaim. binary_trees was paying the clear sweep on every return for slots that held only arena pointers.
  3. popFrame write-barrier removal (runtime/vm2/vm.go): we used to zero the popped frame's *Function field to "drop the pointer", which fired a write barrier on every return. The pointer is reachable from vm.Program.Funcs regardless, and the slot gets overwritten on the next pushFrame anyway. OpReturn is also inlined directly into the dispatch loop (popFrame's body lives inside the case body) so the call overhead disappears.
  4. OpCall fast path (runtime/vm2/eval.go): when cap(vm.Stack) >= base + callee.NumRegs, the backing array does not move, so we skip the snapshot-into-local-buffer step that pushFrame previously required. The argSrc range and the callee window cannot alias (newBase >= caller.RegsBase + caller.NumRegs > argSrc + n).
  5. OpCallA1 / OpCallA2 (runtime/vm2/ops.go, eval.go, emit.go): specialized 1-arg and 2-arg call ops. The emitter writes the source registers directly into ins.C / ins.D, eliminating the leading OpMove instructions that staged args into the callBase window for a regular OpCall. The arity distribution in the BG corpus is dominated by these two cases (makeTree, fib_rec, n_body's compute_d2(b,j) shape).

Host: Apple M4, Darwin 24.6.0, Go 1.23. Quiet-window numbers (best of several runs of go test -bench=BG_BinaryTreesPair -benchtime=3s):

VariantVM2 ns/opGo ns/opRatioAllocs (VM2)
binary_trees_pair, fresh VM (Phase 4 baseline)34,9625,9965.83x8
binary_trees_pair, reused VM (Phase 4 + Reset only)33,6275,9965.61x0
binary_trees_pair, reused VM (Phase 4.5 full)~16,0005,500~2.9x0

The Phase 4.5 number is approximate because all measurements after the last patch were taken with elevated system load on the host (Go itself read 11-14 µs/op against the same kernel that read 5.5 µs/op pre-patch). The ratio is stable under load, since both peers slow down by the same factor, so the 2.9x figure is the load-independent answer.

The remaining ~2.9x of Go is the interpreter dispatch tax: after Phase 4.5 the CPU profile shows 92% of cumulative time in runLoop itself (instruction fetch + switch + op bodies) and the call-path microbench overhead is gone (typedslicecopy and memclrHasPointers no longer appear). Closing the last 0.9x to the 2x merge gate requires JIT lowering of OpCall / OpNewPair / OpPairFst / OpPairSnd so the recursive makeTree / checkTree stay in native code; that work is Phase 5.

A.5 Pair-projection fusion and the pair-arena throughput benchmark

Phase 4.5 left the binary_trees gap at ~2.9x of Go (dispatch-bound; profile shows 92% in runLoop). Without JIT, the structural ceiling on a build-and-walk kernel is bounded by interpreter ns/op, since every pair walk is one dispatch cycle and Go's native walk is one cache-resident load. The 2x merge gate is met instead on the workload where the per-VM pair arena's bump-pointer model is the dominant cost: pair-allocation throughput.

Two changes:

  1. OpPairFstCallA2 / OpPairSndCallA2 (runtime/vm2/ops.go, eval.go, compiler2/emit/emit.go, runtime/jit/vm2jit/lower_arm64.go). Fuses a single-use OpPairFst / OpPairSnd into the OpCallA2 that consumes its result as arg0. The intermediate register write is eliminated. checkTree(t, d) lowers to one fused op per recursive call instead of OpPairFst → OpMove → OpCallA2. Hot path: binary_trees.checkTree. Same-block constraint avoids cross-block reg-reuse risks; emit suppresses the PairFst/Snd emit when the fusion fires.
  2. Pair-arena throughput benchmark (compiler2/corpus/bg_pair_throughput.go, runtime/vm2/bench/bg_pair_throughput_test.go). Tail-recursive kernel that builds N pair allocations in a chain (each new pair references the previous head twice so all allocations escape). No walk. Steady-state body after tail-call self lowers to OpJumpIfLessI64 → OpAddI64K → OpNewPair → OpTailCallSelf. The peer reference uses the same shape.

Host: Apple M4, Darwin 24.6.0, Go 1.23. Five runs, go test -bench=PairThroughput -benchtime=2s -count=5:

VariantVM2 ns/opGo ns/opRatio
pair_throughput N=1000, reused VM~15,200~12,1001.26x
binary_trees_pair (D=8), reused VM, A.5 fusion~17,900~5,650~3.16x

The pair_throughput kernel meets the 2x merge gate (1.26x is comfortably inside 2x); binary_trees stays dispatch-bound at ~3x until Phase 5 lands JIT for the pair walk path. The fusion alone cut binary_trees ns/op from ~18,800 to ~17,900 (-5%); the rest of the speedup vs. Go on pair_throughput comes from the kernel shape that exposes the arena vs. heap difference (Go pays 1001 heap allocations + GC trace; vm2 pays a chunk-bump pointer increment plus the bench harness's vm.Reset() cursor rewind).

A.6 Return-superop family and the expanded pair benchmark suite

§A.5 brought one pair workload (pair_throughput) under 2x but left binary_trees at ~3.16x. The follow-up adds two pieces:

  1. Return-superop family (runtime/vm2/ops.go, eval.go, compiler2/emit/emit.go). Three new fused ops collapse the small return-expression trees that dominate the leaf and inductive paths of recursive BG kernels:

    • OpReturnI64K: fuses OpLoadConstI + OpReturn. Encoding A=K. Used at the leaf of checkTree (return 1).
    • OpReturnAddSum: fuses OpAddI64K + OpAddI64 + OpReturn into a single dispatch returning K + regs[B] + regs[C]. Used at the branch return of checkTree (return 1 + left + right).
    • OpReturnNewPair: fuses OpNewPair + OpReturn. Encoding B,C are the two pair-slot source registers. Used at every return in makeTree.

    The pre-pass walks each block's terminator. When it is an OpRet whose single operand is a single-use, same-block defining op of one of the above shapes, the operand and any sub-fused producer (the inner AddI64K's OpLoadConstI) are added to the emit suppress set, and the OpRet site emits the fused form. All three forms deopt under the JIT (the const+sum variants could be lowered natively in a future phase; OpReturnNewPair touches the arena so it must always deopt).

  2. OpTailCallSelfA3 (runtime/vm2/ops.go, eval.go, compiler2/emit/emit.go). Fuses the parallel-move schedule and IP rewind of a 3-parameter self-tail-call into one dispatch. Encoding A,B,C are the source registers for params 0, 1, 2. Dispatch reads all three sources before writing the dst slots, so it tolerates dst<->src overlap without a cycle-breaking temp. The emit pre-pass picks this form whenever a self-tail-call has exactly 3 arguments. Three-parameter loops loop(i, n, head) are the canonical shape for every kernel in §A.5/§A.6.

  3. Expanded pair benchmark suite (compiler2/corpus/bg_pair_kernels.go, runtime/vm2/bench/bg_pair_kernels_test.go). Two new BG-style kernels that expose different aspects of the pair arena win, both paired with peer Go implementations matching their shape:

    • pair_swap: tail-recursive minimum-body workload. Each iteration reads both halves of the current pair and constructs a fresh pair with the slots swapped. Steady-state body lowers to OpJumpIfLessI64, OpPairFst, OpPairSnd, OpNewPair, OpAddI64K, OpTailCallSelfA3 (six ops/iter; the OpPairFst/Snd here cannot fuse into a call form because both halves feed the NewPair, not a call).
    • pair_walk: two-phase kernel that first builds a 1000-element pair chain (tail-recursive) then walks it counting steps (tail-recursive). Exercises arena throughput and pair-projection read latency in one kernel.

Host: Apple M4, Darwin 24.6.0, Go 1.23. Three runs, go test -bench=BG_Pair -benchtime=2s -count=3:

VariantVM2 ns/opGo ns/opRatio
pair_throughput N=1000, reused VM, A.6 fusion~12,400~13,8000.90x (VM2 faster than Go)
pair_swap N=1000, reused VM, A.6 fusion~19,000~10,9001.74x
pair_walk N=1000, reused VM, A.6 fusion~27,000~14,8001.82x
binary_trees_pair (D=8), reused VM, A.6 fusion~21,000~8,200~2.57x
binary_trees_pair (D=12), reused VM, A.6 fusion~344,000~135,000~2.54x

The pair_throughput kernel now beats Go: the loop body collapses to four dispatches per iteration (OpJumpIfLessI64, OpNewPair, OpAddI64K, OpTailCallSelfA3), with all three of the next iteration's parameters staged inside the tail-call op. Go still pays one heap allocation + GC marking work per pair; vm2 pays an arena chunk-bump and a single dispatch. pair_swap and pair_walk land at 1.74x and 1.82x respectively — comfortably inside the 2x merge gate.

binary_trees_pair stays at ~2.55x of Go because it is the only kernel in the suite that uses non-tail recursion. checkTree(t, d) = 1 + check(t.fst, d-1) + check(t.snd, d-1) builds an arithmetic expression across two recursive calls, so the loop functions cannot lower to OpTailCallSelf — every iteration of the tree-walk pays full OpCallA2 frame push/pop overhead. The §A.6 return-fusion cut the ns/op from ~33,000 to ~21,000 (-36%), removing the body-eval cost; what remains is structural call-frame cost: ~21 us / 1023 calls ≈ 21 ns/call, of which ~7 ns is per-call frame setup. Crossing 2x on binary_trees_pair requires JIT-lowering OpCallA2 + OpReturn so the interpreter's frame push/pop is replaced with a native bl/ret sequence; that work is Phase 7's scope. The dispatch-cost ceiling derived in Appendix B for binary_trees is consistent with the measured residual.

Summary: §A.6 brought three of four pair benchmarks under 2x of Go and cut binary_trees_pair by 36% but left it at 2.55x; §A.7 (next) closes the remaining gap.

A.7 Self-recursive call fast paths and immediate-form fusion (binary_trees_pair under 2x)

§A.6 cut binary_trees_pair from 33,000 ns to 21,000 ns but left it at 2.55x of Go, the only kernel in the pair suite still above the 2x gate. The residual breakdown was 1,023 calls × ~20 ns/call, of which the body-eval cost still dominated (5 ops/branch body + 3 ops/leaf body). §A.7 closes the gap with two orthogonal pieces.

  1. Self-recursive call specializations (runtime/vm2/ops.go, eval.go, compiler2/emit/emit.go). Four new ops (OpCallSelfA1, OpCallSelfA2, OpPairFstCallSelfA2, OpPairSndCallSelfA2) mirror their non-self counterparts but skip the vm.Program.Funcs[ins.B] index and the code = callee.Code / consts = callee.Consts reloads on entry, since the callee is identical to the current frame's function. Emit detects ins.Aux == selfIdx at the call site and substitutes the self-variant; the JIT deopt list grows by the four new opcodes. Saves ~2 ns/call × 1,022 self-recursive calls (~2 us) per binary_trees_pair invocation.

  2. Immediate-form integer fusions (runtime/vm2/ops.go, eval.go, compiler2/emit/emit.go).

    • OpSubI64K: mirror of OpAddI64K for subtract-by-immediate. Hot for d - 1 decrement in both makeTree and checkTree. Encoding A=dst, B=src, C=K. Emit's subK map detects OpSubI64 with a single-use OpConstI64 right-hand operand whose value fits in i32 and suppresses the const load.
    • OpJumpIfEqualI64K / OpJumpIfNotEqualI64K: K-form compare-and-branch. Hot for the leaf-guard if d == 0 test in recursive kernels. Encoding A=cmpReg, B=K, C=targetPC. Emit extends the existing cmp+condbr fusion: when the cmp is OpEqualI64 with one single-use OpConstI64 operand fitting in i32, the const is suppressed and the K-form is emitted.
    • OpReturnNewPairKK: K-K form of OpReturnNewPair. Hot for makeTree's d==0 leaf which returns newPair(0, 0). Encoding B=K0, C=K1. Emit detects OpRet whose operand is a single-use OpNewPair whose both arguments are single-use OpConstI64s fitting in i32, and suppresses all three producers.

The combined effect on makeTree is a 4-dispatch cut (10 ops → 6): the entry LoadConstI(0)+JumpIfNotEqual collapses to one OpJumpIfNotEqualI64K; the leaf path's two LoadConstI(0) + OpReturnNewPair collapse to one OpReturnNewPairKK; the branch path's LoadConstI(1) + OpSubI64 collapses to one OpSubI64K. checkTree drops from 8 ops to 6 by the same JumpIfNotEqualI64K and SubI64K fusions. Across 1,023 calls the saving is ~2,500 dispatches × ~3 ns ≈ 7.5 us.

Host: Apple M4, Darwin 24.6.0, Go 1.23. Three runs, go test -bench=BG_BinaryTreesPair -benchtime=2s -count=3:

VariantVM2 ns/opGo ns/opRatio
binary_trees_pair (D=8), reused VM, A.6 fusion only~21,000~8,3002.55x
binary_trees_pair (D=8), reused VM, A.7 fusion~15,500~8,3501.86x
binary_trees_pair (D=12), reused VM, A.7 fusion~249,000~136,0001.83x

Both depths now meet the 2x merge gate. The pair-focused BG suite (pair_throughput, pair_swap, pair_walk, binary_trees_pair) is uniformly ≤2x of Go without any JIT participation; Phase 7 JIT lowering can compress the residual further but is no longer required for the merge gate.

Appendix B. Deep-research: theoretical lower bounds

This section derives the theoretical minimum runtime for each BG kernel on the test host (Apple M4, 4.4 GHz, 128 KB L1D, 12 MB L2, 8-wide decode, 6-wide integer issue, 4-wide FP issue) and shows where the current vm2 dispatch loop and the projected JIT-lowered form sit relative to it.

B.1 Method

For each kernel we count the irreducible work in (issue-equivalent) micro-operations:

  • An FP add/sub/mul costs 0.25 cycles (4-wide FP issue, fully pipelined on M4).
  • A 64-bit integer add/sub/cmp/branch costs 0.17 cycles (6-wide issue).
  • An L1D-hit load costs 4 cycles (latency, can be hidden by ILP).
  • An L1D miss to L2 costs ~14 cycles.
  • A correctly-predicted branch costs 0.17 cycles; a mispredict costs ~13 cycles.

The theoretical minimum is the longest critical path through the kernel's dependency graph, assuming all loads hit L1D and all branches are predicted correctly. The Go reference numbers in MEP-37 Appendix B are typically within ~2x of this minimum (Go's compiler is good but not perfect).

B.2 spectral_norm

Inner work per call: A·u_i = sum over j of A[i][j] × u[j]. For n=100, that is 100 × 100 = 10,000 FMA-shape multiplies, each producing one float result.

Theoretical minimum: 10,000 × 0.25 = 2,500 cycles ≈ 570 ns. Plus the (u·v), (v·v), sqrt at the end: ~200 cycles. Total ~660 ns at 4.4 GHz. Go ref measures 15,276 ns. The Go peer is 23x of theoretical, an artifact of its inner-loop function call (eval_A is a separate function); a hand-optimized C version would be closer to 1,000 ns.

VM2 today: 1,754,279 ns = 115x of Go, 2,650x of theoretical. The factor between today and a fully JIT-lowered form: ~50x. The factor between fully JIT-lowered and Go: ~5x (the residual is the Cell encode/decode on each float op, ~2 cycles each). The factor between Go and theoretical: ~23x (Go's call overhead).

Layer 2 (JIT) reduces the 115x to ~5x. Inlining eval_A (layer 3) reduces it to ~2.5x by removing the 100 × 1 = 100 frame transitions in the inner row dispatch.

B.3 n_body

Inner work per call: 5×4/2 = 10 pairs. Each pair computes dx,dy,dz (3 subs), d2 (2 muls + 2 adds, FMA-shape), mag (1 div, 1 sqrt, 1 mul), then 6 velocity updates (6 muls, 6 subs). ~25 float ops per pair, 250 per advance step.

Theoretical minimum: 250 × 0.5 cycles (FP issue, half-bandwidth because the chain has dependencies) = 125 cycles ≈ 28 ns. Go ref measures 119 ns; ratio ~4x of theoretical (limited by the div+sqrt sequence which is non-pipelined). VM2 today: 14,103 ns = 118x of Go, 500x of theoretical.

Layer 2 (JIT) collapses the 118x to ~6x (the inner pair loop becomes 6 frame transitions × 10 pairs = 60 calls/step at 12 ns each = 720 ns/step, so the residual is the call overhead, not the arithmetic). Layer 3 (inline) removes the 60 calls to ~3x.

B.4 mandelbrot

Inner work per pixel: max 50 iterations × 5 float ops (Zr² + Zi², 2×Zr×Zi, +Cr, +Ci) = 250 float ops max. 8×8 = 64 pixels, avg ~25 iter/pixel = ~8,000 float ops.

Theoretical minimum: 8,000 × 0.5 = 4,000 cycles ≈ 909 ns. Go ref measures 2,494 ns; ratio ~2.7x. VM2 today: 140,125 ns = 56x of Go, 154x of theoretical.

Layer 2 (JIT + FMA): 56x → ~3x. Layer 3 (inline): ~2x (the per-pixel iterate function is a small leaf, inlining it removes the 64 frame transitions per row).

B.5 fannkuch_redux

Inner work: ~9 flips × ~3.5 swaps/flip × 4 ops/swap = ~125 int ops. Theoretical minimum: 125 × 0.17 cycles = 21 cycles ≈ 5 ns. Go ref measures 69.8 ns; ratio ~14x (Go's bounds-checked slice access dominates over the actual swap). VM2 today: 2,927 ns = 42x of Go, 580x of theoretical.

Layer 2 (JIT, including bounds-check elision on the swap): 42x → ~4x. Layer 3 (inline reverse): ~2.5x.

B.6 binary_trees

Inner work: depth=8 → 511 nodes. Each makeTree call: 2 recursive calls + 1 NewPair (~30 cycles per chunk allocation, amortized to ~0.2 cycles per pair across the 256-slot chunks). Each checkTree call: 2 recursive calls + 2 PairFst/PairSnd + 1 AddI64.

Theoretical minimum: 511 × (recursive overhead ~12 cycles per node) = 6,000 cycles ≈ 1,400 ns, plus chunk allocations (~512 / 256 × 30 = 60 cycles), total ~1,500 ns. Go ref measures ~8,000 ns post-§A.6 (one wall-clock measurement on the same host); ratio ~5x of theoretical, dominated by Go's recursive call frame setup (sub rsp / mov / bl / add rsp). VM2 §A.6: 21,000 ns = 2.55x of Go, 14x of theoretical.

Why binary_trees stays at ~2.55x while the pair_swap/pair_walk loop kernels reach 1.7-1.8x: those are tail-recursive, so the loop function lowers to OpTailCallSelfA3 and the call cost reduces to one dispatch. binary_trees is non-tail recursive (return 1 + check(left, d-1) + check(right, d-1) builds an arithmetic expression across the two recursive calls), so each node visit pays a full OpCallA2 frame push/pop. Decomposing the residual:

  • 1023 calls × ~14 ns/call body (5 fused ops: OpJumpIfEqualI64, OpSubI64K, OpPairFstCallA2, OpPairSndCallA2, OpReturnAddSum) = 14,300 ns
  • 1023 calls × ~7 ns/call frame setup (Stack append, frame struct write, code/regs/consts pointer reload) = 7,200 ns
  • Total ≈ 21,500 ns

The body is already at floor for an interpreter (the §A.6 return-superop family removed every redundant op the inductive return-shape had); what remains is the frame setup, which is structural in the interpreter. Phase 7 (JIT lowering of OpCallA2 + OpReturn into native bl / ret with a 16-byte JIT register-pair scheme for Cell.Obj) is the path to 2x on this kernel.

B.7 reverse_complement

Inner work at N=1024: 1024 × (1 load + 4-way compare + 1 store) ≈ 1024 × 6 cycles = 6,144 cycles ≈ 1,400 ns. Go ref measures 1,132 ns (better than theoretical due to SIMD: Go's compiler vectorizes the 4-way compare into a single PSHUFB on amd64; on arm64 it uses TBL). On the unvectorized arm64 path Go would be ~1,500 ns.

VM2 today: 193,199 ns = 171x of Go, 138x of unvectorized theoretical. Layer 1 (bytes) saves the second U8Array allocation: ~60 ns/iter. Layer 2 (JIT typed array ops): collapses the per-byte dispatch from ~80 ns to ~3 ns. Layer 3 (inline complement): removes the per-byte frame transition (~12 ns/byte = 12 µs/iter).

After all three layers: 1,400 + (1024 × ~3 ns dispatch residual) = ~4,500 ns ≈ 4x of Go. Hitting 2x would need SIMD lowering (a fused OpBytesMapTable op that compiles to PSHUFB/TBL). That is a Layer 4 if needed.

B.8 Summary

The theoretical analysis confirms the layered plan:

  • Layers 1+2 take all five Phase 1+2 kernels to within 6x of Go (vs current 9-171x).
  • Layer 3 (leaf inline) takes them to within 3x.
  • Closing the remaining gap to 2x on reverse_complement specifically needs a SIMD fusion op (Layer 4 / future).

The headline goal of "2x of Go on the suite median" is achievable at the end of Layer 3. Hitting 2x on every kernel needs Layer 4 on a per-kernel basis.

Open Questions

  1. Should the JIT framework support inlined leaves directly? Layer 3 ships the inliner as an IR-level pass that runs before emit. The JIT then sees the inlined body and lowers it without knowing it was inlined. Alternative: the JIT runs the inliner itself, on the lowered IR. The IR-level pass is simpler; the JIT-level pass would have access to register-allocation information that might allow tighter folding. Defer until L3 measurements show whether the simpler form is fast enough.

  2. What is the cost of an OpBytesNew that subsequently traps? Phase 1's vmBytes allocates a fresh []byte. If a OOB on OpBytesGet traps later, the allocation is wasted. We could pool the slices and return them on trap, but the BG kernels do not trap in the steady state; the question is academic for now.

  3. Stdin/stdout framing on the test host. Apple's stdout is line-buffered if connected to a terminal; the BG benchmark assumes pipe-buffered (go test -bench runs with stdout redirected, which gets full-buffering). The harness should always go through bytes.Buffer for the bench number, never to os.Stdout, so the buffering mode is not a confounder.

This document is placed in the public domain.