MEP 37. Benchmarks Game Parity with Go

Field	Value
MEP	37
Title	Benchmarks Game Parity with Go
Author	Mochi core
Status	Active (Phase 1 landing)
Type	Standards Track
Created	2026-05-17

Abstract

vm2 today is within ~2x of Lua on the two Benchmarks Game (BG) programs in the cross-language corpus (MEP-36 Appendix D.2: bg/binary_trees n=10 at 0.95x of Lua, bg/nsieve at 4.6x). Against go run it is between 7x and 70x slower depending on the workload. That ratio is not a single missing optimization: it is a stack of representation and dispatch gaps that compound. This MEP enumerates the gaps, ties each gap to the BG programs that surface it, and proposes the smallest concrete set of changes that closes the dominant ones.

The bet is that four data-model changes (typed arrays for float64/int64/uint8, packed two-element tuples, byte-slice views, byte-allocator arena) plus two opcode families (float arithmetic; specialized typed-array load/store) account for the majority of the gap on six of the ten BG programs. The other four (pidigits, regex_redux, fasta, k_nucleotide) need subsystems that are out of scope for a single MEP and are routed to follow-ups.

The goal stated in headline terms: vm2 within 2x of go run on the six BG programs phase 1+2 cover, and within 5x on the suite median, on the same host that runs MEP-23.

Motivation

The MEP-23 cross-language baseline currently ships two BG programs. Both were added opportunistically: bg/binary_trees because it surfaces the popFrame GC contract that drove MEP-36, bg/nsieve because it was the smallest sieve we could agree on. The remaining eight canonical BG programs are not in the corpus, so vm2 has no measured signal on the workloads the rest of the BG world uses to compare runtimes.

This is a problem for three reasons:

No data on FP-dominated workloads. Four canonical BG programs (mandelbrot, n_body, spectral_norm, pidigits) spend nearly all their time in floating-point arithmetic. vm2 has the TF64 IR type and CFloat/Float() Cell helpers, but no float arithmetic opcodes: no OpAddF64, OpMulF64, etc. Any Mochi program with double-typed math falls back to whatever the host language gives the front-end. We cannot measure the gap because we cannot run the program.
No data on byte-array workloads. Three canonical BG programs (nsieve, fasta, reverse_complement) operate over byte arrays. vm2 stores everything as []Cell (16 bytes per element, with an unsafe.Pointer slot that is nil for scalars). A 10,000-entry sieve takes 160 KB of L1/L2 in vm2 versus 10 KB in Go, so the cache miss rate on the inner loop is structurally higher even before dispatch overhead.
No story on the tree workloads. binary_trees builds and tears down ~2 million two-element nodes per run. Each node is a *vmList with a []Cell of length two: one Go heap allocation, one 32-byte tail, GC traces the lot. Go uses inline two-element slice literals plus generational young-gen reclamation. The factor between the two on this workload is largely allocation cost.

Closing 1+2+3 is what this MEP is about. The remaining BG programs (regex_redux, k_nucleotide, fasta, reverse_complement, pidigits) need subsystem work that does not fit a single MEP and is enumerated below as future scope.

Specification

3.1 The full BG catalog

This MEP commits Mochi to maintaining a peer implementation of all ten canonical BG programs once the dependent subsystems land. The catalog below pairs each program with its dominant data structure and the subset of vm2 enhancements it depends on. The "lands in" column tracks the rollout phase.

Program	Dominant work	Dominant data	Gap from Go	Lands in
binary_trees	recursive allocate + traverse	mutable tree, 2-element nodes	per-node heap traffic	Phase 2
fannkuch_redux	array reversal + permutation	`[]int8` of length 12	boxed `[]Cell` storage	Phase 1
fasta	RNG + table lookup	`[]uint8` output buffer	byte-array, stdout framing	Phase 3
k_nucleotide	hash table + substring	`[]byte` input + `map[string]int`	byte-slice views, large stdin	Phase 3
mandelbrot	dense FP inner loop	`[][]uint8` bitmap	FP arithmetic, byte-array	Phase 1
n_body	5-body Verlet integrator	`[]float64` x/y/z/vx/vy/vz	FP arithmetic, float arrays	Phase 1
pidigits	spigot algorithm	big integers	big-int subsystem	Future MEP
regex_redux	regex pattern matching	DNA byte string	regex engine	Future MEP
reverse_complement	per-byte transform	large byte buffer	byte-array, large I/O	Phase 3
spectral_norm	matrix-vector multiply	`[]float64` u / v	FP arithmetic, float arrays	Phase 1

Phase 1 covers the four FP-dominated programs (mandelbrot, n_body, spectral_norm, fannkuch_redux). Phase 2 covers binary_trees with packed tuples. Phase 3 covers the byte/IO programs. Phases 4+ cover the regex and big-int subsystems and are explicit non-goals here.

3.2 Gap analysis: float arithmetic

n_body's inner loop:

dx = b.x - p.x;  dy = b.y - p.y;  dz = b.z - p.z
d2 = dx*dx + dy*dy + dz*dz
mag = dt / (d2 * sqrt(d2))
p.vx -= dx * b.m * mag

Every operation is float64 op float64. Today vm2 dispatches each through Cell-tagged paths or falls back to OpAddI64 (which decodes the Bits as int — wrong). The fix is direct:

New opcodes: OpAddF64, OpSubF64, OpMulF64, OpDivF64, OpNegF64, OpAbsF64, OpSqrtF64, OpLessF64, OpEqualF64, OpLessEqF64. Each is a 3-byte handler reading two Float() operands and writing one via CFloat. The IR has TF64; the IR builder grows matching constructors.
A typed-immediate companion OpLoadConstF for float constants (the const pool can hold a float Cell directly since CFloat already maps float64 into the Bits field).
A fused FMA opcode OpFmaF64(d, a, b, c) computing d = a*b + c. Mandelbrot's iteration is FMA-shaped; the saving is one dispatch per iteration in the inner loop.

No structural runtime change. The handlers are mechanical and the work to land them is len(ops).

3.3 Gap analysis: typed arrays

vm2.vmList stores []Cell. For a []float64 workload this is wasteful:

16 bytes per element vs 8 bytes for a []float64. spectral_norm at n=5500 is 88 KB vs 44 KB. Both fit in L2 today but at n=20000 (the BG canonical N) the difference is 320 KB vs 160 KB and we cross L2.
Two loads per element: one to fetch the Cell, one to read Bits and decode via Float(). A []float64 is one load.
Cell's Obj field is nil for floats but the GC scans it anyway. For a tight FP loop this is wasted bandwidth.

Resolution: typed array primitives.

*vmF64Array wraps []float64. New constructor OpNewF64Array(dst, len). Access via OpF64ArrGet(dst, arr, idx) and OpF64ArrSet(arr, idx, val). The Cell wrapping this still tagPtrs through Cell.Obj, so GC visibility is identical to *vmList.
*vmI64Array mirrors the above for int64.
*vmU8Array mirrors the above for uint8 (the sieves and the mandelbrot bitmap).
IR types: TF64Array, TI64Array, TU8Array. The IR builder grows matching constructors.

The Cell tag space accommodates this because the Obj field is the only discriminator we use for container types; the new array kinds are just additional *T shapes carried in Obj.

3.4 Gap analysis: tree-shaped data

binary_trees walks ~2M nested two-element nodes. Each *vmList is one Go heap allocation plus a []Cell tail (32 bytes for cap=2). Go writes &Tree{makeTree(d-1), makeTree(d-1)} which is one allocation and a 16-byte body (two Tree pointers). The factor is roughly 2-3x just on allocation cost, on top of the dispatch overhead.

Resolution: a packed two-element tuple Cell.

vmPair is a {a, b Cell} value type allocated in a dedicated arena. The arena is a slice of vmPair per VM; allocation is an index bump. Reclamation happens at popFrame / Run boundary (we own the arena, so we can clear it cheaply).
New ops: OpNewPair(dst, a, b), OpPairFst(dst, p), OpPairSnd(dst, p). No OpPairSet: the corpus has no use case and the arena assumption is monotonic-per-Run.
IR type: TPair. Verifier checks the operand types of OpPairFst/Snd produce the right result type.

The packed-pair representation is the BG-specific lift; binary_trees benefits, and future "small fixed struct" workloads do too. A larger tuple (3, 4, ... up to N) is a follow-up; for binary_trees specifically 2 is sufficient.

3.5 Gap analysis: byte strings and I/O

Three programs need byte-string ops vm2 does not have today:

Byte slicing without copy. k_nucleotide hashes substrings of a 5 MB DNA input. Copying the substring each time costs O(N×k) bytes. Resolution: *vmBytes carries a (buf []byte, off, len int) view; new ops OpBytesSlice(dst, src, off, len) and OpBytesEqual operate on the view. The view holds a back-reference to buf so GC keeps the parent alive.
Stdin/stdout streaming. fasta and reverse_complement write hundreds of KB to stdout. Today the only string op is OpConcatStr which allocates a new buffer per concat. Resolution: OpStdoutWriteBytes(src) that writes a byte buffer to the host stdout without intermediate allocation.
Large stdin read. k_nucleotide reads the whole stdin once. New op OpStdinReadAll(dst) produces a *vmBytes.

These are catalogued but not in Phase 3's hard scope; the data-model pieces (byte view + arena) are. Phase 3 ships the data model and a minimal I/O API; the BG-specific glue lands separately.

3.6 Out of scope

Big-int subsystem (pidigits). Mochi has no first-class big integer today. The standard approach is to wrap math/big and pay the function-call cost; pidigits is not bottlenecked on this once that's done. Tracked as a future MEP.
Regex engine (regex_redux). Mochi has no regex primitive. RE2 (Go's regexp package) is the obvious vendor; the work is mostly API surface, not engine. Tracked as a future MEP.

Rationale

4.1 Why Go as the target

The Benchmarks Game's standard pairwise comparisons use C as the floor. We use Go because:

vm2 is hosted in Go. The host language sets the ceiling for the runtime: a vm2 that beats Go on a workload would be hard to explain ("we beat the language that runs us"). 2x is a credible and self-consistent target.
Cross-compile peers in the existing MEP-23 harness already run Go for every Mochi program; the comparison is free in tooling.
Lua 5.x and CPython are the other obvious targets, but the goal here is to push vm2 past them (already true for binary_trees), so they are not the bar.

4.2 Why typed arrays instead of polymorphic list specialization

A polymorphic list specialization (start with []Cell, switch to []float64 when every element happens to be a float) is what V8 and SpiderMonkey do. The implementation cost is much higher than typed-from-creation arrays, because every OpListSet has to check the homogeneity invariant and pay the deopt path when it breaks. The BG programs do not need polymorphism: the user knows at construction time that the array is float64. Typed-from-creation matches the language semantics and gets the cache-locality win with no dispatch cost.

The follow-on "polymorphic list with type adaptation" is interesting for non-BG workloads and is left to a separate MEP if profile evidence demands it.

4.3 Why packed pairs and not packed N-tuples

Two reasons:

binary_trees is the canonical 2-tuple workload. Going to N>2 introduces alignment and size-classing problems for the arena (a vmPair is exactly 32 bytes; a vmTuple3 is 48 and so on, none of which are L1-cache-line-aligned without padding). Solving size classes is its own MEP.
All current MEP-24 container plans treat structs as their own pointer-backed type. Packed pairs sit alongside that, not in place of it.

If a future BG-like workload demonstrates that 3+ element packed tuples are load-bearing, we generalize then, with the size-class problem solved separately.

4.4 Why an arena for pairs and not Go heap

Per-node &vmPair{...} allocations would inherit the same factor we are trying to close. An arena ([]vmPair, indexed by handle) is one allocation amortized over N pair constructions; the dispatch cost is one bounds check plus one struct write per allocation. The trade-off is that pair memory cannot be reclaimed before Run() ends, the same trade vm2's original Objects table made and that MEP-36 reversed for containers. The reason to accept it here is that BG workloads have monotonic allocation profiles (tree construction is a build-then-tear-down, and the tear-down happens at Run boundary), so the trade is sound for this corpus.

A future MEP that lifts the arena to a full nursery+old-gen split is the obvious sequel; for now the arena suffices.

Migration plan

Four phases, each independently mergeable, each gated by the BG corpus subset it adds.

Phase 1: Float arithmetic + typed arrays. (landing in this MEP's PR)

New vm2 ops: OpAddF64, OpSubF64, OpMulF64, OpDivF64, OpNegF64, OpAbsF64, OpSqrtF64, OpLessF64, OpEqualF64, OpLessEqF64, OpLoadConstF, OpFmaF64.
New vm2 container types: *vmF64Array, *vmI64Array, *vmU8Array. New ops: OpNewF64Array, OpF64ArrLen, OpF64ArrGet, OpF64ArrSet (and mirrors for I64/U8).
New IR ops + builder methods, verified end-to-end with emit + run tests.
Corpus programs: bg/spectral_norm, bg/n_body, bg/mandelbrot, bg/fannkuch_redux. Each ships a .go.tmpl peer and a .lua / .py peer (lua/py optional for first cut, gated on what the BG canonical sources have).

Merge gate: each new program produces correct output at the canonical BG N and at a smaller N (for fast CI). Phase 1 vm2 within 5x of Go on the four programs at their CI N.

Phase 2: Packed pairs + arena.

New vm2 type vmPair, per-VM arena, ops OpNewPair, OpPairFst, OpPairSnd.
New IR type TPair, builder method.
Corpus program: rewrite bg/binary_trees to use packed pairs. Keep the existing nested-list form as a separate program for the historical comparison.

Merge gate: binary_trees within 2x of Go at depth 10, no regression on the rest of the suite, RSS within +1 MB.

Phase 3: Byte strings + minimal I/O.

New vm2 type *vmBytes (view of (buf, off, len)), ops OpBytesNew, OpBytesLen, OpBytesAt, OpBytesSlice, OpBytesEqual.
New ops OpStdoutWriteBytes, OpStdinReadAll.
Corpus programs: bg/reverse_complement, partial bg/fasta (RNG part is straightforward, the table-encoding part rides on byte-view).

Merge gate: reverse_complement within 3x of Go (Go is extremely good at this workload because it short-circuits to bufio; 3x is a reasonable bar).

Phase 4+: Big int and regex (separate MEPs). Out of scope here, called out so the catalog is honest about the remaining gap.

Backwards Compatibility

No language-surface change. All new ops are opt-in at IR construction; existing emitters that don't touch the new builders compile identically. The Cell layout does not change (Phase 1's typed arrays carry their pointer in Obj exactly like *vmList does today).

The JIT register-file contract from MEP-34 is unaffected: typed-array container Cells are tagPtr with an Obj pointer, the same shape the JIT already handles for lists and maps. New scalar ops (OpAddF64 etc.) have the same operand shape as their integer cousins; the JIT trace recognizer learns the new opcodes one by one as benchmarks warrant.

Reference Implementation

Files touched in Phase 1:

runtime/vm2/ops.go — new float and typed-array opcodes.
runtime/vm2/eval.go — dispatch handlers.
runtime/vm2/arrays.go — new file housing vmF64Array, vmI64Array, vmU8Array and their accessors.
compiler2/ir/ir.go — new IR ops and types.
compiler2/ir/builder.go — matching builder methods.
compiler2/emit/emit.go — lowering for the new IR ops.
compiler2/corpus/bg_spectral_norm.go, bg_n_body.go, bg_mandelbrot.go, bg_fannkuch_redux.go — hand-written IR corpus programs.
bench/template/bg/spectral_norm/, bg/n_body/, bg/mandelbrot/, bg/fannkuch_redux/ — peer templates (Go / Lua / Python where available).
runtime/vm2/bench/corpus_test.go — register the new programs at their CI N.

A measured-results MEP (Informational, numbered after Phase 1 lands) reports BG numbers head-to-head against go run at the canonical BG N, following the MEP-29 / MEP-33 / MEP-36 Appendix D template.

Appendix A. Phase 1 + 2 baseline measurements

Captured on the same host that runs MEP-23 (Apple M4, go test -bench with -benchtime=2s), committed in the PRs landing each phase. The vm2 build runs through compiler2 IR (opt: ConstFold + DCE + TailCall) → emit → vm2.Run.

Kernels measured:

spectral_norm n=100: build A, compute A·u, return sqrt((u·v) / (v·v)).
n_body N=5: build 5 bodies with positions (i, 2i, 3i), velocities (i/10, i/5, 3i/10), masses (i+1). One advance step at dt=0.01, then return system energy. Recursive helpers exercised through OpTailCallSelf.
mandelbrot 8x8, maxIter=50: per-pixel escape count, FMA on the imaginary axis. Stored as bytes in a U8Array; returns the sum.
fannkuch_redux n=7: count flips for the fixed starting permutation [4,2,1,5,7,3,6] (9 flips). Storage is a single I64Array.
binary_trees depth=8 (Phase 2): build a perfect binary tree of 511 nodes using MEP-37 §3.4 packed pairs as node storage, then walk it counting. Returns 511.

Benchmark	ns/op	B/op	allocs/op	vm2 / Go
BenchmarkVM2_BG_SpectralNormKernel	1,754,279	154,992	17	115x
BenchmarkGo_BG_SpectralNormKernel	15,276	1,792	2	1.0x
BenchmarkVM2_BG_NBodyKernel	14,103	9,400	19	118x
BenchmarkGo_BG_NBodyKernel	119.4	0	0	1.0x
BenchmarkVM2_BG_MandelbrotKernel	140,125	19,608	9	56x
BenchmarkGo_BG_MandelbrotKernel	2,494	64	1	1.0x
BenchmarkVM2_BG_FannkuchReduxKernel	2,927	1,816	5	42x
BenchmarkGo_BG_FannkuchReduxKernel	69.83	0	0	1.0x
BenchmarkVM2_BG_BinaryTreesPairKernel	61,926	23,032	8	9.2x
BenchmarkGo_BG_BinaryTreesPairKernel	6,748	8,176	511	1.0x

These are the launch numbers, not the destination. The headline goal in the abstract is 2x on phase-1/2 covered programs once the JIT lowering and bounds-check elision pieces land. The ratios at the baseline are driven by:

Each helper call (spectral_norm's eval_A, n_body's pairwise inner loop, binary_trees' recursive checkTree) costs one full OpCall round-trip; Go inlines the same helper.
Every float arithmetic op pays one byte-code dispatch (~6 ns on M4) and one Cell encode + decode round-trip; Go's compiled code is one FP register-to-register instruction.
F64Array / U8Array / I64Array Get/Set each pay one OpCall-free dispatch but still do bounds checks the Go peer eliminates.

binary_trees lands at 9.2x rather than the 42-118x of the Phase 1 kernels: the pair arena saves 503 allocations per Run (8 chunk allocations versus Go's 511 per-node ones), so the GC pressure that dominates Go's number is amortized out of vm2's. The Phase 1 kernels do not see a comparable allocation win because their hot path is dispatch-bound, not GC-bound.

n_body lands closer to spectral_norm's ratio than expected because the recursive inner loop (six OpCall round-trips per body pair, 10 pairs per step) is dispatch-bound, not arithmetic-bound. The follow-up phases (JIT lowering for float ops, bounds-check elision on hot array accesses, FMA pattern matching in emit) close the gap; the data above is the starting point.

Appendix B. Phase 3 baseline (reverse_complement kernel)

Same host (Apple M4, -benchtime=3s). The kernel synthesizes 1024 bytes of ACGT repeating, reverses-and-complements into a second buffer (A↔T, C↔G via a 4-way conditional), then sums the output. Two OpNewU8Array allocations and a chain of OpTailCallSelf loops do the work; per-byte we pay one Call/Return into baseFor (fill) and one into complement (rcLoop), plus the OpU8ArrGet / OpU8ArrSet dispatches.

Benchmark	ns/op	B/op	allocs/op	vm2 / Go
BenchmarkVM2_BG_ReverseComplementKernel	193,199	648,816	22	171x
BenchmarkGo_BG_ReverseComplementKernel	1,132	0	0	1.0x

The 171x gap is wider than the phase-1 float ratios for two reasons specific to byte work. First, complement is a per-byte function call in the IR (no inline expansion). Go's compiled peer keeps the same logic in a single switch that the SSA backend turns into a balanced branch tree without a frame transition. Second, the kernel allocates two 1024-byte vmU8Array Cells per Run; Go's reference allocates the equivalent make([]byte, 1024) once and the bench loop's allocator hits the bump-allocator fast path almost entirely.

Both of those gaps are precisely what the Phase 3 follow-ups in §3.5 propose: a vmBytes view subsystem (slice-without-copy, plus OpStdoutWriteBytes for streaming output) and an emit-side fast path that inlines small leaf functions like complement into the calling frame. Those land in a subsequent MEP rather than this one; the data above is the measurement that motivates them.

Open Questions

Inline float Cell vs Obj-carried float64 array Cells when N=1. A single float result fits in CFloat's NaN-box; a *vmF64Array of length 1 pays a heap allocation for nothing. We may want a fast-path OpF64ArrGet that returns CFloat directly (no boxed temporary). Phase 1 picks the naive form; profile evidence decides whether the fast-path is worth adding.
FMA vs unfused multiply-add for mandelbrot. Go does not auto-FMA on amd64 unless the host CPU supports it; vm2's OpFmaF64 would be unconditional. The "fair" comparison is uncertain. Document the choice and provide both forms in the corpus so reviewers can compare apples-to-apples.
Arena lifetime for pairs across nested Runs. If a host embedder calls Run recursively (re-entrant), the pair arena is shared across outer and inner Run. A nested Run that allocates pairs would leak them into the outer's arena. The Phase 2a implementation keeps the arena alive for the full VM lifetime: chunks accumulate, the pairNext index monotonically grows, and the Go GC reclaims the whole pairChunks slice when the VM is dropped. This is the simplest correct contract; a per-Run reset is the obvious sequel once profile evidence shows the chunk-retention cost dominates the dispatch cost.

Copyright

This document is placed in the public domain.

Abstract​

Motivation​

Specification​

3.1 The full BG catalog​

3.2 Gap analysis: float arithmetic​

3.3 Gap analysis: typed arrays​

3.4 Gap analysis: tree-shaped data​

3.5 Gap analysis: byte strings and I/O​

3.6 Out of scope​

Rationale​

4.1 Why Go as the target​

4.2 Why typed arrays instead of polymorphic list specialization​

4.3 Why packed pairs and not packed N-tuples​

4.4 Why an arena for pairs and not Go heap​

Migration plan​

Backwards Compatibility​

Reference Implementation​

Appendix A. Phase 1 + 2 baseline measurements​

Appendix B. Phase 3 baseline (reverse_complement kernel)​

Open Questions​

Copyright​