MEP 37. Benchmarks Game Parity with Go
| Field | Value |
|---|---|
| MEP | 37 |
| Title | Benchmarks Game Parity with Go |
| Author | Mochi core |
| Status | Active (Phase 1 landing) |
| Type | Standards Track |
| Created | 2026-05-17 |
Abstract
vm2 today is within ~2x of Lua on the two Benchmarks Game (BG) programs in the cross-language corpus (MEP-36 Appendix D.2: bg/binary_trees n=10 at 0.95x of Lua, bg/nsieve at 4.6x). Against go run it is between 7x and 70x slower depending on the workload. That ratio is not a single missing optimization: it is a stack of representation and dispatch gaps that compound. This MEP enumerates the gaps, ties each gap to the BG programs that surface it, and proposes the smallest concrete set of changes that closes the dominant ones.
The bet is that four data-model changes (typed arrays for float64/int64/uint8, packed two-element tuples, byte-slice views, byte-allocator arena) plus two opcode families (float arithmetic; specialized typed-array load/store) account for the majority of the gap on six of the ten BG programs. The other four (pidigits, regex_redux, fasta, k_nucleotide) need subsystems that are out of scope for a single MEP and are routed to follow-ups.
The goal stated in headline terms: vm2 within 2x of go run on the six BG programs phase 1+2 cover, and within 5x on the suite median, on the same host that runs MEP-23.
Motivation
The MEP-23 cross-language baseline currently ships two BG programs. Both were added opportunistically: bg/binary_trees because it surfaces the popFrame GC contract that drove MEP-36, bg/nsieve because it was the smallest sieve we could agree on. The remaining eight canonical BG programs are not in the corpus, so vm2 has no measured signal on the workloads the rest of the BG world uses to compare runtimes.
This is a problem for three reasons:
-
No data on FP-dominated workloads. Four canonical BG programs (mandelbrot, n_body, spectral_norm, pidigits) spend nearly all their time in floating-point arithmetic. vm2 has the
TF64IR type andCFloat/Float()Cell helpers, but no float arithmetic opcodes: noOpAddF64,OpMulF64, etc. Any Mochi program with double-typed math falls back to whatever the host language gives the front-end. We cannot measure the gap because we cannot run the program. -
No data on byte-array workloads. Three canonical BG programs (nsieve, fasta, reverse_complement) operate over byte arrays. vm2 stores everything as
[]Cell(16 bytes per element, with anunsafe.Pointerslot that is nil for scalars). A 10,000-entry sieve takes 160 KB of L1/L2 in vm2 versus 10 KB in Go, so the cache miss rate on the inner loop is structurally higher even before dispatch overhead. -
No story on the tree workloads. binary_trees builds and tears down ~2 million two-element nodes per run. Each node is a
*vmListwith a[]Cellof length two: one Go heap allocation, one 32-byte tail, GC traces the lot. Go uses inline two-element slice literals plus generational young-gen reclamation. The factor between the two on this workload is largely allocation cost.
Closing 1+2+3 is what this MEP is about. The remaining BG programs (regex_redux, k_nucleotide, fasta, reverse_complement, pidigits) need subsystem work that does not fit a single MEP and is enumerated below as future scope.
Specification
3.1 The full BG catalog
This MEP commits Mochi to maintaining a peer implementation of all ten canonical BG programs once the dependent subsystems land. The catalog below pairs each program with its dominant data structure and the subset of vm2 enhancements it depends on. The "lands in" column tracks the rollout phase.
| Program | Dominant work | Dominant data | Gap from Go | Lands in |
|---|---|---|---|---|
| binary_trees | recursive allocate + traverse | mutable tree, 2-element nodes | per-node heap traffic | Phase 2 |
| fannkuch_redux | array reversal + permutation | []int8 of length 12 | boxed []Cell storage | Phase 1 |
| fasta | RNG + table lookup | []uint8 output buffer | byte-array, stdout framing | Phase 3 |
| k_nucleotide | hash table + substring | []byte input + map[string]int | byte-slice views, large stdin | Phase 3 |
| mandelbrot | dense FP inner loop | [][]uint8 bitmap | FP arithmetic, byte-array | Phase 1 |
| n_body | 5-body Verlet integrator | []float64 x/y/z/vx/vy/vz | FP arithmetic, float arrays | Phase 1 |
| pidigits | spigot algorithm | big integers | big-int subsystem | Future MEP |
| regex_redux | regex pattern matching | DNA byte string | regex engine | Future MEP |
| reverse_complement | per-byte transform | large byte buffer | byte-array, large I/O | Phase 3 |
| spectral_norm | matrix-vector multiply | []float64 u / v | FP arithmetic, float arrays | Phase 1 |
Phase 1 covers the four FP-dominated programs (mandelbrot, n_body, spectral_norm, fannkuch_redux). Phase 2 covers binary_trees with packed tuples. Phase 3 covers the byte/IO programs. Phases 4+ cover the regex and big-int subsystems and are explicit non-goals here.
3.2 Gap analysis: float arithmetic
n_body's inner loop:
dx = b.x - p.x; dy = b.y - p.y; dz = b.z - p.z
d2 = dx*dx + dy*dy + dz*dz
mag = dt / (d2 * sqrt(d2))
p.vx -= dx * b.m * mag
Every operation is float64 op float64. Today vm2 dispatches each through Cell-tagged paths or falls back to OpAddI64 (which decodes the Bits as int — wrong). The fix is direct:
- New opcodes:
OpAddF64,OpSubF64,OpMulF64,OpDivF64,OpNegF64,OpAbsF64,OpSqrtF64,OpLessF64,OpEqualF64,OpLessEqF64. Each is a 3-byte handler reading twoFloat()operands and writing one viaCFloat. The IR hasTF64; the IR builder grows matching constructors. - A typed-immediate companion
OpLoadConstFfor float constants (the const pool can hold a floatCelldirectly sinceCFloatalready mapsfloat64into the Bits field). - A fused FMA opcode
OpFmaF64(d, a, b, c)computingd = a*b + c. Mandelbrot's iteration is FMA-shaped; the saving is one dispatch per iteration in the inner loop.
No structural runtime change. The handlers are mechanical and the work to land them is len(ops).
3.3 Gap analysis: typed arrays
vm2.vmList stores []Cell. For a []float64 workload this is wasteful:
- 16 bytes per element vs 8 bytes for a
[]float64. spectral_norm at n=5500 is 88 KB vs 44 KB. Both fit in L2 today but at n=20000 (the BG canonical N) the difference is 320 KB vs 160 KB and we cross L2. - Two loads per element: one to fetch the Cell, one to read
Bitsand decode viaFloat(). A[]float64is one load. - Cell's
Objfield is nil for floats but the GC scans it anyway. For a tight FP loop this is wasted bandwidth.
Resolution: typed array primitives.
*vmF64Arraywraps[]float64. New constructorOpNewF64Array(dst, len). Access viaOpF64ArrGet(dst, arr, idx)andOpF64ArrSet(arr, idx, val). The Cell wrapping this still tagPtrs throughCell.Obj, so GC visibility is identical to*vmList.*vmI64Arraymirrors the above for int64.*vmU8Arraymirrors the above for uint8 (the sieves and the mandelbrot bitmap).- IR types:
TF64Array,TI64Array,TU8Array. The IR builder grows matching constructors.
The Cell tag space accommodates this because the Obj field is the only discriminator we use for container types; the new array kinds are just additional *T shapes carried in Obj.
3.4 Gap analysis: tree-shaped data
binary_trees walks ~2M nested two-element nodes. Each *vmList is one Go heap allocation plus a []Cell tail (32 bytes for cap=2). Go writes &Tree{makeTree(d-1), makeTree(d-1)} which is one allocation and a 16-byte body (two Tree pointers). The factor is roughly 2-3x just on allocation cost, on top of the dispatch overhead.
Resolution: a packed two-element tuple Cell.
vmPairis a{a, b Cell}value type allocated in a dedicated arena. The arena is a slice ofvmPairper VM; allocation is an index bump. Reclamation happens at popFrame / Run boundary (we own the arena, so we can clear it cheaply).- New ops:
OpNewPair(dst, a, b),OpPairFst(dst, p),OpPairSnd(dst, p). NoOpPairSet: the corpus has no use case and the arena assumption is monotonic-per-Run. - IR type:
TPair. Verifier checks the operand types ofOpPairFst/Sndproduce the right result type.
The packed-pair representation is the BG-specific lift; binary_trees benefits, and future "small fixed struct" workloads do too. A larger tuple (3, 4, ... up to N) is a follow-up; for binary_trees specifically 2 is sufficient.
3.5 Gap analysis: byte strings and I/O
Three programs need byte-string ops vm2 does not have today:
- Byte slicing without copy. k_nucleotide hashes substrings of a 5 MB DNA input. Copying the substring each time costs O(N×k) bytes. Resolution:
*vmBytescarries a(buf []byte, off, len int)view; new opsOpBytesSlice(dst, src, off, len)andOpBytesEqualoperate on the view. The view holds a back-reference tobufso GC keeps the parent alive. - Stdin/stdout streaming. fasta and reverse_complement write hundreds of KB to stdout. Today the only string op is
OpConcatStrwhich allocates a new buffer per concat. Resolution:OpStdoutWriteBytes(src)that writes a byte buffer to the host stdout without intermediate allocation. - Large stdin read. k_nucleotide reads the whole stdin once. New op
OpStdinReadAll(dst)produces a*vmBytes.
These are catalogued but not in Phase 3's hard scope; the data-model pieces (byte view + arena) are. Phase 3 ships the data model and a minimal I/O API; the BG-specific glue lands separately.
3.6 Out of scope
- Big-int subsystem (pidigits). Mochi has no first-class big integer today. The standard approach is to wrap
math/bigand pay the function-call cost; pidigits is not bottlenecked on this once that's done. Tracked as a future MEP. - Regex engine (regex_redux). Mochi has no regex primitive. RE2 (Go's
regexppackage) is the obvious vendor; the work is mostly API surface, not engine. Tracked as a future MEP.
Rationale
4.1 Why Go as the target
The Benchmarks Game's standard pairwise comparisons use C as the floor. We use Go because:
- vm2 is hosted in Go. The host language sets the ceiling for the runtime: a vm2 that beats Go on a workload would be hard to explain ("we beat the language that runs us"). 2x is a credible and self-consistent target.
- Cross-compile peers in the existing MEP-23 harness already run Go for every Mochi program; the comparison is free in tooling.
- Lua 5.x and CPython are the other obvious targets, but the goal here is to push vm2 past them (already true for binary_trees), so they are not the bar.
4.2 Why typed arrays instead of polymorphic list specialization
A polymorphic list specialization (start with []Cell, switch to []float64 when every element happens to be a float) is what V8 and SpiderMonkey do. The implementation cost is much higher than typed-from-creation arrays, because every OpListSet has to check the homogeneity invariant and pay the deopt path when it breaks. The BG programs do not need polymorphism: the user knows at construction time that the array is float64. Typed-from-creation matches the language semantics and gets the cache-locality win with no dispatch cost.
The follow-on "polymorphic list with type adaptation" is interesting for non-BG workloads and is left to a separate MEP if profile evidence demands it.
4.3 Why packed pairs and not packed N-tuples
Two reasons:
- binary_trees is the canonical 2-tuple workload. Going to N>2 introduces alignment and size-classing problems for the arena (a
vmPairis exactly 32 bytes; avmTuple3is 48 and so on, none of which are L1-cache-line-aligned without padding). Solving size classes is its own MEP. - All current MEP-24 container plans treat structs as their own pointer-backed type. Packed pairs sit alongside that, not in place of it.
If a future BG-like workload demonstrates that 3+ element packed tuples are load-bearing, we generalize then, with the size-class problem solved separately.
4.4 Why an arena for pairs and not Go heap
Per-node &vmPair{...} allocations would inherit the same factor we are trying to close. An arena ([]vmPair, indexed by handle) is one allocation amortized over N pair constructions; the dispatch cost is one bounds check plus one struct write per allocation. The trade-off is that pair memory cannot be reclaimed before Run() ends, the same trade vm2's original Objects table made and that MEP-36 reversed for containers. The reason to accept it here is that BG workloads have monotonic allocation profiles (tree construction is a build-then-tear-down, and the tear-down happens at Run boundary), so the trade is sound for this corpus.
A future MEP that lifts the arena to a full nursery+old-gen split is the obvious sequel; for now the arena suffices.
Migration plan
Four phases, each independently mergeable, each gated by the BG corpus subset it adds.
Phase 1: Float arithmetic + typed arrays. (landing in this MEP's PR)
- New vm2 ops:
OpAddF64,OpSubF64,OpMulF64,OpDivF64,OpNegF64,OpAbsF64,OpSqrtF64,OpLessF64,OpEqualF64,OpLessEqF64,OpLoadConstF,OpFmaF64. - New vm2 container types:
*vmF64Array,*vmI64Array,*vmU8Array. New ops:OpNewF64Array,OpF64ArrLen,OpF64ArrGet,OpF64ArrSet(and mirrors for I64/U8). - New IR ops + builder methods, verified end-to-end with emit + run tests.
- Corpus programs:
bg/spectral_norm,bg/n_body,bg/mandelbrot,bg/fannkuch_redux. Each ships a.go.tmplpeer and a.lua/.pypeer (lua/py optional for first cut, gated on what the BG canonical sources have).
Merge gate: each new program produces correct output at the canonical BG N and at a smaller N (for fast CI). Phase 1 vm2 within 5x of Go on the four programs at their CI N.
Phase 2: Packed pairs + arena.
- New vm2 type
vmPair, per-VM arena, opsOpNewPair,OpPairFst,OpPairSnd. - New IR type
TPair, builder method. - Corpus program: rewrite
bg/binary_treesto use packed pairs. Keep the existing nested-list form as a separate program for the historical comparison.
Merge gate: binary_trees within 2x of Go at depth 10, no regression on the rest of the suite, RSS within +1 MB.
Phase 3: Byte strings + minimal I/O.
- New vm2 type
*vmBytes(view of(buf, off, len)), opsOpBytesNew,OpBytesLen,OpBytesAt,OpBytesSlice,OpBytesEqual. - New ops
OpStdoutWriteBytes,OpStdinReadAll. - Corpus programs:
bg/reverse_complement, partialbg/fasta(RNG part is straightforward, the table-encoding part rides on byte-view).
Merge gate: reverse_complement within 3x of Go (Go is extremely good at this workload because it short-circuits to bufio; 3x is a reasonable bar).
Phase 4+: Big int and regex (separate MEPs). Out of scope here, called out so the catalog is honest about the remaining gap.
Backwards Compatibility
No language-surface change. All new ops are opt-in at IR construction; existing emitters that don't touch the new builders compile identically. The Cell layout does not change (Phase 1's typed arrays carry their pointer in Obj exactly like *vmList does today).
The JIT register-file contract from MEP-34 is unaffected: typed-array container Cells are tagPtr with an Obj pointer, the same shape the JIT already handles for lists and maps. New scalar ops (OpAddF64 etc.) have the same operand shape as their integer cousins; the JIT trace recognizer learns the new opcodes one by one as benchmarks warrant.
Reference Implementation
Files touched in Phase 1:
runtime/vm2/ops.go— new float and typed-array opcodes.runtime/vm2/eval.go— dispatch handlers.runtime/vm2/arrays.go— new file housingvmF64Array,vmI64Array,vmU8Arrayand their accessors.compiler2/ir/ir.go— new IR ops and types.compiler2/ir/builder.go— matching builder methods.compiler2/emit/emit.go— lowering for the new IR ops.compiler2/corpus/bg_spectral_norm.go,bg_n_body.go,bg_mandelbrot.go,bg_fannkuch_redux.go— hand-written IR corpus programs.bench/template/bg/spectral_norm/,bg/n_body/,bg/mandelbrot/,bg/fannkuch_redux/— peer templates (Go / Lua / Python where available).runtime/vm2/bench/corpus_test.go— register the new programs at their CI N.
A measured-results MEP (Informational, numbered after Phase 1 lands) reports BG numbers head-to-head against go run at the canonical BG N, following the MEP-29 / MEP-33 / MEP-36 Appendix D template.
Appendix A. Phase 1 + 2 baseline measurements
Captured on the same host that runs MEP-23 (Apple M4, go test -bench with -benchtime=2s), committed in the PRs landing each phase. The vm2 build runs through compiler2 IR (opt: ConstFold + DCE + TailCall) → emit → vm2.Run.
Kernels measured:
- spectral_norm n=100: build A, compute A·u, return sqrt((u·v) / (v·v)).
- n_body N=5: build 5 bodies with positions (i, 2i, 3i), velocities (i/10, i/5, 3i/10), masses (i+1). One advance step at dt=0.01, then return system energy. Recursive helpers exercised through
OpTailCallSelf. - mandelbrot 8x8, maxIter=50: per-pixel escape count, FMA on the imaginary axis. Stored as bytes in a U8Array; returns the sum.
- fannkuch_redux n=7: count flips for the fixed starting permutation
[4,2,1,5,7,3,6](9 flips). Storage is a single I64Array. - binary_trees depth=8 (Phase 2): build a perfect binary tree of 511 nodes using MEP-37 §3.4 packed pairs as node storage, then walk it counting. Returns 511.
| Benchmark | ns/op | B/op | allocs/op | vm2 / Go |
|---|---|---|---|---|
| BenchmarkVM2_BG_SpectralNormKernel | 1,754,279 | 154,992 | 17 | 115x |
| BenchmarkGo_BG_SpectralNormKernel | 15,276 | 1,792 | 2 | 1.0x |
| BenchmarkVM2_BG_NBodyKernel | 14,103 | 9,400 | 19 | 118x |
| BenchmarkGo_BG_NBodyKernel | 119.4 | 0 | 0 | 1.0x |
| BenchmarkVM2_BG_MandelbrotKernel | 140,125 | 19,608 | 9 | 56x |
| BenchmarkGo_BG_MandelbrotKernel | 2,494 | 64 | 1 | 1.0x |
| BenchmarkVM2_BG_FannkuchReduxKernel | 2,927 | 1,816 | 5 | 42x |
| BenchmarkGo_BG_FannkuchReduxKernel | 69.83 | 0 | 0 | 1.0x |
| BenchmarkVM2_BG_BinaryTreesPairKernel | 61,926 | 23,032 | 8 | 9.2x |
| BenchmarkGo_BG_BinaryTreesPairKernel | 6,748 | 8,176 | 511 | 1.0x |
These are the launch numbers, not the destination. The headline goal in the abstract is 2x on phase-1/2 covered programs once the JIT lowering and bounds-check elision pieces land. The ratios at the baseline are driven by:
- Each helper call (spectral_norm's
eval_A, n_body's pairwise inner loop, binary_trees' recursivecheckTree) costs one full OpCall round-trip; Go inlines the same helper. - Every float arithmetic op pays one byte-code dispatch (~6 ns on M4) and one Cell encode + decode round-trip; Go's compiled code is one FP register-to-register instruction.
- F64Array / U8Array / I64Array Get/Set each pay one OpCall-free dispatch but still do bounds checks the Go peer eliminates.
binary_trees lands at 9.2x rather than the 42-118x of the Phase 1 kernels: the pair arena saves 503 allocations per Run (8 chunk allocations versus Go's 511 per-node ones), so the GC pressure that dominates Go's number is amortized out of vm2's. The Phase 1 kernels do not see a comparable allocation win because their hot path is dispatch-bound, not GC-bound.
n_body lands closer to spectral_norm's ratio than expected because the recursive inner loop (six OpCall round-trips per body pair, 10 pairs per step) is dispatch-bound, not arithmetic-bound. The follow-up phases (JIT lowering for float ops, bounds-check elision on hot array accesses, FMA pattern matching in emit) close the gap; the data above is the starting point.
Appendix B. Phase 3 baseline (reverse_complement kernel)
Same host (Apple M4, -benchtime=3s). The kernel synthesizes 1024 bytes of ACGT repeating, reverses-and-complements into a second buffer (A↔T, C↔G via a 4-way conditional), then sums the output. Two OpNewU8Array allocations and a chain of OpTailCallSelf loops do the work; per-byte we pay one Call/Return into baseFor (fill) and one into complement (rcLoop), plus the OpU8ArrGet / OpU8ArrSet dispatches.
| Benchmark | ns/op | B/op | allocs/op | vm2 / Go |
|---|---|---|---|---|
| BenchmarkVM2_BG_ReverseComplementKernel | 193,199 | 648,816 | 22 | 171x |
| BenchmarkGo_BG_ReverseComplementKernel | 1,132 | 0 | 0 | 1.0x |
The 171x gap is wider than the phase-1 float ratios for two reasons specific to byte work. First, complement is a per-byte function call in the IR (no inline expansion). Go's compiled peer keeps the same logic in a single switch that the SSA backend turns into a balanced branch tree without a frame transition. Second, the kernel allocates two 1024-byte vmU8Array Cells per Run; Go's reference allocates the equivalent make([]byte, 1024) once and the bench loop's allocator hits the bump-allocator fast path almost entirely.
Both of those gaps are precisely what the Phase 3 follow-ups in §3.5 propose: a vmBytes view subsystem (slice-without-copy, plus OpStdoutWriteBytes for streaming output) and an emit-side fast path that inlines small leaf functions like complement into the calling frame. Those land in a subsequent MEP rather than this one; the data above is the measurement that motivates them.
Open Questions
-
Inline float Cell vs
Obj-carried float64 array Cells when N=1. A single float result fits inCFloat's NaN-box; a*vmF64Arrayof length 1 pays a heap allocation for nothing. We may want a fast-pathOpF64ArrGetthat returnsCFloatdirectly (no boxed temporary). Phase 1 picks the naive form; profile evidence decides whether the fast-path is worth adding. -
FMA vs unfused multiply-add for mandelbrot. Go does not auto-FMA on amd64 unless the host CPU supports it; vm2's
OpFmaF64would be unconditional. The "fair" comparison is uncertain. Document the choice and provide both forms in the corpus so reviewers can compare apples-to-apples. -
Arena lifetime for pairs across nested Runs. If a host embedder calls
Runrecursively (re-entrant), the pair arena is shared across outer and inner Run. A nested Run that allocates pairs would leak them into the outer's arena. The Phase 2a implementation keeps the arena alive for the full VM lifetime: chunks accumulate, thepairNextindex monotonically grows, and the Go GC reclaims the wholepairChunksslice when the VM is dropped. This is the simplest correct contract; a per-Run reset is the obvious sequel once profile evidence shows the chunk-retention cost dominates the dispatch cost.
Copyright
This document is placed in the public domain.