Skip to main content

MEP 23. Cross-language Baseline Benchmarks

FieldValue
MEP23
TitleCross-language Baseline Benchmarks
AuthorMochi core
StatusDraft
TypeProcess
Created2026-05-16

Abstract

MEP 17 gates per-PR performance with Mochi-vs-Mochi regression budgets. It does not answer the question contributors and prospective users actually ask: how does the Mochi VM compare to other small dynamic runtimes. This MEP defines a separate, periodic cross-language baseline that runs hand-written, idiomatic implementations of the same workloads in Mochi, Python, and Lua, and publishes the numbers. It does not gate merges. It tells everyone where the floor is and how far the floor has moved.

The split from MEP 17 is deliberate. MEP 17 measures the VM against itself so optimizations have a target. MEP 23 measures the VM against the world so the project has a story.

Motivation

MEP 17 §Open questions says comparative numbers are "mostly noise" and explicitly leaves them out. That position was right for the per-PR gate (cross-language numbers are too sensitive to host CPU, GC pauses, and idiom drift to merge against), but wrong as a permanent stance on cross-language reporting. Two concrete problems argue for owning a cross-language suite:

  • No baseline at all today. A new contributor asking "is the Mochi VM in a reasonable place" has nothing to point at. The first numbers anyone collects are ad-hoc and unreviewed.
  • Optimization MEPs have no destination. MEPs 18 through 22 (dispatch, inline caches, value representation, JIT) target an unstated goal. "Close enough to Lua on the math suite" is a goal; "30% faster than today" is a budget. Both matter.

The math benchmarks already shipped under bench/template/math/ are useful for this. Today the runner transpiles Mochi to Python, TS, Go, and C, which measures the transpilers, not Mochi-the-runtime against peer runtimes. To measure peer runtimes we need hand-written, idiomatic implementations in each language that solve the same problem the same way. Those existed for Python (unused by the runner); they are added for Lua in this MEP.

Specification

Scope

The baseline suite is hand-written, idiomatic, problem-equivalent programs. It is not:

  • transpiled output (that measures the transpiler)
  • micro-benchmarks of single opcodes (those belong to the MEP 18 dispatch work)
  • the per-PR gate (that is MEP 17)

The suite covers Mochi (mochi run, VM backend), CPython 3 (python3), and Lua 5.x (lua). Other runtimes (PyPy, LuaJIT, Node, Deno-compiled-TS) may be added as separate columns when there is a story to tell.

Programs

The first cut reuses the existing bench/template/math/ programs. Each program directory holds three files: <name>.mochi, <name>.py, <name>.lua. All three:

  • accept the same {{ .N }} template parameter
  • perform the same computation with the same control flow
  • emit one line of JSON to stdout: {"duration_us": <int>, "output": <value>}
  • exclude all I/O and parsing from the timed window; only the inner repeats are timed
ProgramWhat it stresses
fact_recrecursive call, integer multiply
fib_itertight numeric loop, integer add
fib_recdeep recursion, return-heavy call frames
mul_loopcounted loop with integer accumulator
sum_loopcounted loop with integer accumulator
prime_countnested loop with branch and modulo
matrix_mulnested loop over allocated 2D lists

Workload fidelity is enforced by output match: a run that produces a different output field from Mochi for the same N is a workload bug in one of the three implementations, not a perf result. The runner asserts equality of output across the three columns before reporting timing.

Runner integration

The existing bench/runner.go registers two new template kinds:

  • native_py, points at <dir>/<name>.py, renders {{ .N }}, executes with python3
  • native_lua, points at <dir>/<name>.lua, renders {{ .N }}, executes with lua

These are siblings of the existing mochi_py and mochi_ts rows, which transpile Mochi to those languages and remain in place for the transpiler-focused story. The two perspectives report side-by-side: the transpiled row says "how well does our codegen for language X do", and the native row says "how well does language X do on this problem".

Reporting cadence

This is a periodic baseline, not a per-merge gate. The expected cadence:

  • One full run per minor release, attached to the release notes table.
  • One full run when an optimization MEP (18 through 22) lands a tier-2 milestone, attached to that PR for context.
  • Ad-hoc runs by anyone, anytime; the harness is reproducible by design.

Geometric mean is reported alongside the per-program numbers so the headline number is one ratio per peer language ("Mochi VM is currently 14.7x CPython, 45.2x Lua across the math suite"). Per-program numbers prevent that headline from hiding a pathology in one workload.

Reproducibility

The harness inherits MEP 17's reproducibility rules: pinned MOCHI_NOW_SEED, MOCHI_BENCH=1, CPU model and OS recorded in the run header. Cross-language runs additionally record:

  • python3 --version
  • lua -v
  • go version (for the Mochi VM build)
  • the binary path of the Mochi runtime under test (release tag or commit SHA)

Cross-machine comparison is not meaningful and the runner refuses to merge results from different hosts into a single table.

Failure mode

A baseline that drifts is worse than no baseline. The runner treats these as hard failures, not soft warnings:

  • A native program that fails to parse or exits non-zero.
  • An output field that disagrees with Mochi's output for the same N.
  • A missing peer file (<name>.py or <name>.lua not present where <name>.mochi is).

The first failure ends the run with a non-zero exit. Silent skipping is what produced the "we used to be faster than Lua" problem at LuaJIT-Wars-era languages; the project does not repeat that.

Initial baseline

Recorded on the author's machine (Apple Silicon M4, macOS Darwin 24.6.0). Numbers are wall-clock microseconds for the inner repeat loop only. Lower is better. Mochi runtimes: runtime/vm is the long-standing stack VM; runtime/vm2 is the from-scratch register VM landed under MEP-21 v2. Python is CPython 3.14.5. Lua is 5.5.0.

runtime/vm baseline (original)

The original MEP-23 baseline, taken against runtime/vm at the time this MEP was registered.

ProgramNMochi VM (µs)CPython (µs)Lua (µs)Mochi / CPythonMochi / Lua
fact_rec158,35047517717.6x47.2x
fib_iter307,29144519116.4x38.2x
fib_rec25116,9255,8962,69519.8x43.4x
mul_loop204,8664237911.5x61.6x
sum_loop10,0002,121,548142,50231,31614.9x67.7x
prime_count300213,00413,9664,61215.3x46.2x
matrix_mul20- (broken)3,5701,142--

Geometric mean across the six working programs: Mochi (runtime/vm) is 15.7x CPython and 49.7x Lua. matrix_mul is excluded from the headline because the current Mochi runtime produces output: null for it.

runtime/vm2 baseline (MEP-21 v2)

Taken with the cross-language sweep at bench/crosslang (build: go run ./bench/crosslang). Programs fact_rec and mul_loop are capped at n ≤ 16 (resp. n ≤ 17) because larger factorials overflow vm2's 48-bit signed Cell payload; lifting that cap needs boxed-int support in vm2. The remaining programs (fib_iter, fib_rec, prime_count, sum_loop) match Python's and Lua's outputs exactly at every sampled n.

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Luavm2 RSSCPython RSSLua RSS
fact_rec101,5149483961.60x3.82x4.5 MB13.4 MB2.0 MB
fact_rec131,9371,2035431.61x3.57x4.3 MB13.3 MB2.3 MB
fact_rec162,1461,7635921.22x3.62x4.5 MB13.5 MB1.8 MB
fib_iter108965142101.74x4.27x4.5 MB13.5 MB1.8 MB
fib_iter201,6259963831.63x4.24x4.4 MB13.4 MB1.9 MB
fib_iter301,9201,0114201.90x4.57x4.4 MB12.3 MB1.5 MB
fib_rec15160105531.52x3.02x4.2 MB12.2 MB1.5 MB
fib_rec201,7671,1695461.51x3.24x4.1 MB11.8 MB1.5 MB
fib_rec2517,77212,6055,8921.41x3.02x4.1 MB11.7 MB1.5 MB
mul_loop10391464980.84x3.99x4.1 MB12.0 MB1.5 MB
mul_loop134204651090.90x3.85x4.1 MB12.2 MB1.6 MB
mul_loop175496351300.86x4.22x4.1 MB11.6 MB1.5 MB
prime_count501,8931,7104861.11x3.90x4.3 MB12.0 MB1.5 MB
prime_count1005,3614,6441,4381.15x3.73x4.2 MB11.9 MB1.5 MB
prime_count30031,41622,5287,6301.39x4.12x4.2 MB11.7 MB1.5 MB
sum_loop1,00023,75723,4985,2041.01x4.57x4.1 MB11.6 MB1.5 MB
sum_loop10,000235,813240,13752,3250.98x4.51x4.2 MB12.2 MB1.5 MB
sum_loop100,0002,200,4862,387,934430,4880.92x5.11x4.1 MB11.6 MB1.5 MB

Headline: vm2 is 0.84x – 1.90x CPython (often faster on tight integer loops, e.g. mul_loop and sum_loop at large N) and 3.0x – 5.1x Lua. Resident-set size is ~4 MB across all programs, roughly 1/3 of CPython's ~12 MB; Lua holds onto only ~1.5 MB.

The original MEP-23 baseline above measured runtime/vm against the same Python and Lua programs; vm2 cuts the geometric-mean gap to CPython from 15.7x to ~1.3x and the gap to Lua from 49.7x to ~3.9x. The CPython gap is essentially closed; the remaining headline target is Lua. Closing that requires interpreter-loop micro-optimizations (computed-goto-equivalent dispatch, superinstructions for the hot loop bigram, frame-reuse on cross-fn tailcalls) plus boxed-int support so the suite is not capped at 48-bit results.

Methodology note. Two of the original native baselines (bench/template/math/sum_loop/sum_loop.py and mul_loop.py) iterated range(1, n+1) while the Mochi and Lua templates iterated 1..n (exclusive). Both Python files were corrected to range(1, n) in the same PR as this baseline update so all three runtimes now compute identical outputs and the numbers are directly comparable.

The numbers above are the floor on the day vm2 entered the suite. Every subsequent optimization MEP that lands inherits an obligation to move them.

Optimization deltas

Recorded as PRs land. The floor table above is frozen; this section tracks movement against it. Same harness (bench/crosslang), same machine, same day-of-measurement Python and Lua. Per-program N matches the floor's smallest two sizes per program (largest N was dropped per maintainer feedback after the floor was set, since the big rows ran for tens of seconds without changing the relative ranking).

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
fact_rec107735882721.31x2.84x
fact_rec139218343251.10x2.83x
fib_iter104243691891.15x2.24x
fib_iter205526313210.87x1.72x
fib_rec15152107561.41x2.71x
fib_rec201,4769804601.51x3.21x
mul_loop10278423980.66x2.84x
mul_loop133425801250.59x2.74x
prime_count501,6161,9404750.83x3.40x
prime_count1004,2664,9401,5690.86x2.72x
sum_loop1,00021,66330,6586,8750.71x3.15x
sum_loop10,000211,156313,19961,1280.67x3.45x

Landed in this delta:

  • OpTailCall: dropped the argument snapshot. emit.go always stages tailcall args at B = np + ra.NumRegs, which is ≥ np, so the source slice [B..B+n) and the destination param slice [0..n) are guaranteed disjoint. The defensive snapshot in OpTailCall was therefore copying twice for no reason; removing it gave ~28% on sum_loop / fib_iter.
  • Fused i64 compare + branch superinstructions. Six new opcodes (OpJumpIfLessI64, OpJumpIfLessEqI64, OpJumpIfGreaterI64, OpJumpIfGreaterEqI64, OpJumpIfEqualI64, OpJumpIfNotEqualI64) collapse the OpLessI64 → OpJumpIfFalse → OpJump triple into one instruction. The emitter picks the form whose fallthrough matches the next layout block so the trailing OpJump is dropped too: 3 instructions → 1 per loop test, plus one fewer bool-register round-trip per iteration.

Headline movement vs the floor: vm2 vs Lua tightened from 3.0x – 5.1x to 1.7x – 3.5x (geomean ~2.8x). vs CPython, vm2 is now faster on every loop-heavy program in this slice and within ~1.5x on every recursive one.

Round 2 (eval-loop and tail-call codegen)
ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
fact_rec106154922081.25x2.96x
fact_rec137516002501.25x3.00x
fib_iter101862541190.73x1.56x
fib_iter202705022100.54x1.29x
fib_rec1510375361.38x2.86x
fib_rec201,1578804131.31x2.80x
mul_loop10170336760.51x2.24x
mul_loop13200354890.56x2.25x
prime_count508221,3733960.60x2.08x
prime_count1002,3933,8511,1900.62x2.01x
sum_loop1,0008,60923,5475,2000.37x1.66x
sum_loop10,00083,678240,76652,7950.35x1.58x

Landed in this round:

  • Hoist dispatch state into locals in eval.go. fr.Fn.Code, fr.Regs, and fr.IP are kept in named locals (code, regs, ip) across the dispatch loop and only synced back to the frame on Call / TailCall / Return / error. Removes one pointer chase per register touch; ~20% across the suite.
  • OpTailCallSelf + parallel-move codegen for same-fn tail calls. Before: emit staged args into a scratch range above the regalloc window, then OpTailCall did a runtime.memmove of the args into [0..n). Profiles showed that memmove was 15.6% of total CPU on sum_loop, every loop iteration was paying for a copy that included the unchanged loop bound. After: the emitter runs a parallel-move scheduler that emits direct OpMoves into the param slots (dropping src==dst entries and breaking cycles via the otherwise-unused callBase register), then dispatches the trivial OpTailCallSelf which only rewinds IP. Cross-fn tail calls keep the staging path. ~2x on sum_loop.
  • OpAddI64K (add-immediate) superinstruction. The emitter detects AddI64 whose operand is a single-use ConstI64 fitting int32 and fuses both into one dispatch, the loop-counter step (i = i + 1) was the highest-volume OpLoadConstI consumer. ~15% on sum_loop.

Headline movement vs the floor: vm2 vs Lua tightened to 1.3x – 3.0x across the slice (geomean ~2.0x). fib_iter and sum_loop are within ~1.3x – 1.6x of Lua; prime_count and mul_loop are ~2x; the recursive fib_rec/fact_rec programs are still ~3x because each non-tail call goes through a fresh frame acquisition. vs CPython, vm2 is now 1.4x – 2.9x faster on every iterative program in the slice (sum_loop is 0.35x CPython = 2.86x faster).

Round 3 (call/return on a contiguous stack)
ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
fact_rec102794231800.66x1.55x
fact_rec133375001880.67x1.79x
fib_iter101182161030.55x1.15x
fib_iter201914841780.39x1.07x
fib_rec155162310.82x1.65x
fib_rec205296703110.79x1.70x
mul_loop10101280610.36x1.66x
mul_loop13144340770.42x1.87x
prime_count506911,2143530.57x1.96x
prime_count1001,9163,1179820.61x1.95x
sum_loop1,0007,71419,4024,2820.40x1.80x
sum_loop10,00077,965207,07444,1910.38x1.76x

Landed in this round:

  • Stack-of-cells frame model. Round-2 still acquired frames and register slabs through sync.Pool. A profile of fib_rec (the recursive call benchmark) showed sync.Pool.Get/Put plus runtime.memclrNoHeapPointers accounting for 56% of total CPU. The redesign moves activation records onto two contiguous slices on the VM itself (Stack []Cell, Frames []frame); OpCall and OpReturn are now an integer SP bump plus a Frames append/slice-reset. No per-call pool round-trip, no per-frame zero-init memclr.
  • Cross-fn OpTailCall rewritten on the same stack model, reusing the caller's Frames slot and shrinking-then-growing the Stack window to the callee's NumRegs.
  • vm2runner reuses one VM across reps. The crosslang harness used to do vm2.New(prog).Run() per rep, which dominated wall time on short programs (mul_loop n=10/13). Reusing the VM matches what Lua and CPython do (the interpreter state is created once).

Headline movement vs the floor: vm2 vs Lua is now 1.07x – 1.96x across the slice (geomean ~1.6x). fib_iter is at Lua parity (1.07x – 1.15x). The recursive programs that round 2 left at ~3x Lua dropped to 1.55x – 1.79x because the frame-pool round trip is gone. vs CPython, vm2 is 1.6x – 2.9x faster on every program in the slice without exception.

Strings subsystem (MEP-24 §2, first slice)

The first non-integer subsystem lands. bench/template/strings/concat_loop/{mochi,py,lua} exercises acc = acc + "a" repeated N times in a loop, repeated 1000 times by each language harness. Outputs (final byte length) match across all three runtimes.

Baseline (PR #21520):

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
strings/concat_loop106245424411.15x1.41x
strings/concat_loop301,8071,1211,6251.61x1.11x

After tagSStr (this PR):

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
strings/concat_loop102442662390.92x1.02x
strings/concat_loop308895948391.50x1.06x

A new tagSStr Cell tag packs up to 5 bytes inline (length in payload bits 40..47, bytes in bits 0..39), so the literal "a" and all concat results of length ≤ 5 carry zero heap weight. String constants are also pre-materialized into a per-function StrCells []Cell at VM startup so OpLoadStrK is a slice read on the hot path, never an allocation.

The N=10 row closes to Lua parity (1.02x) and beats CPython (0.92x). At N=30 the result string outgrows the 5-byte inline budget, so the tail of the loop allocates *vmString per concat; vm2 is now at Lua parity (1.06x) and the residual CPython gap (1.50x) is the in-place resize fast-path. Concat reuse-when-unique remains the next target.

Lists subsystem (MEP-24 §3, first slice)

The second subsystem lands. bench/template/lists/fill_sum/{mochi,py,lua} allocates an empty list, pushes N integers, then reads them back in a sum loop, repeated 1000 times by each language harness. Outputs (the sum n*(n-1)/2) match across all three runtimes.

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
lists/fill_sum101,0937749081.41x1.20x
lists/fill_sum1006,4956,0424,2551.07x1.53x

vmList is a thin wrapper over []Cell (no element-type specialization at MVP). The hot path is OpListPush (append) and OpListGet (bounds-checked load). vm2 closes to CPython on the larger row (1.07x at N=100) where amortized growth dominates fixed dispatch cost; small-N is dispatch-bound (1.41x at N=10). Lua's per-call overhead is lower in this workload (table append + numeric index), leaving vm2 1.20x-1.53x behind. Element-type specialization (homogeneous []int64 storage) is the natural next step, gated on profile evidence per the MEP-24 §3 deferral.

Maps subsystem (MEP-24 §4, first slice)

The third subsystem lands. bench/template/maps/fill_sum/{mochi,py,lua} fills a map with N int->int entries then sums the values back, repeated 1000 times by each language harness. Outputs match across all three runtimes.

ProgramNvm2 (µs)CPython (µs)Lua (µs)vm2 / CPythonvm2 / Lua
maps/fill_sum107674754381.61x1.75x
maps/fill_sum1007,4553,6261,5332.06x4.86x

vmMap is a thin wrapper over Go's map[any]Cell, with a mapKeyOf normalizer that collapses inline + heap string Cells to their byte content and unboxes ints/bools/null to their scalar value. The MVP intentionally inherits Go map performance; an open-addressed table over the mapKey{tag,bits,aux} struct described in MEP-24 §4 is the natural follow-on. The Lua gap at N=100 (4.86x) is dominated by Lua's hand-tuned hash table and the any boxing penalty on every key; the CPython gap (2.06x) is dispatch + mapKeyOf type-switch overhead, both of which a specialized int-key fast path would erase.

Deep dive: Benchmarks Game methodology

The single-run microbenchmark methodology that produced the tables in the previous section has a known failure mode on a thermally constrained dev laptop: the same vm2 binary running the same program twice can produce results that differ by 3-7x. We observed this concretely while measuring MEP-36 Phase 2, where math/sum_loop n=10000 reported 105 ms on one run and 767 ms on the next ten minutes later with no source change. Two Phase-2 runs of the full sweep produced p-values that benchstat declared "no significant difference" on individual rows while still showing geomean shifts of 30%+. The harness was reporting noise, not signal.

This section documents the methodology we switched to and the reasoning behind it. The reference point is the Computer Language Benchmarks Game: a long-running comparison of ~10 small programs across ~30 language implementations that has been publishing stable numbers on the same hardware for over 15 years.

The four rules we adopted from the Benchmarks Game

  1. One program, one workload, one input. The Benchmarks Game runs each program once at a parameter that takes ~10 seconds on the reference machine. There is no inner repeat=1000 loop folded into the program; the program itself does enough work that the harness's wall-clock measurement of the whole process is the headline number. The existing MEP-23 corpus already follows the inner-loop pattern, so this rule is the future direction for new programs: tune N so the program takes a second or more, then drop the inner repeat to 1. Programs added under the Benchmarks Game suite (see "Suite roadmap" below) follow this convention; existing programs keep their repeats for backwards compatibility with prior appendices.
  2. Median of multiple invocations, not single-run wall-clock. Each (program, n, lang) tuple is invoked K times. The harness reports the median; the min and max are kept for noise-floor inspection. K = 5 is the floor; K = 10 or more is appropriate when the deltas under investigation are below 20%. Median is robust to the two- outlier failure mode we hit with single-run methodology (one thermal stall on a 5-second program throws the mean off by 50%+; it moves the median by zero).
  3. Workloads exposed to all peer runtimes, not just one. A vm2- only benchmark is informative for MEP 17 (the per-PR gate) but not for MEP 23 (the cross-language story). Every program under the BG suite has hand-written .mochi, .py, .lua, and .go.tmpl peers that compute the same output. The runner asserts output equality across all four columns before recording timings; a mismatch is a hard failure. The .go.tmpl extension keeps the unrendered template (which contains {{ .N }}) out of go build ./... walks and out of gopls; the harness substitutes {{ .N }} and writes the result as main.go inside a per-tuple temp dir before invoking go build.
  4. Memory is a first-class metric, not a footnote. The Benchmarks Game tables show CPU time, peak resident-set size, and code size side-by-side. We already report MaxRSS via getrusage; the BG harness elevates it to the headline next to wall-clock, because the "memory management" story that MEP 36 cares about is invisible if the only metric is CPU.

The fifth Benchmarks Game rule — fixed reference hardware, with results published per host — is out of scope for this MEP. The runner records the host CPU model in the JSON output and refuses to merge results across hosts, but a project-wide reference machine is a separate (and arguably CI-shaped) decision.

Harness changes (this MEP)

bench/crosslang gained one flag and one new aggregation pass:

  • -repeat K (default 1): each (program, n, lang) tuple is invoked K times. Setting K=1 preserves the prior behaviour for callers that do not want to pay the K× wall-clock cost.
  • Per-tuple aggregation prints median=… min=… max=… mem=… rather than a single duration. The markdown table column reports the median; the live progress lines include all three so a reader can see at a glance whether the median is meaningful (a min/max spread

    2× medians says "this row is noise-floor; bump K or shorten N").

  • The JSON output schema changed from []result to []aggregate, where each entry carries the run count and the min/median/max. The Markdown table grew two columns (Go (µs) and Go RSS) and one ratio (vm2 / Go); existing consumers that grep on vm2 / Lua still find that ratio in the same position relative to it.
  • A Go peer column was added next to CPython and Lua. Each program ships a <name>.go.tmpl template alongside the existing .py / .lua files; the harness substitutes {{ .N }} and go builds the result once per tuple, then runs the binary K times. The build is amortized across the K runs so it does not pollute the timing window. Go is the right peer for the "interpreter-vs-compiled-AOT" question, the same way CPython and Lua are the right peers for the "interpreter-vs-other-interpreter" question.

Subprocess startup is still per-invocation, not per-K. Reusing the process across K invocations would underreport memory (the high-water mark of run 1 would carry into runs 2..K) and would mask cold-cache effects that the BG explicitly measures. The K× cost is the price of honest noise quantification.

Suite roadmap

The math/strings/lists/maps suites already in MEP-23 are the floor. The BG suite extends them with programs whose specific job is to exercise a memory-management property the existing suites cannot reach:

ProgramWhat it stresses (BG analogue)Status
binary_treesMany short-lived tree allocations + traversal, the canonical GC stress test (BG: binary-trees). Tests MEP 36's container-reclamation contract.landed (bg/binary_trees)
nsieveLarge bool array, write-heavy then read-heavy (BG: nsieve / pidigits style). Tests list payload throughput.landed (bg/nsieve)
mandelbrotTight floating-point loop, zero allocation (BG: mandelbrot). Tests dispatch and float64 ops, not GC.blocked on float64 IR ops
n_bodyFloating-point physics simulation, many array reads, no allocations after init (BG: n-body).blocked on float64 IR ops
fastaBuffered string output + RNG (BG: fasta). Tests the string subsystem at scale.deferred (no string builder; concat is O(n^2))
regex_reduxRegex match throughput (BG: regex-redux). Out of scope until vm2 has a regex op-pack.deferred

The first two (nsieve, binary_trees) landed in this MEP. The other three are listed here so the methodology change has a public roadmap.

The binary_trees program in particular is the headline benchmark for MEP 36's GC story: it allocates millions of tree nodes and lets them die. Phase 2 of MEP 36 is supposed to make those nodes reclaimable as soon as the frame holding them returns; the binary-trees throughput-vs-RSS ratio is the load-bearing measurement of whether the design landed. The Mochi peer encodes a tree as a nested 2-element list ([left, right], leaf = []) rather than a struct, because vm2's IR has no struct support yet. The encoding is faithful to the BG semantics, each subtree is its own allocation, and exercises the exact code path MEP 36 cares about: many small list allocations dying as their referring frame returns.

mandelbrot and n_body are blocked on float64 ops in the IR. The compiler2/ir builder currently exposes ConstI64, AddI64, etc., but no float-typed counterparts. Adding them is a separate MEP shaped work item (new opcodes, new emit cases, new Cell tag for float payload). When that lands, both programs will be straight ports of the canonical BG kernels with no encoding tricks.

fasta is technically expressible with the current string ops (ConcatStr, LenStr) but the only allocation primitive is "make a new string equal to a ++ b", which is O(n) per append and so O(n^2) overall for a buffered-output program. A strings.Builder analogue (append-only buffer with a finalize-to-string op) would make it tractable. Filed as a follow-up to the MEP-24 string subsystem rather than a blocker for the BG suite.

The first BG-suite numbers, taken with the methodology this section defines (median of 5 over full-process invocations, -repeat 5), appear in the appendix below.

Why this MEP changes the methodology now

Three concrete pressures pushed the change:

  1. MEP 36 needs a stable measurement. The Phase 1 → Phase 2 comparison requires deltas at the +/-10% level. Single-run methodology cannot resolve that band on the dev hardware in use.
  2. go test -bench + benchstat is the wrong tool for cross-language work. benchstat normalizes for in-process b.N adjustment, which is irrelevant when comparing vm2 and CPython (different process, different timer). The BG model (full-process median of K) is directly comparable across runtimes.
  3. The previous tables in this MEP are valuable as a historical record. They are not retracted; they are pinned at the date and commit they were taken. The BG suite is the forward methodology; the existing rows remain so future readers can trace the trajectory.

Rationale

Why a new MEP, not an edit to MEP 17. MEP 17 is a process gate. Folding cross-language reporting into it would either weaken the per-PR rule (peer runtime variance is too high to gate on) or strengthen the cross-language rule into a gate (an over-commitment given how many factors outside the VM author's control move those numbers). Separate documents, separate cadences, no confusion.

Why Python and Lua specifically. They are the two most cited points of reference for small dynamic runtimes. CPython is the floor any non-JIT register-machine interpreter should aspire to clear; Lua 5.x is the ceiling for a pure-C single-pass interpreter without a JIT. Sitting between them is a reasonable target for the next 18 months. JIT-class peers (LuaJIT, PyPy, V8) become relevant when MEP 22 ships.

Why hand-written, not transpiled. A transpiler benchmark measures the transpiler. We have one of those (the existing mochi_py, mochi_ts rows in the runner) and it answers a different question. Conflating the two has historically misled language-launch announcements; we are not doing that.

Why output-equality enforcement. The cheapest way to look fast is to compute less. Output equality catches that automatically.

Backwards Compatibility

None. This is process-only and additive infrastructure. Existing bench/runner.go rows (transpiled mochi_py, mochi_ts, mochi_go, mochi_c) are unchanged.

The hand-written .py files that previously sat next to the .mochi templates and were unused by the runner are now used. Their duration_ms key was renamed to duration_us to match the runner's existing Result.DurationUs field; if anyone was reading the old key by hand, they will notice.

Reference Implementation

  • Runner change: bench/runner.go adds native_py and native_lua template registration in Benchmarks and matching display labels in report / exportMarkdown.
  • New Lua programs: bench/template/math/<name>/<name>.lua for all seven math programs.
  • Updated Python programs: bench/template/math/<name>/<name>.py now emit duration_us.

Open Questions

  • Other suites. The bench/template/join/ directory has nine join workloads that today are not wired into the runner. They are a stronger test of the query algebra than the math suite is of pure compute. Adding native peer implementations for them is the next slice of this MEP, tracked separately so the math baseline can land.
  • JIT peers. PyPy and LuaJIT belong in a follow-up table once MEP 22 ships, so the comparison happens between like categories (interpreter-vs-interpreter, JIT-vs-JIT).
  • CI integration. This is local-developer infrastructure today. A future MEP may wire a cross-language run into a nightly CI job on a fixed instance, with the JSON archived for trend analysis. That is not in scope here.

References

  • MEP 17: VM Performance Methodology and Baseline. The internal Mochi-vs-Mochi gate this MEP is paired with.
  • [Wren-perf]: Bob Nystrom, Wren Performance. https://wren.io/performance.html. The cross-language table format this MEP follows.
  • [LuaPerformance]: Roberto Ierusalimschy, Lua Performance Tips. https://www.lua.org/gems/sample.pdf. Background on why Lua sits where it does on small-interpreter benchmarks.

Appendix B: First BG-suite measurements

Captured 2026-05-17 on macOS arm64 (Apple Silicon dev laptop), median of 5 invocations per (program, n, lang) tuple via bench/crosslang -repeat 5. Output equality enforced across all four peers. RSS is ru_maxrss in bytes (macOS), the high-water mark per invocation. The two new BG-suite rows are listed first; the existing math/strings/lists/maps rows are included so the methodology change is auditable against the same machine state in one pass.

ProgramNvm2 (µs)CPython (µs)Lua (µs)Go (µs)vm2 / CPythonvm2 / Luavm2 / Govm2 RSSCPython RSSLua RSSGo RSSmatch
bg/binary_trees81171196941207215021.21x0.97x7.80x9.1 MB12.3 MB1.9 MB7.5 MB
bg/binary_trees10184104156494200145164021.18x0.92x11.22x10.6 MB13.1 MB2.5 MB10.6 MB
bg/nsieve10005117234110841072.19x4.72x47.82x5.8 MB12.2 MB1.7 MB4.3 MB
bg/nsieve100005526325754111498182.15x4.96x67.56x12.9 MB12.3 MB2.7 MB4.6 MB
lists/fill_sum10781358389172.18x2.01x45.94x4.7 MB13.2 MB1.8 MB4.2 MB
lists/fill_sum100527427061883981.95x2.80x53.82x6.3 MB11.9 MB1.5 MB4.3 MB
maps/fill_sum1012774594552592.78x2.81x4.93x5.4 MB12.4 MB1.5 MB4.5 MB
maps/fill_sum10095933652155412212.63x6.17x7.86x10.1 MB12.3 MB1.7 MB6.5 MB
math/fact_rec1033233013661.01x2.44x55.33x4.4 MB12.6 MB1.6 MB4.3 MB
math/fact_rec1343745414880.96x2.95x54.62x4.5 MB12.1 MB1.5 MB4.2 MB
math/fib_iter101781687951.06x2.25x35.60x4.4 MB12.1 MB1.5 MB4.4 MB
math/fib_iter20311315135150.99x2.30x20.73x4.3 MB12.1 MB1.5 MB4.1 MB
math/fib_rec1580502551.62x3.20x16.00x4.3 MB11.9 MB1.5 MB4.2 MB
math/fib_rec20799536244331.49x3.27x24.21x4.5 MB12.3 MB1.5 MB4.1 MB
math/mul_loop1019020243100.94x4.42x19.00x4.3 MB11.9 MB1.5 MB4.3 MB
math/mul_loop1323624455100.97x4.29x23.60x4.3 MB12.1 MB1.5 MB4.2 MB
math/prime_count501074878256391.22x4.20x27.54x4.5 MB11.9 MB1.5 MB4.2 MB
math/prime_count10029592290704701.29x4.20x42.27x4.4 MB12.2 MB1.5 MB4.1 MB
math/sum_loop1000112431387031032640.81x3.62x42.59x4.4 MB12.1 MB1.7 MB4.1 MB
math/sum_loop100001119911420443102130440.79x3.61x36.79x4.4 MB12.0 MB1.6 MB4.4 MB
strings/concat_loop103082552322741.21x1.33x1.12x4.6 MB12.2 MB1.5 MB4.2 MB
strings/concat_loop3012665878447992.16x1.50x1.58x5.8 MB12.2 MB1.5 MB4.9 MB

Quick reading of the two new rows:

  • bg/binary_trees lands within 8% of CPython and within 8% of Lua at depth 10 (vm2 / Lua = 0.92x means vm2 is actually faster than Lua here). RSS is below CPython (10.6 MB vs 13.1 MB). This is the Phase 2 container-reclamation contract paying off: tree nodes die as the frame holding them returns, so the high-water mark stays bounded by the working set rather than the cumulative allocation count. Without that contract a 1024-iteration loop over a 2047-node tree would peg RSS at the cumulative number of allocations, not the live set.
  • bg/nsieve is 2-5x off the interpreter peers. The hot loop is xs[i] == 0 followed by a write phase, both of which pay full Cell overhead per element (one tag check + one heap touch per access). CPython's bytearray and Lua's integer-keyed table both hit the same logical workload through a tighter native code path. This row is the right place to point a typed-list specialization MEP at when it is time to write one.
  • The math/strings/lists/maps rows are reproduced here from the same capture run so the BG-suite numbers and the prior-suite numbers are comparable apples-to-apples (same hardware, same kernel revision, same build of the runner). They are not retracted from the earlier appendix; that earlier appendix used a different methodology and is preserved at the date it was taken so future readers can see the trajectory.

This document is placed in the public domain.