MEP 23. Cross-language Baseline Benchmarks

Field	Value
MEP	23
Title	Cross-language Baseline Benchmarks
Author	Mochi core
Status	Draft
Type	Process
Created	2026-05-16

Abstract

MEP 17 gates per-PR performance with Mochi-vs-Mochi regression budgets. It does not answer the question contributors and prospective users actually ask: how does the Mochi VM compare to other small dynamic runtimes. This MEP defines a separate, periodic cross-language baseline that runs hand-written, idiomatic implementations of the same workloads in Mochi, Python, and Lua, and publishes the numbers. It does not gate merges. It tells everyone where the floor is and how far the floor has moved.

The split from MEP 17 is deliberate. MEP 17 measures the VM against itself so optimizations have a target. MEP 23 measures the VM against the world so the project has a story.

Motivation

MEP 17 §Open questions says comparative numbers are "mostly noise" and explicitly leaves them out. That position was right for the per-PR gate (cross-language numbers are too sensitive to host CPU, GC pauses, and idiom drift to merge against), but wrong as a permanent stance on cross-language reporting. Two concrete problems argue for owning a cross-language suite:

No baseline at all today. A new contributor asking "is the Mochi VM in a reasonable place" has nothing to point at. The first numbers anyone collects are ad-hoc and unreviewed.
Optimization MEPs have no destination. MEPs 18 through 22 (dispatch, inline caches, value representation, JIT) target an unstated goal. "Close enough to Lua on the math suite" is a goal; "30% faster than today" is a budget. Both matter.

The math benchmarks already shipped under bench/template/math/ are useful for this. Today the runner transpiles Mochi to Python, TS, Go, and C, which measures the transpilers, not Mochi-the-runtime against peer runtimes. To measure peer runtimes we need hand-written, idiomatic implementations in each language that solve the same problem the same way. Those existed for Python (unused by the runner); they are added for Lua in this MEP.

Specification

Scope

The baseline suite is hand-written, idiomatic, problem-equivalent programs. It is not:

transpiled output (that measures the transpiler)
micro-benchmarks of single opcodes (those belong to the MEP 18 dispatch work)
the per-PR gate (that is MEP 17)

The suite covers Mochi (mochi run, VM backend), CPython 3 (python3), and Lua 5.x (lua). Other runtimes (PyPy, LuaJIT, Node, Deno-compiled-TS) may be added as separate columns when there is a story to tell.

Programs

The first cut reuses the existing bench/template/math/ programs. Each program directory holds three files: <name>.mochi, <name>.py, <name>.lua. All three:

accept the same {{ .N }} template parameter
perform the same computation with the same control flow
emit one line of JSON to stdout: {"duration_us": <int>, "output": <value>}
exclude all I/O and parsing from the timed window; only the inner repeats are timed

Program	What it stresses
`fact_rec`	recursive call, integer multiply
`fib_iter`	tight numeric loop, integer add
`fib_rec`	deep recursion, return-heavy call frames
`mul_loop`	counted loop with integer accumulator
`sum_loop`	counted loop with integer accumulator
`prime_count`	nested loop with branch and modulo
`matrix_mul`	nested loop over allocated 2D lists

Workload fidelity is enforced by output match: a run that produces a different output field from Mochi for the same N is a workload bug in one of the three implementations, not a perf result. The runner asserts equality of output across the three columns before reporting timing.

Runner integration

The existing bench/runner.go registers two new template kinds:

native_py — points at <dir>/<name>.py, renders {{ .N }}, executes with python3
native_lua — points at <dir>/<name>.lua, renders {{ .N }}, executes with lua

These are siblings of the existing mochi_py and mochi_ts rows, which transpile Mochi to those languages and remain in place for the transpiler-focused story. The two perspectives report side-by-side: the transpiled row says "how well does our codegen for language X do", and the native row says "how well does language X do on this problem".

Reporting cadence

This is a periodic baseline, not a per-merge gate. The expected cadence:

One full run per minor release, attached to the release notes table.
One full run when an optimization MEP (18 through 22) lands a tier-2 milestone, attached to that PR for context.
Ad-hoc runs by anyone, anytime; the harness is reproducible by design.

Geometric mean is reported alongside the per-program numbers so the headline number is one ratio per peer language ("Mochi VM is currently 14.7x CPython, 45.2x Lua across the math suite"). Per-program numbers prevent that headline from hiding a pathology in one workload.

Reproducibility

The harness inherits MEP 17's reproducibility rules: pinned MOCHI_NOW_SEED, MOCHI_BENCH=1, CPU model and OS recorded in the run header. Cross-language runs additionally record:

python3 --version
lua -v
go version (for the Mochi VM build)
the binary path of the Mochi runtime under test (release tag or commit SHA)

Cross-machine comparison is not meaningful and the runner refuses to merge results from different hosts into a single table.

Failure mode

A baseline that drifts is worse than no baseline. The runner treats these as hard failures, not soft warnings:

A native program that fails to parse or exits non-zero.
An output field that disagrees with Mochi's output for the same N.
A missing peer file (<name>.py or <name>.lua not present where <name>.mochi is).

The first failure ends the run with a non-zero exit. Silent skipping is what produced the "we used to be faster than Lua" problem at LuaJIT-Wars-era languages; the project does not repeat that.

Initial baseline

Recorded on the author's machine (Apple Silicon, macOS Darwin 24.6.0). Numbers are wall-clock microseconds for the inner repeat loop only. Lower is better. Mochi run with mochi run (current main, VM backend). Python is CPython 3.14.5. Lua is 5.5.0.

Program	N	Mochi VM (µs)	CPython (µs)	Lua (µs)	Mochi / CPython	Mochi / Lua
`fact_rec`	15	8,350	475	177	17.6x	47.2x
`fib_iter`	30	7,291	445	191	16.4x	38.2x
`fib_rec`	25	116,925	5,896	2,695	19.8x	43.4x
`mul_loop`	20	4,866	423	79	11.5x	61.6x
`sum_loop`	10,000	2,121,548	142,502	31,316	14.9x	67.7x
`prime_count`	300	213,004	13,966	4,612	15.3x	46.2x
`matrix_mul`	20	— (broken)	3,570	1,142	—	—

Geometric mean across the six working programs: Mochi is 15.7x CPython and 49.7x Lua. matrix_mul is excluded from the headline because the current Mochi runtime produces output: null for it; that is a separate correctness issue tracked outside this MEP, but the program stays in the suite so it shows up green the day it is fixed.

The numbers above are the floor on the day MEP 23 was registered. Every subsequent optimization MEP that lands inherits an obligation to move them.

Rationale

Why a new MEP, not an edit to MEP 17. MEP 17 is a process gate. Folding cross-language reporting into it would either weaken the per-PR rule (peer runtime variance is too high to gate on) or strengthen the cross-language rule into a gate (an over-commitment given how many factors outside the VM author's control move those numbers). Separate documents, separate cadences, no confusion.

Why Python and Lua specifically. They are the two most cited points of reference for small dynamic runtimes. CPython is the floor any non-JIT register-machine interpreter should aspire to clear; Lua 5.x is the ceiling for a pure-C single-pass interpreter without a JIT. Sitting between them is a reasonable target for the next 18 months. JIT-class peers (LuaJIT, PyPy, V8) become relevant when MEP 22 ships.

Why hand-written, not transpiled. A transpiler benchmark measures the transpiler. We have one of those (the existing mochi_py, mochi_ts rows in the runner) and it answers a different question. Conflating the two has historically misled language-launch announcements; we are not doing that.

Why output-equality enforcement. The cheapest way to look fast is to compute less. Output equality catches that automatically.

Backwards Compatibility

None. This is process-only and additive infrastructure. Existing bench/runner.go rows (transpiled mochi_py, mochi_ts, mochi_go, mochi_c) are unchanged.

The hand-written .py files that previously sat next to the .mochi templates and were unused by the runner are now used. Their duration_ms key was renamed to duration_us to match the runner's existing Result.DurationUs field; if anyone was reading the old key by hand, they will notice.

Reference Implementation

Runner change: bench/runner.go adds native_py and native_lua template registration in Benchmarks and matching display labels in report / exportMarkdown.
New Lua programs: bench/template/math/<name>/<name>.lua for all seven math programs.
Updated Python programs: bench/template/math/<name>/<name>.py now emit duration_us.

Open Questions

Other suites. The bench/template/join/ directory has nine join workloads that today are not wired into the runner. They are a stronger test of the query algebra than the math suite is of pure compute. Adding native peer implementations for them is the next slice of this MEP, tracked separately so the math baseline can land.
JIT peers. PyPy and LuaJIT belong in a follow-up table once MEP 22 ships, so the comparison happens between like categories (interpreter-vs-interpreter, JIT-vs-JIT).
CI integration. This is local-developer infrastructure today. A future MEP may wire a cross-language run into a nightly CI job on a fixed instance, with the JSON archived for trend analysis. That is not in scope here.

References

MEP 17: VM Performance Methodology and Baseline. The internal Mochi-vs-Mochi gate this MEP is paired with.
[Wren-perf]: Bob Nystrom, Wren Performance. https://wren.io/performance.html. The cross-language table format this MEP follows.
[LuaPerformance]: Roberto Ierusalimschy, Lua Performance Tips. https://www.lua.org/gems/sample.pdf. Background on why Lua sits where it does on small-interpreter benchmarks.

Copyright

This document is placed in the public domain.

Abstract​

Motivation​

Specification​

Scope​

Programs​

Runner integration​

Reporting cadence​

Reproducibility​

Failure mode​

Initial baseline​

Rationale​

Backwards Compatibility​

Reference Implementation​

Open Questions​

References​

Copyright​