MEP 17. VM Performance Methodology and Baseline

Field	Value
MEP	17
Title	VM Performance Methodology and Baseline
Author	Mochi core
Status	Draft
Type	Process
Created	2026-05-16

Abstract

Every optimization MEP that follows (dispatch, inline caches, value representation, JIT) is judged against numbers, not against intuitions. This MEP establishes the harness, the reference suite, the reporting format, and the regression budget that those changes must clear before they land. It is the only VM MEP that ships no runtime code; it ships measurement infrastructure.

The contract is small: every VM change that touches the dispatch loop, the value representation, the GC interaction, or the bytecode encoding must report numbers from this suite in its PR description. A change that does not improve the workload it targets, or that regresses an unrelated workload by more than the budget below, does not merge.

Motivation

The Mochi VM (runtime/vm/) is today a register machine with a 107-opcode set, a single switch dispatch loop, a tagged-union Value struct, constant folding, dead-code elimination via liveness, and specialized numeric opcodes (OpAddInt, OpAddFloat). It is correct, small, and has no inline caches, no adaptive specialization, no JIT, and no benchmark suite that gates merges. The pattern that fails everywhere in interpreter history is the same: a smart change lands without a baseline, a year later someone notices the VM is 30% slower than it was, and nobody can bisect because there were no numbers to bisect against.

Bob Nystrom's Wren performance page ([Wren-perf]) demonstrates the alternative: a small, honest, public benchmark set with reproducible runs and a comparison table that explains every column. Wren's own claim ("close to Lua") is credible because the harness exists in the repo, not because of marketing. LuaJIT publishes its NYI list and its trace stats for the same reason. CPython's PEP 659 ([PEP-659]) and the V8 Speedometer numbers ([Sparkplug]) are all reported against fixed suites with stable reporting. The cheapest way to never improve a VM is to never measure it.

There is also a correctness argument. The Mochi conformance suite (golden files under tests/vm/valid/ and tests/rosetta/x/Mochi/) pins behavior but not performance. A 100x slowdown passes the golden tests fine; the user notices. Performance regressions are bugs, and bugs are caught by tests.

Specification

Bench harness

A new package runtime/vm/bench hosts the harness. Each benchmark is a Go testing.B driver that loads a .mochi program from bench/programs/<name>.mochi, compiles it once, and then runs the VM b.N times against the compiled program. The compile step is excluded from timing because compile cost is the subject of MEP-0022, not the interpreter MEPs.

func BenchmarkVM_Fib(b *testing.B) { runBench(b, "fib.mochi") }

runBench resets b.ResetTimer() after compilation, runs the program with a fixed input, and reports b.N iterations of the full vm.Run cycle. It also reports b.ReportAllocs() so the allocation count and bytes-allocated-per-op show up next to ns/op.

Reference suite

The reference suite is small on purpose: ten programs that together cover the language surface and stress the parts of the VM that optimizations target. Each program lives in runtime/vm/bench/programs/.

Name	Stresses	Source
`fib.mochi`	recursive call, integer arithmetic, return	classic
`iter_sum.mochi`	tight numeric loop, `for x in 1..N`, `OpAddInt`	classic
`string_cat.mochi`	string concatenation, `OpStr`, allocation pressure	classic
`map_get.mochi`	map lookup in a loop, `OpIndex` on map	classic
`list_build.mochi`	`append` in a loop, list growth, `OpAppend`	classic
`struct_field.mochi`	struct field read/write in a loop	original
`hof_map.mochi`	higher-order `map`, closure invocation, `OpCall`	original
`query_select.mochi`	`from ... where ... select ...` over a list array	from MEP-14 fixtures
`agent_emit.mochi`	agent stream `emit` and intent dispatch	from MEP examples
`json_round.mochi`	JSON parse + serialize round trip on a 1 MB fixture	original

The suite is reviewed once per minor release. Programs are added when a new language feature lands and removed only when the feature is removed. The names are stable; downstream changelogs reference them.

Reporting format

Every PR that claims a perf change must include a fenced block in the description that pastes the go test -bench output for the changed benchmarks, both before and after, on the same machine. The format:

benchstat before.txt after.txt

The benchstat tool ([benchstat]) is the canonical comparison. It computes the geometric mean delta, the per-benchmark delta, and a statistical significance marker. PRs that quote raw ns/op without benchstat are asked to re-run.

Reproducibility

The harness pins randomness and time. MOCHI_NOW_SEED is already wired through the VM (vm_eval.go:806); the harness sets it. MOCHI_BENCH=1 is also set, and the runtime treats that as a hint to disable any wall-clock-derived heuristics (none today, but reserve the flag now).

The harness reports the Go version, the CPU model (from /proc/cpuinfo or sysctl -n machdep.cpu.brand_string on macOS), and the OS / kernel. PRs reporting numbers from different machines must say so; benchstat across machines is meaningless.

Regression budget

The budget protects against silent decay across many small changes.

A PR that targets benchmark X may regress benchmark X by 0%. Improvements are reported with benchstat; regressions on the target are blockers.
A PR may regress unrelated benchmarks by up to 3% if the geometric mean across the full suite is non-positive. Anything beyond 3% on a single benchmark needs explicit justification in the PR description and reviewer sign-off.
A PR may regress the geometric mean of the suite by 0%. A perf change that makes the suite slower on average does not merge.
Memory: a 5% regression in B/op or allocs/op triggers the same review gate as a 3% time regression.

These numbers are intentionally generous in the early phase (no JIT, no ICs); they tighten as the floor rises. A change that improves the geometric mean by 10% can spend that improvement elsewhere over the next quarter without re-clearing the budget.

Conformance interaction

The bench programs are also conformance fixtures. Each has a checked-in expected output under bench/expected/<name>.out. A bench run that produces wrong output fails the test, not just the benchmark. This means the performance suite double-duties as a correctness sub-suite, and a regression that breaks the program shows up before the timing numbers do.

Tracking

A new directory runtime/vm/bench/history/ records every release's benchstat-formatted baseline. The release script appends one file per minor version. The README at runtime/vm/bench/README.md reads from that directory to show a perf history table.

Implementation notes

The smallest viable shipment of this MEP is the harness, the ten programs, the expected outputs, and the README. No optimizer change ships in the same PR. Subsequent MEPs (0018 onward) cite this harness in their PR descriptions.

Implementation cost is one engineer-week. There is no algorithmic content; the value is the discipline.

Open questions

Cross-host comparison. A future MEP may add a vm-arena CI job that runs the bench suite on a fixed cloud instance after each merge and uploads the JSON to a tracked dashboard. That is not in this MEP; the goal here is the local-developer harness.
Microbenchmarks vs program-level. The ten programs are all program-level. Microbenchmarks for individual opcodes (BenchmarkOpAddInt) are useful for the dispatch MEP and can be added there. They are not part of the headline geometric mean because per-opcode numbers can mislead about whole-program impact.
Comparative numbers. Wren publishes vs Lua, Python, Ruby. Mochi is unique enough that comparative numbers are mostly noise; the harness reports Mochi-vs-Mochi only.

References

[Wren-perf] Bob Nystrom, Wren Performance. https://wren.io/performance.html
[PEP-659] Mark Shannon, PEP 659: Specializing Adaptive Interpreter. https://peps.python.org/pep-0659/
[Sparkplug] Leszek Swirski, Sparkplug: a non-optimizing JavaScript compiler. https://v8.dev/blog/sparkplug
[benchstat] Russ Cox, benchstat. https://pkg.go.dev/golang.org/x/perf/cmd/benchstat
[Ertl-Gregg] Anton Ertl and David Gregg, The Structure and Performance of Efficient Interpreters. Journal of Instruction-Level Parallelism 5, 2003.

Abstract​

Motivation​

Specification​

Bench harness​

Reference suite​

Reporting format​

Reproducibility​

Regression budget​

Conformance interaction​

Tracking​

Implementation notes​

Open questions​

References​