MEP 21. VM Conformance and Differential Testing
| Field | Value |
|---|---|
| MEP | 21 |
| Title | VM Conformance and Differential Testing |
| Author | Mochi core |
| Status | Draft |
| Type | Standards Track |
| Created | 2026-05-16 |
Abstract
MEPs 18 through 22 propose changes to dispatch, specialization, value representation, and (eventually) code generation. Each change is a chance to introduce a correctness regression that the existing golden suite misses because the regression only fires on inputs nobody wrote a fixture for. This MEP adds three correctness layers that close the gap without requiring a fixture for every bug: differential execution between the interpreter and the constant folder, property-based testing for laws the VM must obey, and Go's native testing.F fuzzer for the bytecode compiler.
These layers are independent of any of the perf MEPs; they harden the VM regardless of which optimization lands next. They also enable future MEPs (notably the copy-and-patch baseline JIT in MEP-22) to be tested against an oracle: the interpreter is the reference, the JIT is the candidate, and a differential harness compares outputs over a corpus.
Motivation
The Mochi conformance suite today is two things: golden files (tests/vm/valid/*.mochi + .out pairs) and the Rosetta task corpus (tests/rosetta/x/Mochi/). Both are example-based: a fixture passes if its output matches the recorded .out. This catches the bugs someone wrote a fixture for. It misses the rest.
The patterns this MEP closes are well-known:
- Optimizer-introduced divergence. A constant-folding pass that mis-evaluates
1e308 * 10produces+Infwhile the runtime evaluates+Infcorrectly. The golden test passes if1e308 * 10is never in a fixture. The differential harness catches it because it asks: "is the folded answer equal to the unfolded answer for every expression in the corpus?" - Quickened opcode bugs. MEP-19 introduces
OpAdd_II. If the deopt path is wrong on overflow, the bug shows up only on overflowing inputs, which goldens may not exercise. A property test onOpAdd(associative-modulo-overflow, commutative) catches it. - Bytecode compiler crashes. A malformed input that the parser accepts but the compiler panics on. The Go fuzzer can find these in minutes given a seed corpus.
Each layer below corresponds to one of these.
Specification
Layer 1: differential execution (folder vs interpreter)
The Mochi compiler does constant folding in runtime/vm/constfold.go. The interpreter evaluates the same expressions at runtime. They must agree.
A new test harness, runtime/vm/diff/, walks every expression in every program in the bench suite (MEP-17), the golden suite, and the Rosetta corpus. For each pure expression (no ! io, no ! fs, etc., per MEP-15 effect inference) it:
- Evaluates the expression at compile time via
constfold.Eval. - Compiles the expression to bytecode with folding disabled and evaluates at runtime via the interpreter.
- Asserts the two values are
equal, whereequalfollows Mochi's structural equality.
A divergence is a P1 bug. The test fails the CI run.
Pure expressions are the only ones in scope because impure expressions are non-deterministic by definition (now(), fetch(...)). MEP-15's Effects.IsEmpty() is the gate: if the expression is pure, the two evaluators must agree.
This catches the entire class of "the optimizer rewrites this to a value the interpreter does not produce" bugs. It is a one-time investment with permanent value; every optimization MEP gets free oracle testing.
Layer 2: property-based testing
A new package runtime/vm/prop/ defines properties that the VM must obey across every input. Properties express laws, not examples. Implementation uses [gopter] or hand-written quickcheck-style generators (gopter is preferred for breadth).
The starter law set:
- Arithmetic associativity over
int:(a + b) + c == a + (b + c)for all inta, b, c. Modulo overflow, which Mochi defines per MEP (TODO: cite the right MEP) as wrapping; the property is on the wrapped result. - Arithmetic commutativity over
intandfloat:a + b == b + a. Caveat for float: NaN. The property excludes NaN inputs. - Equality reflexivity:
x == xfor every value not containing NaN. lenofappend:len(append(xs, y)) == len(xs) + 1.- Map round-trip:
m[k] = v; m[k] == vfor any keykand valuev. - Option unwrap-after-guard: for any
Option<T>, ifo != nonetheno!returns the wrapped value (MEP-16). - Pure expression idempotency: for any pure expression
e,e == eevaluated twice. - Effect-set monotonicity (MEP-15): for any function
f,inferredEffects(f) ⊆ declaredEffects(f)whenever a declaration is present.
Each property runs a few thousand random inputs by default; CI runs more (configurable via MOCHI_PROP_N). A failed property prints the smallest counterexample (shrinking).
Properties are not fixtures. They do not pin a specific output; they pin a relationship. This is the right tool for the optimizer-vs-interpreter agreement question and for any invariant the type system promises.
Layer 3: bytecode-compiler fuzzing
Go's native testing.F runs fuzzers in CI ([Go-Fuzz]). A new file runtime/vm/fuzz_test.go defines:
func FuzzCompile(f *testing.F) {
seedFromCorpus(f, "tests/vm/valid")
f.Fuzz(func(t *testing.T, src string) {
prog, err := parser.ParseString(src)
if err != nil {
return
}
defer func() { recover() }() // panics are the bugs we hunt
_, _ = vm.Compile(prog, nil)
})
}
The seed corpus is the existing golden inputs (paths walked at fuzz init). The fuzzer mutates them. Any panic in vm.Compile is a bug; the fuzzer minimizes and reports.
A second fuzzer, FuzzRun, takes the compiled program and runs it under a CPU/memory budget. The harness uses runtime.SetFinalizer and a hard wall-clock cap (5s) to detect non-termination. Any panic, infinite loop, or memory blowup is a bug.
CI runs the fuzzers for a fixed time budget per PR (default 60s each). Nightly runs extend to 30 minutes. Found crashers are checked into runtime/vm/fuzz_corpus/ so they become permanent regression tests.
Layer 4: oracle harness for MEP-22
The copy-and-patch JIT in MEP-22 will need a differential harness against the interpreter. This MEP defines that harness ahead of time so MEP-22 ships behind it.
The shape: for every program in the test corpus, run under the interpreter and under the JIT, compare outputs. Any divergence is a P1 bug. The interpreter is canonical.
The harness is gated by a build tag (-tags jit) and is a no-op until MEP-22 ships. Its skeleton lands in this MEP so MEP-22 has nothing to design about correctness, only to implement.
Coverage measurement
A new CI step runs go test -cover ./runtime/vm/... and reports the line coverage delta per PR. The goal is not a number to chase; it is to make uncovered handlers visible. A new opcode that lands without a test prints as uncovered and is rejected at review.
Status at a glance
| Item | Status |
|---|---|
| Differential exec: folder vs interpreter on pure expressions | proposed |
| Property suite: arithmetic, equality, len, map, option laws | proposed |
| Property suite: shrinkable counterexamples | proposed |
Fuzzer: FuzzCompile over parser corpus | proposed |
Fuzzer: FuzzRun with wall-clock and memory budget | proposed |
Crasher corpus checked in under runtime/vm/fuzz_corpus/ | proposed |
| Coverage delta reported in CI | proposed |
| Oracle harness skeleton for MEP-22 | proposed |
Risks
- Flaky properties on float arithmetic. Floating-point laws fail in edge cases (NaN, signed zero, denormals). The property generators must exclude these inputs explicitly; failures here are usually generator bugs, not VM bugs.
- Fuzzer noise. Fuzzers find parser-acceptable-but-semantically-rejected inputs. The fuzzer counts a panic as a bug; it does not count a type error as a bug. The harness handles type errors as expected.
- CI time. A 60s fuzz budget per PR is cheap. A 30-minute nightly run is fine. If the property suite grows beyond ~5 minutes total, split into nightly.
- False oracles. The differential harness assumes the interpreter is correct. A bug in the interpreter that the folder also has would be invisible. The property suite catches the cases the folder cannot.
Non-goals
- No formal verification. No proof of VM correctness against a denotational semantics. The combination of differential testing, properties, and fuzzing is enough to catch regressions; full verification is decades of work and not on the table.
- No metamorphic testing. A future MEP may add metamorphic relations (e.g. "adding
var _ = 1to any program should not change its output") but the three layers here are sufficient for the perf MEPs that motivate this work.
Implementation notes
Layer 1 (differential exec) is the highest-value, lowest-cost item: a few hundred lines of Go that catch the entire optimizer-divergence class. Land it first.
Layer 2 (properties) is the largest open-ended investment. Start with the eight laws above; add more as bugs are found. The discipline is that every fixed VM bug ships with the property that would have caught it.
Layer 3 (fuzzing) is one PR; the seed corpus is what makes it effective. The corpus is the union of tests/vm/valid/, tests/rosetta/x/Mochi/, and the bench suite from MEP-17. Re-seed nightly from the latest corpus.
Layer 4 (JIT oracle skeleton) is plumbing only; the real value materializes when MEP-22 ships.
References
- [Go-Fuzz] Tutorial: Getting started with fuzzing in Go. https://go.dev/doc/tutorial/fuzz
- [gopter] gopter, a property-based testing library for Go. https://github.com/leanovate/gopter
- [Claessen-QC] Koen Claessen, John Hughes, QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP 2000.
- [Csmith] Xuejun Yang et al., Finding and Understanding Bugs in C Compilers. PLDI 2011 (differential testing as bug-finding method).
- [Yarpgen] Yarpgen: a random C/C++ program generator. https://github.com/intel/yarpgen
- [PEP-659] Mark Shannon, PEP 659. https://peps.python.org/pep-0659/ (the deopt model and why it must be testable)