Skip to main content

MEP 21. VM Conformance and Differential Testing

FieldValue
MEP21
TitleVM Conformance and Differential Testing
AuthorMochi core
StatusDraft
TypeStandards Track
Created2026-05-16

Abstract

MEPs 18 through 22 propose changes to dispatch, specialization, value representation, and (eventually) code generation. Each change is a chance to introduce a correctness regression that the existing golden suite misses because the regression only fires on inputs nobody wrote a fixture for. This MEP adds three correctness layers that close the gap without requiring a fixture for every bug: differential execution between the interpreter and the constant folder, property-based testing for laws the VM must obey, and Go's native testing.F fuzzer for the bytecode compiler.

These layers are independent of any of the perf MEPs; they harden the VM regardless of which optimization lands next. They also enable future MEPs (notably the copy-and-patch baseline JIT in MEP-22) to be tested against an oracle: the interpreter is the reference, the JIT is the candidate, and a differential harness compares outputs over a corpus.

Motivation

The Mochi conformance suite today is two things: golden files (tests/vm/valid/*.mochi + .out pairs) and the Rosetta task corpus (tests/rosetta/x/Mochi/). Both are example-based: a fixture passes if its output matches the recorded .out. This catches the bugs someone wrote a fixture for. It misses the rest.

The patterns this MEP closes are well-known:

  1. Optimizer-introduced divergence. A constant-folding pass that mis-evaluates 1e308 * 10 produces +Inf while the runtime evaluates +Inf correctly. The golden test passes if 1e308 * 10 is never in a fixture. The differential harness catches it because it asks: "is the folded answer equal to the unfolded answer for every expression in the corpus?"
  2. Quickened opcode bugs. MEP-19 introduces OpAdd_II. If the deopt path is wrong on overflow, the bug shows up only on overflowing inputs, which goldens may not exercise. A property test on OpAdd (associative-modulo-overflow, commutative) catches it.
  3. Bytecode compiler crashes. A malformed input that the parser accepts but the compiler panics on. The Go fuzzer can find these in minutes given a seed corpus.

Each layer below corresponds to one of these.

Specification

Layer 1: differential execution (folder vs interpreter)

The Mochi compiler does constant folding in runtime/vm/constfold.go. The interpreter evaluates the same expressions at runtime. They must agree.

A new test harness, runtime/vm/diff/, walks every expression in every program in the bench suite (MEP-17), the golden suite, and the Rosetta corpus. For each pure expression (no ! io, no ! fs, etc., per MEP-15 effect inference) it:

  1. Evaluates the expression at compile time via constfold.Eval.
  2. Compiles the expression to bytecode with folding disabled and evaluates at runtime via the interpreter.
  3. Asserts the two values are equal, where equal follows Mochi's structural equality.

A divergence is a P1 bug. The test fails the CI run.

Pure expressions are the only ones in scope because impure expressions are non-deterministic by definition (now(), fetch(...)). MEP-15's Effects.IsEmpty() is the gate: if the expression is pure, the two evaluators must agree.

This catches the entire class of "the optimizer rewrites this to a value the interpreter does not produce" bugs. It is a one-time investment with permanent value; every optimization MEP gets free oracle testing.

Layer 2: property-based testing

A new package runtime/vm/prop/ defines properties that the VM must obey across every input. Properties express laws, not examples. Implementation uses [gopter] or hand-written quickcheck-style generators (gopter is preferred for breadth).

The starter law set:

  • Arithmetic associativity over int: (a + b) + c == a + (b + c) for all int a, b, c. Modulo overflow, which Mochi defines per MEP (TODO: cite the right MEP) as wrapping; the property is on the wrapped result.
  • Arithmetic commutativity over int and float: a + b == b + a. Caveat for float: NaN. The property excludes NaN inputs.
  • Equality reflexivity: x == x for every value not containing NaN.
  • len of append: len(append(xs, y)) == len(xs) + 1.
  • Map round-trip: m[k] = v; m[k] == v for any key k and value v.
  • Option unwrap-after-guard: for any Option<T>, if o != none then o! returns the wrapped value (MEP-16).
  • Pure expression idempotency: for any pure expression e, e == e evaluated twice.
  • Effect-set monotonicity (MEP-15): for any function f, inferredEffects(f) ⊆ declaredEffects(f) whenever a declaration is present.

Each property runs a few thousand random inputs by default; CI runs more (configurable via MOCHI_PROP_N). A failed property prints the smallest counterexample (shrinking).

Properties are not fixtures. They do not pin a specific output; they pin a relationship. This is the right tool for the optimizer-vs-interpreter agreement question and for any invariant the type system promises.

Layer 3: bytecode-compiler fuzzing

Go's native testing.F runs fuzzers in CI ([Go-Fuzz]). A new file runtime/vm/fuzz_test.go defines:

func FuzzCompile(f *testing.F) {
seedFromCorpus(f, "tests/vm/valid")
f.Fuzz(func(t *testing.T, src string) {
prog, err := parser.ParseString(src)
if err != nil {
return
}
defer func() { recover() }() // panics are the bugs we hunt
_, _ = vm.Compile(prog, nil)
})
}

The seed corpus is the existing golden inputs (paths walked at fuzz init). The fuzzer mutates them. Any panic in vm.Compile is a bug; the fuzzer minimizes and reports.

A second fuzzer, FuzzRun, takes the compiled program and runs it under a CPU/memory budget. The harness uses runtime.SetFinalizer and a hard wall-clock cap (5s) to detect non-termination. Any panic, infinite loop, or memory blowup is a bug.

CI runs the fuzzers for a fixed time budget per PR (default 60s each). Nightly runs extend to 30 minutes. Found crashers are checked into runtime/vm/fuzz_corpus/ so they become permanent regression tests.

Layer 4: oracle harness for MEP-22

The copy-and-patch JIT in MEP-22 will need a differential harness against the interpreter. This MEP defines that harness ahead of time so MEP-22 ships behind it.

The shape: for every program in the test corpus, run under the interpreter and under the JIT, compare outputs. Any divergence is a P1 bug. The interpreter is canonical.

The harness is gated by a build tag (-tags jit) and is a no-op until MEP-22 ships. Its skeleton lands in this MEP so MEP-22 has nothing to design about correctness, only to implement.

Coverage measurement

A new CI step runs go test -cover ./runtime/vm/... and reports the line coverage delta per PR. The goal is not a number to chase; it is to make uncovered handlers visible. A new opcode that lands without a test prints as uncovered and is rejected at review.

Status at a glance

ItemStatus
Differential exec: folder vs interpreter on pure expressionsproposed
Property suite: arithmetic, equality, len, map, option lawsproposed
Property suite: shrinkable counterexamplesproposed
Fuzzer: FuzzCompile over parser corpusproposed
Fuzzer: FuzzRun with wall-clock and memory budgetproposed
Crasher corpus checked in under runtime/vm/fuzz_corpus/proposed
Coverage delta reported in CIproposed
Oracle harness skeleton for MEP-22proposed

Risks

  • Flaky properties on float arithmetic. Floating-point laws fail in edge cases (NaN, signed zero, denormals). The property generators must exclude these inputs explicitly; failures here are usually generator bugs, not VM bugs.
  • Fuzzer noise. Fuzzers find parser-acceptable-but-semantically-rejected inputs. The fuzzer counts a panic as a bug; it does not count a type error as a bug. The harness handles type errors as expected.
  • CI time. A 60s fuzz budget per PR is cheap. A 30-minute nightly run is fine. If the property suite grows beyond ~5 minutes total, split into nightly.
  • False oracles. The differential harness assumes the interpreter is correct. A bug in the interpreter that the folder also has would be invisible. The property suite catches the cases the folder cannot.

Non-goals

  • No formal verification. No proof of VM correctness against a denotational semantics. The combination of differential testing, properties, and fuzzing is enough to catch regressions; full verification is decades of work and not on the table.
  • No metamorphic testing. A future MEP may add metamorphic relations (e.g. "adding var _ = 1 to any program should not change its output") but the three layers here are sufficient for the perf MEPs that motivate this work.

Implementation notes

Layer 1 (differential exec) is the highest-value, lowest-cost item: a few hundred lines of Go that catch the entire optimizer-divergence class. Land it first.

Layer 2 (properties) is the largest open-ended investment. Start with the eight laws above; add more as bugs are found. The discipline is that every fixed VM bug ships with the property that would have caught it.

Layer 3 (fuzzing) is one PR; the seed corpus is what makes it effective. The corpus is the union of tests/vm/valid/, tests/rosetta/x/Mochi/, and the bench suite from MEP-17. Re-seed nightly from the latest corpus.

Layer 4 (JIT oracle skeleton) is plumbing only; the real value materializes when MEP-22 ships.

References