Testing gates: per-phase fixtures, vm3 byte-equal, type-check strictness, reproducibility
This note defines the test gate plan that MEP-51 must clear at each of its 18 phases. The structure mirrors MEP-50 (Kotlin) and MEP-49 (Swift), with deltas for the Python-specific gates (mypy + pyright) and the build-artifact reproducibility gate that has no equivalent in the JVM or Swift sibling MEPs.
See the shared-decisions anchor for the load-bearing decisions and
[[10-build-system]] for the build pipeline this note tests.
The gate hierarchy
MEP-51 has four ordered gate tiers. A fixture passes Phase N only if it clears every tier in order.
Tier 1 (master): vm3 byte-equal stdout
The Mochi reference interpreter vm3 runs the source .mochi
fixture and captures stdout. The transpiler emits Python source from
the same fixture. CPython 3.12 runs the emitted Python and captures
stdout. The two stdouts must be byte-identical.
This is the only gate that compares observable behaviour. The other three tiers gate the artifact, not the behaviour.
byte-identical means:
- Same UTF-8 bytes.
- Same
\nline endings (POSIX\n, never\r\n, even on Windows). - Same trailing newline presence/absence.
- Same numeric formatting (e.g. floats as
1.5not1.50or1.5e0).
The expect.txt golden file is the byte-equal target. It is
generated once from vm3 and committed; subsequent CI runs compare
the emitted Python's stdout to expect.txt.
Normalised line endings: the CI runner sets git config core.autocrlf false and git config core.eol lf to prevent Windows from rewriting
line endings on checkout. We document this in the contributor guide.
Test runner: tests/transpiler3/python/runner.go (Go test driver,
mirrors tests/transpiler3/c/runner.go).
Tier 2 (secondary): mypy --strict + pyright --strict
Both type checkers run on the emitted Python with strict mode enabled. Both must report zero errors.
mypy invocation:
uv run mypy --strict --python-version 3.12 --no-incremental src/
Flags:
--strictenables all of--check-untyped-defs --disallow-any-generics --disallow-incomplete-defs --disallow-subclassing-any --disallow-untyped-calls --disallow-untyped-decorators --disallow-untyped-defs --no-implicit-optional --no-implicit-reexport --strict-equality --warn-redundant-casts --warn-return-any --warn-unused-configs --warn-unused-ignores.--python-version 3.12pins to the floor.--no-incrementaldisables the.mypy_cachebecause we run on fresh CI containers and incremental cache state varies.
pyright invocation:
uv run pyright --pythonversion 3.12
pyright uses [tool.pyright] in pyproject.toml for config; the
typeCheckingMode = "strict" field in pyproject.toml already
enforces strict mode (see [[10-build-system]]).
Both checkers must report zero errors. Warnings are allowed but
tracked; we audit warnings monthly to escalate persistent ones to
errors (see [[12-risks-and-alternatives]] R2 for the divergence
story).
Why both checkers
mypy and pyright disagree on roughly 15% of strict-mode edge cases. Examples from the pyright issue tracker (cross-referenced with mypy issues):
- PEP 695 type alias variance defaults: mypy infers covariant by default; pyright infers invariant by default for class-bound generics. Mochi's emit must produce code that satisfies both.
TypedDicttotality: mypy is stricter ontotal=Falseaccess; pyright requires explicitgetfor optional keys.Protocolruntime checkability: mypy ignores at type-check time; pyright raises if@runtime_checkableis missing on a Protocol used inisinstance.- Generic class inheritance: pyright allows narrower type arguments; mypy enforces invariance.
Passing both gates narrows the emit to the intersection of correct typed Python. This costs CI time (about 1.5x the type-check budget) but catches real bugs.
Tier 3 (tertiary): ruff check + ruff format fixed-point
Lint and format. Two invocations:
uv run ruff check src/ tests/
uv run ruff format --check src/ tests/
ruff check runs the linter. Zero errors required. The selected
rules are in [tool.ruff.lint] in pyproject.toml (see
[[10-build-system]]).
ruff format --check runs the formatter in check-only mode. If any
file would be reformatted, the exit code is non-zero. The gate fails.
This is a fixed-point check: emit -> format -> emit again must
produce identical output. We test by running ruff format once and
comparing the diff:
$ uv run ruff format src/
$ git diff --quiet src/ # must succeed
If git diff shows changes, the emit is not formatter-stable. We fix
the emitter to produce formatter-stable output.
ruff format is black-compatible; we discuss the relation in
[[10-build-system]]. We do not run black separately.
Tier 4 (quaternary): wheel build + install + smoke test
uv build
uv pip install --system dist/*.whl
python -c "import <pkg>; <smoke test>"
The wheel must build without errors. The wheel must install into a
fresh venv without errors. The smoke test imports the top-level
package and runs a --version or equivalent.
Build errors usually mean pyproject.toml mis-emission (e.g. wrong
package layout, missing py.typed). Install errors usually mean a
dependency declaration mismatch. Smoke-test errors mean the emit
breaks at module import time (most often: typo in a generated
__init__.py).
This gate runs after Tier 1-3 pass. It's deliberately last because it's the slowest (cold install on Windows can take 30 seconds).
Additional gates (per phase)
Some phases add specific gates beyond the four tiers.
PyPI Trusted Publishing dry-run (Phase 18)
uv publish --dry-run --trusted-publishing always
--dry-run verifies the OIDC exchange + upload signature without
actually uploading. We run this in PR CI to catch credential
misconfiguration before release.
Reproducibility (Phase 15+)
Two builds, two hosts, byte-identical wheel SHA256:
HOST_A: SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) uv build --wheel
HOST_B: SOURCE_DATE_EPOCH=$(git log -1 --pretty=%ct) uv build --wheel
diff <(shasum -a 256 host_a/dist/*.whl) <(shasum -a 256 host_b/dist/*.whl)
Hosts cover linux x86_64 + linux aarch64 + macos arm64 by Phase 16. Windows is excluded from reproducibility through Phase 16 because of filesystem case-insensitivity deltas; Phase 16.1 adds Windows.
ipykernel notebook execution (Phase 17)
Spawn JupyterLab, open a .ipynb fixture, execute all cells, capture
each cell's outputs, diff against expect.ipynb.
uv run jupyter nbconvert --to notebook --execute fixtures/<name>.ipynb \
--output /tmp/actual.ipynb
diff <(jq '.cells[].outputs' fixtures/<name>.expect.ipynb) \
<(jq '.cells[].outputs' /tmp/actual.ipynb)
The diff must be empty. jq strips notebook-format metadata (cell
ids, execution counts) that vary between runs.
Per-phase gate definitions
Each phase has a Go test wrapper in tests/transpiler3/python/ that
sets up fixtures, runs the transpiler, and walks the gate tiers. The
test wrapper for Phase N is phase<N>_test.go with a top-level test
TestPhase<N>...Python.
The fixture directory layout:
tests/transpiler3/python/
├── runner.go
├── phase1_helloworld_test.go
├── phase2_scalars_test.go
├── phase3_1_lists_test.go
├── phase3_2_maps_test.go
├── phase3_3_sets_test.go
├── phase3_4_list_of_records_test.go
├── phase4_records_test.go
├── ...
├── phase18_pypi_test.go
└── fixtures/
├── phase1/
│ └── helloworld/
│ ├── source.mochi
│ ├── expect.txt
│ └── meta.toml
├── phase2/
│ └── ...
└── ...
Each fixture has:
source.mochi: the Mochi source.expect.txt: the byte-equal stdout target.meta.toml: optional metadata (e.g.skip_pyright = trueif a fixture probes a known pyright divergence).
Phase 1: hello world
Goal: print("hello world") round-trips through emit and runs.
Fixtures: 1.
helloworld: printshello world\n.
Gate: Tiers 1-4. mypy strict on an __init__.py + __main__.py +
generated/foo.py triplet. Wheel builds and installs.
Output:
# generated/foo.py
def main() -> None:
print("hello world")
Test: TestPhase1HelloWorldPython.
Phase 2: scalars
Goal: int, float, bool, str, bytes lower correctly, arithmetic
and comparisons work, formatting matches Mochi's.
Fixtures: 12.
int_arith:1 + 2 * 3int_overflow: arbitrary precision big intsfloat_arith: IEEE 754 additionfloat_special: NaN, +inf, -infbool_logic:and,or,not, short-circuitstring_concat:"a" + "b"string_len: code-point lengthstring_unicode: emoji + combining marksstring_format:f"{x:.2f}"bytes_literal:b"\x00\x01"bytes_concat: byte concatenationprint_mixed: print of mixed types via__str__
Gate: Tiers 1-4. Special attention to float formatting: Python's
default repr(1.5) is "1.5", matching Mochi. repr(0.1 + 0.2) is
"0.30000000000000004"; Mochi prints 0.3. The emitter must use
explicit format strings to align with Mochi's output. See the
float_special fixture for the NaN edge case (Mochi prints NaN,
Python prints nan; we override).
Phase 3.1: lists
Goal: list[T] lowering, including literals, indexing, slicing,
len, append, comprehensions.
Fixtures: 18.
list_literal_int,list_literal_str,list_literal_mixed_into_unionlist_index_positive,list_index_negative,list_index_out_of_boundslist_slice_basic,list_slice_step,list_slice_negative_steplist_len,list_append,list_extend,list_poplist_iter_for,list_iter_indexlist_comprehension_simple,list_comprehension_filter,list_comprehension_nested
Gate: Tiers 1-4. mypy strict requires list[T] annotations on all
list-typed parameters and returns. pyright strict requires the same.
Out-of-bounds indexing: Mochi raises a structured error; Python raises
IndexError. The emitter wraps Python's IndexError in a
MochiResult-aware translator (see Phase 11).
Phase 3.2: maps (dict)
Goal: map<K, V> lowering to dict[K, V], including literal,
indexing, len, keys, values, has, for-each iteration.
Fixtures: 22.
dict_literal_str_int,dict_literal_int_str,dict_literal_str_str,dict_literal_str_list,dict_literal_str_recorddict_index_get,dict_index_missing_raises,dict_index_missing_defaultdict_set,dict_delete,dict_updatedict_len,dict_keys,dict_values,dict_itemsdict_has_keydict_iter_for,dict_iter_keys,dict_iter_values,dict_iter_itemsdict_comprehension,dict_merge
Gate: Tiers 1-4. Mochi's m["k"] on missing key raises a typed
error; Python's d["k"] raises KeyError. Emit uses d["k"]
unchanged (the error translation happens at the Mochi-aware boundary
when crossing into a Result-returning context). For
dict.get("k", default) Mochi has m.get("k", default) lowering
directly.
Phase 3.3: sets
Goal: set<T> lowering to set[T]. Insertion order is NOT
guaranteed by Python's set (unlike dict). For order-sensitive
sets, emit uses dict.fromkeys(...).keys() as an ordered-set
substitute.
Fixtures: 14.
set_literal,set_add,set_remove,set_contains,set_iter,set_len,set_union,set_intersection,set_difference,set_symmetric_difference,set_comprehension,set_from_list,set_ordered_iter(usesdict.fromkeys),set_frozen(usesfrozenset).
Gate: Tiers 1-4. The set_ordered_iter fixture is tricky: Mochi
guarantees insertion-ordered iteration on sets; Python set does
not. The emitter detects when ordering is observed (e.g. via for
iteration printing) and substitutes dict.fromkeys(...). The
detection is done in lowering, not emit (see MEP-51 §6).
Phase 3.4: list of records
Goal: list<Record> where Record is a frozen dataclass.
Fixtures: 16.
list_record_basic,list_record_filter,list_record_sort,list_record_map,list_record_index,list_record_append,list_record_nested,list_record_with_option,list_record_with_list,list_record_with_map,list_record_query(select-from-where),list_record_groupby,list_record_distinct,list_record_aggregate,list_record_join,list_record_serialise_jsonl.
Gate: Tiers 1-4. Dataclass emission uses @dataclass(frozen=True, slots=True) (PEP 557 + PEP 622-ish slots). mypy strict requires
explicit type annotations on all dataclass fields; the emit always
includes them. pyright strict requires field() with
default_factory for mutable defaults; the emit always uses this
pattern.
Phase 4: records (frozen dataclass)
Goal: standalone record types, equality, hashing, __repr__.
Fixtures: 20.
record_basic,record_equality,record_hash,record_repr,record_nested,record_with_optional,record_with_list,record_with_map,record_with_set,record_pattern_match,record_clone_with,record_serialise,record_deserialise,record_generic,record_generic_pep695,record_recursive(linked-list node),record_field_default,record_field_factory,record_inheritance(Mochi disallows; fixture asserts emit-time error),record_slots_size.
Gate: Tiers 1-4. record_slots_size checks that emitted classes
have __slots__ (no __dict__) by sys.getsizeof(instance) being
less than a struct-equivalent.
Phase 5: sum types
Goal: PEP 695 type alias for sum types, dataclass variants,
exhaustive match.
Fixtures: 18.
sum_basic_two_variants,sum_three_variants,sum_variant_with_data,sum_variant_no_data,sum_nested,sum_recursive(tree),sum_generic,sum_match_exhaustive,sum_match_non_exhaustive_error,sum_match_guards,sum_match_wildcard,sum_serialise,sum_deserialise,sum_option_some_none,sum_result_ok_err,sum_either_left_right,sum_complex_records,sum_with_options_in_variants.
Gate: Tiers 1-4. PEP 695 type Foo = A | B | C syntax requires
3.12+. mypy 1.13+ handles PEP 695 well; pyright 1.1.380+ also. We
pin both versions exactly (see "test stability" below).
Exhaustiveness: Python's match does not require exhaustiveness, but
mypy strict (--strict plus --warn-unreachable) flags unreachable
branches. We emit a final case _: that raises a synthetic
assert_never to satisfy both checkers. typing.assert_never (3.11+)
gives a type-narrowing guarantee.
Phase 6: closures + higher-order
Goal: closures capture variables, lambdas have correct inferred types, higher-order functions (map, filter, reduce, fold) work.
Fixtures: 16.
closure_basic,closure_mutable_capture,closure_immutable_capture,closure_late_binding,closure_lambda_one_arg,closure_lambda_multi_arg,closure_higher_order_map,closure_higher_order_filter,closure_higher_order_reduce,closure_higher_order_compose,closure_curry,closure_partial_application,closure_returns_closure,closure_captures_self,closure_recursive,closure_in_method.
Gate: Tiers 1-4. closure_late_binding checks that the
classic Python closure-in-loop gotcha (variable captured by
reference, not value) is handled by emit using a default-argument
trick or functools.partial.
Phase 7: query DSL
Goal: Mochi's from x in xs select x.y where ... order by ...
DSL lowers to list / generator expressions + itertools where
appropriate.
Fixtures: 20.
query_select,query_where,query_orderby,query_groupby,query_distinct,query_join_inner,query_join_left,query_join_right,query_join_full,query_aggregate_count,query_aggregate_sum,query_aggregate_avg,query_aggregate_max,query_aggregate_min,query_nested,query_subquery,query_correlated_subquery,query_lazy_iterator,query_async_iterator,query_pipeline_dataflow.
Gate: Tiers 1-4. query_lazy_iterator checks that the emit uses
a generator (not eager list comp) when the result is iterated only
once. query_async_iterator uses AsyncIterator[T] from
collections.abc.
Phase 8: datalog
Goal: Mochi's datalog blocks compile to seminaive evaluation in Python.
Fixtures: 12.
datalog_transitive_closure,datalog_path_finding,datalog_ancestor,datalog_same_generation,datalog_negation_stratified,datalog_negation_unstratified_error,datalog_arithmetic,datalog_aggregation_count,datalog_aggregation_sum,datalog_recursion_terminates,datalog_recursion_unsound_warning,datalog_large_dataset_perf.
Gate: Tiers 1-4 plus a runtime budget check: each datalog fixture must finish within 5 seconds on the CI host's reference machine (github-hosted ubuntu-24.04, 4-core).
Phase 9: agents
Goal: agent syntax lowers to a class with asyncio.Queue
mailbox and asyncio.TaskGroup supervision.
Fixtures: 18.
agent_basic,agent_state_mutation,agent_message_handler,agent_cast,agent_call_reply,agent_call_timeout,agent_supervisor_one_for_all,agent_supervisor_one_for_one,agent_supervisor_rest_for_one(custom),agent_cancellation_propagation,agent_exception_group_aggregation,agent_spawn_child,agent_child_failure_restart,agent_graceful_shutdown,agent_message_ordering_fifo,agent_backpressure_bounded_queue,agent_two_agents_interact,agent_pool_of_workers.
Gate: Tiers 1-4 plus an asyncio-event-loop debug check
(PYTHONASYNCIODEBUG=1) for unawaited coroutines or
double-cancellation. The check fails the gate on any debug warning.
Phase 10: streams
Goal: stream<T> lowers to AsyncIterator[T]. Stream
combinators (map, filter, fold, take, drop, zip) emit as async
generators.
Fixtures: 14.
stream_basic,stream_finite,stream_infinite_take,stream_map,stream_filter,stream_fold,stream_zip,stream_flatmap,stream_throttle,stream_buffer,stream_close_on_drop,stream_backpressure,stream_error_propagation,stream_two_consumers_split.
Gate: Tiers 1-4. stream_close_on_drop checks aclose() is
called when the iterator is GC'd. Python's gc.callbacks is used to
verify; __del__ on async iterators is unreliable.
Phase 11: async coloring, MochiResult, ExceptionGroup
Goal: every async-capable function returns a coroutine; sync
functions stay sync; errors are MochiResult[T, E] not exceptions
for explicit error paths.
Fixtures: 20.
result_ok,result_err,result_match,result_map,result_chain,result_from_exception,result_into_exception_at_boundary,exception_group_aggregation,taskgroup_one_failure,taskgroup_two_failures,taskgroup_cancellation,taskgroup_nested,taskgroup_timeout_outer,taskgroup_timeout_inner,async_function_definition,async_function_call,async_function_concurrency_limit,async_generator_yield,async_generator_close,async_with_context_manager.
Gate: Tiers 1-4. async_with_context_manager tests async with
__aenter__/__aexit__. mypy strict requires the protocol; pyright strict checks the same.
Phase 12: FFI
Goal: Mochi extern declarations lower to ctypes calls for
native libraries and to direct Python imports for pure-Python deps.
Fixtures: 14.
ffi_ctypes_libc_strlen,ffi_ctypes_libc_qsort,ffi_ctypes_struct_pack,ffi_ctypes_callback,ffi_pure_python_import,ffi_typed_stub_only,ffi_cffi_alternative(skipped if cffi not installed),ffi_error_propagation,ffi_memory_safety,ffi_thread_safety,ffi_release_gil_decorator,ffi_platform_specific_linux,ffi_platform_specific_macos,ffi_platform_specific_windows.
Gate: Tiers 1-4. Platform-specific fixtures use @pytest.mark.skipif
based on sys.platform. The runner uses --platform=<plat>
matching to gate which fixtures count per host.
Phase 13: LLM provider dispatch
Goal: Mochi's llm.chat lowers to a mochi-runtime[ai] call
that dispatches to the right provider (Anthropic, OpenAI, local
Ollama).
Fixtures: 8 (with --llm=mock to avoid network).
llm_basic_completion,llm_streaming_completion,llm_tool_use,llm_multi_turn,llm_provider_anthropic,llm_provider_openai,llm_provider_local_ollama,llm_error_handling.
Gate: Tiers 1-4. Real network calls are gated to a nightly run
with --llm=real and credentials from secrets; the PR CI uses the
mock provider.
Phase 14: fetch (httpx)
Goal: Mochi fetch (HTTP GET/POST) lowers to httpx calls.
Fixtures: 10.
fetch_get_text,fetch_get_json,fetch_post_json,fetch_redirect,fetch_timeout,fetch_auth_basic,fetch_auth_bearer,fetch_streaming_response,fetch_error_status,fetch_proxy.
Gate: Tiers 1-4. Fixtures hit a local httptest server
(httpx.MockTransport); no real network in CI.
Phase 15: wheel + sdist build via uv
Goal: uv build produces wheel + sdist that install and run.
Fixtures: 4.
wheel_basic_install,wheel_with_extras,sdist_basic_install,sdist_with_extras.
Gate: Tier 4 dominates. The fixture runs uv build, installs the
wheel into a fresh venv, runs a smoke test, then repeats with the
sdist. The build is gated on dist/*.whl and dist/*.tar.gz both
existing with the expected name.
Phase 16: reproducible build
Goal: byte-identical wheel SHA256 across hosts.
Fixtures: 2.
reproducibility_basic,reproducibility_with_extras.
Gate: Build on linux-x86_64, linux-aarch64, macos-arm64; compare SHA256. Windows reproducibility deferred to Phase 16.1.
Sub-phase 16.1 (Windows reproducibility): adds a third fixture and extends the SHA comparison to windows-x86_64. Outstanding issues: filesystem case sensitivity in zip entries, CRLF/LF handling in generated sources (we already normalise but the gate verifies).
Phase 17: Jupyter ipykernel
Goal: Mochi code runs cell-by-cell in JupyterLab.
Fixtures: 10.
notebook_helloworld,notebook_variable_persistence,notebook_function_redefinition,notebook_import_in_cell,notebook_error_handling,notebook_async_cell,notebook_plot_matplotlib,notebook_query_dsl,notebook_record_definition,notebook_multi_cell_workflow.
Gate: Tiers 1-3 do not apply directly (the source is a .ipynb, not a .mochi file). Tier 4 is replaced by the notebook execution diff:
jupyter nbconvert --execute fixtures/<name>.ipynb --output /tmp/actual.ipynb
diff (filtered) /tmp/actual.ipynb fixtures/<name>.expect.ipynb
Filter: remove execution_count, id, cell metadata; preserve
outputs text/plain entries.
Phase 18: PyPI Trusted Publishing
Goal: end-to-end publish flow including OIDC and PEP 740 attestation.
Fixtures: 1.
publish_dryrun_trusted_publishing.
Gate: uv publish --dry-run --trusted-publishing always exits 0
when the OIDC token claims match the configured PyPI trust. Real
publish runs only on release tags, not PR CI.
Total fixture count target
Approximate target by Phase 18: 400 fixtures.
Running total (cumulative):
- After Phase 1: 1
- After Phase 2: 13
- After Phase 3.1: 31
- After Phase 3.2: 53
- After Phase 3.3: 67
- After Phase 3.4: 83
- After Phase 4: 103
- After Phase 5: 121
- After Phase 6: 137
- After Phase 7: 157
- After Phase 8: 169
- After Phase 9: 187
- After Phase 10: 201
- After Phase 11: 221
- After Phase 12: 235
- After Phase 13: 243
- After Phase 14: 253
- After Phase 15: 257
- After Phase 16: 259
- After Phase 17: 269
- After Phase 18: 270
The 400 target leaves room for ad-hoc fixtures added post-phase (regression captures, user bug reports, edge cases). We expect about 130 such fixtures to land between Phase 18 ratification and v1 release.
Go test wrappers
Following the existing tests/transpiler3/c/ precedent, each phase
gets a Go test file:
// tests/transpiler3/python/phase1_helloworld_test.go
package python_test
import (
"testing"
"mochi/tests/transpiler3/python/runner"
)
func TestPhase1HelloWorldPython(t *testing.T) {
runner.RunPhase(t, "phase1", "helloworld")
}
The runner.RunPhase helper:
- Loads the fixture directory under
fixtures/<phase>/<name>/. - Reads
meta.tomlfor skip flags. - Runs vm3 on
source.mochito capture the reference stdout. - Compares vm3 stdout to
expect.txt(sanity check). - Invokes the Mochi transpiler with
--target=python. - Writes emitted Python to a temp directory.
- Runs Tier 2:
uv run mypy --strict src/. - Runs Tier 2:
uv run pyright. - Runs Tier 3:
uv run ruff check src/+uv run ruff format --check src/. - Runs the Python:
python -m <pkg>. - Compares Python stdout to
expect.txt(Tier 1 master gate). - Runs Tier 4:
uv build+uv pip install+ smoke test.
Failures at any tier print a diff and exit with a tier-specific code so CI can surface which tier failed.
The runner is parallelised: each fixture runs in its own goroutine
with its own temp dir. The Tier 4 build step is the bottleneck (~5
seconds per fixture); we cap parallelism at GOMAXPROCS.
CI matrix
The full per-release CI matrix:
| OS | Python | Arch | Notes |
|---|---|---|---|
| ubuntu-24.04 | 3.12.0 | x86_64 | floor |
| ubuntu-24.04 | 3.12.7 | x86_64 | latest patch |
| ubuntu-24.04 | 3.13.0 | x86_64 | next-floor candidate |
| ubuntu-24.04-arm | 3.12.7 | aarch64 | ARM verification |
| macos-14 | 3.12.7 | arm64 | Apple Silicon |
| windows-2022 | 3.12.7 | x86_64 | Windows |
6 cells. Tier 1-4 gates run on every cell. Reproducibility runs on the three non-Windows cells. ipykernel runs only on ubuntu-24.04 + 3.12.7 x86_64 (Phase 17).
Each cell takes about 8 minutes to clear all four tiers for the full fixture set as of Phase 18 (270 fixtures, 400 with regression). Total CI wall-clock per release: about 50 minutes for the test job, plus 5 minutes for build / reproducibility / publish.
We do NOT test on macOS x86_64. GitHub deprecated x86_64 macOS runners in 2024; ARM is the default. Users running x86_64 macOS (still sold as "Intel Macs" through 2023) get a documented best-effort support level; we cannot run gates there.
Test stability: pinned tool versions
The gate must not drift due to checker upgrades. We pin exact versions:
| Tool | Version | Reason |
|---|---|---|
| mypy | 1.13.0 | PEP 695 stable, no flagged regressions |
| pyright | 1.1.380 | strict mode parity with mypy 1.13 |
| ruff | 0.6.9 | format stable + PEP 695 lint complete |
| pytest | 8.3.x | latest stable |
| uv | 0.7.0 | first stable lockfile format |
The pins live in pyproject.toml [project.optional-dependencies].dev
and in uv.lock (auto-pinned from the spec).
We bump pins quarterly in a dedicated PR with the diff of new warnings (any new strict-mode error becomes a fix-in-PR; any new warning becomes an audit ticket).
mypy plugin: no
Some projects ship a mypy plugin to teach mypy about emitted code patterns (Pydantic does this, attrs does this until type narrowing matured). We do not ship a Mochi mypy plugin. Reasons:
- Plugins are mypy-specific. pyright has no plugin API. We'd have to double-implement.
- PEP 695 plus PEP 698 plus
Self(PEP 673) gives us the type primitives we need to express Mochi's type system directly. No inference gaps. - Plugin maintenance burden: every mypy major release breaks plugin ABI. Pydantic 1's plugin was a perennial source of bug reports until Pydantic 2 reduced its scope.
Mochi-emitted code stands alone in mypy strict mode without a plugin. This is a hard requirement; if we hit a pattern that mypy can't infer without a plugin, we change the emit, not add a plugin.
Golden file management
Each fixture has expect.txt (Tier 1 master gate) and optionally
expect.ipynb (Phase 17). Conventions:
- Line endings: LF (
\n), not CRLF. Enforced via.gitattributes* text=auto eol=lf. - Trailing newline: file ends with
\n. Mochi's print adds a trailing newline by default. - UTF-8 encoding, no BOM.
- For floats: emit uses Python's
f"{x:.6g}"to truncate representation, then matches vm3's identical truncation.
Regenerating goldens: mochi tests regen --phase=<n> runs vm3 on
every fixture in the phase and overwrites expect.txt. This is a
last-resort tool; in normal flow goldens are stable.
Error path testing
Not every fixture tests the success path. Some test the error path:
- Phase 3.1:
list_index_out_of_boundstests PythonIndexErrorbecoming a Mochi-aware error message. - Phase 5:
sum_match_non_exhaustive_errortests that the emit catches a non-exhaustive match at compile time (Mochi exhaustiveness checker) and emits anassert_neverthat mypy + pyright enforce. - Phase 9:
agent_child_failure_restarttests cancellation + restart paths.
Error-path fixtures have expect.txt matching the expected error
message verbatim. Error message stability is a contract: changing the
emitted error text breaks fixtures. We bump fixtures intentionally
when the error format is improved.
Platform-specific tests
Some fixtures test platform-specific behaviour. Phase 12 (FFI) has
ffi_platform_specific_linux, ffi_platform_specific_macos,
ffi_platform_specific_windows. The runner reads meta.toml:
platforms = ["linux"]
and skips the fixture on other platforms. Skipping is logged in CI output, not silent.
Phase 16 reproducibility excludes Windows by similar mechanism: the
reproducibility job has if: runner.os != 'Windows' in the workflow.
Performance gates
Some phases enforce a wall-clock budget per fixture:
- Phase 8 (datalog): 5 seconds per fixture on github-hosted ubuntu-24.04.
- Phase 9 (agents): 2 seconds per fixture (longer ones use asyncio.timeout(2)).
- Phase 13 (LLM with mock): 1 second per fixture.
Wall-clock gates are noisy on shared runners. We use a 2x tolerance
(timeout * 2 triggers a flake re-run before failing). After two
consecutive fails the test is marked failed.
Snapshot stability
The emitted Python source itself is not a gate. We do not snapshot emit output. Reasons:
- Emit output changes frequently as the emitter is refactored.
- The gate is on observable behaviour (Tier 1) and the static artifact (Tier 4), not on the intermediate source.
- Snapshotting emit output creates churn: every refactor breaks thousands of golden files.
If a developer wants to inspect the emit, mochi transpile --target=python --print-source prints to stdout without writing
files. No CI gate compares the print.
Comparison to MEP-45 (C) and MEP-50 (Kotlin) test gates
| Concern | MEP-45 (C) | MEP-50 (Kotlin) | MEP-51 (Python) |
|---|---|---|---|
| Master gate | vm3 byte-equal stdout | vm3 byte-equal stdout | vm3 byte-equal stdout |
| Compile gate | clang -fsyntax-only | kotlinc -Werror | mypy + pyright (both strict) |
| Lint gate | clang-tidy subset | ktlint | ruff check |
| Format gate | clang-format --dry-run | ktlint --format check | ruff format --check |
| Build gate | make + executable run | gradle build + jar | uv build + wheel install |
| Reproducibility | SOURCE_DATE_EPOCH | gradle --reproducible | SOURCE_DATE_EPOCH + sort |
| Notebook gate | n/a | Kotlin Notebook | ipykernel |
| OIDC publish | n/a (no central registry) | central-portal OIDC | PyPI Trusted Publishing |
Three differences worth calling out:
- Two type checkers. Only MEP-51 runs two type checkers. MEP-50
has one (
kotlinc), MEP-45 has none beyond clang. The dual gate costs us about 30% of total CI time but catches edge cases neither checker would alone. - Notebook gate. MEP-50 has Kotlin Notebook (Jupyter via the Kotlin kernel); MEP-51 has ipykernel via direct Python. Same shape, different plumbing.
- Reproducibility scope. MEP-45 reproducibility is per-platform (the wheel-equivalent is an executable; cross-platform identical binaries are not achievable). MEP-50 and MEP-51 have a cross-platform artifact (jar / wheel), so identical-bytes is achievable and gated.
References
- PEP 484, 526, 561, 585, 591, 612, 646, 647, 673, 692, 695, 698 (typing)
- PEP 654 (ExceptionGroup), PEP 678 (BaseException.add_note)
- mypy docs,
mypy.readthedocs.io - pyright docs,
microsoft.github.io/pyright/ - ruff docs,
docs.astral.sh/ruff/ the shared-decisions anchorfor load-bearing decisions[[10-build-system]]for the build pipeline this gate tests[[12-risks-and-alternatives]]for divergence + tool-version risks