MEP-45 research note 03, Prior art: language → C compilation (2024–2026)
Author: research pass for MEP-45. Date: 2026-05-22 (GMT+7). Method: structured web research by a delegated agent; verbatim report preserved below with light section-anchor edits.
The report is the canonical survey for the MEP body's "Rationale" and "Prior Art" sections. References at the foot are the authoritative source list.
Survey: State of the Art in Compiling High-Level Languages to C (2024-2026)
This survey covers production transpilers, research compilers, and seminal/recent papers relevant to designing a Mochi-to-C transpiler in 2026. It is structured by system (sections 1-11) and then by technique (sections 12-21), closing with a distilled set of design lessons (section 22).
1. Nim and its C backend
Nim has historically used C and C++ as its primary backends; in Nim 2.x ARC/ORC is the default memory model (Nim Blog 2020, "Introduction to ARC/ORC"). The classic pipeline lives in injectdestructors.nim, lambdalifting.nim, closureiters.nim, and transf.nim. ARC injects =destroy, =copy, =sink hooks via a per-scope owned-temporary table plus data-flow last-read analysis; when a variable is last-read, the compiler rewrites the use as =sink and skips the destructor at scope exit. Closures are eliminated by lambda lifting into explicit (env*, fn*) pairs; iterators are transformed by closureiters into resumable state machines (a switch-on-resume-PC closure environment, similar to C# iterators). Exceptions can be lowered three ways (setjmp, cpp, or --exceptions:goto); the goto mode produces ANSI-C-only output where every potentially-throwing call sets a thread-local error flag and jumps to a generated dispatch label, which is what most cross-compilation users pick.
The 2024-2026 architecture is a clean rewrite called Nimony organised around NIF (Nim Intermediate Format) and its C-shaped dialect NIFC (nim-lang/RFCs #556, "Nim Roadmap 2025"; nim-lang/nifspec). NIF is essentially S-expressions with separate tag and identifier namespaces, designed to be the on-disk module format and the lingua franca between compiler stages. Lowering ("Gear 2") performs eleven explicit passes in order: parse, sem1 (lookup/type), sem2 (effect inference), iterator inlining, lambda lifting, deref injection, dup injection, control-flow expression lowering, destructor injection, builtin mapping, exception translation; only then does "Gear 3" generate NIFC. The key lesson is that the old monolithic ccg* generator hit a complexity wall (issues #21579, #23897 show painful interactions between --exceptions:goto, ARC, and C++ goto-past-initialiser rules), and the team explicitly decided to desugar everything into low-level Nim before touching C.
Concrete code-shape: each Nim proc emits a flat C function whose locals all sit at the top, with structured if/while/goto for control flow. Closure environments are heap-allocated tyObject_envXX structs whose fields are the captured locals; refcounted types use a hidden header word holding the count plus type info pointer (TNimType*). Generics are monomorphised eagerly, which causes the documented compile-time explosion when many type parameters meet ARC's per-type hook generation.
2. Crystal
Crystal is LLVM-only, there is no C textual backend, but its codegen IR shape is instructive. The compiler emits LLVM IR that treats every reference type as a tagged pointer, with method dispatch realised as inlined virtual-call sequences after Crystal's whole-program type inference has narrowed union types. Memory is managed by bdwgc (Boehm-Demers-Weiser) linked as a runtime library (src/gc/boehm.cr), which avoids the need for @llvm.gcroot/stack maps. Allocations split into GC_malloc (may contain pointers) and GC_malloc_atomic (pointer-free, e.g. String payload, Bytes), exploiting Boehm's atomic-block optimisation to avoid scanning. Finalisers route through GC_register_finalizer_ignore_self.
A continuing pain point is multithreading: Boehm versions before 8.2.0 require an upstream patch, and Boehm "does not guarantee to scan thread-local storage" (issue tracker), so Crystal stores any pointer destined for TLS also on the thread's stack. Migration to Immix has been discussed since issue #5271 (2017) but blocked by the lack of LLVM-level safepoints and the fact that conservative scanning would have to be replaced by precise stack maps, a major investment. The lesson for a transpiler-style compiler is that Boehm gives you 80% of a GC for one library link, and that the "atomic vs non-atomic allocation" distinction is a cheap, high-impact optimisation when you control the type system.
3. Vala
Vala is the canonical "transpile to readable C" success story (valac man page, Vala FAQ, Wikipedia). It targets GLib/GObject; classes become GTypeInstance subtypes with generated _class_init, _instance_init, _get_type functions; properties become g_param_spec_* plus generated getters/setters; signals are g_signal_new calls plus generated marshalers, so the Vala signal foo (int x) line becomes one C macro registration and a stub foo_emit wrapper. Memory uses GObject reference counting; weak annotations turn into raw pointers, and Vala inserts g_object_ref/g_object_unref along all paths via a simple scope analysis (Vala's analysis is admittedly less precise than Nim's, hence the FAQ note "Vala uses reference counting in more places than most GObject/C-based applications do").
Two profile knobs matter: gobject (default, depends on libgobject) and posix (no GLib, no runtime types, useful for microcontrollers and minimal containers). Cross-language interop is essentially free because every Vala class is a GObject; GObject-Introspection metadata is emitted automatically. ABI stability requires --abi-stability to force public members in source order. The retrospective lesson: piggy-backing on a existing object system (GObject in Vala's case) saved years of design work but locked the language into GLib's worldview. A Mochi-to-C compiler that does not want that lock-in must invent its own minimal runtime, see §15.
4. OCaml and Multicore
OCaml's primary backend is ocamlopt to native, not C; ocamlc produces bytecode. The headline 2021-2026 work is Multicore OCaml (Sivaramakrishnan et al., PLDI 2021, "Retrofitting Effect Handlers onto OCaml"; arXiv:2104.00250) which retrofitted effect handlers with three hard requirements: (R1) backward compatibility, (R2) DWARF/debugger compatibility, and (R3) millions of live continuations. The implementation uses segmented fiber stacks where each handler frame contains a handler_info struct pointing back to its parent fiber plus three closures (value, exn, effect cases). Context blocks emit DWARF callbacks so unwinders stitch parent fibers transparently. The benchmark result, 1% mean overhead on non-effect code, set the bar for "effects without regret."
OCaml 5.0 shipped this in 2022; 5.2 (May 2024) restored POWER64 and added ThreadSanitizer; 5.3 (Jan 2025) added dedicated syntax for deep effect handlers (Leo White, Tom Kelly, et al.). The 2024 Oxidizing OCaml with Modal Memory Management paper (Lorenzen, White, Dolan, Eisenberg, Lindley, ICFP 2024, PACMPL 8/ICFP/253) introduced modes on three axes, affinity, uniqueness, locality, with backwards-compatible inference; field-level modalities, regions (a function or loop body is a region), and the exclave construct for escape, enabling stack-allocated closures and in-place updates of immutable structures without unsafe. The interaction with tail calls is the subtle part: a region must close before the tail jump, so locals in that region cannot be passed as tail-call arguments. OCaml is also the first industrial-strength language targeting the WASM-GC extension (ocaml.org/releases).
5. Roc
Roc compiles via LLVM, but its design notes are the most-cited modern reference on platform/host split and refcount elision. The platform (which a Roc app embeds against) provides all I/O primitives, Stdout.line lives in the platform, not in std. The Roc runtime exports a tiny C ABI surface (roc__mainForHost_1_exposed_generic, etc.); platforms can be written in Zig, Rust, Swift, or C and link the Roc-compiled app object in. This is exactly the "two-tier compilation" model a Mochi-to-C transpiler should consider: emit the app as a .o with a fixed C ABI, let the host write int main().
Roc uses Perceus (see §13) for refcount elision plus Morphic mutation analysis to discover in-place updates on persistent data structures (roc-lang.org/fast). Closures are compiled as {void* fn; void* env} fat pointers (rwx.com/blog/how-roc-compiles-closures), but the compiler aggressively converts them to tagged unions over known function sets when whole-program analysis can enumerate the closure values, eliminating the indirect call. Defunctionalisation is a stated long-term goal (still gappy in 2025). The 2024 Utrecht thesis "Reference Counting with Reuse in Roc" formalises Roc's variant of the λ1n calculus.
6. Cone, Koka, Lean 4, Eff
Koka (Daan Leijen, Microsoft Research) is the most relevant single system to study. It compiles directly to portable C99 with no GC and no runtime system, using compiler-guided reference counting (Perceus; Reinking, Xie, de Moura, Leijen, PLDI 2021, distinguished paper; doi 10.1145/3453483.3454032). Effects compile via the line in Schuster, Brachthäuser et al.'s "Compiling effect handlers in capability-passing style" (ICFP 2020, doi 10.1145/3408975) and Xie & Leijen's "Generalized evidence passing for effect handlers" (ICFP 2021, doi 10.1145/3473576), where each handler in scope is passed as an evidence vector and yield-bubbling unwinds to the matching handler frame. The C library libhandler (github.com/koka-lang/libhandler) implements the same scheme as a portable C99 library using setjmp/portable stack copying.
The Koka research line through 2024 stacks: Frame-Limited Reuse (Lorenzen & Leijen, PACMPL 6/ICFP 2022, doi 10.1145/3547634) generalised reuse analysis with bounded-extra-memory soundness; FP² Fully in-Place Functional Programming (Lorenzen, Leijen, Swierstra, ICFP 2023, PACMPL 7/ICFP/198) introduced the fip keyword and the λfip calculus with unboxed tuples, borrowed parameters, and a proof that FIP functions run in constant stack and zero allocation when args are unique. The Koka binary-trees benchmark beats C++ by ~20%, mainly thanks to TRMC (tail-recursion modulo cons) and 8-byte vs C++'s 16-byte alignment.
Lean 4 (de Moura & Ullrich 2021, CADE) ships both a C and LLVM backend. The pipeline is Lean → LCNF (A-normal form, Flanagan 1993; with join-point insertion à la Maurer 2017) → IR (explicit inc/dec/reuse/isShared) → C or LLVM (EmitC.lean, EmitLLVM.lean). Perceus is implemented at the IR layer; the compiler marks parameters borrowed to elide refcount traffic and detects destructive-update opportunities via isShared. Threading uses sign-bit-tagged counts: positive = unsynchronised (thread-local), negative = atomic (shared); once a block goes shared it never goes back.
Eff (Bauer & Pretnar) and the optimising Eff compiler (Karachalias, Pretnar et al., PACMPL 5/ICFP 2021, doi 10.1145/3485479) compile Eff to OCaml using type-and-effect-directed rewrites that aggressively reduce handler applications, then erase effect info and emit "tight" OCaml. "Eff Directly in OCaml" (Kiselyov & Sivaramakrishnan, arXiv:1812.11664) instead embeds Eff using delimited continuations or Multicore OCaml's native effects.
Cone (Jonathan Goodwin; github.com/jondgoodwin/cone) targets LLVM but its design, gradual memory management mixing single-owner, refcounted, traced, lifetime-bound, and arena strategies in the same program plus lockless/locked permissions for sharing, is a useful menu for what a Mochi could expose. Cone is research-grade, not production-ready, but the design notes are the clearest articulation of "let each allocation pick its strategy."
POPL/PLDI 2024-2025 effect-handler progress: Soundly Handling Linearity (POPL 2024) combined linear types with multi-shot handlers via control-flow linearity; Affect (POPL 2025) uses affine types on continuations so the compiler knows which optimisations are sound; Handling the Selection Monad (PLDI 2025) extended handlers with choice continuations; "Tracing JIT for Effects and Handlers" (Gaißert, Bolz-Tereick, Brachthäuser, OOPSLA2 2025) is the first JIT-focused work. The actionable takeaway: tracking continuation linearity in the type system pays for itself in code quality.
7. Hare, Zig, Odin
These are C-replacement systems languages, not C-targeting transpilers, but they teach portability lessons. Zig bundles LLVM + Clang + headers for ~97 libcs (~130 MiB uncompressed) and exposes zig cc as a turn-key cross-compiler (ziglang.org/learn/overview). Its compile-time evaluation (comptime) emulates the target architecture, which is the same trick a transpiler can use to fold target-specific intrinsics at compile time. Hare uses QBE as its backend (small enough to fit on a floppy), the lesson is that you do not need LLVM if you accept the optimisation hit; QBE is auditable in days. Odin has no methods, leading to C-style parser_parse_statement(Parser *p) prefixed-function code; this is the cost of avoiding method dispatch and an argument for keeping method syntax in Mochi even at the C-generation layer.
Common design discipline across all three: no hidden control flow, no hidden allocations, explicit allocators passed as arguments. Zig's "comptime emulates target" plus "no preprocessor / no macros" is the cleanest demonstration that you can have rich metaprogramming without textual hacks.
8. MLton
MLton (mlton.org) is the canonical whole-program SML compiler and the classic reference for the defunctorize → monomorphise → defunctionalize pipeline. Defunctorisation duplicates each functor at every application and renames variables to eliminate structures. Monomorphisation duplicates each polymorphic type/function at every instantiation. Defunctionalisation (Reynolds 1972) replaces higher-order functions with a tagged record of free variables plus a top-level apply function dispatching on the tag. The output IR is simply-typed first-order SSA; both a native and a C backend exist.
Reported code-size blowup is bounded, empirically ~30% max in MLton. Recent work (Lambda Set Specialization, Brandon et al.) extends defunctionalisation to lift restrictions on how function values may be used, recasting it as type monomorphisation over a polymorphic function-flow type system; evaluations target MLton, OCaml, and Morphic. The implication for Mochi: whole-program defunctionalisation is a credible, implementable-by-a-small-team alternative to LLVM-style polymorphic optimisation if you accept whole-program compilation.
9. Cosmopolitan libc
Cosmopolitan (jart/cosmopolitan; ape/specification.md; justine.lol/cosmopolitan/) reconfigures stock GCC/Clang to emit APE, a polyglot of ELF, Mach-O, PE, and bootable disk image where the magic bytes are valid x86 instructions in all modes. A single binary then runs on Linux, macOS, Windows, FreeBSD, OpenBSD, NetBSD, and BIOS on x86-64; macOS arm64 was added recently. The cosmocc wrapper (tool/cosmocc/) is essentially gcc with a different linker script and cosmopolitan.a. The apelink tool weaves multiple native binaries plus tiny APE loaders into a self-extracting shell-script header so each platform extracts and re-execs the right binary at first run. As of v3.5.8 (mid-2024) hello-world bytes are byte-identical across all platforms, that is the reproducibility win.
The major limitation: no cross-platform dlopen (DLL+DYLIB+SO ABIs cannot be polyglotted), worked around in projects like llamafile by manually loading a platform-specific binary that then uses the OS's own dlopen. For Mochi: if you emit ANSI C99 and link with cosmocc, you get a single binary artifact for free, extraordinary value-per-line.
10. emscripten and wasi-libc
Emscripten (emscripten.org) is LLVM-clang-to-WASM with an emulated POSIX layer. The 2019 switch to the upstream LLVM Wasm backend (Emscripten 1.39) removed fastcomp. Two output modes: standard (.wasm + .js glue, depends on Emscripten's JS runtime for longjmp, C++ exceptions, time, console, WebGL) and STANDALONE_WASM (no JS, uses WASI APIs, runnable in Wasmer/Wasmtime/WAVM). wasi-libc is the alternative, pure musl-derived libc with WASI syscalls, usable directly with clang --target=wasm32-wasi --sysroot=$WASI_SYSROOT.
The WASI ecosystem in 2024-2025 underwent a major iteration: WASI Preview 2 / 0.2 (Bytecode Alliance, early 2024) added the Component Model and "worlds" (wasi-cli, wasi-sockets, wasi-clocks, wasi-random); WASI 0.3 / Preview 3 is expected in 2025 with native async I/O via the Component Model (eunomia.dev WASI status, Feb 2025). Wasmtime was the first runtime with full WASI 0.2 support. Threading remains experimental (wasi-threads proposal; not in Preview 2 mainstream). For Mochi-to-C-to-WASM, the cleanest path is to emit ANSI C with no exotic libc dependencies and let clang --target=wasm32-wasi handle the rest.
11. TigerBeetle, Bun, zig cc in practice
zig cc is now mainstream as a portable cross-compiler. TigerBeetle uses zig cc for its Go client library CI on Windows and Linux (tigerbeetle/src/clients/go/ci.zig) because gcc is not available on Windows out of the box. Bun is built largely on top of Zig + WebKit. Uber uses Zig as the cross-compiler for its Go monorepo (Jakštys blog). A July 2024 guide (jcbhmr.com) walks through CMake + zig cc cross-compilation; a small zig-ar wrapper is needed because CMake's CMAKE_AR does not accept commands with arguments.
The reproducibility angle: reproducible-builds.org reports through 2024 (March, September, October) document toolchain hermeticity gains; the September 2024 Android D8 bug ("DEX output depends on core count") is a cautionary tale for Mochi, any parallelism in the build can poison reproducibility unless ordering is forced. cosmocc and zig cc together cover the entire matrix: zig cc for cross-arch, cosmocc for single-binary multi-OS, both deterministic. A 2026 Mochi-to-C transpiler should target clang -std=c99 -pedantic as the canonical compiler and validate against zig cc and cosmocc as portability oracles.
12. Conservative GC in C: bdwgc and what's new
bdwgc (github.com/bdwgc/bdwgc) is at 8.2.10 (Oct 2025). The core algorithm, mark-sweep over the heap, conservative pointer identification on stacks and registers, has not changed in a decade, but incremental and generational modes (under VM support) plus prefetching strategies do meaningful work. Multithreading is supported with two caveats: full signal-safety costs two syscalls per malloc (usually unacceptable, default-off), and TLS is not scanned, pointers in pthread_getspecific must be duplicated onto the thread stack. CMake is the cross-platform build path (cmake.md).
Practical experience (jank Clojure dialect, HN comment 2024): "stupid simple to integrate and surprisingly fast … MPS, MMTk, others are worlds apart in dev work." For a Mochi-to-C transpiler the calculus is: bdwgc gives you a working memory manager in one link line; you trade ~10-30% throughput vs precise/generational and lose moving/compacting; you keep ANSI-C portability.
13. Precise reference counting: Perceus and successors
Perceus (Reinking, Xie, de Moura, Leijen, PLDI 2021) is the modern canonical algorithm. The compiler inserts dup/drop operations such that cycle-free programs are garbage free, objects are freed at the exact moment of last use. Key rules: delay dup to leaves, generate drop immediately after the binding goes dead; the linear resource calculus λ1 proves soundness. Reuse analysis pairs a drop of size-S with a fresh allocation of size-S into a single in-place update when refcount is 1 at runtime, which enables the FBIP (functional but in-place) programming style, purely-functional code that runs imperative-fast. Frame-Limited Reuse (Lorenzen & Leijen, ICFP 2022) generalises this to "drop-guided reuse" with provable bounded extra memory per call; FP² (Lorenzen, Leijen, Swierstra, ICFP 2023) adds the static fip keyword guaranteeing constant stack + zero allocation; Oxidizing OCaml (ICFP 2024) brings the same affinity/uniqueness/locality modes to OCaml.
Companion techniques: Lobster (Wouter van Oortmerssen) reportedly elides 95% of refcount ops via compile-time flow-sensitive lifetime analysis (strlen.com/lobster), closer to move semantics than to traditional RC. Swift ARC uses an SSA-with-ownership IR (OSSA) where the compiler inserts retain/release after type checking; ARC optimisations include retain-release pairing, retain motion across barriers, and "observed object lifetimes are emergent" (WWDC 2021 ARC session). Lean 4 uses the same Perceus algorithm but exposed via the borrowed attribute and isShared IR primitive (lean-lang.org/doc/reference/latest/Run-Time-Code/Reference-Counting).
Cycle collection: Bacon-Rajan (ECOOP 2001, "Concurrent Cycle Collection in Reference Counted Systems") is the standard solution. Two insights: (1) cycles can only form when a decrement leaves a non-zero count, so candidate roots are exactly those decrements, and (2) inside a cycle all RCs are internal, so subtracting them via local DFS finds the cycle. The synchronous version is used in PHP 5.3+; the concurrent version achieves 6 ms max pause in Jalapeño. PHP-style "generational" cycle collection (Python) is a heuristic on top. For Mochi: Perceus + FBIP-style reuse + Bacon-Rajan cycle collection as a backup gives a garbage-free in the common case system with a safety net.
14. Region/arena inference
Tofte-Talpin (TOPLAS 1997, "A region inference algorithm", doi 10.1145/291891.291894; implemented in the MLKit) is the seminal type-and-effect system inferring region lifetimes for ML; the runtime stack pushes/pops regions with all objects deallocated together. The big practical complaint: small program changes can cause "drastic, unintuitive effects on object lifetimes."
Cyclone (Grossman, Morrisett et al., PLDI 2002, "Region-Based Memory Management in Cyclone") adapted regions to a low-level explicit-region language. Key contributions: region subtyping (LIFO scoping induces an "outlives" relation), simplified effects without effect variables, dynamic regions (not tied to a lexical scope) for data that needs to escape, and default effects computed from function prototypes to preserve separate compilation. "Linear Regions Are All You Need" (Fluet, Morrisett, Ahmed, ESOP 2006) shows λrgnUL, a substructural type system that can encode Tofte-Talpin AND Cyclone dynamic regions, removing the LIFO restriction.
Recent: Spegion (arXiv:2506.02182, 2025) introduces implicit, non-lexical regions with sized allocations; Elsman's "Double-Ended Bit-Stealing for Algebraic Data Types" (ICFP 2024); Koparkar et al., "Garbage Collection for Mostly Serialized Heaps" (ISMM 2024); "Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes" (Anand et al., PLDI 2024), JITs optimistically stack-allocate based on escape analysis and dynamically heapify if the optimism fails, with a stack-ordering analysis to amortise the check. The lesson for Mochi: region inference is brittle as a primary memory strategy, but arena-per-request or arena-per-handler is the cheapest possible "GC" and pairs beautifully with Roc-style platform/host split.
15. Closures in C
The exhaustive treatment is Matt Might's "Compiling Scheme to C with flat closure conversion," Hokstad's "How to implement closures," and the Roc post "How Roc Compiles Closures." Four strategies, each with concrete trade-offs:
- Fat pointer (
struct { void* fn; void* env; }passed everywhere). Portable, simple; cost is one extra register/argument per call site and an explicit unpack at the call. Roc uses this with the optimisation thatenvisunionover known environments and can live on the stack when the closure does not escape. - Thunk (mmap a small RWX page, write machine code that pushes the env then jumps to the body, return a plain
void(*)()). Callers see a regular function pointer; cost is W^X-unfriendly, architecture-specific code generation. Apple Silicon and many hardened Linuxes now forbid this withoutmprotectdance plusPT_GNU_STACKexemptions. - Object dispatch (closure is an object with a
callmethod). Uniform with an object system; full vtable cost. - Tagged-union dispatch (when whole-program analysis enumerates the closure values, emit
switch(tag)calling the right top-level function directly). No indirect call; Roc's preferred path when defunctionalisation succeeds.
For Mochi: pick fat-pointer as the default but implement tagged-union dispatch where the analysis is trivial (e.g. callbacks visible in scope), and never emit thunks (they break W^X and reproducibility).
16. Coroutines and stack switching in C
Three viable approaches in 2026 (Wikipedia "Coroutine," Boost.Context, libdill issue #30, Wikipedia "Protothread"):
ucontextfamily (makecontext,swapcontext) is obsolete on macOS (deprecated POSIX 1.2008) and not in many embedded libcs.- Boost.Context / libdill / libco style, hand-written asm per architecture (x86-64 SysV, x86-64 Win64, aarch64, riscv64) that swaps callee-saved registers + stack pointer. libdill benchmarks ~13-16ns context switch, ~25-44ns coroutine creation. Caveats: signal-mask preservation costs a
rt_sigprocmasksyscall (kills perf), and signal handlers run in the current coroutine's stack. - Stackless / Protothreads / Duff's device / computed-goto, uses
switchto fake resumption from yield points; locals do not survive across yield (must be in an explicit state struct). Very cheap (bytes of memory per coroutine); cannot yield from within a nested switch unless using GCC computed gotos. This is the model C++20 coroutines chose, and what async/await ends up being.
For Mochi: the stackless model is the right default for async/generators because it survives sandboxes (WASM, BIOS, embedded), and the per-target asm switching can be added later as an optimisation for true threads.
17. Effect handlers compiled to C
The literature breaks into three families. Direct stack capture, what Multicore OCaml does, copy or segment the native stack on perform, resume by jumping back to a saved context. Best performance at runtime, requires DWARF cooperation and per-arch asm. Capability-passing / evidence-passing (Schuster, Brachthäuser, Ostermann, ICFP 2020, PACMPL 4/ICFP/93; Xie & Leijen, ICFP 2021, PACMPL 5/ICFP/127), pass the handler implementation as a runtime value (a capability) so calls to effect operations are direct calls into the handler; combine with iterated CPS to enable optimisation across effect calls. This is what Koka's C backend uses and what makes the Koka benchmarks competitive with hand-written C++. Monadic translation, Eff compiles to OCaml via type-and-effect-directed rewrites (Karachalias et al. ICFP 2021), erases effect info, emits tight code.
For Mochi-to-C, evidence passing is the right pick: pure C99, no asm, predictable perf, integrates naturally with refcounted environments. The Koka libhandler library (github.com/koka-lang/libhandler) is a literal blueprint. Recent POPL/PLDI/OOPSLA work (Soundly Handling Linearity, POPL 2024; Affect, POPL 2025; Tracing JIT for Effects, OOPSLA2 2025) consistently shows that tracking continuation linearity in the type system (one-shot vs multi-shot) enables much better codegen, Mochi should expose this distinction.
18. Sum types in C
The portable C lowering of an ADT is a discriminated union: struct { tag_t tag; union { variant_A a; variant_B b; ... } u; }. Size = sizeof(largest_variant) + sizeof(tag) + padding. Three optimisations matter:
- Niche-filling (Rust's
Option<&T>taking zero extra bytes by using the null pointer as theNonetag; RFC 2195), detect "forbidden" bit patterns in payload types and reuse them as discriminators. Implement by tracking per-typenichemetadata:&Thas one niche (null),boolhas 254,enum Direction { N, E, S, W }as u8 has 252. - Pointer tagging, exploit alignment: on 64-bit, an 8-byte-aligned pointer has the low 3 bits free. OCaml's classic value representation uses bit 0 as the int tag. Lean 4 uses tagged pointers for small constructors.
- Repr-C escape hatch, for FFI, emit a fixed-layout
{int tag; union body;}, lose niche optimisation, gain ABI stability.
For Mochi: implement niche-filling for the common cases (Option<&T>, Option<Box<T>>, single-niche enums), it is high-impact and the analysis is local.
19. Pattern matching compilation
The standard algorithm is Luc Maranget's decision-tree algorithm (Maranget, ML Workshop 2008, "Compiling Pattern Matching to Good Decision Trees," doi 10.1145/1411304.1411311). The compiler maintains a pattern matrix + occurrence + action vectors; recursively chooses a column to split on using a necessity heuristic ("number of rows for which this column is needed"); shares the resulting DAG to avoid code blowup; produces a tree that never tests the same subterm twice. The alternative, backtracking automata (Le Fessant & Maranget 2001), is more compact but may re-test. Modern decision-tree compilers (e.g. Tobin-Hochstadt's Racket match) combine both.
For Mochi: a single-pass Maranget decision tree, emitted as a chain of switch/if over discriminator fields, gives O(1)-per-test code that the C compiler can further optimise. Crumbles.blog (Nov 2025, "Tour of a pattern matcher: decision trees") has a current implementation walkthrough.
20. Other stream / dataflow / Datalog targets
Lustre (Halbwachs et al. 1991) is the synchronous-dataflow language behind SCADE; its classic compiler synthesises a finite-state machine in C where the "step" function is called once per clock tick. Vélus (Bourke et al., PLDI 2017, "A formally verified compiler for Lustre," doi 10.1145/3062341.3062358) is a CompCert-verified compiler from normalised Lustre to Clight, handling sampling, nodes, and delays, with the dataflow→imperative transformation proven correct. The CoCompiler (miniKanren 2025, arXiv:2510.00210) re-encodes Vélus in the Walrus relational language so the Lustre↔C compilation runs both directions, interesting for round-tripping codegen.
Soufflé (souffle-lang.github.io) translates Datalog to heavily-templated parallel C++ via a relational-algebra machine (Futamura projection on semi-naive evaluation). Magic-set transformation is available behind -m; data-structure choices (B-trees vs tries) are auto-selected via VLDB-published index analysis. Compiled mode beats interpreted by 10-100× for large workloads.
Pony / ORCA (Clebsch et al., OOPSLA 2017, "Orca: GC and Type System Co-Design for Actor Languages," doi 10.1145/3133896), fully concurrent per-actor GC with no global synchronisation, requiring only that (i) actor behaviours are atomic and (ii) messages are causally delivered. Each actor owns its heap; cross-actor references use weighted reference counting via INC/DEC messages sent only to the owning actor. Actor collection itself handles dead-cycle detection. This is the right reference for any Mochi actor extension.
21. Escape analysis for C codegen
Standard textbook material (Wikipedia "Escape analysis"; Park & Goldberg 1992 "Higher order escape analysis"), track pointer flow, prove a pointer does not outlive the procedure, stack-allocate. The 2024 PLDI "Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes" paper (CSE IIT Bombay preprint) is the recent SOTA: combine static escape analysis with JIT-time optimistic stack allocation plus a "stack walk" run-time check that heapifies objects (moving them to the heap and patching references) if the optimism turns out to be wrong. Microsoft's 2026 HotSpot JEP proposes true stack allocation that preserves object shape (vs scalar replacement) for objects at control-flow merges; claims up to 15% heap allocation reduction on Renaissance and Scala DaCapo.
For Mochi: a simple intra-procedural escape analysis (does any reference flow into a global, a return value, a heap field, or a closure environment?) catches the easy wins; combine with the FBIP / Perceus reuse path for the hard cases.
22. Distilled lessons for a Mochi-to-C transpiler in 2026
-
Lower in stages, not in the C emitter. Nim's painful experience and the Nimony rewrite agree: do iterator inlining, lambda lifting, deref injection, destructor injection, exception lowering as separate passes over your own IR before you ever touch C. NIF's separation of tag and identifier namespaces is a clean trick worth borrowing.
-
Pick Perceus + FBIP as the default memory model, with Bacon-Rajan cycle collection as the safety net and bdwgc as the "off switch" for users who do not care. This matches Koka's proven design, integrates with effect handlers via evidence passing, and gives you predictable, sub-millisecond pause behaviour without a moving GC.
-
Closures as fat pointers, defunctionalise where you can. Default to
{fn*, env*}; emit MLton/Roc-style defunctionalisation when whole-program analysis succeeds; never generate thunks (W^X, code-signing, WASM all hate them). -
Effect handlers via evidence passing, not stack capture. Pure C99 output, no per-architecture asm, blueprint exists in Koka's libhandler. Track continuation linearity (one-shot vs multi-shot) in the type system per POPL 2024/2025, the optimisation payoff is large.
-
Sum types: niche-fill
Option<&T>,Option<Box<T>>, and small enums; otherwise lay out as{tag; union}. Pattern match via Maranget decision trees emitted as cascadedswitch. -
Coroutines: stackless by default (computed-goto state machines in the Protothread / C++20 style), stack-switching as an opt-in for true threads, preserves WASM and embedded portability.
-
Two-tier compilation à la Roc. Emit the program as a
.owith a small, documented C ABI; ship a default "host" platform written in C that doesmain()+ I/O; let users replace the host. This is the cleanest model for embedding Mochi in larger systems. -
Target
clang -std=c99 -pedanticas canonical, validate withzig ccandcosmocc. This gives you (a) cross-architecture cross-compilation viazig cc -target ..., (b) single-binary multi-OS viacosmocc, (c) WASM viaclang --target=wasm32-wasi, and (d) byte-reproducible builds if you discipline the codegen (deterministic name mangling, sorted iteration, no time/PID in output). -
Region/arena allocation as a programmer-visible escape hatch, not as the primary strategy. Tofte-Talpin region inference is too brittle for a primary model (per MLKit experience); but per-request arenas + Spegion-style sized regions integrate cleanly with Roc-style host platforms.
-
Generate readable, debuggable C. Vala's success rests on its C output being legible and using familiar idioms (GObject,
g_signal_emit, etc.). Mochi-to-C should emit code that a curious user can read, with#linedirectives back to Mochi source sogdband sanitisers work. -
Reproducibility is a first-class feature. The 2024 Reproducible Builds reports (Android D8 core-count bug, Bazel hermeticity issues) prove that any nondeterminism, iteration order, time, PID, parallelism, will eventually bite. Bake hermetic codegen in from day one.
-
Borrow Lean 4's pipeline diagram: source → A-normal form → IR with explicit
inc/dec/reuse/isShared→ C/LLVM. Make the IR introspectable (trace.compiler.ir.result) so users can debug refcount surprises.
Sources
(URLs are preserved from the source agent; cited inline above.)
Nim & NIF: Nim Roadmap 2025; Nim ARC/ORC intro; Nim compilation pipeline (DeepWiki); nifspec; NIF spec doc; issues #23897, #21579.
Crystal: Crystal Wikipedia; src/gc/boehm.cr; Crystal GC API; Immix issue #5271.
Vala: valac man; Vala FAQ; Vala Signals tutorial; Vala Wikipedia.
OCaml / Multicore / Modes: arXiv:2104.00250 (Retrofitting Effect Handlers onto OCaml); ACM PLDI 2021; OCaml 5.3 release / changelog; "Oxidizing OCaml with Modal Memory Management" ICFP 2024 (Lorenzen, White, Dolan, Eisenberg, Lindley).
Roc: Roc Fast / Platforms pages; "How Roc Compiles Closures" (rwx.com); Utrecht thesis "Reference Counting with Reuse in Roc."
Koka / Perceus / FBIP / FIP / Evidence: Perceus PLDI 2021 PDF and ACM; Frame-Limited Reuse ICFP 2022 (PACMPL 6/ICFP); FP² ICFP 2023 (PACMPL 7/ICFP/198); Schuster et al. ICFP 2020 (PACMPL 4/ICFP/93); Xie & Leijen ICFP 2021 (PACMPL 5/ICFP/127); Koka GitHub; libhandler GitHub.
Lean 4: Lean 4 paper PDF (CADE 2021); Lean 4 reference, Reference Counting; Lean DeepWiki; Lean 4.23 release notes.
Cone: github.com/jondgoodwin/cone; PLDB Cone entry.
Eff: github.com/matijapretnar/eff; "Eff Directly in OCaml" arXiv:1812.11664; "Efficient Compilation of Algebraic Effect Handlers" ICFP 2021 (PACMPL 5/ICFP).
Hare / Zig / Odin: Zig overview; colinsblog.net systems-languages II; zig cc OSDev; jcbhmr.com zig cc + CMake 2024.
MLton: mlton.org; "Whole-Program Compilation in MLton" (Weeks); SIGPLAN blog "Defunctionalization Everybody Does It, Nobody Talks About It."
Cosmopolitan: github.com/jart/cosmopolitan; ape/specification.md; justine.lol/cosmopolitan/; cosmocc README; APE DeepWiki.
Emscripten / WASI: emscripten.org/docs/compiling/WebAssembly.html; WASI Component Model status 2025 (eunomia.dev); Standalone WASM wiki.
TigerBeetle / zig cc / reproducibility: tigerbeetle Go client ci.zig; papercompute.com Cross-Compiling CGo; reproducible-builds.org reports 2024-09/10.
GC, RC, regions: bdwgc GitHub; Boehm tutorial PDF; Boehm GC Wikipedia; MMTk about; Bacon-Rajan ECOOP 2001 PDF; Tofte-Talpin TOPLAS 1997; "Linear Regions Are All You Need" ESOP 2006; "Optimistic Stack Allocation and Dynamic Heapification for Managed Runtimes" PLDI 2024 (CSE IIT Bombay preprint); Spegion arXiv:2506.02182.
Lobster / Swift ARC: strlen.com/lobster; aardappel.github.io/lobster reference; WWDC 2021 ARC session.
Closures, coroutines: Matt Might "Compiling Scheme to C"; Hokstad closures; Wikipedia Coroutine and Protothread; Boost.Context; libdill issue #30.
Tagged unions, niche: Rust RFC 2195; Wikipedia "Tagged union"; patshaughnessy.net on tagged unions.
Pattern matching: Maranget ML 2008 PDF + ACM doi 10.1145/1411304.1411311; crumbles.blog Nov 2025.
Datalog, dataflow, actors: Soufflé docs and magic-set; Vélus PLDI 2017; Lustre Verimag; CoCompiler arXiv:2510.00210; Orca OOPSLA 2017 (Clebsch et al., doi 10.1145/3133896); Pony GC tutorial.
Effects 2024-2025: Soundly Handling Linearity POPL 2024; Affect POPL 2025; Handling the Selection Monad PLDI 2025; "Simplifying explicit subtyping coercions" arXiv 2024.