MEP 3. Abstract Syntax Tree
| Field | Value |
|---|---|
| MEP | 3 |
| Title | Abstract Syntax Tree |
| Author | Mochi core |
| Status | Informational |
| Type | Informational |
| Created | 2026-05-08 |
| Revised | 2026-05-11 |
Abstract
Mochi's abstract syntax tree is a closed family of tagged unions. Every choice point in the language (a statement, a type reference, a postfix operator, a primary expression, a literal) is encoded as a struct whose fields are mutually exclusive: exactly one is populated on a well-formed parse. The tags themselves are the field names; their identity is checked post-parse by a normalisation pass and asserted by a contract validator before any consumer sees the tree.
Grammar productions resolve to a smaller post-normalised shape, and that shape is the AST. MEP 2 fixes what can be parsed; this MEP fixes what consumers (type checker, interpreter, code generators) are allowed to see.
The companion documents are MEP 1 (Lexer) and MEP 2 (Grammar).
Motivation
An AST is the interface between the parser and every later pass. The harder that interface is to specify, the more places a language can quietly grow corner cases. Three forces shape this revision:
- One legal shape per node. A node that admits two simultaneous interpretations forces every consumer to disambiguate. Each pass picks a different precedence, the language drifts, and a "fix" in one pass becomes a regression in another. A tagged union with exactly one arm set has only one interpretation by construction.
- Soundness is parser-checkable. Whatever the type checker, optimiser, or codegen needs to assume about the tree, the parser must guarantee. Assumptions that hold "in practice" but not "by construction" are bugs waiting on the right input. The post-parse validator turns those assumptions into runtime contracts that fail loudly the first time they break.
- No dead arms. The AST is small enough that every field carries weight. A field that was once optional but is now always populated, or always populated for one variant but never for another, is documentation rot. The shape consumers see must match the shape the documentation describes, literally.
The previous revision of this MEP catalogued the node set without committing to which arms were mutually exclusive, which were sometimes both populated, or what the validator was allowed to assume. The catalogue was a description of the implementation rather than a contract for it. This revision states the contract.
Specification
Encoding
Every choice point is a struct with one nullable field per alternative. The field name is the tag. After parsing and normalisation, exactly one field on the struct is non-nil; if zero or two are non-nil the parser raises a P070 diagnostic before returning the tree.
Choice = struct {
Pos lexer.Position
Arm₁ *Variant₁ // exactly one of Arm₁..Armₙ is non-nil
Arm₂ *Variant₂
...
Armₙ *Variantₙ
}
There is no enum tag, no discriminator integer, and no shared-base interface. The non-nil field is the discriminator. This shape is what the golden JSON corpus pins (each arm carries omitempty, so the JSON view of a well-formed tree shows exactly one arm per choice node).
A choice node also carries a Pos lexer.Position so any later pass can attach diagnostics to source.
Programs
Program {
Pos lexer.Position
Package string // empty for files without `package`
PackageDoc string // doc comment block immediately preceding `package`
Statements []*Statement
}
A program is an ordered list of statements; order is meaningful (statements may shadow earlier bindings). Doc comments are attached by a post-parse pass over the source, described under §"Doc-comment attribution" below.
Statements
Statement is a 29-arm tagged union. Each arm corresponds to one keyword-led or shape-led production from MEP 2.
Statement {
Pos lexer.Position
// declarations
Import *ImportStmt Type *TypeDecl
ExternType *ExternTypeDecl ExternVar *ExternVarDecl
ExternFun *ExternFunDecl ExternObject *ExternObjectDecl
// bindings
Let *LetStmt Var *VarStmt
Assign *AssignStmt Fun *FunStmt
// control flow
Return *ReturnStmt If *IfStmt
While *WhileStmt For *ForStmt
Break *BreakStmt Continue *ContinueStmt
// observability
Test *TestBlock Bench *BenchBlock
Expect *ExpectStmt
// streams, agents, logic
Agent *AgentDecl Stream *StreamDecl
Model *ModelDecl On *OnHandler
Emit *EmitStmt Fact *FactStmt
Rule *RuleStmt
// I/O
Fetch *FetchStmt Update *UpdateStmt
// expression as statement
Expr *ExprStmt
}
The validator asserts exactly one arm is set. Pos is the position of the statement's first token, regardless of which arm is set.
Doc-comment attribution
Doc comments are written as a contiguous block of /// line comments. Two rules define attribution:
- A
///block ending on lineLattaches to the first non-blank, non-comment line at or afterL+1. If that line is blank, the block detaches and is dropped. - The block attaches to a declaration's
Docfield. The eligible declaration kinds areFunStmt,LetStmt,VarStmt,TypeDecl(and eachTypeMember),AgentDecl, andStreamDecl(and eachStreamField). A block above any other shape is dropped.
Attribution runs after parsing as a deterministic second pass over the source, not as a grammar production. The pass is observationally part of the parser contract: a conforming implementation that returns a Program returns one whose Doc fields are populated by this rule. Two consumers reading the same source agree on Doc strings.
The grammar does not carry doc comments because attaching them inside productions would force the participle struct tags to grow a field for every comment-eligible site. The post-pass keeps the grammar small and the attribution rule explicit.
Type declarations
TypeDecl {
Pos lexer.Position
Name string
Doc string
Members []*TypeMember // struct form
Variants []*TypeVariant // single- or multi-variant union
Alias *TypeRef // function-type or generic-type alias
}
Exactly one of {Members, Variants, Alias} is populated. The grammar admits four surface forms (type T { ... }, type T = { ... }, type T = V₁ | V₂ | ..., type T = SomeTypeRef), which the normalisation pass collapses onto these three shape arms.
Members is the struct form (with or without the leading =).
Variants is the union form. A single-variant declaration such as type Pair = P(a: int, b: int) is lifted from a SingleVariant parser artifact into a one-element Variants slice. The artifact field is not reachable from a normalised tree; the validator rejects any tree that still carries it.
Alias is a type-reference alias (type F = fun(int):int, type L = list<int>). A bare-identifier alias such as type Id = int resolves through Simple on the right-hand TypeRef.
A type checker distinguishes a true alias from a single-variant declaration by whether the sole variant carries fields. The decision lives in the type system, not in the AST shape, because both forms are valid for different roles.
Type references
TypeRef {
Fun *FunType // fun(P₁, P₂, ...) [: R]
Generic *GenericType // Name<A₁, A₂, ...>
Struct *InlineStructType // { f₁: T₁, ... }
Simple *string // bare identifier
}
Exactly one of {Fun, Generic, Struct, Simple} is populated. Simple is *string because the parser captures the identifier directly; using *string keeps the choice tagged-by-presence with the other three pointer-typed arms.
There is no Union arm. A union type by name resolves through Simple to a UnionType in the type environment. There is no inline union literal: let x: int | nil = ... does not parse today. Keeping union types nominal is a soundness decision: structural unions would force every type position to carry a normalisation pass over its alternatives, and would surface to programmers as a second way to spell Maybe<T>.
Expressions
Expr { Pos lexer.Position, Binary *BinaryExpr }
BinaryExpr { Left *Unary, Right []*BinaryOp }
BinaryOp { Pos lexer.Position, Op string, All bool, Right *Unary }
Unary { Pos lexer.Position, Ops []string, Value *PostfixExpr }
PostfixExpr { Target *Primary, Ops []*PostfixOp }
PostfixOp { Pos lexer.Position,
Call *CallOp, Index *IndexOp, Field *FieldOp, Cast *CastOp }
A binary expression is flat: a leading Unary and a list of (operator, right-hand Unary) pairs. Both operands of every binary level are the same Unary type. This is the operand-symmetry property fixed by MEP 2 §Layered precedence with operand symmetry. The AST does not carry a precedence tree; that tree is built by the type checker per the language-defined ladder. A future implementation that parses directly into a precedence tree would accept the same set of programs and produce an observationally equivalent AST under the same conformance.
Unary.Ops is a slice of strings ("-" or "!"), applied right to left. The slice is empty for a Unary with no prefix.
PostfixOp is a four-arm tagged union (Call, Index, Field, Cast) and carries its own Pos, so a diagnostic on "the third subscript in a chain" can point at the operator rather than the operand. Exactly one arm is set per op. A postfix chain is a flat list of these ops applied left to right.
Primary expressions
Primary {
Pos lexer.Position
// composite constructors
Struct *StructLiteral
List *ListLiteral
Map *MapLiteral
// expressions
Call *CallExpr
Query *QueryExpr
LogicQuery *LogicQueryExpr
If *IfExpr
Match *MatchExpr
Generate *GenerateExpr
// I/O expressions
Fetch *FetchExpr
Load *LoadExpr
Save *SaveExpr
// values
Selector *SelectorExpr // identifier with optional dotted tail
FunExpr *FunExpr // lambda
Lit *Literal
Group *Expr // parenthesised
}
Exactly one arm is populated per primary. The grammar's ordered choice gives StructLiteral priority over Selector: the lookahead Ident '{' commits to a struct literal before the identifier could be reread as a bare selector followed by a block.
FunExpr carries two mutually exclusive bodies: BlockBody []*Statement for fun() { ... } and ExprBody *Expr for fun() => expr. Exactly one is set after normalisation; the validator asserts this.
Literals
Literal {
Pos lexer.Position
Int *IntLit
Float *float64
Bool *boolLit
Str *string
Null bool // true iff `null`
}
Exactly one of {Int, Float, Bool, Str, Null} is set. The Null arm is a bool flag rather than a pointer because the value carries no payload; the validator treats Null = true as the populated arm. IntLit is a struct (not a bare *int64) so it can carry the original base (decimal, hex, binary, octal) and original literal text for diagnostics.
Patterns
There is no Pattern AST. A MatchCase.Pattern is an *Expr. The type checker classifies the expression at check time:
- A
Literalmatches by equality. - A bare
Selectorwith no tail acts as a wildcard binding. - A
Callof the formTag(a, b, c)matches a tagged-union variant by name and binds field arguments. - The identifier
_is the discard wildcard.
The reuse trades surface generality against parse-time precision (see §"Patterns reuse Expr"). The same reuse is a known soundness gap (see §"Soundness, remaining gaps").
Control flow
IfStmt { Pos, Cond *Expr, Then []*Statement,
ElseIf *IfStmt, Else []*Statement }
WhileStmt{ Pos, Cond *Expr, Body []*Statement }
ForStmt { Pos, Name string, Source *Expr, RangeEnd *Expr, Body []*Statement }
IfStmt.ElseIf and IfStmt.Else are mutually exclusive at the same level: an if either continues with another if chained through ElseIf or terminates with a flat Else block. The validator asserts this exclusion (P070); the grammar admits the impossible co-occurrence only as a participle-level artifact.
ForStmt.RangeEnd is non-nil for for i in 0..10 and nil for for x in xs. The two forms share an AST shape so loop lowering does not need a separate node kind.
I/O nodes
LoadExpr { Pos, Path *string, Type *TypeRef, With *Expr }
SaveExpr { Pos, Src *Expr, Path *string, With *Expr }
FetchExpr { Pos, URL *Expr, With *Expr }
FetchStmt { Pos, URL *Expr, Target string, With *Expr }
UpdateStmt{ Pos, Target string, Set *MapLiteral, Where *Expr }
EmitStmt { Pos, Stream string, Fields []*StructLitField }
Path is *string, not string. Three values matter and the type must distinguish them:
Path == nil(noto ...clause): the runtime uses stdin (load) or stdout (save).Path == &"some/file": the runtime reads from or writes to the given path.Path == &"": rejected at parse time asP062. The grammar admits it because the literal is independently optional, but""is indistinguishable from the absent form at runtime, and silently accepting it produces unrelated I/O failures downstream. The normalisation pass converts this corner case into a positioned diagnostic.
Logic and streams
StreamDecl { Pos, Name, Doc, Fields []*StreamField }
ModelDecl { Pos, Name string, Fields []*ModelField }
OnHandler { Pos, Stream, Alias string, Body []*Statement }
AgentDecl { Pos, Name, Doc, Body []*AgentBlock }
AgentBlock { Pos lexer.Position,
Let *LetStmt, Var *VarStmt, Assign *AssignStmt,
On *OnHandler, Intent *IntentDecl }
IntentDecl { Pos, Name string, Params []*Param, Return *TypeRef,
Body []*Statement }
FactStmt { Pos, Pred *LogicPredicate }
RuleStmt { Pos, Head *LogicPredicate, Body []*LogicCond }
LogicCond { Pos lexer.Position, Pred *LogicPredicate, Neq *LogicNeq }
AgentBlock and LogicCond are tagged unions like Statement, scoped to their containers. Both carry their own Pos, so diagnostics on a rule body or an agent body can point at the body itself rather than at the first contained child. The validator asserts the same exactly-one-arm rule on both. ModelDecl carries no Doc field; the asymmetry with StreamDecl is intentional (models are anonymous descriptors of an external resource, streams are first-class language entities).
Design principles
Tagged union by nullable field
Mochi encodes choice nodes as Go structs with mutually exclusive nullable fields rather than as interface types with a node() method or as integer-discriminated unions. The encoding earns three properties:
- JSON shape is the contract. The golden test corpus diffs the JSON view of every parse.
omitemptyon each arm produces a JSON where only the populated arm appears. A discriminated union with a tag integer would either bloat the JSON (every node carries its tag) or require a custom marshaller (every node must know its tag map). The nullable-field encoding is the smallest shape that the standardencoding/jsonproduces correctly. - No
node()dispatch. With interfaces, every consumer pays for a virtual call per node. With nullable fields, the type switch isswitch { case s.Let != nil: ... }: branch prediction is exact, the compiler inlines the field test, and a node's shape is visible from its definition without chasing an interface satisfaction. - Single source of truth. The participle parser tag and the JSON tag sit next to each other on the same field. Add a new arm and both views update in lockstep. There is no separate enum to keep in sync.
The cost is that "exactly one arm set" is not statically enforced. The validator (§"Soundness") closes that gap at runtime.
AST owned by the parser package
There is no separate ast package. The AST structs live in parser/parser.go, sharing memory with the participle tags. The trade-off is direct: a separate package would let tooling depend on the AST without pulling in the parser, but it would also force every AST change to ripple through two packages and would create a moment in time where one package describes the AST and the other still uses the old shape. The language is small enough that this ripple matters more than the dependency.
If the AST grows to support multiple front ends, the calculus changes. See §"Open questions".
Flat binary expressions, precedence in the checker
The expression spine is a flat Left + Right* list rather than a precedence tree. The flatness has two justifications:
- The grammar is right-recursive. Participle's
participle/v2does not implement left-recursion memoisation. Writing the grammar in the right-recursiveOperand (op Operand)*shape is the natural translation. Building a precedence tree at parse time would require an extra pass; pushing it into the checker keeps the parser linear. - Operand symmetry is structural. Both sides of every binary operator are
Unary. A reader of the AST sees the symmetry immediately. A precedence tree would hide it behind whichever node happened to be the root.
The flatness is a parse-time choice. A consumer that wants a tree builds one (the type checker does, applying the precedence ladder from the language definition). The AST and the tree are observationally equivalent under that translation.
Patterns reuse Expr
Pattern syntax overlaps almost entirely with expression syntax: literals match by equality, identifiers bind, calls match constructors. A dedicated Pattern AST would duplicate the call, literal, and selector grammars and force the parser to know context. Today the parser parses an Expr and the checker classifies it. The cost is that the AST does not statically forbid a pattern that has no equivalent meaning (e.g. match x { (a + b) => ... }); the gain is a smaller grammar and a smaller AST.
This is a known soundness gap (§"Soundness, remaining gaps"). It is preserved for now because the gain currently outweighs the cost.
Nullable scalar for tri-state semantics
Path *string carries three states the language distinguishes: nil, "", and "some/path" mean three different things at runtime. The pointer type forces every consumer to decide which case it is handling. The P062 rejection of Path == &"" then collapses the language to two states (nil for stdio, non-empty string for a named target) before any consumer sees the tree. The grammar admits the third state only because the literal is independently optional in PEG; the normalisation pass closes the corner.
Soundness
Soundness in this MEP means: every assumption an AST consumer is allowed to make is enforced before the tree leaves the parser.
Enforced after this revision
A successful parser.Parse or parser.ParseString returns a tree that satisfies all of the following. Each is asserted by parser.assertProgramInvariants, run after parser.normalizeProgram. A violation is reported as a P070 diagnostic with the position of the offending node and a pointer to the contract-violation issue tracker; it never returns to a caller as a corrupt tree.
- Statement exactly-one-of. A
Statementhas exactly one of its 29 arms set. - Primary exactly-one-of. A
Primaryhas exactly one of its 16 arms set. - TypeRef exactly-one-of. A
TypeRefhas exactly one of{Fun, Generic, Struct, Simple}set. - TypeDecl shape exclusion. A
TypeDeclhas exactly one of{Members, Variants, Alias}set.SingleVariantis nil (it is a parser artifact that normalisation lifts intoVariants). - PostfixOp exactly-one-of. A
PostfixOphas exactly one of{Call, Index, Field, Cast}set. - Literal exactly-one-of. A
Literalhas exactly one of{Int, Float, Bool, Str, Null}set, treatingNull = trueas populated. - FunExpr body exactly-one-of. A
FunExprhas exactly one of{BlockBody, ExprBody}set. - If-stmt else exclusion. An
IfStmtdoes not have bothElseIfandElseset at the same level. - Load/Save path non-empty.
LoadExpr.PathandSaveExpr.Path, when non-nil, dereference to a non-empty string. The empty-string case is rejected at normalisation time asP062. - LogicCond exactly-one-of. A
LogicCondhas exactly one of{Pred, Neq}set. - AgentBlock exactly-one-of. An
AgentBlockhas exactly one of{Let, Var, Assign, On, Intent}set. - MatchCase pattern shape. A
MatchCase.Patternis a literal, a bare identifier, a parenthesised pattern, or a variant constructor call whose arguments are themselves patterns. Compound expressions (1 + 2), indexing (xs[0]), field access (p.x), casts (x as T), and other non-pattern shapes are rejected at normalisation time asP063. - ImportStmt language registry. An
ImportStmt.Lang, when set, is one of{go, python, typescript, zig, lua, clj, clojure}. An unknown value (e.g.import pythn "math" as math) is rejected at normalisation time asP064. - ExprStmt observability. An expression used as a statement must be observable: the expression tree must reach a call, fetch, save, load, or generate node along an unconditional path that does not pass through a function-literal body. A bare value (
x), arithmetic (1 + 2), or pure indexing/field access (xs[0],p.x) is rejected at normalisation time asP065. - FunExpr default return resolves to
unit. A block-bodied lambda or function withReturn == niland no value-yielding expression body resolves to the canonicalunittype (types.UnitType) at inference. Consumers readingFunExpr.Return == nilmay treat it asunitrather than special-casing "infer vs. void". The previous "void" spelling was renamed tounitin the type system; target-language transpilers that emitvoidas a keyword (Go, Java, C++, C#, Dart) continue to do so unchanged. - Recursive transparency. Sub-expressions (the value of a
LetStmt, the body of a block, the operand of a postfix op, etc.) are themselves validated. A corrupt arm anywhere in the tree fails at the closest enclosing position.
A grammar change that breaks any of these invariants fails the validator on the first input that exercises the break, with a positioned diagnostic. The validator is cheap (one deterministic traversal) and runs unconditionally.
Remaining gaps
No known soundness gaps remain in this revision. The list above is the full set of contracts the parser asserts; if a future revision discovers a new gap it will be enumerated here.
Conformance
A conforming consumer of the AST may assume every invariant enumerated under §"Enforced after this revision" holds on the tree it receives from parser.Parse or parser.ParseString. A conforming alternative parser must produce a tree that satisfies the same invariants, reachable through the same node types, and reject the same diagnostics (P062, P063, P064, P065, P070) for the same inputs. Diagnostic wording is not normative; diagnostic codes and positions are.
A consumer that walks the tree must:
- Treat the tagged-union encoding as exhaustive. A
default:in a switch overStatementarms indicates dead code in this MEP; if a future revision adds an arm, every consumer must opt into the new shape rather than silently ignoring it. - Apply binary-operator precedence as defined by the language ladder (MEP 2 §Layered precedence with operand symmetry), not as implied by tree shape. A consumer that walks
BinaryExpr.Rightleft to right without applying precedence will mistypea + b * c. - Preserve the order of
Statementsin aProgramand ofBodyin any block. Reordering changes meaning.
Open questions
- Separate
astpackage. Splitting the AST out ofparserwould let downstream tooling depend on the AST without pulling in participle and lexer machinery. The cost is a translation step and a duplicate type definition; the gain is a smaller dependency surface for code generators. Defer until a second front end (or a serialised AST distribution) becomes a concrete requirement. - Pattern AST. A dedicated
Patternunion would move match-shape checks from the type checker into the parser, surfacing more errors with positions. The cost is grammar duplication for literals, calls, and selectors. See MEP 13. - Structural unions in
TypeRef. First-classint | nilin aTypeRefarm would let optional types use surface syntax instead of nominal aliases. The decision turns on whether structural and nominal unions can coexist without producing two ways to spell the same type. See MEP 10. - Doc-comment ownership. Whether
Docis parsed (grammar production) or post-attached (text scan) changes nothing observable today, but a grammar production would make the attachment a parser contract rather than a normalisation contract.
References
- Aho, Sethi, Ullman. Compilers: Principles, Techniques, and Tools. §4.3 on AST design and §5 on syntax-directed translation.
- Appel, Andrew. Modern Compiler Implementation in ML. Chapter 4 on AST representations and the choice between class hierarchies and tagged unions.
- Wadler, Philip. The Expression Problem. 1998. The trade-off the tagged-union-by-nullable-field encoding navigates.
- Ford, Bryan. Parsing Expression Grammars: A Recognition-Based Syntactic Foundation. POPL 2004. PEG semantics that this AST's parser is built on.
- participle/v2. The struct-tag parser dialect whose tags share fields with the AST.
- MEP 1 (Lexer).
- MEP 2 (Grammar).
- MEP 10 (Type system roadmap).
- MEP 13 (Pattern matching).
Copyright
This document is placed in the public domain.