MEP 1. Lexer
| Field | Value |
|---|---|
| MEP | 1 |
| Title | Lexer |
| Author | Mochi core |
| Status | Informational |
| Type | Informational |
| Created | 2026-05-08 |
| Revised | 2026-05-11 |
Abstract
Mochi has a single-pass lexer that produces a flat token stream consumed by the grammar in MEP 2. The token classes are nine: Comment, Bool, Keyword, Ident, Float, Int, String, Punct, Whitespace. Each class has one canonical production. Token-class ordering is part of the specification: Bool matches before Keyword, Keyword matches before Ident. Every shape the lexer accepts is well-formed by construction; every shape the lexer rejects has a stable diagnostic code in the P040 band.
The companion document is MEP 2 (Grammar). This MEP fixes the token stream; MEP 2 fixes the productions consumed from it.
Motivation
A lexer is the most expensive layer to get wrong, because every later phase inherits its mistakes. A new keyword silently shadows an identifier the rest of the language depends on. A new operator collides with the prefix of an existing one. A numeric literal silently splits into two adjacent tokens and the user discovers it three error messages later. Three principles drive this revision:
- One source, one token stream. The lexer is deterministic: given the same bytes it always produces the same tokens. Rule ordering is part of the spec; no implementation detail leaks through.
- No corner cases. Anything the lexer accepts is well-formed. Shapes that are ambiguous or surprising (
1eas1+e,1_000as1+_000,0xwith no digits, an unterminated string, a raw newline inside a string literal) are rejected with a positioned diagnostic. There is no "this looked weird but the lexer split it for you" path. - One diagnostic per failure mode. Every lexer-level rejection has a unique
P0xxcode. A tool consuming the lexer's output can predict the diagnostic shape without parsing the message text.
Notation
The lexer is specified in the same PEG dialect used by MEP 2, restricted to single-character primitives where convenient:
| Operator | Meaning |
|---|---|
e1 e2 | Sequence: match e1, then e2. |
e1 | e2 | Ordered choice: try e1; if it fails, try e2. |
e? | Optional: 0 or 1 occurrence. |
e* | Repetition: 0 or more occurrences. |
e+ | Repetition: 1 or more occurrences. |
&e | Positive lookahead: match without consuming input. |
!e | Negative lookahead: fail without consuming input. |
'lit' | Match a literal character or string. |
[abc] | Match one of the listed characters. |
'a'..'z' | Match a character in the inclusive range. |
. | Match any single Unicode scalar value. |
Character classes that name a Unicode category use Go's unicode package names: unicode.L for any letter, unicode.So for "Other Symbol", unicode.N for any digit.
Specification
Source encoding
Mochi source is UTF-8. The lexer accepts at most one byte-order mark (U+FEFF) at offset 0 and strips it before lexing. A BOM at any other position is error[P047].
Newlines are recognised as \n, \r\n, or \r. Line numbers start at 1. Column numbers are 1-based and count UTF-8 scalar values from the start of the current line; multi-byte runes count as one column at the position of their first byte. Tab is one column. Offsets are byte offsets from the start of the file, starting at 0.
Token classes
The lexer is a longest-match scanner with ordered alternatives. Rules are tried top-to-bottom; ties are broken by the longest match.
Token = Comment
| Bool
| Keyword
| Ident
| Float
| Int
| String
| Punct
| Whitespace
Comment = LineComment | BlockComment
LineComment = ('//' | '#') (!Newline .)*
BlockComment= '/*' (!'*/' .)* '*/'
Bool = 'true' | 'false'
Keyword = HardKeyword & !IdentCont
Ident = IdentStart IdentCont*
IdentStart = unicode.L | unicode.So | '_'
IdentCont = IdentStart | unicode.N
Float = Digit+ '.' Digit+ Exponent?
| Digit+ Exponent
Exponent = ('e' | 'E') ('+' | '-')? Digit+
Int = HexInt | BinInt | OctInt | DecInt
HexInt = '0' ('x' | 'X') HexDigit+
BinInt = '0' ('b' | 'B') BinDigit+
OctInt = '0' ('o' | 'O') OctDigit+
DecInt = Digit+
String = '"' StringChar* '"'
StringChar = '\\' EscapeChar
| !('"' | '\\' | Newline) .
EscapeChar = ['"\\abfnrtv]
| 'x' HexDigit HexDigit
| 'u' HexDigit HexDigit HexDigit HexDigit
| 'U' HexDigit HexDigit HexDigit HexDigit
HexDigit HexDigit HexDigit HexDigit
Punct = '==' | '!=' | '<=' | '>=' | '&&' | '||'
| '=>' | ':-' | '..'
| [-+*/%=<>!|{}[\](),.:]
Whitespace = (' ' | '\t' | Newline | ';')+
Newline = '\r\n' | '\n' | '\r'
Digit = '0'..'9'
HexDigit = '0'..'9' | 'a'..'f' | 'A'..'F'
BinDigit = '0' | '1'
OctDigit = '0'..'7'
Three properties of the rule list are part of the contract:
Boolis tried beforeKeyword,Keywordis tried beforeIdent. Otherwisetrue,false, or any reserved word would lex as identifiers and the parser would never see them as keywords.Keyworduses negative lookahead!IdentContat the trailing position. Without it, the prefixifwould consume the first two characters ofifteand emit a keyword followed by an identifierte. With it,iftelexes as a singleIdent.- Numeric literals do not include a leading
-. Negation is an operator, parsed in MEP 2 §Expressions. The lexer always produces-1as two tokens for the input-1. This is what letsxs[len(xs)-1]parse: the-1inside the subscript is a subtraction, not a literal.
Reserved words
There are 33 hard keywords. Each appears in HardKeyword; lexing one yields a Keyword token, never an Ident. Every program in which any of these appears as a bare identifier is a parse error.
all agent break continue export else emit
expect extern fact fetch for fun generate
if import in intent let load match
none on package return rule save stream
test then type var while
Soft keywords are reserved only inside specific productions. They lex as Ident and become significant only where MEP 2 looks for them. Outside that production they are ordinary identifiers; let from = 1 is well-formed.
| Bucket | Words | Where significant |
|---|---|---|
| Declaration | bench, model, update | BenchBlock, ModelDecl, UpdateStmt |
| Query clauses | from, where, select, group, by, into, having, sort, order, skip, take, distinct, join, left, right, outer | QueryExpr, JoinClause |
| Modifiers | as, to, with | load, save, fetch, cast, import |
| Set operators | union, except, intersect | RelExpr in MEP 2 |
Numeric literals
| Form | Examples | Notes |
|---|---|---|
DecInt | 0, 42, 007 | Leading zeros are decimal, not octal. 007 is 7. |
HexInt | 0xFF, 0XaB | At least one hex digit after the prefix. |
BinInt | 0b1010, 0B1 | At least one binary digit after the prefix. |
OctInt | 0o7, 0O17 | At least one octal digit after the prefix. |
Float | 1.0, 3.14, 1e10, 1.5e-2 | Fractional form requires digits on both sides of .. Exponent requires at least one digit. |
Three shapes are explicitly rejected by the lexer:
- Underscores in numeric literals (
1_000,0xFF_AA) match no alternative. They produceerror[P043]. - Bare base prefixes (
0x,0b,0o) match no alternative because the digit list is+, not*. They produceerror[P046]. - Numeric literal immediately followed by an identifier-start character (
1e,3.14abc,0xFFG) iserror[P043]. The pre-lex pass detects this before the simple lexer would split the input into two adjacent tokens.
Int values are 64-bit signed. Literals outside the int64 range produce error[P045] at the literal's start. Float values are 64-bit IEEE-754; overflow during conversion is error[P048].
String literals
Strings are double-quoted single-line literals. StringChar excludes raw Newline characters, so a " followed by an unescaped line break is error[P044], not a silent multi-line string. Multi-line text uses concatenation:
let msg = "first line\n" +
"second line\n"
After lexing, the parser passes every String token through participle.Unquote (which calls strconv.Unquote). The accepted escape sequences are those of Go's strconv.Unquote:
| Escape | Meaning |
|---|---|
\\ | Backslash |
\" | Double quote |
\a \b \f \n \r \t \v | Alert, backspace, form feed, newline, carriage return, tab, vertical tab |
\xHH | One-byte hex escape, exactly two hex digits |
\uHHHH | 16-bit Unicode escape, exactly four hex digits |
\UHHHHHHHH | 32-bit Unicode escape, exactly eight hex digits |
Any other backslash-prefixed character is error[P041]. The brace forms used by Rust and JavaScript (\u{...}) are not supported; the full-width form \uHHHH is the canonical spelling.
An unterminated string literal is error[P040] at the position of the opening quote. There is no raw-string form, no triple-quoted string, and no string interpolation. Concatenation is +.
Comments
LineComment = ('//' | '#') (!Newline .)*
BlockComment = '/*' (!'*/' .)* '*/'
Block comments do not nest. The first */ ends the comment. In /* outer /* inner */ tail */ the inner */ closes the comment and the trailing tail */ is parsed as ordinary source, which yields a parse error on the */.
An unterminated block comment is error[P042] at the position of the opening /*.
The doc-comment attachment pass (parser/docs.go) re-reads the source after parsing and attaches each preceding line comment to the following declaration's Doc field. Line comments are otherwise elided before parsing.
Position fidelity
Every emitted lexer.Token carries a Pos with Filename, Offset, Line, Column. Positions point at the start of the token. Multi-byte runes count as one column at the position of their first byte. Tabs count as one column. CRLF newlines advance the line counter at the \n; the \r is consumed without resetting the column.
Pre-lex and post-lex passes
The lexer is the participle simple table bracketed by two side-effect-free passes:
- Pre-lex strips a leading BOM (or rejects a BOM elsewhere with
P047) and performs a byte-level scan for shapes the simple regex cannot describe:- unterminated string literals (
P040), - unterminated block comments (
P042), - numeric literals adjacent to identifier characters (
P043), - raw newline inside a string (
P044), - bare base prefixes with no digits (
P046).
- unterminated string literals (
- Post-capture translates participle's opaque error messages into the stable code surface:
strconv.ParseInt: ... value out of rangebecomesP045,invalid quoted string ...becomesP041,strconv.ParseFloat: ... value out of rangebecomesP048.
Both passes share their diagnostic templates with parser/errors.go so the codes are stable across releases.
Diagnostic codes
| Code | Class | Trigger |
|---|---|---|
P040 | Lexer | Unterminated string literal |
P041 | Lexer | Invalid escape sequence in string literal |
P042 | Lexer | Unterminated block comment |
P043 | Lexer | Missing whitespace between numeric literal and identifier (or underscore separator) |
P044 | Lexer | Raw newline inside string literal |
P045 | Lexer | Integer literal out of range for int64 |
P046 | Lexer | Incomplete numeric literal: missing digits after base prefix |
P047 | Lexer | Byte-order mark at a position other than offset 0 |
P048 | Lexer | Float literal out of range for IEEE-754 binary64 |
P049 is the next free code in the lexer band. Codes P060+ belong to the grammar's post-parse layer (see MEP 2).
Tokenize API
Mochi exposes parser.Tokenize(filename, src) ([]lexer.Token, error) for tooling: formatters, linters, editors, language servers. It returns the same token stream the parser consumes. Token kinds are the nine class names from §Token classes. The pre-lex validator runs first; ill-formed input returns an error of the same shape and code the parser would produce.
Design principles
One source, one token stream
Token-class ordering, longest-match resolution, and rule-list order are part of the spec. A consumer can predict the exact token stream from the source bytes without consulting the implementation.
No corner cases
A lexer that splits ambiguous input into two adjacent tokens hides the user's mistake. Every shape that is meaningful matches exactly one rule. Every shape that is not meaningful is a diagnostic with a stable code. The pre-lex pass exists to make this true for shapes the simple regex table cannot describe (unterminated comments, numeric-adjacency, BOM-at-offset-N), not to relax the rules.
Identifier discipline
Hard keywords are reserved everywhere; soft keywords are reserved only at the production that consumes them. The hard list is short (33 words) and the soft list is bounded by the grammar shapes that need it. New surface vocabulary defaults to soft unless it appears in a top-level statement.
Stable diagnostic codes
Every lexer rejection has a unique P0xx code. A downstream consumer (a code action, a quick-fix, a CI assertion) can match on the code without parsing the message text. Codes are append-only: a retired code is never reused.
Negation lives in the parser
-1 lexes as two tokens. A single-token negative literal would break subtraction in postfix contexts (xs[len(xs)-1]); the expression grammar in MEP 2 handles unary minus at the operand level, with structural symmetry on both sides of every binary operator.
Conformance
A conforming lexer must produce the documented token stream for every shape this MEP marks as accepted and reject every shape it marks as rejected with the documented diagnostic code. The obligations below are stated in terms of the language; any test corpus that exercises them satisfies this MEP.
Token coverage
A conforming lexer must produce the expected token class for at least one example of each of the following:
- Hard keywords. Each of the 33 reserved words lexes to
Keyword, never toIdent. - Soft keywords. Each soft keyword lexes to
Identoutside its production, and the resulting token round-trips as a bare identifier in any context where an identifier is accepted. - Integer literals.
DecInt,HexInt,BinInt,OctInt, including the boundary cases: zero, a literal with leading zeros (007is7), a single digit after a base prefix (0x1,0b1,0o1), and mixed-case base prefixes (0X,0B,0O). - Float literals.
1.0, scientific notation with positive and negative exponents (1e10,1.5e-2,0.5e+5), and rejection of1.and.1. - String literals. Each of the eight escape forms, the empty string
"", and rejection of raw newlines inside a string. - Punctuation. Every two-character form (
==,!=,<=,>=,&&,||,=>,:-,..) and every single-character form. - Boolean literals.
trueandfalselex toBool, not toKeywordorIdent. - Comments. Both line forms (
//,#) and the block form (/* ... */). The block form must accept/***/,/*a**/, and/* /, */ */as single tokens. Block comments do not nest: the first*/closes.
Position fidelity
A conforming lexer must report token positions that satisfy:
- Line numbers advance correctly across
\n,\r\n, and\r. - A tab counts as exactly one column.
- A multi-byte rune counts as exactly one column at the position of its first byte.
Byte-order mark
- A BOM at offset 0 is stripped and the rest of the source is lexed normally.
- A source containing only a BOM is a valid empty program.
- A BOM at any other position is rejected with
P047.
Diagnostic obligations
A conforming lexer must emit a positioned diagnostic with the documented code for each of the rejected shapes:
| Input shape | Code |
|---|---|
| Unterminated string literal | P040 |
| Invalid escape sequence | P041 |
| Unterminated block comment | P042 |
| Numeric literal adjacent to an identifier char | P043 |
| Raw newline inside a string literal | P044 |
Integer literal out of int64 range | P045 |
| Base prefix with no following digits | P046 |
| Byte-order mark at offset other than 0 | P047 |
| Float literal out of IEEE-754 binary64 range | P048 |
The position must point at the start of the offending construct. The wording of the diagnostic message is not normative; only the code and position are.
Robustness obligations
A conforming lexer must terminate on every input. For any byte sequence it must produce either a valid token stream or one of the diagnostics above; it must not panic, hang, or return a null result alongside a null error. Random-byte fuzzing for the duration the language community standardises on (currently ten seconds per release) must surface no counterexamples to this invariant.
Open questions
- Nested block comments. A literal
*/inside a block comment terminates it early. Switching to a hand-written tokenizer would let comments nest. Low priority; the current behaviour is pinned byblock_comment_no_nesting. - Triple-quoted strings. A
"""literal that allows raw newlines and ends only at the matching"""would let users write long strings without+concatenation. The cost is a new escape surface and a third string form to learn. - Numeric separators.
1_000_000is rejected today byP043. Adding underscores between digits as a readability aid is cheap; the question is whether the surface complexity is worth it for a language that already supports hex, binary, and octal literals.
References
- Python lexical analysis reference. A worked example of a fully-specified lexer with stable token classes.
- participle/v2 lexer documentation. The implementation framework.
- Go
strconv.Unquotesemantics. The escape decoding rules. - Unicode TR #31. Identifier and pattern syntax. Mochi's
IdentStartandIdentContare the practical subset described there. - Cooper, Keith and Linda Torczon. Engineering a Compiler. Chapter 2 (Scanners).
- MEP 2 (Grammar).
- MEP 3 (Abstract Syntax Tree).
Copyright
This document is placed in the public domain.