MEP 1. Lexer

Field	Value
MEP	1
Title	Lexer
Author	Mochi core
Status	Informational
Type	Informational
Created	2026-05-08
Revised	2026-05-11

Abstract

Mochi has a single-pass lexer that produces a flat token stream consumed by the grammar in MEP 2. The token classes are nine: Comment, Bool, Keyword, Ident, Float, Int, String, Punct, Whitespace. Each class has one canonical production. Token-class ordering is part of the specification: Bool matches before Keyword, Keyword matches before Ident. Every shape the lexer accepts is well-formed by construction; every shape the lexer rejects has a stable diagnostic code in the P040 band.

The companion document is MEP 2 (Grammar). This MEP fixes the token stream; MEP 2 fixes the productions consumed from it.

Motivation

A lexer is the most expensive layer to get wrong, because every later phase inherits its mistakes. A new keyword silently shadows an identifier the rest of the language depends on. A new operator collides with the prefix of an existing one. A numeric literal silently splits into two adjacent tokens and the user discovers it three error messages later. Three principles drive this revision:

One source, one token stream. The lexer is deterministic: given the same bytes it always produces the same tokens. Rule ordering is part of the spec; no implementation detail leaks through.
No corner cases. Anything the lexer accepts is well-formed. Shapes that are ambiguous or surprising (1e as 1 + e, 1_000 as 1 + _000, 0x with no digits, an unterminated string, a raw newline inside a string literal) are rejected with a positioned diagnostic. There is no "this looked weird but the lexer split it for you" path.
One diagnostic per failure mode. Every lexer-level rejection has a unique P0xx code. A tool consuming the lexer's output can predict the diagnostic shape without parsing the message text.

Notation

The lexer is specified in the same PEG dialect used by MEP 2, restricted to single-character primitives where convenient:

Operator	Meaning
`e1 e2`	Sequence: match `e1`, then `e2`.
`e1 \| e2`	Ordered choice: try `e1`; if it fails, try `e2`.
`e?`	Optional: 0 or 1 occurrence.
`e*`	Repetition: 0 or more occurrences.
`e+`	Repetition: 1 or more occurrences.
`&e`	Positive lookahead: match without consuming input.
`!e`	Negative lookahead: fail without consuming input.
`'lit'`	Match a literal character or string.
`[abc]`	Match one of the listed characters.
`'a'..'z'`	Match a character in the inclusive range.
`.`	Match any single Unicode scalar value.

Character classes that name a Unicode category use Go's unicode package names: unicode.L for any letter, unicode.So for "Other Symbol", unicode.N for any digit.

Specification

Source encoding

Mochi source is UTF-8. The lexer accepts at most one byte-order mark (U+FEFF) at offset 0 and strips it before lexing. A BOM at any other position is error[P047].

Newlines are recognised as \n, \r\n, or \r. Line numbers start at 1. Column numbers are 1-based and count UTF-8 scalar values from the start of the current line; multi-byte runes count as one column at the position of their first byte. Tab is one column. Offsets are byte offsets from the start of the file, starting at 0.

Token classes

The lexer is a longest-match scanner with ordered alternatives. Rules are tried top-to-bottom; ties are broken by the longest match.

Token       = Comment
            | Bool
            | Keyword
            | Ident
            | Float
            | Int
            | String
            | Punct
            | Whitespace

Comment     = LineComment | BlockComment
LineComment = ('//' | '#') (!Newline .)*
BlockComment= '/*' (!'*/' .)* '*/'

Bool        = 'true' | 'false'
Keyword     = HardKeyword & !IdentCont
Ident       = IdentStart IdentCont*
IdentStart  = unicode.L | unicode.So | '_'
IdentCont   = IdentStart | unicode.N

Float       = Digit+ '.' Digit+ Exponent?
            | Digit+ Exponent
Exponent    = ('e' | 'E') ('+' | '-')? Digit+
Int         = HexInt | BinInt | OctInt | DecInt
HexInt      = '0' ('x' | 'X') HexDigit+
BinInt      = '0' ('b' | 'B') BinDigit+
OctInt      = '0' ('o' | 'O') OctDigit+
DecInt      = Digit+

String      = '"' StringChar* '"'
StringChar  = '\\' EscapeChar
            | !('"' | '\\' | Newline) .
EscapeChar  = ['"\\abfnrtv]
            | 'x' HexDigit HexDigit
            | 'u' HexDigit HexDigit HexDigit HexDigit
            | 'U' HexDigit HexDigit HexDigit HexDigit
                  HexDigit HexDigit HexDigit HexDigit

Punct       = '==' | '!=' | '<=' | '>=' | '&&' | '||'
            | '=>' | ':-' | '..'
            | [-+*/%=<>!|{}[\](),.:]

Whitespace  = (' ' | '\t' | Newline | ';')+
Newline     = '\r\n' | '\n' | '\r'

Digit       = '0'..'9'
HexDigit    = '0'..'9' | 'a'..'f' | 'A'..'F'
BinDigit    = '0' | '1'
OctDigit    = '0'..'7'

Three properties of the rule list are part of the contract:

Bool is tried before Keyword, Keyword is tried before Ident. Otherwise true, false, or any reserved word would lex as identifiers and the parser would never see them as keywords.
Keyword uses negative lookahead !IdentCont at the trailing position. Without it, the prefix if would consume the first two characters of ifte and emit a keyword followed by an identifier te. With it, ifte lexes as a single Ident.
Numeric literals do not include a leading -. Negation is an operator, parsed in MEP 2 §Expressions. The lexer always produces - 1 as two tokens for the input -1. This is what lets xs[len(xs)-1] parse: the -1 inside the subscript is a subtraction, not a literal.

Reserved words

There are 33 hard keywords. Each appears in HardKeyword; lexing one yields a Keyword token, never an Ident. Every program in which any of these appears as a bare identifier is a parse error.

all      agent    break    continue export   else     emit
expect   extern   fact     fetch    for      fun      generate
if       import   in       intent   let      load     match
none     on       package  return   rule     save     stream
test     then     type     var      while

Soft keywords are reserved only inside specific productions. They lex as Ident and become significant only where MEP 2 looks for them. Outside that production they are ordinary identifiers; let from = 1 is well-formed.

Bucket	Words	Where significant
Declaration	`bench`, `model`, `update`	`BenchBlock`, `ModelDecl`, `UpdateStmt`
Query clauses	`from`, `where`, `select`, `group`, `by`, `into`, `having`, `sort`, `order`, `skip`, `take`, `distinct`, `join`, `left`, `right`, `outer`	`QueryExpr`, `JoinClause`
Modifiers	`as`, `to`, `with`	`load`, `save`, `fetch`, cast, import
Set operators	`union`, `except`, `intersect`	`RelExpr` in MEP 2

Numeric literals

Form	Examples	Notes
`DecInt`	`0`, `42`, `007`	Leading zeros are decimal, not octal. `007` is `7`.
`HexInt`	`0xFF`, `0XaB`	At least one hex digit after the prefix.
`BinInt`	`0b1010`, `0B1`	At least one binary digit after the prefix.
`OctInt`	`0o7`, `0O17`	At least one octal digit after the prefix.
`Float`	`1.0`, `3.14`, `1e10`, `1.5e-2`	Fractional form requires digits on both sides of `.`. Exponent requires at least one digit.

Three shapes are explicitly rejected by the lexer:

Underscores in numeric literals (1_000, 0xFF_AA) match no alternative. They produce error[P043].
Bare base prefixes (0x, 0b, 0o) match no alternative because the digit list is +, not *. They produce error[P046].
Numeric literal immediately followed by an identifier-start character (1e, 3.14abc, 0xFFG) is error[P043]. The pre-lex pass detects this before the simple lexer would split the input into two adjacent tokens.

Int values are 64-bit signed. Literals outside the int64 range produce error[P045] at the literal's start. Float values are 64-bit IEEE-754; overflow during conversion is error[P048].

String literals

Strings are double-quoted single-line literals. StringChar excludes raw Newline characters, so a " followed by an unescaped line break is error[P044], not a silent multi-line string. Multi-line text uses concatenation:

let msg = "first line\n" +
          "second line\n"

After lexing, the parser passes every String token through participle.Unquote (which calls strconv.Unquote). The accepted escape sequences are those of Go's strconv.Unquote:

Escape	Meaning
`\\`	Backslash
`\"`	Double quote
`\a` `\b` `\f` `\n` `\r` `\t` `\v`	Alert, backspace, form feed, newline, carriage return, tab, vertical tab
`\xHH`	One-byte hex escape, exactly two hex digits
`\uHHHH`	16-bit Unicode escape, exactly four hex digits
`\UHHHHHHHH`	32-bit Unicode escape, exactly eight hex digits

Any other backslash-prefixed character is error[P041]. The brace forms used by Rust and JavaScript (\u{...}) are not supported; the full-width form \uHHHH is the canonical spelling.

An unterminated string literal is error[P040] at the position of the opening quote. There is no raw-string form, no triple-quoted string, and no string interpolation. Concatenation is +.

Comments

LineComment  = ('//' | '#') (!Newline .)*
BlockComment = '/*' (!'*/' .)* '*/'

Block comments do not nest. The first */ ends the comment. In /* outer /* inner */ tail */ the inner */ closes the comment and the trailing tail */ is parsed as ordinary source, which yields a parse error on the */.

An unterminated block comment is error[P042] at the position of the opening /*.

The doc-comment attachment pass (parser/docs.go) re-reads the source after parsing and attaches each preceding line comment to the following declaration's Doc field. Line comments are otherwise elided before parsing.

Position fidelity

Every emitted lexer.Token carries a Pos with Filename, Offset, Line, Column. Positions point at the start of the token. Multi-byte runes count as one column at the position of their first byte. Tabs count as one column. CRLF newlines advance the line counter at the \n; the \r is consumed without resetting the column.

Pre-lex and post-lex passes

The lexer is the participle simple table bracketed by two side-effect-free passes:

Pre-lex strips a leading BOM (or rejects a BOM elsewhere with P047) and performs a byte-level scan for shapes the simple regex cannot describe:
- unterminated string literals (P040),
- unterminated block comments (P042),
- numeric literals adjacent to identifier characters (P043),
- raw newline inside a string (P044),
- bare base prefixes with no digits (P046).
Post-capture translates participle's opaque error messages into the stable code surface: strconv.ParseInt: ... value out of range becomes P045, invalid quoted string ... becomes P041, strconv.ParseFloat: ... value out of range becomes P048.

Both passes share their diagnostic templates with parser/errors.go so the codes are stable across releases.

Diagnostic codes

Code	Class	Trigger
`P040`	Lexer	Unterminated string literal
`P041`	Lexer	Invalid escape sequence in string literal
`P042`	Lexer	Unterminated block comment
`P043`	Lexer	Missing whitespace between numeric literal and identifier (or underscore separator)
`P044`	Lexer	Raw newline inside string literal
`P045`	Lexer	Integer literal out of range for `int64`
`P046`	Lexer	Incomplete numeric literal: missing digits after base prefix
`P047`	Lexer	Byte-order mark at a position other than offset 0
`P048`	Lexer	Float literal out of range for IEEE-754 binary64

P049 is the next free code in the lexer band. Codes P060+ belong to the grammar's post-parse layer (see MEP 2).

Tokenize API

Mochi exposes parser.Tokenize(filename, src) ([]lexer.Token, error) for tooling: formatters, linters, editors, language servers. It returns the same token stream the parser consumes. Token kinds are the nine class names from §Token classes. The pre-lex validator runs first; ill-formed input returns an error of the same shape and code the parser would produce.

Design principles

One source, one token stream

Token-class ordering, longest-match resolution, and rule-list order are part of the spec. A consumer can predict the exact token stream from the source bytes without consulting the implementation.

No corner cases

A lexer that splits ambiguous input into two adjacent tokens hides the user's mistake. Every shape that is meaningful matches exactly one rule. Every shape that is not meaningful is a diagnostic with a stable code. The pre-lex pass exists to make this true for shapes the simple regex table cannot describe (unterminated comments, numeric-adjacency, BOM-at-offset-N), not to relax the rules.

Identifier discipline

Hard keywords are reserved everywhere; soft keywords are reserved only at the production that consumes them. The hard list is short (33 words) and the soft list is bounded by the grammar shapes that need it. New surface vocabulary defaults to soft unless it appears in a top-level statement.

Stable diagnostic codes

Every lexer rejection has a unique P0xx code. A downstream consumer (a code action, a quick-fix, a CI assertion) can match on the code without parsing the message text. Codes are append-only: a retired code is never reused.

Negation lives in the parser

-1 lexes as two tokens. A single-token negative literal would break subtraction in postfix contexts (xs[len(xs)-1]); the expression grammar in MEP 2 handles unary minus at the operand level, with structural symmetry on both sides of every binary operator.

Conformance

A conforming lexer must produce the documented token stream for every shape this MEP marks as accepted and reject every shape it marks as rejected with the documented diagnostic code. The obligations below are stated in terms of the language; any test corpus that exercises them satisfies this MEP.

Token coverage

A conforming lexer must produce the expected token class for at least one example of each of the following:

Hard keywords. Each of the 33 reserved words lexes to Keyword, never to Ident.
Soft keywords. Each soft keyword lexes to Ident outside its production, and the resulting token round-trips as a bare identifier in any context where an identifier is accepted.
Integer literals. DecInt, HexInt, BinInt, OctInt, including the boundary cases: zero, a literal with leading zeros (007 is 7), a single digit after a base prefix (0x1, 0b1, 0o1), and mixed-case base prefixes (0X, 0B, 0O).
Float literals. 1.0, scientific notation with positive and negative exponents (1e10, 1.5e-2, 0.5e+5), and rejection of 1. and .1.
String literals. Each of the eight escape forms, the empty string "", and rejection of raw newlines inside a string.
Punctuation. Every two-character form (==, !=, <=, >=, &&, ||, =>, :-, ..) and every single-character form.
Boolean literals. true and false lex to Bool, not to Keyword or Ident.
Comments. Both line forms (//, #) and the block form (/* ... */). The block form must accept /***/, /*a**/, and /* /, */ */ as single tokens. Block comments do not nest: the first */ closes.

Position fidelity

A conforming lexer must report token positions that satisfy:

Line numbers advance correctly across \n, \r\n, and \r.
A tab counts as exactly one column.
A multi-byte rune counts as exactly one column at the position of its first byte.

Byte-order mark

A BOM at offset 0 is stripped and the rest of the source is lexed normally.
A source containing only a BOM is a valid empty program.
A BOM at any other position is rejected with P047.

Diagnostic obligations

A conforming lexer must emit a positioned diagnostic with the documented code for each of the rejected shapes:

Input shape	Code
Unterminated string literal	`P040`
Invalid escape sequence	`P041`
Unterminated block comment	`P042`
Numeric literal adjacent to an identifier char	`P043`
Raw newline inside a string literal	`P044`
Integer literal out of `int64` range	`P045`
Base prefix with no following digits	`P046`
Byte-order mark at offset other than 0	`P047`
Float literal out of IEEE-754 binary64 range	`P048`

The position must point at the start of the offending construct. The wording of the diagnostic message is not normative; only the code and position are.

Robustness obligations

A conforming lexer must terminate on every input. For any byte sequence it must produce either a valid token stream or one of the diagnostics above; it must not panic, hang, or return a null result alongside a null error. Random-byte fuzzing for the duration the language community standardises on (currently ten seconds per release) must surface no counterexamples to this invariant.

Open questions

Nested block comments. A literal */ inside a block comment terminates it early. Switching to a hand-written tokenizer would let comments nest. Low priority; the current behaviour is pinned by block_comment_no_nesting.
Triple-quoted strings. A """ literal that allows raw newlines and ends only at the matching """ would let users write long strings without + concatenation. The cost is a new escape surface and a third string form to learn.
Numeric separators. 1_000_000 is rejected today by P043. Adding underscores between digits as a readability aid is cheap; the question is whether the surface complexity is worth it for a language that already supports hex, binary, and octal literals.

References

Python lexical analysis reference. A worked example of a fully-specified lexer with stable token classes.
participle/v2 lexer documentation. The implementation framework.
Go strconv.Unquote semantics. The escape decoding rules.
Unicode TR #31. Identifier and pattern syntax. Mochi's IdentStart and IdentCont are the practical subset described there.
Cooper, Keith and Linda Torczon. Engineering a Compiler. Chapter 2 (Scanners).
MEP 2 (Grammar).
MEP 3 (Abstract Syntax Tree).

Copyright

This document is placed in the public domain.

Abstract​

Motivation​

Notation​

Specification​

Source encoding​

Token classes​

Reserved words​

Numeric literals​

String literals​

Comments​

Position fidelity​

Pre-lex and post-lex passes​

Diagnostic codes​

Tokenize API​

Design principles​

One source, one token stream​

No corner cases​

Identifier discipline​

Stable diagnostic codes​

Negation lives in the parser​

Conformance​

Token coverage​

Position fidelity​

Byte-order mark​

Diagnostic obligations​

Robustness obligations​

Open questions​

References​

Copyright​

Abstract

Motivation

Notation

Specification

Source encoding

Token classes

Reserved words

Numeric literals

String literals

Comments

Position fidelity

Pre-lex and post-lex passes

Diagnostic codes

Tokenize API

Design principles

One source, one token stream

No corner cases

Identifier discipline

Stable diagnostic codes

Negation lives in the parser

Conformance

Token coverage

Position fidelity

Byte-order mark

Diagnostic obligations

Robustness obligations

Open questions

References

Copyright