Skip to main content

MEP 1. Lexer

FieldValue
MEP1
TitleLexer
AuthorMochi core
StatusInformational
TypeInformational
Created2026-05-08
Revised2026-05-11

Abstract

Mochi has a single-pass lexer that produces a flat token stream consumed by the grammar in MEP 2. The token classes are nine: Comment, Bool, Keyword, Ident, Float, Int, String, Punct, Whitespace. Each class has one canonical production. Token-class ordering is part of the specification: Bool matches before Keyword, Keyword matches before Ident. Every shape the lexer accepts is well-formed by construction; every shape the lexer rejects has a stable diagnostic code in the P040 band.

The companion document is MEP 2 (Grammar). This MEP fixes the token stream; MEP 2 fixes the productions consumed from it.

Motivation

A lexer is the most expensive layer to get wrong, because every later phase inherits its mistakes. A new keyword silently shadows an identifier the rest of the language depends on. A new operator collides with the prefix of an existing one. A numeric literal silently splits into two adjacent tokens and the user discovers it three error messages later. Three principles drive this revision:

  1. One source, one token stream. The lexer is deterministic: given the same bytes it always produces the same tokens. Rule ordering is part of the spec; no implementation detail leaks through.
  2. No corner cases. Anything the lexer accepts is well-formed. Shapes that are ambiguous or surprising (1e as 1 + e, 1_000 as 1 + _000, 0x with no digits, an unterminated string, a raw newline inside a string literal) are rejected with a positioned diagnostic. There is no "this looked weird but the lexer split it for you" path.
  3. One diagnostic per failure mode. Every lexer-level rejection has a unique P0xx code. A tool consuming the lexer's output can predict the diagnostic shape without parsing the message text.

Notation

The lexer is specified in the same PEG dialect used by MEP 2, restricted to single-character primitives where convenient:

OperatorMeaning
e1 e2Sequence: match e1, then e2.
e1 | e2Ordered choice: try e1; if it fails, try e2.
e?Optional: 0 or 1 occurrence.
e*Repetition: 0 or more occurrences.
e+Repetition: 1 or more occurrences.
&ePositive lookahead: match without consuming input.
!eNegative lookahead: fail without consuming input.
'lit'Match a literal character or string.
[abc]Match one of the listed characters.
'a'..'z'Match a character in the inclusive range.
.Match any single Unicode scalar value.

Character classes that name a Unicode category use Go's unicode package names: unicode.L for any letter, unicode.So for "Other Symbol", unicode.N for any digit.

Specification

Source encoding

Mochi source is UTF-8. The lexer accepts at most one byte-order mark (U+FEFF) at offset 0 and strips it before lexing. A BOM at any other position is error[P047].

Newlines are recognised as \n, \r\n, or \r. Line numbers start at 1. Column numbers are 1-based and count UTF-8 scalar values from the start of the current line; multi-byte runes count as one column at the position of their first byte. Tab is one column. Offsets are byte offsets from the start of the file, starting at 0.

Token classes

The lexer is a longest-match scanner with ordered alternatives. Rules are tried top-to-bottom; ties are broken by the longest match.

Token = Comment
| Bool
| Keyword
| Ident
| Float
| Int
| String
| Punct
| Whitespace

Comment = LineComment | BlockComment
LineComment = ('//' | '#') (!Newline .)*
BlockComment= '/*' (!'*/' .)* '*/'

Bool = 'true' | 'false'
Keyword = HardKeyword & !IdentCont
Ident = IdentStart IdentCont*
IdentStart = unicode.L | unicode.So | '_'
IdentCont = IdentStart | unicode.N

Float = Digit+ '.' Digit+ Exponent?
| Digit+ Exponent
Exponent = ('e' | 'E') ('+' | '-')? Digit+
Int = HexInt | BinInt | OctInt | DecInt
HexInt = '0' ('x' | 'X') HexDigit+
BinInt = '0' ('b' | 'B') BinDigit+
OctInt = '0' ('o' | 'O') OctDigit+
DecInt = Digit+

String = '"' StringChar* '"'
StringChar = '\\' EscapeChar
| !('"' | '\\' | Newline) .
EscapeChar = ['"\\abfnrtv]
| 'x' HexDigit HexDigit
| 'u' HexDigit HexDigit HexDigit HexDigit
| 'U' HexDigit HexDigit HexDigit HexDigit
HexDigit HexDigit HexDigit HexDigit

Punct = '==' | '!=' | '<=' | '>=' | '&&' | '||'
| '=>' | ':-' | '..'
| [-+*/%=<>!|{}[\](),.:]

Whitespace = (' ' | '\t' | Newline | ';')+
Newline = '\r\n' | '\n' | '\r'

Digit = '0'..'9'
HexDigit = '0'..'9' | 'a'..'f' | 'A'..'F'
BinDigit = '0' | '1'
OctDigit = '0'..'7'

Three properties of the rule list are part of the contract:

  • Bool is tried before Keyword, Keyword is tried before Ident. Otherwise true, false, or any reserved word would lex as identifiers and the parser would never see them as keywords.
  • Keyword uses negative lookahead !IdentCont at the trailing position. Without it, the prefix if would consume the first two characters of ifte and emit a keyword followed by an identifier te. With it, ifte lexes as a single Ident.
  • Numeric literals do not include a leading -. Negation is an operator, parsed in MEP 2 §Expressions. The lexer always produces - 1 as two tokens for the input -1. This is what lets xs[len(xs)-1] parse: the -1 inside the subscript is a subtraction, not a literal.

Reserved words

There are 33 hard keywords. Each appears in HardKeyword; lexing one yields a Keyword token, never an Ident. Every program in which any of these appears as a bare identifier is a parse error.

all agent break continue export else emit
expect extern fact fetch for fun generate
if import in intent let load match
none on package return rule save stream
test then type var while

Soft keywords are reserved only inside specific productions. They lex as Ident and become significant only where MEP 2 looks for them. Outside that production they are ordinary identifiers; let from = 1 is well-formed.

BucketWordsWhere significant
Declarationbench, model, updateBenchBlock, ModelDecl, UpdateStmt
Query clausesfrom, where, select, group, by, into, having, sort, order, skip, take, distinct, join, left, right, outerQueryExpr, JoinClause
Modifiersas, to, withload, save, fetch, cast, import
Set operatorsunion, except, intersectRelExpr in MEP 2

Numeric literals

FormExamplesNotes
DecInt0, 42, 007Leading zeros are decimal, not octal. 007 is 7.
HexInt0xFF, 0XaBAt least one hex digit after the prefix.
BinInt0b1010, 0B1At least one binary digit after the prefix.
OctInt0o7, 0O17At least one octal digit after the prefix.
Float1.0, 3.14, 1e10, 1.5e-2Fractional form requires digits on both sides of .. Exponent requires at least one digit.

Three shapes are explicitly rejected by the lexer:

  • Underscores in numeric literals (1_000, 0xFF_AA) match no alternative. They produce error[P043].
  • Bare base prefixes (0x, 0b, 0o) match no alternative because the digit list is +, not *. They produce error[P046].
  • Numeric literal immediately followed by an identifier-start character (1e, 3.14abc, 0xFFG) is error[P043]. The pre-lex pass detects this before the simple lexer would split the input into two adjacent tokens.

Int values are 64-bit signed. Literals outside the int64 range produce error[P045] at the literal's start. Float values are 64-bit IEEE-754; overflow during conversion is error[P048].

String literals

Strings are double-quoted single-line literals. StringChar excludes raw Newline characters, so a " followed by an unescaped line break is error[P044], not a silent multi-line string. Multi-line text uses concatenation:

let msg = "first line\n" +
"second line\n"

After lexing, the parser passes every String token through participle.Unquote (which calls strconv.Unquote). The accepted escape sequences are those of Go's strconv.Unquote:

EscapeMeaning
\\Backslash
\"Double quote
\a \b \f \n \r \t \vAlert, backspace, form feed, newline, carriage return, tab, vertical tab
\xHHOne-byte hex escape, exactly two hex digits
\uHHHH16-bit Unicode escape, exactly four hex digits
\UHHHHHHHH32-bit Unicode escape, exactly eight hex digits

Any other backslash-prefixed character is error[P041]. The brace forms used by Rust and JavaScript (\u{...}) are not supported; the full-width form \uHHHH is the canonical spelling.

An unterminated string literal is error[P040] at the position of the opening quote. There is no raw-string form, no triple-quoted string, and no string interpolation. Concatenation is +.

Comments

LineComment = ('//' | '#') (!Newline .)*
BlockComment = '/*' (!'*/' .)* '*/'

Block comments do not nest. The first */ ends the comment. In /* outer /* inner */ tail */ the inner */ closes the comment and the trailing tail */ is parsed as ordinary source, which yields a parse error on the */.

An unterminated block comment is error[P042] at the position of the opening /*.

The doc-comment attachment pass (parser/docs.go) re-reads the source after parsing and attaches each preceding line comment to the following declaration's Doc field. Line comments are otherwise elided before parsing.

Position fidelity

Every emitted lexer.Token carries a Pos with Filename, Offset, Line, Column. Positions point at the start of the token. Multi-byte runes count as one column at the position of their first byte. Tabs count as one column. CRLF newlines advance the line counter at the \n; the \r is consumed without resetting the column.

Pre-lex and post-lex passes

The lexer is the participle simple table bracketed by two side-effect-free passes:

  1. Pre-lex strips a leading BOM (or rejects a BOM elsewhere with P047) and performs a byte-level scan for shapes the simple regex cannot describe:
    • unterminated string literals (P040),
    • unterminated block comments (P042),
    • numeric literals adjacent to identifier characters (P043),
    • raw newline inside a string (P044),
    • bare base prefixes with no digits (P046).
  2. Post-capture translates participle's opaque error messages into the stable code surface: strconv.ParseInt: ... value out of range becomes P045, invalid quoted string ... becomes P041, strconv.ParseFloat: ... value out of range becomes P048.

Both passes share their diagnostic templates with parser/errors.go so the codes are stable across releases.

Diagnostic codes

CodeClassTrigger
P040LexerUnterminated string literal
P041LexerInvalid escape sequence in string literal
P042LexerUnterminated block comment
P043LexerMissing whitespace between numeric literal and identifier (or underscore separator)
P044LexerRaw newline inside string literal
P045LexerInteger literal out of range for int64
P046LexerIncomplete numeric literal: missing digits after base prefix
P047LexerByte-order mark at a position other than offset 0
P048LexerFloat literal out of range for IEEE-754 binary64

P049 is the next free code in the lexer band. Codes P060+ belong to the grammar's post-parse layer (see MEP 2).

Tokenize API

Mochi exposes parser.Tokenize(filename, src) ([]lexer.Token, error) for tooling: formatters, linters, editors, language servers. It returns the same token stream the parser consumes. Token kinds are the nine class names from §Token classes. The pre-lex validator runs first; ill-formed input returns an error of the same shape and code the parser would produce.

Design principles

One source, one token stream

Token-class ordering, longest-match resolution, and rule-list order are part of the spec. A consumer can predict the exact token stream from the source bytes without consulting the implementation.

No corner cases

A lexer that splits ambiguous input into two adjacent tokens hides the user's mistake. Every shape that is meaningful matches exactly one rule. Every shape that is not meaningful is a diagnostic with a stable code. The pre-lex pass exists to make this true for shapes the simple regex table cannot describe (unterminated comments, numeric-adjacency, BOM-at-offset-N), not to relax the rules.

Identifier discipline

Hard keywords are reserved everywhere; soft keywords are reserved only at the production that consumes them. The hard list is short (33 words) and the soft list is bounded by the grammar shapes that need it. New surface vocabulary defaults to soft unless it appears in a top-level statement.

Stable diagnostic codes

Every lexer rejection has a unique P0xx code. A downstream consumer (a code action, a quick-fix, a CI assertion) can match on the code without parsing the message text. Codes are append-only: a retired code is never reused.

Negation lives in the parser

-1 lexes as two tokens. A single-token negative literal would break subtraction in postfix contexts (xs[len(xs)-1]); the expression grammar in MEP 2 handles unary minus at the operand level, with structural symmetry on both sides of every binary operator.

Conformance

A conforming lexer must produce the documented token stream for every shape this MEP marks as accepted and reject every shape it marks as rejected with the documented diagnostic code. The obligations below are stated in terms of the language; any test corpus that exercises them satisfies this MEP.

Token coverage

A conforming lexer must produce the expected token class for at least one example of each of the following:

  1. Hard keywords. Each of the 33 reserved words lexes to Keyword, never to Ident.
  2. Soft keywords. Each soft keyword lexes to Ident outside its production, and the resulting token round-trips as a bare identifier in any context where an identifier is accepted.
  3. Integer literals. DecInt, HexInt, BinInt, OctInt, including the boundary cases: zero, a literal with leading zeros (007 is 7), a single digit after a base prefix (0x1, 0b1, 0o1), and mixed-case base prefixes (0X, 0B, 0O).
  4. Float literals. 1.0, scientific notation with positive and negative exponents (1e10, 1.5e-2, 0.5e+5), and rejection of 1. and .1.
  5. String literals. Each of the eight escape forms, the empty string "", and rejection of raw newlines inside a string.
  6. Punctuation. Every two-character form (==, !=, <=, >=, &&, ||, =>, :-, ..) and every single-character form.
  7. Boolean literals. true and false lex to Bool, not to Keyword or Ident.
  8. Comments. Both line forms (//, #) and the block form (/* ... */). The block form must accept /***/, /*a**/, and /* /, */ */ as single tokens. Block comments do not nest: the first */ closes.

Position fidelity

A conforming lexer must report token positions that satisfy:

  • Line numbers advance correctly across \n, \r\n, and \r.
  • A tab counts as exactly one column.
  • A multi-byte rune counts as exactly one column at the position of its first byte.

Byte-order mark

  • A BOM at offset 0 is stripped and the rest of the source is lexed normally.
  • A source containing only a BOM is a valid empty program.
  • A BOM at any other position is rejected with P047.

Diagnostic obligations

A conforming lexer must emit a positioned diagnostic with the documented code for each of the rejected shapes:

Input shapeCode
Unterminated string literalP040
Invalid escape sequenceP041
Unterminated block commentP042
Numeric literal adjacent to an identifier charP043
Raw newline inside a string literalP044
Integer literal out of int64 rangeP045
Base prefix with no following digitsP046
Byte-order mark at offset other than 0P047
Float literal out of IEEE-754 binary64 rangeP048

The position must point at the start of the offending construct. The wording of the diagnostic message is not normative; only the code and position are.

Robustness obligations

A conforming lexer must terminate on every input. For any byte sequence it must produce either a valid token stream or one of the diagnostics above; it must not panic, hang, or return a null result alongside a null error. Random-byte fuzzing for the duration the language community standardises on (currently ten seconds per release) must surface no counterexamples to this invariant.

Open questions

  • Nested block comments. A literal */ inside a block comment terminates it early. Switching to a hand-written tokenizer would let comments nest. Low priority; the current behaviour is pinned by block_comment_no_nesting.
  • Triple-quoted strings. A """ literal that allows raw newlines and ends only at the matching """ would let users write long strings without + concatenation. The cost is a new escape surface and a third string form to learn.
  • Numeric separators. 1_000_000 is rejected today by P043. Adding underscores between digits as a readability aid is cheap; the question is whether the surface complexity is worth it for a language that already supports hex, binary, and octal literals.

References

This document is placed in the public domain.