August 22, 2021
A lexer takes the source code, a sequence of characters, and group them into tokens. e.g., it makes the first decision on how to process the strings 100-10, -100-10, and -100--100 into groups. I’m going to call this grouping “tokenization” even though I may be misusing the term.
Tokenizing source code is hard. How should -100--100 be tokenized? Should it be a literal -100 followed by the minus token, followed by another -100?