Token Classes (Classes)
In programming languages: Identifiers, Keywords, "(", ")", Numbers, Operator, ...
Each class corresponds to a sets of strings:
- Identifiers: a string of letters or digits (starting with a letter)
- Integer: a non-empty string of digits
- Keywords: a fixed set of reserved words
- Whitespace: a non-empty string of blanks, tabs, spaces,...
Punctuation marks are usually a token class by itself, like classes of "(", ")", ";"
Goal of lexical analysis: classify substrings according to roles (token classes) and communicate tupos (shown below) to the parser
The words in the program are called lexemes, so the analyzer classify lexemes into corresponding token classes.
Everything in lexical analysis is a string. The analyzer will send things like <Id, "foo"> (Id here equals to identifier) which is called a token to the parser.
LA Examples
LA is a left-to-right scan.
- Lexical Analyzer has to look ahead (FORTRAN)
DO 5 I = 1,25 -> a loop in FORTRAN
DO 4 I = 1.25 -> identifier assignment
The LA have to look ahead to see where the token ends, so it's compulsory. However, we need to avoid it as much as possible or to ensure making bounded lookahead to reduce the workload.
Regular Languages
Regular Expresssions specify regular languages
- 'c' = { "c" } (single character string), epsilon (empty string)
- Compound
- Union
- Concatenation
- Iteration
Languages depend on Alphabets.
Formal Languages
Regular language is a form of formal language.
Meaning Function L(x): Mapping notations (expressions) to meanings
aka. mapping syntax to semantics (a many-to-one mapping)
This is the basis for optimization, because we can substitute easier experssions (faster to run, etc) with harder experssions.
Lexical Specifications
AA* = A^+
letter = [ a - z A - Z ]
digit = [ 0 - 9 ]
Identifier = letter(letter + digit)*
Integers = digit^+
Whitespace = ( ' ' + '\n' + '\t' )^+
In pascal:
digits = digit^+
opt_fraction = ('.' digits) + ε // use "+ ε" to identify optional, also can be denoted with a "?"
= ('.' digits)?
opt_exponent = ( 'E' ( '+' + '-' + ε ) digits ) + ε