CSCE 531 Compiler Construction Ch 4 Lexical Analysis (2024)

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (1)

CSCE 531 Compiler Construction Ch. 4: Lexical Analysis Spring 2010 Marco Valtorta mgv@cse. sc. edu UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (2)

Acknowledgment • The slides are based on the textbook and other sources, including slides from Bent Thomsen’s course at the University of Aalborg in Denmark and several other fine textbooks • The three main other compiler textbooks I considered are: – Aho, Alfred V. , Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Techniques, & Tools, 2 nd ed. Addison-Welsey, 2007. (The “dragon book”) – Appel, Andrew W. Modern Compiler Implementation in Java, 2 nd ed. Cambridge, 2002. (Editions in ML and C also available; the “tiger books”) – Grune, Dick, Henri E. Bal, Ceriel J. H. Jacobs, and UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering Koen G. Langendoen. Modern Compiler Design.

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (3)

Quick review • Syntactic analysis – Prepare the grammar • Grammar transformations – Left-factoring – Left-recursion removal – Substitution – (Lexical analysis) • This lecture – Parsing - Phrase structure analysis • • Group words into sentences, paragraphs and complete programs Top-Down and Bottom-Up Recursive Decent Parser Construction of AST Note: You will need (at least) two grammars • One for Humans to read and understand • (may be ambiguous, left recursive, have more productions than necessary, …) • One for constructing the parser UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (4)

Textbook vs. Handout • The textbook [Watt and Brown, 2000] does not take advantage of the fact that the lexical structure of a language is described by a regular grammar, but it does lexical analysis just like parsing, i. e. building a parser for a context-free grammar • These slides are a good complement to the Appel’s chapter 2 (handout) UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (5)

The “Phases” of a Compiler Source Program Syntax Analysis Error Reports Abstract Syntax Tree Contextual Analysis Error Reports Decorated Abstract Syntax Tree Code Generation Object Code UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (6)

Syntax Analysis: Scanner Dataflow chart Source Program Stream of Characters Scanner Error Reports Stream of “Tokens” Parser Error Reports Abstract Syntax Tree UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (7)

1) Scan: Divide Input into Tokens An example mini Triangle source program: let var y: Integer in !new year y : = y+1 Tokens are “words” in the input, for example keywords, operators, identifiers, literals, etc. scanner let . . . ident. y var ident. y becomes : = colon : ident. y UNIVERSITY OF SOUTH CAROLINA op. + ident. Integer in in intlit 1 eot . . . Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (8)

Developing RD Parser for Mini Triangle Last Lecture we just said: • The following non-terminals are recognized by the scanner • They will be returned as tokens by the scanner Identifier : = Letter (Letter|Digit)* Integer-Literal : : = Digit* Operator : : = + | - | * | / | < | > | = Comment : : = ! Graphic* eol Assume scanner produces instances of: public class Token { byte kind; String spelling; final static byte IDENTIFIER = 0, INTLITERAL = 1; UNIVERSITY OF SOUTH CAROLINA . . . Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (9)

And this is where we need it public class Parser { private Token current. Token; private void accept(byte expected. Kind) { if (current. Token. kind == expected. Kind) current. Token = scanner. scan(); else report syntax error } private void accept. It() { current. Token = scanner. scan(); } public void parse() { accept. It(); //Get the first token parse. Program(); if (current. Token. kind != Token. EOT) report syntax error } . . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (10)

Steps for Developing a Scanner 1) Express the “lexical” grammar in EBNF (do necessary transformations) 2) Implement Scanner based on this grammar (details explained later) 3) Refine scanner to keep track of spelling and kind of currently scanned token. To save some time we’ll do step 2 and 3 at once this time UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (11)

Systematic Development of Scanner (1) Express (lexical) grammar in EBNF, performing the needed transformations (2) Create a scanning method scan. N for each non terminal N (3) Create a scanner class with – private variable current. Char – private methods : take and take. It – the private scanning methods implemented in step (2) – add private parse. N method for each non terminal N, enhanced to record each token’s kind and spelling – public scan method that scans Separator* Token, discarding any separators but returning the token UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering that follows them

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (12)

Developing a Scanner Token : : = Identifier | Integer-Literal | Operator | ; | : = | ~ | ( | ) | eot • Express the “lexical” grammar in EBNF Identifier : : = Letter (Letter | Digit)* Integer-Literal : : = Digit* Operator : : = + | - | * | / | < | > | = Separator : : = Comment | space | eol Comment : : = ! Graphic* eol Now perform substitution and left factorization. . . Token : : = Letter (Letter | Digit)* | Digit* | + | - | * | / | < | > | = | ; | : (=|e) | ~ | ( | ) | eot Separator : : = ! Graphic* eol | space | eol Substitutions reduce the number of needed methods for efficiency. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (13)

Developing a Scanner Implementation of the scanner public class Scanner { private char current. Char; private String. Buffer current. Spelling; private byte current. Kind; private char take(char expected. Char) {. . . } private char take. It() {. . . } // other private auxiliary methods and scanning // methods here. public Token scan() {. . . } } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (14)

Developing a Scanner The scanner will return instances of Token: public class Token { byte kind; String spelling; final static byte IDENTIFIER = 0; INTLITERAL = 1; OPERATOR = 2; BEGIN = 3; CONST = 4; . . . public Token(byte kind, String spelling) { this. kind = kind; this. spelling = spelling; if spelling matches a keyword change my kind automatically } . . . } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (15)

Developing a Scanner public class Scanner { private char current. Char = get first source char; private String. Buffer current. Spelling; private byte current. Kind; private char take(char expected. Char) { if (current. Char == expected. Char) { current. Spelling. append(current. Char); current. Char = get next source char; } else report lexical error } private char take. It() { current. Spelling. append(current. Char); current. Char = get next source char; }. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (16)

Developing a Scanner. . . public Token scan() { // Get rid of potential separators before // scanning a token while ( (current. Char == ‘!’) || (current. Char == ‘n’ ) ) scan. Separator(); current. Spelling = new String. Buffer(); current. Kind = scan. Token(); return new Token(currentkind, current. Spelling. to. String()); } private void scan. Separator() {. . . } private byte scan. Token() {. . . } . . . UNIVERSITY OF SOUTH CAROLINA Developed much in the same way as parsing methods Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (17)

Developing a Scanner Token : : = Letter (Letter | Digit)* | Digit* | + | - | * | / | < | > | = | ; | : (=|e) | ~ | ( | ) | eot private byte scan. Token() { switch (current. Char) { case ‘a’: case ‘b’: . . . case ‘z’: case ‘A’: case ‘B’: . . . case ‘Z’: scan Letter (Letter | Digit)* return Token. IDENTIFIER; case ‘ 0’: . . . case ‘ 9’: scan Digit* return Token. INTLITERAL ; case ‘+’: case ‘-’: . . . : case ‘=’: take. It(); return Token. OPERATOR; . . . etc. . . } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (18)

Developing a Scanner Let’s look at the identifier case in more detail . . . return. . . case ‘a’: case ‘b’: . . . case ‘z’: case ‘A’: case ‘B’: . . . case ‘Z’: scan Letter (Letter | Digit)* scan Letter accept. It(); return(Letter | Digit)* scan while (is. Letter(current. Char) Token. IDENTIFIER; case ‘ 0’: . . . case ‘ 9’: return || is. Digit(current. Char) ) Token. IDENTIFIER; . . . case ‘ 0’: . . . case ‘ 9’: scan accept. It(); (Letter | Digit) . . . return Token. IDENTIFIER; case ‘ 0’: . . . case ‘ 9’: . . . Thus developing a scanner is a mechanical task. But before we look at doing that, we need some theory! UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (19)

Developing a Scanner The scanner will return instances of Token: public class Token { byte kind; String spelling; final static byte IDENTIFIER = 0; INTLITERAL = 1; OPERATOR = 2; BEGIN = 3; CONST = 4; . . . public Token(byte kind, String spelling) { this. kind = kind; this. spelling = spelling; if spelling matches a keyword change my kind automatically } . . . } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (20)

Developing a Scanner The scanner will return instances of Token. The implementation b is the one in the Triangle source code. It is more complex that the on pp. 123 -124 of the textbook. public class Token {. . . public Token(byte kind, String spelling) { if (kind == Token. IDENTIFIER) { int current. Kind = first. Reserved. Word; boolean searching = true; while (searching) { int comparison = token. Table[current. Kind]. compare. To(spelling); if (comparison == 0) { this. kind = current. Kind; searching = false; } else if (comparison > 0 || current. Kind == last. Reserved. Word) { this. kind = Token. IDENTIFIER; searching = false; } else { current. Kind ++; } } } else this. kind = kind; . . . } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (21)

Developing a Scanner The scanner will return instances of Token: public class Token {. . . private static String[] token. Table = new String[] { "<int>", "<char>", "<identifier>", "<operator>", "array", "begin", "const", "do", "else", "end", "func", "if", "in", "let", "of", "proc", "record", "then", "type", "var", "while", ": ", "; ", ", ", ": =", "~", "(", ")", "[", "]", "{", "}", "<error>" }; private final static int first. Reserved. Word = Token. ARRAY, last. Reserved. Word = Token. WHILE; . . . } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (22)

Generating Scanners Generation of scanners is based on • Regular Expressions: to describe the tokens to be recognized • Finite State Machines: an execution model to which REs Recap: Regular Expressions are “compiled” e The empty string t Generates only the string t XY Generates any string xy such that x is genera and y is generated by Y X | Y Generates any string which generated either by X or by Y X* The concatenation of zero or more strings generate by X (X) For grouping UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (23)

Generating Scanners • Regular Expressions can be recognized by a finite state machine. (often used synonyms: finite automaton (acronym FA)) Definition: A finite state machine is an N-tuple (States, S, start, d , End) States A finite set of “states” S An “alphabet”: a finite set of symbols from which the strings we want to recognize are formed (for example: the ASCII char set) start A “start state” Start States d Transition relation d States x S. These are “arrows” between states labeled by a letter from the alphabet. End A set of final states. End States UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (24)

Generating Scanners • Finite state machine: the easiest way to describe a Finite State Machine is by means Example: an FA that recognizes M r | M s of a picture: = initial state M M r s UNIVERSITY OF SOUTH CAROLINA = final state = non-final state Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (25)

Deterministic, and non-deterministic DFA • A FA is called deterministic (acronym: DFA) if for every state and every possible input symbol, there is only one possible transition to chose from. Otherwise it is called nondeterministic (NDFA or NFA). Q: Is this FSM deterministic or non-deterministic: M M r s UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (26)

• Deterministic, and non-deterministic FA Theorem: every NDFA can be converted into an equivalent DFA. M M DFA ? M UNIVERSITY OF SOUTH CAROLINA r s Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (27)

Deterministic, and non-deterministic FA • Theorem: every NDFA can be converted into an equivalent DFA. Algorithm: The basic idea: DFA is defined as a machine that does a “parallel simulation” of the NDFA. • The states of the DFA are subsets of the states of the NDFA (i. e. every state of the DFA is a set of states of the NDFA) => This state can be interpreted as meaning “the simulated DFA is now in any of these states” UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (28)

Deterministic, and non-deterministic FA Conversion algorithm example: M 1 r 2 M 3 r 4 r, s r r M {1} {2, 4} {3, 4} is a final state because 3 is a final state s {3, 4} r s {4} s UNIVERSITY OF SOUTH CAROLINA {2, 4} --r-->{3, 4} because: 2 --r--> 3 4 --r--> 4 Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (29)

FA with e moves (N)DFA-e automata are like (N)DFA. In an (N)DFA-e we are allowed to have transitions which are “e-moves”. Example: M r (M r)* M r e Theorem: every (N)DFA-e can be converted into an equivalent NDFA (without e-moves). M r r UNIVERSITY OF SOUTH CAROLINA M Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (30)

FA with e moves Theorem: every (N)DFA-e can be converted into an equivalent NDFA (without e-moves). convert into a final state e Algorithm: 1) converting states into final states: if a final state can be reached from a state S using an e-transition convert it into a final state. Repeat this rule until no more states can be converted. For example: convert into a final state e e 2 UNIVERSITY OF SOUTH CAROLINA 1 Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (31)

FA with e moves Algorithm: 1) converting states into final states. 2) adding transitions (repeat until no more can be added) a) for every transition followed by e-transition e t t add transition b) for every transition preceded by e-transition t e t add transition 3) delete all e-transitions UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (32)

Converting a RE into an NDFA -e RE: e FA: RE: t FA: RE: XY FA: t X e UNIVERSITY OF SOUTH CAROLINA Y Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (33)

Converting a RE into an NDFA -e RE: X|Y FA: RE: X* FA: e X e e Y e e X e UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (34)

FA and the implementation of Scanners • Regular expressions, (N)DFA-e and NDFA and DFA’s are all equivalent formalism in terms of what languages can be defined with them. • Regular expressions are a convenient notation for describing the “tokens” of programming languages. • Regular expressions can be converted into FA’s (the algorithm for conversion into NDFAe is straightforward) • DFA’s can be easily implemented as computer programs. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (35)

FA and the implementation of Scanners What a typical scanner generator does: Token definitions Regular expressions Scanner Generator A possible algorithm: - Convert RE into NDFA-e - Convert NDFA-e into NDFA - Convert NDFA into DFA - generate Java/C/. . . code UNIVERSITY OF SOUTH CAROLINA Scanner DFA Java or C or. . . note: In practice this exact algorithm is not used. For reasons of performance, sophisticated optimizations are used. • direct conversion from RE to DFA • minimizing the DFA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (36)

Implementing a DFA Definition: A finite state machine is an N-tuple (States, S, start, d, End) States N different states => integers {0, . . , N-1} => int data type S byte or char data type. start An integer number d Transition relation d States x States. For a DFA this is a function States x S -> States Represented by a two dimensional array (one dimension for the current state, another for the current character. The contents of the array is the next state. End A set of final states. Represented (for example) by an array of Booleans (mark final state by true and other states by false) UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (37)

Implementing a DFA public class Recognizer { static boolean[ ] final. State = final state table ; static int[ ][ ] delta = transition table ; private byte current. Char. Code = get first char ; private int current. State = start state ; public boolean recognize() { while (current. Char. Code is not end of file) && (current. State is not error state ) { current. State = delta[current. State][current. Char. Code]; current. Char. Code = get next char ; return final. State[current. State]; } } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (38)

Implementing a Scanner as a DFA Slightly different from previously shown implementation (but similar in spirit): • Not the goal to match entire input => when to stop matching? Match longest possible token before reaching error state. • How to identify matched token class (not just true|false) Final state determines matched token class UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (39)

Implementing a Scanner as a DFA public class Scanner { static int[ ] matched. Token = maps state to token class static int[ ][ ] delta = transition table ; private byte current. Char. Code = get first char ; private int current. State = start state ; private int tokbegin = begining of current token ; private int tokend = end of current token private int token. Kind ; . . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (40)

Implementing a Scanner as a DFA public Token scan() { skip separator (implemented as DFA as well) tokbegin = current source position token. Kind = error code while (current. State is not error state ) { if (current. State is final state ) { tokend = current source location ; token. Kind = matched. Token[current. State]; current. State = delta[current. State][current. Char. Code]; current. Char. Code = get next source char ; } if (token. Kind == error code ) report lexical error ; move current source position to tokend return new Token(token. Kind, source chars from tokbegin to tokend-1 ); } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (41)

We don’t do this by hand anymore! Writing scanners is a rather “robotic” activity which can be automated. • JLex (JFlex) – input: • a set of REs and action code – output • a fast lexical analyzer (scanner) – based on a DFA • Or the lexer is built into the parser generator as in Java. CC UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (42)

JLex Lexical Analyzer Generator for Java Definition of tokens Regular Expressions JLex We will look at an example JLex specification (adapted from the manual). Consult the manual on the handout for details on how to write your own JLex specifications. Java File: Scanner Class Recognizes Tokens UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (43)

The JLex tool Layout of JLex file: user code (added to start of generated file) User code is copied directly into the output class %% options JLex directives allow you to include code in the lexical analysis class, change names of various components, switch on character counting, line counting, manage EOF, etc. %{ user code (added inside the scanner class declaration) %} macro definitions Macro definitions gives names for useful regular expressions (regexps) %% lexical declaration Regular expression rules define the tokens to be recognised and actions to be taken UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (44)

JLex Regular Expressions • Regular expressions are expressed using ASCII characters (0 – 127). • The following characters are metacharacters. ? * + | ( ) ^ $. [ ] { } “ • Metacharacters have special meaning; they do not represent themselves. • All other characters represent themselves. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (45)

JLex Regular Expressions • • • Let r and s be regular expressions. r? matches zero or one occurrences of r. r* matches zero or more occurrences of r. r+ matches one or more occurrences of r. r|s matches r or s. rs matches r concatenated with s. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (46)

JLex Regular Expressions • Parentheses are used for grouping. ("+"|"-")? • If a regular expression begins with ^, then it is matched only at the beginning of a line. • If a regular expression ends with $, then it is matched only at the end of a line. • The dot. matches any non-newline character. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (47)

JLex Regular Expressions • Brackets [ ] match any single character listed within the brackets. – [abc] matches a or b or c. – [A-Za-z] matches any letter. • If the first character after [ is ^, then the brackets match any character except those listed. – [^A-Za-z] matches any nonletter. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (48)

JLex Regular Expressions • A single character within double quotes " " represents itself. • Metacharacters lose their special meaning and represent themselves when they stand alone within single quotes. – "? " matches ? . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (49)

JLex Escape Sequences • Some escape sequences. – n matches newline. – b matches backspace. – r matches carriage return. – t matches tab. – f matches formfeed. • If c is not a special escape-sequence character, then c matches c. UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (50)

The JLex tool: Example An example: import java_cup. runtime. *; %% %class Lexer %unicode %cup %line %column %state STRING %{. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (51)

The JLex tool %state STRING %{ String. Buffer string = new String. Buffer(); private Symbol symbol(int type) { return new Symbol(type, yyline, yycolumn); } private Symbol symbol(int type, Object value) { return new Symbol(type, yyline, yycolumn, value); } %}. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (52)

The JLex tool %} Line. Terminator = r|n|rn Input. Character = [^rn] White. Space = {Line. Terminator} | [ tf] /* comments */ Comment = {Traditional. Comment} | {End. Of. Line. Comment} | Traditional. Comment = "/*" {Comment. Content} "*"+ "/" End. Of. Line. Comment= "//"{Input. Character}* {Line. Terminator} Comment. Content = ( [^*] | *+ [^/*] )* Identifier = [: jletter: ] [: jletterdigit: ]* Dec. Integer. Literal = 0 | [1 -9][0 -9]* %%. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (53)

The JLex tool. . . %% <YYINITIAL> "abstract" { return symbol(sym. ABSTRACT); } <YYINITIAL> "boolean" { return symbol(sym. BOOLEAN); } <YYINITIAL> "break" { return symbol(sym. BREAK); } <YYINITIAL> { /* identifiers */ {Identifier} { return symbol(sym. IDENTIFIER); } /* literals */ {Dec. Integer. Literal} { return symbol(sym. INT_LITERAL); } . . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (54)

The JLex tool . . . /* literals */ {Dec. Integer. Literal} { return symbol(sym. INT_LITERAL); } " { string. set. Length(0); yybegin(STRING); } /* operators */ "=" { return symbol(sym. EQ); } "==" { return symbol(sym. EQEQ); } "+" { return symbol(sym. PLUS); } /* comments */ {Comment} { /* ignore */ } /* whitespace */ {White. Space} { /* ignore */ } }. . . UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (55)

The JLex tool. . . <STRING> { " { yybegin(YYINITIAL); return symbol(sym. STRINGLITERAL, string. to. String()); } [^nr"]+ { string. append( yytext() ); } \t { string. append('t'); } \n { string. append('n'); } \r { string. append('r'); } \" { string. append('"'); } \ { string. append(''); } } UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (56)

JLex generated Lexical Analyser • Class Yylex – Name can be changed with %class directive – Default construction with one arg – the input stream • You can add your own constructors – The method performing lexical analysis is yylex() • Public Yytoken yylex() which return the next token • You can change the name of yylex() with %function directive – String yytext() returns the matched token string – Int yylenght() returns the length of the token – Int yychar is the index of the first matched char (if %char used) • Class Yytoken – Returned by yylex() – you declare it or supply one already defined – You can supply one with %type directive • Java_cup. runtime. Symbol is useful – Actions typically written to return Yytoken(…) UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (57)

Java. io. Stream. Tokenizer • An alternative to JLex is to use the class Stream. Tokenizer from java. io • The class recognizes 4 types of lexical elements (tokens): • number (sequence of decimal numbers eventually starting with the –(minus) sign and/or containing the decimal point) • word (sequence of characters and digits starting with a character) • line separator • end of file UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (58)

Java. io. Stream. Tokenizer tokens = new Stream. Tokenizer( input File); next. Token() method move a tokenizer to the next token_variable. next. Token() returns the token type as its value Stream. Tokenizer. TT_EOF : end-of-file reached Stream. Tokenizer. TT_NUMBER : a number was scanned; the value is saved in nval(double); if it is an integer, it needs to be typecasted into int ((int)tokens. nval) Stream. Tokenizer. TT_WORD : a word was scanned; the value is saved in sval(String) UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (59)

Java. io. Stream. Tokenizer UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (60)

UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (61)

UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (62)

UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (63)

Conclusions • Don’t worry too much about DFAs • You do need to understand how to specify regular expressions • Note that different tools have different notations for regular expressions • You would probably only need to use JLex (Lex) if you use also use CUP (or Yacc or SML-Yacc) • The textbook [Watt and Brown, 2000] does not take advantage of the fact that the lexical structure of a language is described by a regular grammar, but it does lexical analysis just like parsing, i. e. building a parser for a context-free grammar • These slides are a good complement to the Appel’s chapter 2 (handout) UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering

CSCE 531 Compiler Construction Ch 4 Lexical Analysis (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Foster Heidenreich CPA

Last Updated:

Views: 5257

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Foster Heidenreich CPA

Birthday: 1995-01-14

Address: 55021 Usha Garden, North Larisa, DE 19209

Phone: +6812240846623

Job: Corporate Healthcare Strategist

Hobby: Singing, Listening to music, Rafting, LARPing, Gardening, Quilting, Rappelling

Introduction: My name is Foster Heidenreich CPA, I am a delightful, quaint, glorious, quaint, faithful, enchanting, fine person who loves writing and wants to share my knowledge and understanding with you.