Lexical Analysis Notes 1 PDF

Title	Lexical Analysis Notes 1
Course	Compiler Design
Institution	PES University
Pages	12
File Size	414 KB
File Type	PDF
Total Downloads	569
Total Views	664

Preview

CLICK TO PREVIEW PDF

Summary

Unit 1Lexical AnalysisTwo methods to construct a scanner (lexical analyzer). By hand, beginning with a diagram of what lexemes look like. Then write code to follow the diagram and return the corresponding token and possibly other information. Feed the patterns describing the lexemes to a “lexer-gene...

Description

Unit 1

Lexical Analysis Two methods to construct a scanner (lexical analyzer). 1. By hand, beginning with a diagram of what lexemes look like. Then write code to follow the diagram and return the corresponding token and possibly other information. 2. Feed the patterns describing the lexemes to a “lexer-generator”, which then produces the scanner. The historical lexer-generator is Lex; a more modern one is flex. Note that the speed (of the lexer not of the code generated by the compiler) and error reporting/correction are typically much better for a handwritten lexer. As a result most production-level compiler projects write their own lexers 3.1: Role of the Lexical Analyzer The lexer is called by the parser when the latter is ready to process another token. The lexer also might do some housekeeping such as eliminating whitespace and comments. Some call these tasks scanning, but others call the entire task scanning. After the lexer, individual characters are no longer examined by the compiler; instead tokens (the output of the lexer) are used. 3.1.1: Lexical Analysis Versus Parsing Why separate lexical analysis from parsing? The reasons are basically software engineering concerns. 1. Simplicity of design. When one detects a well defined subtask (produce the next token), it is often good to separate out the task (modularity). 2. Efficiency. With the task separated it is easier to apply specialized techniques. 3. Portability. Only the lexer need communicate with the outside. 3.1.2: Tokens, Patterns, and Lexemes • • •

A token is a pair. These are what the parser processes. The attribute might actually be a tuple of several attributes A pattern describes the character strings for the lexemes of the token. For example “a letter followed by a (possibly empty) sequence of letters and digits”. A lexeme for a token is a sequence of characters that matches the pattern for the token.

Note the circularity of the definitions for lexeme and pattern. Common token classes.

1. One for each keyword. The pattern is trivial. 2. One for each operator or class of operators. A typical class is the comparison operators. Note that these have the same precedence. We might have + and - as the same token, but not + and *. 3. One for all identifiers (e.g. variables, user defined type names, etc). 4. Constants (i.e., manifest constants) such as 6 or “hello”, but not a constant identifier such as “quantum” in the Java statement. “static final int quantum = 3;”. There might be one token for integer constants, one for real, one for string, etc. 5. One for each punctuation symbol. 3.1.3: Attributes for Tokens We saw an example of attributes in the last chapter. For tokens corresponding to keywords, attributes are not needed since the name of the token tells everything. But consider the token corresponding to integer constants. Just knowing that the we have a constant is not enough, subsequent stages of the compiler need to know the value of the constant. Similarly for the token identifier we need to distinguish one identifier from another. The normal method is for the attribute to specify the symbol table entry for this identifier. 3.1.4: Lexical Errors We saw in this movie an example where parsing got “stuck” because we reduced the wrong part of the input string. We also learned about FIRST sets that enabled us to determine which production to apply when we are operating left to right on the input. For predictive parsers the FIRST sets for a given nonterminal are disjoint and so we know which production to apply. In general the FIRST sets might not be disjoint so we have to try all the productions whose FIRST set contains the lookahead symbol. All the above assumed that the input was error free, i.e. that the source was a sentence in the language. What should we do when the input is erroneous and we get to a point where no production can be applied? The simplest solution is to abort the compilation stating that the program is wrong, perhaps giving the line number and location where the parser could not proceed. We would like to do better and at least find other errors. We could perhaps skip input up to a point where we can begin anew (e.g. after a statement ending semicolon), or perhaps make a small change to the input around lookahead so that we can proceed. 3.2: Input Buffering Determining the next lexeme often requires reading the input beyond the end of that lexeme. For example, to determine the end of an identifier normally requires reading the first whitespace character after it. Also just reading > does not determine the lexeme as it could also be >=. When you determine the current lexeme, the characters you read beyond it may need to be read again to determine the next lexeme.

3.2.1: Buffer Pairs The book illustrates the standard programming technique of using two (sizable) buffers to solve this problem. 3.2.2: Sentinels A useful programming improvement to combine testing for the end of a buffer with determining the character read. 3.3: Specification of Tokens The chapter turns formal and, in some sense, the course begins. The book is fairly careful about finite vs infinite sets and also uses (without a definition!) the notion of a countable set. (A countable set is either a finite set or one whose elements can be put into one to one correspondence with the positive integers. That is, it is a set whose elements can be counted. The set of rational numbers, i.e., fractions in lowest terms, is countable; the set of real numbers is uncountable, because it is strictly bigger, i.e., it cannot be counted.) We should be careful to distinguish the empty set φ from the empty string ε. Formal language theory is a beautiful subject, but I shall suppress my urge to do it right and try to go easy on the formalism. 3.3.1: Strings and Languages We will need a bunch of definitions. Definition: An alphabet is a finite set of symbols. Example: {0,1}, presumably φ (uninteresting), ascii, unicode, ebcdic, latin-1. Definition: A string over an alphabet is a finite sequence of symbols from that alphabet. Strings are often called words or sentences. Example: Strings over {0,1}: ε, 0, 1, 111010. Strings over ascii: ε, sysy, the string consisting of 3 blanks. Definition: The length of a string is the number of symbols (counting duplicates) in the string. Example: The length of allan, written |allan|, is 5. Definition: A language over an alphabet is a countable set of strings over the alphabet. Example: All grammatical English sentences with five, eight, or twelve words is a language over ascii. It is also a language over unicode. Definition: The concatenation of strings s and t is the string formed by appending the string t to s. It is written st.

Example: εs = sε = s for any string s. We view concatenation as a product (see Monoid in wikipedia http://en.wikipedia.org/wiki/Monoid). It is thus natural to define s0=ε and si+1=sis. Example: s1=s, s4=ssss. More string terminology A prefix of a string is a portion starting from the beginning and a suffix is a portion ending at the end. More formally, Definitions: A prefix of s is any string obtained from s by removing (possibly zero) characters from the end of s. A suffix is defined analogously and a substring of s is obtained by deleting a prefix and a suffix. Example: If s is 123abc, then (1) s itself and ε are each a prefix, suffix, and a substring. (2) 12 are 123a are prefixes. (3) 3abc is a suffix. (4) 23a is a substring. Definitions: A proper prefix of s is a prefix of s other than ε and s itself. Similarly, proper suffixes and proper substrings of s do not include ε and s. Definition: A subsequence of s is formed by deleting (possibly) positions from s. We say positions rather than characters since s may for example contain 5 occurrences of the character Q and we only want to delete a certain 3 of them. Example: issssii is a subsequence of Mississippi. 3.3.2: Operations on Languages Definition: The union of L1 and L2 is simply the set-theoretic union, i.e., it consists of all words (strings) in either L1 or L2. Example: The union of {Grammatical English sentences with one, three, or five words} with {Grammatical English sentences with two or four words} is {Grammatical English sentences with five or fewer words}. Definition: The concatenation of L1 and L2 is the set of all strings st, where s is a string of L1 and t is a string of L2. We again view concatenation as a product and write LM for the concatenation of L and M. Examples:: The concatenation of {a,b,c} and {1,2} is {a1,a2,b1,b2,c1,c2}. The concatenation of {a,b,c} and {1,2,ε} is {a1,a2,b1,b2,c1,c2,a,b,c}.

Definition: As with strings, it is natural to define powers of a language L. L0={ε}, which is not φ. Li+1=LiL. Definition: The (Kleene) closure of L, denoted L* is L0 ∪ L1 ∪ L2 ... Definition: The positive closure of L, denoted L+ is L1 ∪ L2 ... Example: {0,1,2,3,4,5,6,7,8,9}+ gives all unsigned integers, but with some ugly versions. It has 3, 03, 000003. {0} ∪ ( {1,2,3,4,5,6,7,8,9} ({0,1,2,3,4,5,6,7,8,9,0}* ) ) seems better. In these notes I may write * for * and + for +, but that is strictly speaking wrong and I will not do it on the board or on exams or on lab assignments. Example: {a,b}* is {ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}. {a,b}+ is {a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}. {ε,a,b}* is {ε,a,b,aa,ab,ba,bb,...}. {ε,a,b}+ is the same as {ε,a,b}*. The book gives other examples based on L={letters} and D={digits}, which you should read.. 3.3.3: Regular Expressions The idea is that the regular expressions over an alphabet consist of ε, the alphabet, and expressions using union, concatenation, and *, but it takes more words to say it right. For example, I didn't include (). Note that (A ∪ B)* is definitely not A* ∪ B* (* does not distribute over ∪) so we need the parentheses. The book's definition includes many () and is more complicated than I think is necessary. However, it has the crucial advantages of being correct and precise. The wikipedia entry doesn't seem to be as precise. I will try a slightly different approach, but note again that there is nothing wrong with the book's approach (which appears in both first and second editions, essentially unchanged). Definition: The regular expressions and associated languages over an alphabet consist of 1. 2. 3. 4. 5. 6.

ε, the empty string; the associated language L(ε) is {ε}. Each symbol x in the alphabet; L(x) is {x}. rs for all regular expressions (REs) r and s; L(rs) is L(r)L(s). r|s for all REs r and s; L(r|s) is L(r) ∪ L(s). r* for all REs r; L(r*) is (L(r))*. (r) for all REs r; L((r)) is L(r).

Parentheses, if present, control the order of operations. Without parentheses the following precedence rules apply.

The postfix unary operator * has the highest precedence. The book mentions that it is left associative. (I don't see how a postfix unary operator can be right associative or how a prefix unary operator such as unary - could be left associative.) Concatenation has the second highest precedence and is left associative. | has the lowest precedence and is left associative. The book gives various algebraic laws (e.g., associativity) concerning these operators. The reason we don't include the positive closure is that for any RE r+ = rr*. Homework: 3.6 a and b. 3.3.4: Regular Definitions These will look like the productions of a context free grammar we saw previously, but there are differences. Let Σ be an alphabet, then a regular definition is a sequence of definitions d1 → r1 d2 → r2 ... dn → rn where the d's are unique and not in Σ and ri is a regular expressions over Σ ∪ {d1,...,di-1}. Note that each di can depend on all the previous d's. Example: C identifiers can be described by the following regular definition letter_ → A | B | ... | Z | a | b | ... | z | _ digit → 0 | 1 | ... | 9 CId → letter_ ( letter_ | digit)* Homework: 3.7 a,b (c is optional) 3.3.5: Extensions of Regular Expressions There are many extensions of the basic regular expressions given above. The following three will be frequently used in this course as they are particular useful for lexical analyzers as opposed to text editors or string oriented programming languages, which have more complicated regular expressions. All three are simply shorthand. That is, the set of possible languages generated using the extensions is the same as the set of possible languages generated without using the extensions. 1. One or more instances. This is the positive closure operator + mentioned above.

2. Zero or one instance. The unary postfix operator ? defined by r? = r | ε for any RE r. 3. Character classes. If a1, a2, ..., an are symbols in the alphabet, then [a1a2...an] = a1 | a2 | ... | an. In the special case where all the a's are consecutive, we can simply the notation further to just [a1-an]. Examples: C-language identifiers letter_ → [A-Za-z_] digit → [0-9] Id → letter_ ( letter | digit ) *

Unsigned integer or floating point numbers digit → [0-9] digits → digit+ number → digits (. digits)?(E[+-]? digits)? 3.4: Recognition of Tokens Goal is to perform the lexical analysis needed for the following grammar. stmt → if expr then stmt | if expr then stmt else stmt |ε expr → term relop term // relop is relational operator =, >, etc | term term → id | number Recall that the terminals are the tokens, the nonterminals produce terminals. Lexeme Token Whitespace ws — if

if

—

then

then

—

else

else

—

An identifier id

Attribute

Pointer to table entry

A number <

number Pointer to table entry relop LT

| = | = |

>

relop

GT

>=

relop

GE

We also want the lexer to remove whitespace so we define a new token ws → ( blank | tab | newline ) + where blank, tab, and newline are symbols used to represent the corresponding ascii characters.

Recall that the lexer will be called by the parser when the latter needs a new token. If the lexer then recognizes the token ws, it does not return it to the parser but instead goes on to recognize the next token, which is then returned. Note that you can't have two consecutive ws tokens in the input because, for a given token, the lexer will match the longest lexeme starting at the current position that yields this token. The table on the right summarizes the situation. For the parser all the relational ops are to be treated the same so they are all the same token, relop. Naturally, other parts of the compiler will need to distinguish between the various relational ops so that appropriate code is generated. Hence, they have distinct attribute values. 3.4.1: Transition Diagrams A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each possible token. It shows the decisions that must be made based on the input seen. The two main components are circles representing states (think of them as decision points of the lexer) and arrows representing edges (think of them as the decisions made). The transition diagram (3.12 in the 1st edition, 3.13 in the second) for relop is shown on the right. 1. The double circles represent accepting or final states at which point a lexeme has been found. There is often an action to be done (e.g., returning the token), which is written to the right of the double circle. 2. If we have moved one (or more) characters too far in finding the token, one (or more) stars are drawn. 3. An imaginary start state exists and has an arrow coming from it to indicate where to begin the process. It is fairly clear how to write code corresponding to this diagram. You look at the first character, if it is , you return (relop,NE). If it is another character, return (relop,LT) and adjust the input buffer so that you will read this character again since you have used it for the current lexeme. If the first character was =, you return (relop,EQ).

3.4.2: Recognition of Reserved Words and Identifiers The transition diagram below corresponds to the regular definition given previously.

Note again the star affixed to the final state. Two questions remain. 1. How do we distinguish between identifiers and keywords such as then, which also match the pattern in the transition diagram? 2. What is (gettoken(), installID())? We will continue to assume that the keywords are reserved, i.e., may not be used as identifiers. (What if this is not the case—as in Pl/I, which had no reserved words? Then the lexer does not distinguish between keywords and identifiers and the parser must.) We will use the method mentioned last chapter and have the keywords installed into the symbol table prior to any invocation of the lexer. The symbol table entry will indicate that the entry is a keyword. installID() checks if the lexeme is already in the table. If it is not present, the lexeme is install as an id token. In either case a pointer to the entry is returned. gettoken() examines the lexeme and returns the token name, either id or a name corresponding to a reserved keyword. Both installID() and gettoken() access the buffer to obtain the lexeme of interest The text also gives another method to distinguish between identifiers and keywords. 3.4.3: Completion of the Running Example So far we have transition diagrams for identifiers (this diagram also handles keywords) and the relational operators. What remains are whitespace, and numbers, which are the simplest and most complicated diagrams seen so far. Recognizing Whitespace The diagram itself is quite simple reflecting the simplicity of the corresponding regular expression.

• • •

The delim in the diagram represents any of the whitespace characters, say space, tab, and newline. The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token. There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token.

Recognizing Numbers The diagram below is from the second edition. It is essentially a combination of the three diagrams in the first edition.

This certainly looks formidable, but it is not that bad; it follows from the regular expression. In class go over the regular expression and show the corresponding parts in the diagram. When an accepting states is reached, action is required but is not shown on the diagram. Just as identifiers are stored in a symbol table and a pointer is returned, there is a corresponding number table in which numbers are stored. These numbers are needed when code is generated. Depending on the source language, we may wish to indicate in the table whether this is a real or integer. A similar, but more complicated, transition diagram could be produced if they language permitted complex numbers as well. 3.4.4: Architecture of a Transition-Diagram-Based Lexical Analyzer The idea is that we write a piece of code for each decision diagram. I will show the one for relational operations below (from the 2nd edition). This piece of code contains a case for each

state, which typically reads a character and then goes to the next case depending on the character read. The numbers in the circles are the names of the cases. Accepting states often need to take some action and return to the parser. Many of these accepting states (the ones with stars) need to restore one character of input. This is called retract() in the code. What should the code for a particular diagram do if at one state the character read is not one of those for which a next state has been defined? That is, what if the character read is not the label of any of the outgoing arcs? This means that we have failed to find the token corresponding to this diagram. The code calls fail(). This is not an error case. It simply means that the current input does not match this particular token. So we need to go to the code section for another diagram after restoring the input pointer so that we start the next ...