03. Lexical Analysis - Lecture notes 3 PDF

Title	03. Lexical Analysis - Lecture notes 3
Course	Computing Theory and Programming Languages
Institution	California State University Sacramento
Pages	7
File Size	281.4 KB
File Type	PDF
Total Downloads	15
Total Views	171

Preview

CLICK TO PREVIEW PDF

Summary

Lecture 3...

Description

Lexical Analysis Here is an overview of a compiler: Interme diate Represen tation

High level Language Program Front end: A nalysis

Scanner

Parser

Machine Language Program

Back end: S ynthesis Interme diate C ode Gene ration Optim ization Code Generation

Sem antic Analyser

Syntactic Analysis

And a more detailed view: Source Code (sequence of characters) ➔ Pre-processor ➔ Modified source Code ➔ Analysis (source code dependent)

Scanner (lexical analysis) ➔

Token stream (sequence of tokens)

➔

Parser (syntactic analysis) ➔

Syntax Tree (parse tree) ➔

Semantic Analyzer ➔ Annoted Syntax Tree

Synthesis (target code dependent)

➔

Intermediate Code Generator ➔ Intermediate Representation

➔

Machine-Independent Code Optimizer ➔ Intermediate Representation

➔

Code Generator ➔

Target-machine Code

➔

Machine-Dependent Code Optimizer ➔ Target-machine Code

The various steps in the compiling process are called phases. 03. Lexical Analysis, Spring 2018

1

Scanning: converting the programmers original source code file, which is typically a sequence of ASCII characters, into a sequence of tokens. Token: the smallest element of a programming language. It is made up of characters but it is “atomic” meaning that it can’t be broken down into characters. Tokens include at least the following: • reserved words, • punctuation marks, • identifiers, • literals. Lexical Analysis/Scanning: the first and simplest phase of a compiler. It assembles the characters into tokens. For example, given the following piece of Java source code: if (i==12) age> 1 This code fragment contains 19 characters. It would be very difficult to formally specify the syntactic structure of a language such as Java entirely based on the ASCII characters. Instead, the syntax of Java is described based on “tokens” in the language, and a separate lexical definition exists that defines how the characters in the source code translate into tokens.

In the above case, the lexical analyzer will produce the following token sequence: Keyword IF

Operator LPAREN

Identifier i

Operator EQUAL

Number 12

Operator RPAREN

Identifier age

Operator GT

Number 1

The sequence of tokens may be represented as a list, where each token is encoded in a way which makes subsequent processing easy using a fixed format, usually a number, or pair of numbers, or equivalent: • for fixed element of the language (e.g. keywords, operators) a single number may be used. For example the token ”if” might be denoted by 22.while the token “{“ might be denoted by 16. • for user defined categories (e.g. identifiers, literals) a pair may be used that 03. Lexical Analysis, Spring 2018

2

identifies the category accompanied by another item identifying the specific element in the category. For example, the identifier xyz might be denoted by 57 (used for all identifiers) and 45 (specific to xyz, usually indicating where the specific identifier information is to be found, usually in a symbol table). Since the input is a stream of characters it is sometimes difficult to determine where one token finishes and another starts. Clearly one could require that the tokens be separated by spaces. Most programming languages allow this but, as you know, do not require it. For example, the code: foo=55 is an acceptable assignment and a scanner will be able to distinguish the three tokens “foo”, “=”, and “55”, even without spaces between them. However, there are situations where it can be trickier. For example: ifp=5 Should the scanner consider “ifp” to be a single token (i.e. a variable identifier), or should it look at it as the keyword “if”, followed by the identifier “p”? This dilemma complicated the lexical specifications of early programming languages such as FORTRAN and COBOL. Principle of the Longest String In modern programming languages this principle usually applies. It says that, given a choice between two interpretations, we always choose the longest one. In the above example the scanner will choose the “ifp” interpretation, since it is longer than “if”. In a case where the intent is to have two tokens “if” and “g” they must be separated by one or more spaces. In summary, the scanner accumulates characters one -by-one and add them to the current token until one of the following occurs: • a white space (i.e. space, , , ). • the token becomes illegal. 03. Lexical Analysis, Spring 2018

3

At this point the scanner terminates the current token and start on a new one. The process continues until the entire source file is converted to tokens.

Example: Consider the following lexical requirements for a very simple language: • • • • •

the alphabet includes the digits 0-9, characters P, R, I, N, T, and the period (.) there is one keyword: “PRINT” identifiers can be any other sequence of the P, R, I, N, T characters numbers can be any sequence of digits, with an optional decimal point. If there is a decimal point, there must be at least one digit before or after the decimal. The principle of longest substring applies.

Show the tokens resulting from a lexical scan of the following input string: PRINT03

30PRINTT.3

PRINTPRINT

PRI

NT

I

The answer would be as follows: 1st token: 2nd token: 3rd token: 4th token: 5th token: 6th token: 7th token: 8th token: 9th token:

PRINT 03 30 PRINTT .3 PRINTPRINT PRI NT I

(keyword) (number) (number) (identifier) (number) (identifier) (identifier) (identifier) (identifier)

It is this stream of 9 tokens that would then be sent to the parser.

Imagine you have a language that has only operators, identifiers, and two reserved words, and the following characteristics: • the only legal characters are the digits 0 and 1, the letter K, and the characters + • there are only two operators + and ++ • there are only two keywords K0 and K1 (keywords have precedence over 03. Lexical Analysis, Spring 2018

4

identifiers) • identifiers can be of any length, and can only include K’s, 0's, and 1's. • the principle of the longest string applies. • spaces are legal, but have no other significance. Show the final sequence of tokens resulting from a lexical scan of the following character sequence 001K111+++K0 1K1K0K1

Tokens are usually specified using regular expressions and can be identified using finite automata (or equivalent). Examples: Here are some automata which represent the analysis of various tokens. Identifier

whitespace

letter start

s2

s1

s0

letter or digit

unsigned integer non digit

digit start

s1

s0

s2

digit 03. Lexical Analysis, Spring 2018

5

03. Lexical Analysis, Spring 2018

6

relational operators s2

relop LE

s3

relop NE

s4

relop LT

s5

relop EQ

=

s7

relop GE

other

s8

relop GT

= >

s1 <

other

start

= s0

> s6

Note that to represent an actual scanner all those automata would have to be combined. Notice that in some cases the character should not be consumed and should be put back in the stream of characters (i.e. “other” in the relop case). This generally requires the use of a “buffer” which can be emptied if the character is consumed but that will hold the character otherwise for use on the next token. . Scanners can be written by hand modeling a finite state automaton or one can use scanner generators such as Lex, FLex, or JLex. All those scanners rely on regular expressions to describe the syntax of the tokens

03. Lexical Analysis, Spring 2018

7...