Breaking Down Words: The Art of Lexical Analysis in Compiler Design

Peter K Joseph
6 min readMar 11, 2023

--

This is a continuation of my previous blog. It is advised to read the same if you are an absolute beginner. Click this text to open the parent blog.

The first step in the compilation process is lexical analysis, which involves breaking the source code into a sequence of tokens. Tokens are the basic building blocks of a programming language, and they represent keywords, identifiers, operators, and other elements of the code. The lexical analyzer, also known as a scanner, reads the source code character by character and groups them into tokens. The output of this stage is a stream of tokens that will be used in the next stage.

To understand this concept better, let’s take a simple C code.

#include <stdio.h>
int main() {
printf("Hello, World!");
return 0;
}

Now during the lexical analysis, the lexical analyzer will perform the following operations. As for this blog, I plan to keep it simple and would later introduce a more detailed version of this in a later stage.

Input Buffering: The first step in lexical analysis is to read the source code and store it in a buffer. The input buffer is a contiguous block of memory that contains the source code to be analyzed.

Scanning: The first step in lexical analysis is scanning, also known as tokenization. The scanner will read the source code character by character and then groups them into tokens based on the language’s rules. The scanner removes any whitespace and comments and generates a stream of tokens that will be used in the next step.

The above code, after the Scanning process would look like this

Token: #include
Token: <
Token: stdio.h
Token: >
Token: int
Token: main
Token: (
Token: )
Token: {
Token: printf
Token: (
Token: "Hello, World!"
Token: )
Token: ;
Token: return
Token: 0
Token: ;
Token: }

Identifying and Creating Tokens: The next step is to identify the tokens generated by the scanner. Tokens can be keywords, identifiers, operators, literals, or punctuation marks. Keywords are reserved words in the programming language, such as “if” and “while”. Identifiers are user-defined names used to represent variables, functions, and other entities in the program. Operators are symbols used to perform operations, such as addition and subtraction. Literals are constants, such as numbers and strings. Punctuation marks include braces, commas, semicolons, and parentheses.

eg: in the stream of tokens generated, the keyword “include” is identified as a keyword, the header “stdio.h” is identified as a header, and the string “Hello, World!” is identified as a string literal.

After this, we assign these tokens with a token value and a type. The token type specifies the category of the token, such as keyword or identifier. The value of the token is the actual text of the token.

So, for the example give, the parsed token stream would look like

<keyword, "#include"> <operator, "<"> <header, "stdio.h"> <operator, ">">
<keyword, "int"> <identifier, "main"> <operator, "("> <operator, ")">
<operator, "{">
<keyword, "printf"><operator, "("><string, "Hello, World!"><operator, ")"><operator, ";">
<keyword, "return"> <number, "0"> <operator, ";">
<operator, "}">

Filtering out Invalid Tokens: The final step is to filter out any invalid tokens. Invalid tokens are tokens that do not conform to the rules of the programming language, such as misspelled keywords or identifiers.

Example: Consider a code that has a for statement written as “fr (i = 0; i < n; i++)”. As per the rules following and governed by the specified programming language (as in, unless the language has “fr” as it’s keyword for “for”) which in our case in C programming language, it should be for. Hence, this statement violates the rules set up by C. The Lexical Analyzer would either warn or in this case, will throw an error.

Now, let’s level up a bit. Till now, what was explained is an oversimplified version of what a Lexical Analyzer does.

In reality, the Lexical Analyzer would read the source code and store it in the system’s input buffer. The input buffer is a contiguous block of memory that has the source code required to be analyzed (Input Buffering). The Scanning stage happens after this. It then scans the input buffer and identify each token in the source code. The rest, as explained at Scanning, would take the code and return the tokens found. An example for the same is given below for the example code that we are using

Token: #include
Token: <
Token: stdio.h
Token: >
Token: int
Token: main
Token: (
Token: )
Token: {
Token: printf
Token: (
Token: "Hello, World!"
Token: )
Token: ;
Token: return
Token: 0
Token: ;
Token: }

After this, we will create Lexemes for the found tokens. During scanning, the scanner identifies each lexeme and creates a token based on the lexeme’s category. An example for the same is given below for the example code that we are using

Lexeme: #include, Token: Preprocessor Directive
Lexeme: <, Token: Less Than Operator
Lexeme: stdio.h, Token: Header File Name
Lexeme: >, Token: Greater Than Operator
Lexeme: int, Token: Keyword
Lexeme: main, Token: Identifier
Lexeme: (, Token: Left Parenthesis
Lexeme: ), Token: Right Parenthesis
Lexeme: {, Token: Left Curly Brace
Lexeme: printf, Token: Identifier
Lexeme: (, Token: Left Parenthesis
Lexeme: "Hello, World!", Token: String Literal
Lexeme: ), Token: Right Parenthesis
Lexeme: ;, Token: Semicolon
Lexeme: return, Token: Keyword
Lexeme: 0, Token: Numeric Literal
Lexeme: ;, Token: Semicolon
Lexeme: }, Token: Right Curly Brace

After this, we enter the stage of eliminating white space and comments. The scanner eliminates all white spaces and comments from the source code before creating tokens. This helps simplify the tokenization process and makes the resulting token stream more manageable. After this stage is done, we then go into the stage of generating tokens. After scanning and creating lexemes, the scanner generates tokens based on the lexeme category. Each token represents a single element of the programming language, such as a keyword, identifier, operator, or symbol. An example for the same for the example code that we are using would be as follows.

Token: Preprocessor Directive
Token: Less Than Operator
Token: Header File Name
Token: Greater Than Operator
Token: Keyword
Token: Identifier
Token: Left Parenthesis
Token: Right Parenthesis
Token: Left Curly Brace
Token: Identifier
Token: Left Parenthesis
Token: String Literal
Token: Right Parenthesis
Token: Semicolon

After this stage of the compilation is done, we enter the final stage of lexical analysis where we begin the construction of the token stream. A token stream is a sequence of tokens generated by the scanner that represents the source code. The token stream is then passed on to the next phase of the compiler, which is syntax analysis. An example for the stream of tokens is as follows

Preprocessor Directive < Header File Name > 
Keyword main Left Parenthesis Right Parenthesis Left Curly Brace
Identifier Left Parenthesis String Literal Right Parenthesis Semicolon
Keyword return Numeric Literal Semicolon
Right Curly Brace

After all these steps, the Lexical Analyzer produces the stream of tokens as Lexemes. An example is as mentioned below

<keyword, "#include"> <operator, "<"> <header, "stdio.h"> <operator, ">">
<keyword, "int"> <identifier, "main"> <operator, "("> <operator, ")">
<operator, "{">
<keyword, "printf"><operator, "("><string, "Hello, World!"><operator, ")"><operator, ";">
<keyword, "return"> <number, "0"> <operator, ";">
<operator, "}">

A token is a sequence of characters that represents a particular element of the programming language, such as a keyword, identifier, operator, or symbol.

A token stream is a sequence of tokens generated by the lexical analysis phase of the compilation process. During lexical analysis, the scanner reads the source code character by character, identifies lexemes, and assigns them to corresponding token types. These tokens are then assembled into a token stream, which represents the source code in a more abstract and structured way.

A lexeme is a sequence of characters in the source code that represents a single token.

--

--

Peter K Joseph
Peter K Joseph

Written by Peter K Joseph

A Computer Science student at Christ Univeristy, Bangalore. Intern at Toyota Industries and Engine India. Passionate Full Stack Developer.

Responses (1)