Compiler (I) Lex Introduction

In this post, I will introduce what Lex is and how we can use Lex.

Compiler Structure

What is Lex?

Lex is a computer program that generates lexical analyzers (“scanners” or “lexers”). Lex is commonly used with the yacc parser generator. Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language.

How to write Lex?

The following is an example Lex file for the flex version of Lex. It recognizes strings of numbers (positive integers) in the input, and simply prints them out.

/* Definition Section (required) */

%{
    #include <stdio.h>
    /* The Definition Section will be copied 
       to the top of generated C program. 
       Include header files, declare variables. */
}%

POS_INTEGAR   ([+]?[0-9]+)
NEG_INTEGAR   ([-][0-9]+)
INTEGAR       ([-+]?[0-9]+)

/* Above is more elegant way to write regular xxpressions */

%% /* Separate definition section from rules section */

/* Rules Section (required) */
/* The Rules Section is for writing regular
   expression to recognize tokens. When pattern
   is matched, then execute action. 
[Regular expression rule] { The things you want to do; } */

[0-9]+  { printf("Saw an integer: %s\n", yytext); }

/* {POS_INTEGAR} { printf("Saw an integer: %s\n", yytext); } 
   We can write like above to get the same result. */ 
   
.|\n  { /* Ingore and do nothing */ }

// "." is wild card character, represent any character expect line feed \n

%% /* Separate rules section from C code section */

/* C code section */
/* The C Code Section will be copied to the
   bottom of generated C program. */
   
int main(void)
{
    /* Call the lexer, then quit. */
    yylex();
    return 0;
}

For the rules section:

  • Always choose the longest matching pattern.
  • If the length are the same, choose the first met rule.

Lex predefined variables

Name Functions
char* yytext Pointer to matched string.
int yyleng Length of matched string.
int yylex(void) Function call to invoke lexer and return token.
int yywrap(void) Return 1 if no more files to be read.
char* yymore(void) Return the next token.
int yyless(int n) Retain the first n characters in yytext and (sort of) return the rest back to the input stream.
FILE* yyin Input stream pointer.
FILE* yyout Output stream pointer.
ECHO Print out the yytext.
BEGIN Condition switch.
REJECT Go to the next alternative rule.

Condition

// TODO

How to compile lex file

flex scanner.l 
gcc -o scanner lex.yy.c -lfl
./scanner < <input_C_file>