Lexical analyzer
The language has a finite set of valid string/token/lexeme. The job of the lexical analyzer is to scan and identify the string/token/lexeme. It also performs searching for the specified pattern by the language rules.
Regular expressions can express finite languages in the definition of pattern for finite strings of symbols. The grammar specified by the regular expressions is named Regular grammar. Regular grammar defines a language that is known as Regular language.
To specify the patterns, the system needs to follow an important notation which is the regular expression. The regular expressions act as names for a set of strings that are matched by the pattern. Regular languages explain the tokens of the programming language.
The regular expression has the specification which is an example of a recursive definition. The implementation of regular languages is productive and easy to interpret.
Regular expressions follow a set of algebraic laws which are used to handle regular expressions into equivalent forms.
Operations
The multiple variations of languages are the following:
L and M on Union of two languages can be written as
L U M = { s | s is in L or s is in M }
L and M on Concatenation of two languages can be written as
LM = {st | s is in L and t is in M}
L on Kleene closure of a language can be written as,
L* = Zero or more occurrence of language L
Notations
If the regular expressions r and s denote the languages L(r) and L(s) then the following.
Union : (r)|(s) is a regular expression indicating L(r) U L(s)
Concatenation : (r)(s) is a regular expression specifying L(r)L(s)
Kleene closure : (r)* is a regular expression indication (L(r))*
(r)is a regular expression specifying L(r)
Precedence and associativity
* , concatenation(.), and | (pipe sign) are left-associative
The highest precedence is *
Concatenation (.) has the second-highest precedence.
| (pipe sign) has the lowest precedence.
The valid tokens of a language in a regular expression represent the following.
If x is a regular expression, then:
-
x* means zero or more occurrence of x.
For example, it can create {e, x, xx, xxx, xxxx,....}
-
x+ means one or more occurrences of x.
For example, it can create {x, xx, xxx, xxxx,....} or x.x*
-
x? means at most one occurrence of x.
For example, it can create either {x} or {e}.
All lower-case alphabets of the English language can be denoted by [a-z].
All upper-case alphabets of the English language can be denoted by [A-Z].
All-natural digits of Mathematics can be denoted by [0-9].
The regular expressions represent the occurrence of symbols used.
letter = [a - z] or [A - Z]
digit = 0|1|2|3|4|5|6|7|8|9 or [0 - 9]
Sign = [+ | -]
The regular expressions represent the language tokens.
Decimal = (sign) ? (digit) +
Identifier = (leter)(letter | digit)*
But the lexical analyzer is unable to verify the validity of a regular expression in defining the patterns of keywords of a language. Finite automata is an authorized way for the verification process.