Formal grammars - regular expressions, CFGs, formal language

From Helpful
Jump to navigation Jump to search

This is more for overview of my own than for teaching or exercise.

Overview of the math's areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
: Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort

Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

  • Other data analysis, data summarization, learning
Data modeling, restructuring, and massaging
Statistical modeling · Classification, clustering, decisions, and fuzzy coding ·
dimensionality reduction ·
Optimization theory, control theory · State observers, state estimation
Connectionism, neural nets · Evolutionary computing
  • More applied:
Formal grammars - regular expressions, CFGs, formal language
Signal analysis, modeling, processing
Image processing notes

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.


Regular Expressions and Finite State Automata

Regular expressions can be compiled into FSAs.

...well, except that some programming languages insist on hooking in non-regex, non-regular-language things into their regular expressions. Because it's useful.

Because they no longer conform to this grammar (and some of which explode in memory/CPU for that reason), this is not longer compilable.

Both DFA's and NFA's (Deterministic and Nondeterministic Finite Automata) are usable in this context. (One is not necessarily a better choice. NFAs are sometimes more minimal, and parallelizable; DFAs are often easier to read and slightly faster to evaluate)

For tasks that aren't quite 'accept text', it is sometimes easier to build a FSA more directly than go via regexp syntax as an intermediate.

Applications include working with conceptual units, such as graphemes, morphemes and phonemes, such as he aforementioned conjugation of a verb, pluralisation of a noun, estimating where syllable breaks are, and so on - that is, recognize whether a form makes sense, rather than generating them, at least just yet.

To do this, one usually needs a fairly strict rule system, which are usually abstractly designed. This may introduce a lot of categories and properties for words because you will probably want to knowing what a word is (eg. an irregular plural noun) to be able to know what to do with it. After all, we're trying to avoid storing every word, and you can do a bunch, but only do so much with the last couple of letters.


Examples / exercises

Regular expressions

Examples: (generally assuming alphabet is only lowercase a to z, sometimes general ASCII)

  • All lowercase strings ending in 'b': [a-z]*b
  • Repeated words: \b([a-zA-Z]+) \1
  • Capitalized alliterations: \b([A-Z])[A-Za-z]* \1
  • A string without an e: ^[a-df-z]*$ (assuming presence of anchors)
  • A string with at least two vowels: ^[b-df-hj-np-tv-z]*[aeiuo]+[b-df-hj-np-tv-z]*[aeiuo]+[a-z]*$

Note that there are usually various solutions, some clearer and shorter than others, though sometimes with somewhat different results. For example, consontants could also be written [^aeiou], but only when your alphabet only allows a-z -- with ascii that negation will be 250 characters, not the 21 you want.


  • Make a regexp that accepts a word containing two vowels, ending with a consonant, and having an odd number of letters. (Note that this would be easier with a regexp/FSA team)
  • Explore different implementations, how they deal with greediness, grouping, multiple matches, overlapping matches, allow backreferences, forward references and substitution.
  • Think of and possibly implement a KISS ELIZA, eg. with a number of rules like:
cat | tr '.' '\n' | tr -s '\n' | \
sed -e "s/I'm/I am/g" | sed -e "s/I've/I have/g" | \
sed -e "s/I am sad/I'm sorry to hear that./g" | \
sed -e "s/I am \([a-zA-Z\ ]*\)/Why do you say you are \1?/g" | \
sed -e "s/I have \([a-zA-Z\ ]*\)/How long have you had \1?/g"

...and other such rules. (Think of marking your output so it won't get processed again. You also probably want to use -r)


Examples: (when I get bothered with graphviz)


  • Write a regular expression for a machine. Algorithms exist for this. If you can't simplify it in one step, remember that a disjuction of all possible paths is correct, if not very minimal.
  • Describe an algorithm for negating a DFA, i.e., that builds a machine that doesn't accepts all strings except the one in the original.

Concepts and Algorithms

Regular Grammar, Regular Expression, Finite State Machine (FSM, or ~Automaton, FSA) (Deterministic FSA (DFA or DFSA), Nondeterminsitic FSA (NFA or NFSA)).

A FSA is usually a five-tuple: (Q,\Sigma,S,E,F) where:

  • Q is the set of states
  • \Sigma> is the set of symbols (the alphabet)
  • S is the start state or states \subset Q
  • E is the final accepting state or states \subset Q
  • F is the set of edges, seen as triples \subset (Q,(\Sigma \cup \epsilon),Q)

Whether it is deterministic or not (always has one state active) can be seen from S and F. If there is one start state, and no triples in Q which the first two match (ie. if from a single state a symbol is used to point to at mos one other node) it is nondeterministic.

There is an algorithm that minimises a machine to a more compact form. Interesting is that for the same language, there is only one minimal machine, so if you make a machine several ways (eg. once from state transitions, once compiled from a regexp), if after you minimize they are equivalent, that proves the accepted languages are the same.

State space search: The problem definition builds a set of possible answers that is looked through, either by pre-building it, or by lazily evaluating it, and either returning on the first answer, or exhaustively returning all. (Evaluating a NFA for a string is an example of such an algorithm, as is Perl's regexp implementation, which uses backtracking)

Finite State Transducers

Examples / Exercises

  • Implement the Soundex algorithm in a transducer.
  • Make a transducer that replace the c in words by an s or k according to their pronunciation.

Weighed Finite State Automata

Context Free Grammars

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

CFGs are those grammars which are described by a set of rules described by

N -> s


  • N is a nonterminal and
  • s is a list consisting of
    • terminals (which are unit strings),
    • nonterminals (i.e. the left-hand side of a rule), or
    • an empty string.

For example:

S -> A B
A -> a A
A -> a  #a common shorthand would combine these into "A -> A | a"
B -> b


  • CFGs are more expressive than regular expressions in that it allows recursion (and not just a limited form of direct iteration).
  • Using recursion is not necessarily more expressive than regular languages are (the above is roughly equivalent to the regexp "aa+b"), but it can be.
  • There are formal languages that are not context free
...but many things that are close are made to be CFGs because it makes parsing simpler.
  • The context freeness comes from the way CFGs generate/accept a string: At any point, a rule N can be replaced by any right-side s, regardless of context. (as opposed to Context-sensitive_grammars, which can be a pain)

There are various subclassifications of CFG's, usually restrictive properties related to ease of parsing. These are usually descriptions of what forms rules can take, which translates into certain inherent guarantees in their use. Most of the common ones:

Usually (for all of these?(verify)) any CFG can be transformed into any other form - though it often doesn't result in a nicely readable set of rules.

Note that these grammars are not necessarily unambiguous.

When applies more mathematically we design grammars to be unambiguous. For example, markup and protocol scriptions - you e.g. see EBNF in a bunch of RFCs.

In terms of expressiveness, the following can be said:

  • RE is a subset of CFG
  • CFG isn't a subset of RE
  • a CFG intersected with a RE is a CFG


  • Taking the intersection of two CFGs doesn't work
  • Taking the complement also doesn't (can be seen as intersection with all-string CFG)

BNF and derivations


This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Meta, more abstract:

  • abstract syntax tree, AST - see syntax tree
  • bottom-up parsing - see shift-reduce parsing
  • compiler compiler, compiler generator: Create a compiler for a given grammar
  • concrete syntax tree - see syntax tree
  • lexer, lexical analyser - tokenizes input according to the terminals in some grammar(verify).
A necessary step in parsing most grammars
Note that a parser generator often also creates one of these (see lexer generator)(verify)
  • lexer generator, lexical analyser generator: creates a lexical analyser based on some specification (e.g. all the terminals), which can save a lot of writing and rewriting bother

  • syntax tree
    • concrete syntax tree, parse tree - showing the syntactic structure of a string
    • abstract syntax tree - shows meaning rather than every detail in the input string
  • shift-reduce parsing -
  • top-down parsing (cf. bottom-up/shift-reduce) [1]

A parser (top-down / bottom-up?) often consists of

  • an input buffer, containing tokens from the input string
  • a stack for the terminals and non-terminals from the grammar yet to be parsed
  • a parsing table of grammar rules, often in terms of the stack top and the next input token

  • LALR grammar
  • LALR parser - look-ahead LR parser; see LR parsers
  • LR parser - reads input from left to right, produces rightmost derivation.
  • LR grammar
  • LL grammar
  • LL parser - reads input from left to right, produces leftmost derivation
  • k, as in 'LL(k) parser', 'LR(k)' and such refers to the unconsumed token lookahead. If such a parser exists for a given grammar, it need not backtrace and the parser can unambiguously decide the parse with one token of lookahead. Higher-k parsers are hard to write, although parser generators can deal with that for you. Theorhetically, LR(k) for k>1 can be transformed into LR(1)
  • SLR grammar
  • SLR parser - Simple LR parser. Parser-generated LR(0) state machine. Not practical for various real-world languages[2]

Apparently there is a mild preference for LL parsers in Europe, and LR in America, apparently because of the presentation of parsing theory along the way.

Parsers are often one of:

  • LALR(1) - somewhat less powerful than LR(1), but simpler to implement. Various parser generators (e.g. yacc derivations) create LALR(1) parsers
  • LR(0) - no lookahead, often too simple for use
  • SLR(1)
  • LR(1) - general, but complex to implement (...yourself). Many programming language grammars are LR(1)

Ordering of expressiveness of grammars, more to less: (currently approximate)

  • LR
  • LALR
  • SLR


  • yacc and derivations
    • yacc (AT&T) - builds LALR parser for a CFG in C
    • GNU bison - builds LALR parser for a CFG in C, C++, or Java
    • Berkeley Yacc [3]
    • PLY ('Python Lex-Yacc') [4]
    • racc (ruby yacc) [5]
    • MKS yacc [6]
    • Abraxas pcyacc [7]
  • Common lexers
    • lex - lexer generator (often seen with yacc) [8]
    • flex, flex++ - lexer generator (often seen with bison) [9]
  • antlr

See also:


Parsing expression grammar (PEG)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Parsing Expression Grammar (PEG) resemble CFGs but will e.g. accept the first of choices, rather than any, which helps replace ambiguous/multiple parses by having a clear preference.

While CFGs may be more flexible for cases where you might care about evaluating all parses (e.g. in natural language), PEGs may be preferable (also over regexps) when you want to express things like "this pattern if it matches, otherwise that one", "as much of that pattern as is there", etc, and/or care for the fact that the result will often be an AST.

Context-sensitive grammars

The formal side of informal language

Formal language describes taking such a structured approach to natural language (NLP).

Grammar formalisms and grammar frameworks describe varied approaches of expressing grammars in more formal, structured terms.

Often used for computational linguistics, specifically interests such as language modeling or machine translation. ...though note that translation and text analysis includes various methods that are more free-form than going via a structured grammar.

This couldn't go without mention of the Chomskian tradition from the 60s, which was perhaps the first formal, almost mathematical take on language processing, which though not practiced the same way today was strides in the ability to reason about expressiveness and complexity and such.

The more basic models here resembles CFGs, but with various practical alterations.

For example, unification-based grammar is a step more general than CFGs. It can model CFGs (by not having features) and can do more than CFGs ( using them). UGs are in fact extremely general, and as such harder to write fast parsers for(verify).

More widely, formal language is still fairly open-ended, with there being no singular best parse, meaning this becomes about modeling rather than representation. Nor is there a singular best representation of a parse (in general; there may be within a framework's restrictions).

In practice there are many variations and inventions, so you don't really talk about CFGs and such at all. See e.g. Grammar formalisms.

So there are a lot of approaches, styles of markup, and more.

This also interplays with habits and choices at tagsets and dependency tree level.

Tagsets by themselves are the lexical categories you annotate text with.

These tend to be heavily tied to the particular formalism / implmentation that you analyse text with, often via a parse tree / dependency tree of some sort.

Which is one of a few reasons that the tags from a tagset may not translate directly to others, and the parse trees much less so - even within a formalism, consistency may come from conventions.

There are developments, like Universal Dependencies, that try to create treebanks with one more universal dependency-tree style.


Some more terms from parsers