# Difference between revisions of "Formal grammars - regular expressions, CFGs, formal language"

 This is more for overview of my own than for teaching or exercise. Arithmetic · 'elementary mathematics' and similar concepts Set theory, Category theory Geometry and its relatives · Topology Elementary algebra - Linear algebra - Abstract algebra Calculus and analysis Logic Semi-sorted  : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort Math on data: Statistics as a field some introduction · areas of statistics types of data · on random variables, distributions Virtues and shortcomings of... on sampling · probability glossary · references, unsorted Footnotes on various analyses Other data analysis, data summarization, learning Formal grammars - regular expressions, CFGs, formal language Data modeling, restructuring, analysis, fuzzy cases, learning Statistical modeling · Classification, clustering, decisions · dimensionality reduction · Optimization theory, control theory Connectionism, neural nets · Evolutionary computing

 These are primarily notesIt won't be complete in any sense.It exists to contain fragments of useful information.

# Regular Expressions and Finite State Automata

Regular expressions (excluding some complex extensions) can be compiled into FSAs. Regexps can be useful even in their somewhat limited expressiveness; they can contain infinitely iterative cycles, meaning they can limitedly model iteration in natural language, and besides that do optionality, alternatives and a few other things.

Relevant to the non-formal-language purpose, basic POSIX regexps have been used to look for simple words, filter in several word variations, alternative patterns in any text, information extraction, and more - though at least occasionally its output will be handled more exactly by something else. Linguistically, you would often work on a letter basis, for example to try to spell-correct, perhaps conjugate verbs, pluralize nouns, or on phonemes, syllables, for various purposes.

Sometimes it is even conceptually easier to build a machine instead of a regular expression. Both DFA's and NFA's (Deterministic and Nondeterministic Finite Automata) are usable in this context. (One is not necessarily a better choice. NFAs are sometimes more minimal, and parallelizable; DFAs are often easier to read and slightly faster to evaluate) Note that in their basic form, for each input string they transaction according to it, and either end up in an end state or not; they are acceptor machines.

Applications of this is things that deal with conceptual units, such as graphemes, morphemes and phonemes, such as he aforementioned conjugation of a verb, pluralisation of a noun, estimating where syllable breaks are, and so on - that is, recognize whether a form makes sense, rather than generating them, at least just yet.

To do this, one usually needs a fairly strict rule system, which are usually abstractly designed. This may introduce a lot of categories and properties for words because you will probably want to knowing what a word is (eg. an irregular plural noun) to be able to know what to do with it. After all, we're trying to avoid storing every word, and you can do a bunch, but only do so much with the last couple of letters.

Many regexp implementations compile to automata. Given that, it is interesting to be able to play with the resultant machines, or both in interaction, like you can with eg. this - after all, there are formal operations to combine machines, which means you can re-use and combine them. More about that in the transducer section.

## Storage

One somewhat specialist use is data storage for efficient lookup.

Tuple Dictionaries

Perfect Hash

## Examples / exercises

Regular expressions

Examples: (generally assuming alphabet is only lowercase a to z, sometimes general ASCII)

• All lowercase strings ending in 'b':
[a-z]*b
• Repeated words:
\b([a-zA-Z]+) \1
• Capitalized alliterations:
\b([A-Z])[A-Za-z]* \1
• A string without an e:
^[a-df-z]*$(assuming presence of anchors) • A string with at least two vowels: ^[b-df-hj-np-tv-z]*[aeiuo]+[b-df-hj-np-tv-z]*[aeiuo]+[a-z]*$

Note that there are usually various solutions, some clearer and shorter than others, though sometimes with somewhat different results. For example, consontants could also be written </tt>[^aeiou]</tt>, but only when your alphabet only allows a-z -- with ascii that negation will be 250 characters, not the 21 you want.

Exercises:

• Make a regexp that accepts a word containing two vowels, ending with a consonant, and having an odd number of letters. (Note that this would be easier with a regexp/FSA team)
• Explore different implementations, how they deal with greediness, grouping, multiple matches, overlapping matches, allow backreferences, forward references and substitution.
• Think of and possibly implement a KISS ELIZA, eg. with a number of rules like:
cat | tr '.' '\n' | tr -s '\n' | \
sed -e "s/I'm/I am/g" | sed -e "s/I've/I have/g" | \
sed -e "s/I am sad/I'm sorry to hear that./g" | \
sed -e "s/I am $$[a-zA-Z\ ]*$$/Why do you say you are \1?/g" | \
sed -e "s/I have $$[a-zA-Z\ ]*$$/How long have you had \1?/g"


...and other such rules. (Think of marking your output so it won't get processed again. You also probably want to use -r)

Automata

Examples: (when I get bothered with graphviz)

Exercises:

• Write a regular expression for a machine. Algorithms exist for this. If you can't simplify it in one step, remember that a disjuction of all possible paths is correct, if not very minimal.
• Describe an algorithm for negating a DFA, i.e., that builds a machine that doesn't accepts all strings except the one in the original.

## Concepts and Algorithms

Regular Grammar, Regular Expression, Finite State Machine (FSM, or ~Automaton, FSA) (Deterministic FSA (DFA or DFSA), Nondeterminsitic FSA (NFA or NFSA)).

A FSA is usually a five-tuple: (Q,\Sigma,S,E,F) where:

• Q is the set of states
• \Sigma> is the set of symbols (the alphabet)
• S is the start state or states \subset Q
• E is the final accepting state or states \subset Q
• F is the set of edges, seen as triples \subset (Q,(\Sigma \cup \epsilon),Q)

Whether it is deterministic or not (always has one state active) can be seen from S and F. If there is one start state, and no triples in Q which the first two match (ie. if from a single state a symbol is used to point to at mos one other node) it is nondeterministic.

There is an algorithm that minimises a machine to a more compact form. Interesting is that for the same language, there is only one minimal machine, so if you make a machine several ways (eg. once from state transitions, once compiled from a regexp), if after you minimize they are equivalent, that proves the accepted languages are the same.

State space search: The problem definition builds a set of possible answers that is looked through, either by pre-building it, or by lazily evaluating it, and either returning on the first answer, or exhaustively returning all. (Evaluating a NFA for a string is an example of such an algorithm, as is Perl's regexp implementation, which uses backtracking)

# Finite State Transducers

## Examples / Exercises

• Implement the Soundex algorithm in a transducer.
• Make a transducer that replace the c in words by an s or k according to their pronunciation.

# Context Free Grammars

CFGs are those grammars which are described by a set of rules described by

N -> s


where N is a nonterminal and s is a list consisting of terminals (which are unit strings), nonterminals (i.e. the left-hand side of a rule), or an empty string. For example:

S -> A B
A -> a A
A -> a  #a common shorthand would combine these into "A -> A | a"
B -> b


It is more expressive than regular expressions in that it allows recursion (and not just a limited form of direct iteration). Using recursion is not necessarily more expressive than regular languages are (the above is roughly equivalent to the regexp "aa+b"), but it can be. There are formal languages that are not context free, but CFGs are useful for quite a few purposes.

The context freeness comes from the way CFGs generate/accept a string: At any point, a rule N can be replaced by any right-side s, regardless of context. (as opposed to Context-sensitive_grammars, which can be a pain)

There are various subclassifications of CFG's, usually restrictive properties related to ease of parsing. These are usually descriptions of what forms rules can take, which translates into certain inherent guarantees in their use. Most of the common ones:

Usually (for all of these?(verify)) any CFG can be transformed into any other form - though it often doesn't result in a nicly readable set of rules.

There are also simpler properties about input, such as:

• Nonambiguous - determines that there will be only one parse for any given acceptable string. Obviously useful and used in programming language definitions.
• LL(k) grammar, LR(k) grammar - CFGs that can be parsed with a a LL(k) and LR(k) parser (see below). The k refers to what amount of lookahead is needed to know which rule to choose. (Why was this again? resolving k-deep ambiguity?(verify)) LL(1) languages are quite simple to parse.

Usual parsers are:

• LL and LR parsers, give leftmost and rightmost parse, respectively.
• CYK (Cocke-Younger-Kasami), made for CNF, can give all possible parses. (O(n3))
• GHR (look up)
• Earley, which is attractive because it is given to be able to parse all CFGs. (O(n3))

In terms of expressiveness, the following can be said:

• RE is a subset of CFG
• CFG isn't a subset of RE
• a CFG intersected with a RE is a CFG

Also:

• Taking the intersection of two CFGs doesn't work
• Taking the complement also doesn't (can be seen as intersection with all-string CFG)

## Probablistic/Statistic CFGs

Usually named a Probablistic CFG, PCFG, but sometimes also an SCFG (Stochastic CFG).

A mildly trained parser that chooses the most-seen partial derivations, hoping to end up at a more natural parse tree rather than the first, leftmost, or such.

## BNF and derivations

In the form of a list of

<symbol> ::= epression


Basic BNF makes it hard to specify options and repetitions

## Parsers

Regular languages are often parsed

• abstract syntax tree, AST - see syntax tree
• Bottom-up parsing - see shift-reduce parsing
• compiler compiler, compiler generator: Create a compiler for a given grammar
• concrete syntax tree - see syntax tree
• lexer, lexical analyzer, lexical scanner - tokenize an input. A necessary step in parsing most grammars
• lexer generator, lexical analyser generator: creates a lexical analyser based on some specification (e.g. all the terminals), which can save a lot of writing and rewriting bother
• lexer, lexical analyser - tokenizes input according to the terminals in some grammar(verify). Note that a parser generator often also creates one of these (see lexer generator)(verify)
• Shift-reduce parsing -
• Top-down parsing (cf. bottom-up/shift-reduce) 
• LALR grammar
• LALR parser - look-ahead LR parser; see LR parsers
• LR parser - reads input from left to right, produces rightmost derivation.
• LR grammar
• LL grammar
• LL parser - reads input from left to right, produces leftmost derivation
• k, as in 'LL(k) parser', 'LR(k)' and such refers to the unconsumed token lookahead. If such a parser exists for a given grammar, it need not backtrace and the parser can unambiguously decide the parse with one token of lookahead. Higher-k parsers are hard to write, although parser generators can deal with that for you. Theorhetically, LR(k) for k>1 can be transformed into LR(1)
• SLR grammar
• SLR parser - Simple LR parser. Parser-generated LR(0) state machine. Not practical for various real-world languages
• syntax tree
• concrete syntax tree, parse tree - showing the syntactic structure of a string
• abstract syntax tree - shows meaning rather than every detail in the input string

A parser (top-down / bottom-up?) often consists of

• an input buffer, containing tokens from the input string
• a stack for the terminals and non-terminals from the grammar yet to be parsed
• a parsing table of grammar rules, often in terms of the stack top and the next input token

Apparently there is a mild preference for LL parsers in Europe and LR in the USA, apparently because of the presentation of parsing theory along the way.

Parsers are often one of:

• LALR(1) - somewhat less powerful than LR(1), but simpler to implement. Various parser generators (e.g. yacc derivations) create LALR(1) parsers
• LR(0) - no lookahead, often too simple for use
• SLR(1)
• LR(1) - general, but complex to implement (...yourself). Many programming language grammars are LR(1)

Ordering of expressiveness of grammars, more to less: (currently approximate)

• LR
• LALR
• SLR

Software:

• yacc and derivations
• yacc (AT&T) - builds LALR parser for a CFG in C
• GNU bison - builds LALR parser for a CFG in C, C++, or Java
• Berkeley Yacc 
• PLY ('Python Lex-Yacc') 
• racc (ruby yacc) 
• MKS yacc 
• Abraxas pcyacc 
• Common lexers
• lex - lexer generator (often seen with yacc) 
• flex, flex++ - lexer generator (often seen with bison) 
• antlr

Unsorted:

# Formal language

Formal language describes taking such a structured approach to natural language (NLP).

This often resembles CFGs, but with various practical alterations.

For example, unification-based grammar is a step more general than CFGs. It can model CFGs (by not having features) and can do more than CFGs (...by using them). UGs are in fact extremely general, and as such harder to write fast parsers for(verify).

It is also much more open-ended, with there being no singular best parse, singlular best representation, for a lot of structures, so there are a lot of approaches.