# Difference between revisions of "Formal grammars - regular expressions, CFGs, formal language"

 This is more for overview of my own than for teaching or exercise. Arithmetic · 'elementary mathematics' and similar concepts Set theory, Category theory Geometry and its relatives · Topology Elementary algebra - Linear algebra - Abstract algebra Calculus and analysis Logic Semi-sorted  : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort Math on data: Statistics as a field some introduction · areas of statistics types of data · on random variables, distributions Virtues and shortcomings of... on sampling · probability glossary · references, unsorted Footnotes on various analyses Other data analysis, data summarization, learning Formal grammars - regular expressions, CFGs, formal language Data modeling, restructuring, analysis, fuzzy cases, learning Statistical modeling · Classification, clustering, decisions · dimensionality reduction · Optimization theory, control theory Connectionism, neural nets · Evolutionary computing

 These are primarily notesIt won't be complete in any sense.It exists to contain fragments of useful information.

# Regular Expressions and Finite State Automata

Regular expressions (excluding some complex extensions) can be compiled into FSAs. Regexps can be useful even in their somewhat limited expressiveness; they can contain infinitely iterative cycles, meaning they can limitedly model iteration in natural language, and besides that do optionality, alternatives and a few other things.

Relevant to the non-formal-language purpose, basic POSIX regexps have been used to look for simple words, filter in several word variations, alternative patterns in any text, information extraction, and more - though at least occasionally its output will be handled more exactly by something else. Linguistically, you would often work on a letter basis, for example to try to spell-correct, perhaps conjugate verbs, pluralize nouns, or on phonemes, syllables, for various purposes.

Sometimes it is even conceptually easier to build a machine instead of a regular expression. Both DFA's and NFA's (Deterministic and Nondeterministic Finite Automata) are usable in this context. (One is not necessarily a better choice. NFAs are sometimes more minimal, and parallelizable; DFAs are often easier to read and slightly faster to evaluate) Note that in their basic form, for each input string they transaction according to it, and either end up in an end state or not; they are acceptor machines.

Applications of this is things that deal with conceptual units, such as graphemes, morphemes and phonemes, such as he aforementioned conjugation of a verb, pluralisation of a noun, estimating where syllable breaks are, and so on - that is, recognize whether a form makes sense, rather than generating them, at least just yet.

To do this, one usually needs a fairly strict rule system, which are usually abstractly designed. This may introduce a lot of categories and properties for words because you will probably want to knowing what a word is (eg. an irregular plural noun) to be able to know what to do with it. After all, we're trying to avoid storing every word, and you can do a bunch, but only do so much with the last couple of letters.

Many regexp implementations compile to automata. Given that, it is interesting to be able to play with the resultant machines, or both in interaction, like you can with eg. this - after all, there are formal operations to combine machines, which means you can re-use and combine them. More about that in the transducer section.

## Storage

One somewhat specialist use is data storage for efficient lookup.

Tuple Dictionaries

Perfect Hash

## Examples / exercises

Regular expressions

Examples: (generally assuming alphabet is only lowercase a to z, sometimes general ASCII)

• All lowercase strings ending in 'b':
[a-z]*b
• Repeated words:
\b([a-zA-Z]+) \1
• Capitalized alliterations:
\b([A-Z])[A-Za-z]* \1
• A string without an e:
^[a-df-z]*$(assuming presence of anchors) • A string with at least two vowels: ^[b-df-hj-np-tv-z]*[aeiuo]+[b-df-hj-np-tv-z]*[aeiuo]+[a-z]*$

Note that there are usually various solutions, some clearer and shorter than others, though sometimes with somewhat different results. For example, consontants could also be written </tt>[^aeiou]</tt>, but only when your alphabet only allows a-z -- with ascii that negation will be 250 characters, not the 21 you want.

Exercises:

• Make a regexp that accepts a word containing two vowels, ending with a consonant, and having an odd number of letters. (Note that this would be easier with a regexp/FSA team)
• Explore different implementations, how they deal with greediness, grouping, multiple matches, overlapping matches, allow backreferences, forward references and substitution.
• Think of and possibly implement a KISS ELIZA, eg. with a number of rules like:
cat | tr '.' '\n' | tr -s '\n' | \
sed -e "s/I'm/I am/g" | sed -e "s/I've/I have/g" | \
sed -e "s/I am sad/I'm sorry to hear that./g" | \
sed -e "s/I am $$[a-zA-Z\ ]*$$/Why do you say you are \1?/g" | \
sed -e "s/I have $$[a-zA-Z\ ]*$$/How long have you had \1?/g"


...and other such rules. (Think of marking your output so it won't get processed again. You also probably want to use -r)

Automata

Examples: (when I get bothered with graphviz)

Exercises:

• Write a regular expression for a machine. Algorithms exist for this. If you can't simplify it in one step, remember that a disjuction of all possible paths is correct, if not very minimal.
• Describe an algorithm for negating a DFA, i.e., that builds a machine that doesn't accepts all strings except the one in the original.

## Concepts and Algorithms

Regular Grammar, Regular Expression, Finite State Machine (FSM, or ~Automaton, FSA) (Deterministic FSA (DFA or DFSA), Nondeterminsitic FSA (NFA or NFSA)).

A FSA is usually a five-tuple: (Q,\Sigma,S,E,F) where:

• Q is the set of states
• \Sigma> is the set of symbols (the alphabet)
• S is the start state or states \subset Q
• E is the final accepting state or states \subset Q
• F is the set of edges, seen as triples \subset (Q,(\Sigma \cup \epsilon),Q)

Whether it is deterministic or not (always has one state active) can be seen from S and F. If there is one start state, and no triples in Q which the first two match (ie. if from a single state a symbol is used to point to at mos one other node) it is nondeterministic.

There is an algorithm that minimises a machine to a more compact form. Interesting is that for the same language, there is only one minimal machine, so if you make a machine several ways (eg. once from state transitions, once compiled from a regexp), if after you minimize they are equivalent, that proves the accepted languages are the same.

State space search: The problem definition builds a set of possible answers that is looked through, either by pre-building it, or by lazily evaluating it, and either returning on the first answer, or exhaustively returning all. (Evaluating a NFA for a string is an example of such an algorithm, as is Perl's regexp implementation, which uses backtracking)

# Finite State Transducers

Of course, just accepting strings isn't very interesting to natural language processing. Much more interesing, and more complex as well, is making transducers, where an accepting edge also emits output. While there are other ways of using it (including imitating simple DFAs) its main use is probably as a translator: it alters the string in a way determined by the machine. For example, one that maps words to whether letters are consontants or vowels, eg. 'word' to 'cvcc'.

To give an example of a practical purpose (copied from [1], like other examples)

The intuitive choice for representation a Mealy machine, edges containing the input/output pairs. There is no really common notation, but a/b and a:b seem common. Shorthanding is possible, such as making a usually common self-map like a:a just a, summarizing ranges ([a..z]:b, which may be 26 edges in the actual machine, or one, depending on implementation) and even ?:b (? in the sense of 'any single character').

You can combine machines if you so wish. If you have one machine that tests for even length, one that tests for an ending consonant, and one that tests whether a word has to vowels, one earlier problem would be much easier to solve than the regexp it asked for.

Transducers are preferably DFA's. When they are nondeterministic, semantics and implementational detailscome into play. I would suppose the solution is to make a list of all solutions (based on each solution path, in the backtracking sense). However, you cannot simply combine machines when you do that, because a next machine would only accept one string at a time.

To be slightly more generally than 'various combination': we are talking about closure properties. For the less mathematic, like me:

A closure (see also wikipedia) is a function that has some (interesting) property and that, when applied to X, includes X (as a subset) and is minimal. (...preferably exactly X, at which point the object is said to be closed under the property). For example, binary relations are often (always?) closed, eg. \Re + \Re which gives a value still in \Re - there by virtue of those being infinite sets, but more practical operations, such as the union, intersection, substraction of sets, are also closure operations, more specifically set closures: a set S and a (binary) operation are said to exhibit closure if the operation returns a set which is a subset of S. Actually, this isn't horribly important, but interesting in a technical way.

The usual operations (closure properties) are:

• Union

Given two machines, this simply puts them inside the same context. That is, their node sets, edge sets, start states sets and end state sets are unioned, after making sure their naming doesn't overlap. It always creates a nondeterministic FSA on account of introducing two start states. It is often assumed that the alphabets are identical; if not they are unioned as well.

• Concatenation (feeds output into the next machine)

This sounds boring, but it means you can prepare for more interesting operations by replacing or inserting tokens - for example marking the start and end of a string with special tokens, or re-using your to-vc machine, followed by something that uses that. A simple method is to put an epsilon transition (or lambda transition, depending on where you learned state machines) between the first machine's end state and the second's start state and strip these two nodes of said properties.

• Intersection

Takes machines and makes them alternative paths. Since simpy joining start states and joining end states may make the machine nondeterministic, the joining may make it nondeterministic

' start and end states and merges these start states and their end states, having the effect of a choice: More alternatives are now possible.

• Inversion

If before it mapped string I to O, it now maps from O to I. This means you can switch between parser and generator.

• Kleene star / Kleene closure

Cyclic repetition: In regular expressions, * is called the Kleene star. This wraps some epsilon transitions around the machine and adds a start-and-end state, which has the effect that you can 'do' the machine any amount of times.

Kleene closure is a unary operation on a set of strings (often characters, depending on the application) S, denoted as S*. See wikipedia for more information.

You can store various things in an automaton's structure. With simple recognizers, for example, you can store a lexicon in that each path is a word. The simplest form is a tree; merging it so that it isn't in fact a tree but has one end state and end is a little harder, but doable and often noticably more space-efficient. With transducers you can store a similar structure even including certain transformations, like including how to transform from a word's stem into various other forms. The book I'm reading [1] lists examples that have letters as transactions, followed by various things to do with that word, eg make it a plural noun, or a singular noun, a process which is similar for regular words so can similarly be joined for space efficiency - though for an entire language, the machine would still be huge.

## Examples / Exercises

• Implement the Soundex algorithm in a transducer.
• Make a transducer that replace the c in words by an s or k according to their pronunciation.