Difference between revisions of "Formal grammars - regular expressions, CFGs, formal language"
m (→Formal language)
m (→Formal language)
|Line 371:||Line 371:|
Formal language describes taking such a structured approach to natural language ([[NLP]]).
Formal language describes taking such a structured approach to natural language ([[NLP]]).
Revision as of 22:21, 4 May 2022
| This is more for overview of my own than for teaching or exercise.
Other data analysis, data summarization, learning
| These are primarily notes|
It won't be complete in any sense.
It exists to contain fragments of useful information.
- 1 Introduction
- 2 Regular Expressions and Finite State Automata
- 3 Finite State Transducers
- 4 Weighed Finite State Automata
- 5 Context Free Grammars
- 6 Context-sensitive grammars
- 7 The formal side of informal language
Regular Expressions and Finite State Automata
Regular expressions (excluding some complex extensions) can be compiled into FSAs. Regexps can be useful even in their somewhat limited expressiveness; they can contain infinitely iterative cycles, meaning they can limitedly model iteration in natural language, and besides that do optionality, alternatives and a few other things.
Relevant to the non-formal-language purpose, basic POSIX regexps have been used to look for simple words, filter in several word variations, alternative patterns in any text, information extraction, and more - though at least occasionally its output will be handled more exactly by something else. Linguistically, you would often work on a letter basis, for example to try to spell-correct, perhaps conjugate verbs, pluralize nouns, or on phonemes, syllables, for various purposes.
Sometimes it is even conceptually easier to build a machine instead of a regular expression. Both DFA's and NFA's (Deterministic and Nondeterministic Finite Automata) are usable in this context. (One is not necessarily a better choice. NFAs are sometimes more minimal, and parallelizable; DFAs are often easier to read and slightly faster to evaluate) Note that in their basic form, for each input string they transaction according to it, and either end up in an end state or not; they are acceptor machines.
Applications of this is things that deal with conceptual units, such as graphemes, morphemes and phonemes, such as he aforementioned conjugation of a verb, pluralisation of a noun, estimating where syllable breaks are, and so on - that is, recognize whether a form makes sense, rather than generating them, at least just yet.
To do this, one usually needs a fairly strict rule system, which are usually abstractly designed. This may introduce a lot of categories and properties for words because you will probably want to knowing what a word is (eg. an irregular plural noun) to be able to know what to do with it. After all, we're trying to avoid storing every word, and you can do a bunch, but only do so much with the last couple of letters.
Many regexp implementations compile to automata. Given that, it is interesting to be able to play with the resultant machines, or both in interaction, like you can with eg. this - after all, there are formal operations to combine machines, which means you can re-use and combine them. More about that in the transducer section.
One somewhat specialist use is data storage for efficient lookup.
Examples / exercises
Examples: (generally assuming alphabet is only lowercase a to z, sometimes general ASCII)
- All lowercase strings ending in 'b':
- Repeated words:
- Capitalized alliterations:
- A string without an e: (assuming presence of anchors)
- A string with at least two vowels:
Note that there are usually various solutions, some clearer and shorter than others, though sometimes with somewhat different results. For example, consontants could also be written </tt>[^aeiou]</tt>, but only when your alphabet only allows a-z -- with ascii that negation will be 250 characters, not the 21 you want.
- Make a regexp that accepts a word containing two vowels, ending with a consonant, and having an odd number of letters. (Note that this would be easier with a regexp/FSA team)
- Explore different implementations, how they deal with greediness, grouping, multiple matches, overlapping matches, allow backreferences, forward references and substitution.
- Think of and possibly implement a KISS ELIZA, eg. with a number of rules like:
cat | tr '.' '\n' | tr -s '\n' | \ sed -e "s/I'm/I am/g" | sed -e "s/I've/I have/g" | \ sed -e "s/I am sad/I'm sorry to hear that./g" | \ sed -e "s/I am \([a-zA-Z\ ]*\)/Why do you say you are \1?/g" | \ sed -e "s/I have \([a-zA-Z\ ]*\)/How long have you had \1?/g"
...and other such rules. (Think of marking your output so it won't get processed again. You also probably want to use -r)
Examples: (when I get bothered with graphviz)
- Write a regular expression for a machine. Algorithms exist for this. If you can't simplify it in one step, remember that a disjuction of all possible paths is correct, if not very minimal.
- Describe an algorithm for negating a DFA, i.e., that builds a machine that doesn't accepts all strings except the one in the original.
Concepts and Algorithms
A FSA is usually a five-tuple: (Q,\Sigma,S,E,F) where:
- Q is the set of states
- \Sigma> is the set of symbols (the alphabet)
- S is the start state or states \subset Q
- E is the final accepting state or states \subset Q
- F is the set of edges, seen as triples \subset (Q,(\Sigma \cup \epsilon),Q)
Whether it is deterministic or not (always has one state active) can be seen from S and F. If there is one start state, and no triples in Q which the first two match (ie. if from a single state a symbol is used to point to at mos one other node) it is nondeterministic.
There is an algorithm that minimises a machine to a more compact form. Interesting is that for the same language, there is only one minimal machine, so if you make a machine several ways (eg. once from state transitions, once compiled from a regexp), if after you minimize they are equivalent, that proves the accepted languages are the same.
State space search: The problem definition builds a set of possible answers that is looked through, either by pre-building it, or by lazily evaluating it, and either returning on the first answer, or exhaustively returning all. (Evaluating a NFA for a string is an example of such an algorithm, as is Perl's regexp implementation, which uses backtracking)
Finite State Transducers
Examples / Exercises
- Implement the Soundex algorithm in a transducer.
- Make a transducer that replace the c in words by an s or k according to their pronunciation.
Weighed Finite State Automata
Context Free Grammars
CFGs are those grammars which are described by a set of rules described by
N -> s
- N is a nonterminal and
- s is a list consisting of
- terminals (which are unit strings),
- nonterminals (i.e. the left-hand side of a rule), or
- an empty string.
S -> A B A -> a A A -> a #a common shorthand would combine these into "A -> A | a" B -> b
- CFGs are more expressive than regular expressions in that it allows recursion (and not just a limited form of direct iteration).
- Using recursion is not necessarily more expressive than regular languages are (the above is roughly equivalent to the regexp "aa+b"), but it can be.
- There are formal languages that are not context free
- ...but many things that are close are made to be CFGs because it makes parsing simpler.
- The context freeness comes from the way CFGs generate/accept a string: At any point, a rule N can be replaced by any right-side s, regardless of context. (as opposed to Context-sensitive_grammars, which can be a pain)
There are various subclassifications of CFG's, usually restrictive properties related to ease of parsing. These are usually descriptions of what forms rules can take, which translates into certain inherent guarantees in their use. Most of the common ones:
- CNF (Chomsky Normal Form),
- GNF (Greibach Normal Form),
- KNF (Kuroda Normal Form),
- BNF (Backus-Naur Form),
- EBNF (Extended Backus-Naur Form).
Usually (for all of these?(verify)) any CFG can be transformed into any other form - though it often doesn't result in a nicely readable set of rules.
There are also simpler properties about input, such as:
- Nonambiguous - determines that there will be only one parse for any given acceptable string. Obviously useful and used in programming language definitions, markup and protocol definitions (e.g. you see EBNF in a bunch of RFCs)
- LL(k) grammar, LR(k) grammar - CFGs that can be parsed with a a LL(k) and LR(k) parser (see below).
- The k refers to what amount of lookahead is needed to know which rule to choose, which you could call k-deep ambiguity. For formal languages we try to keep that as low as sensible. LL(1) languages are quite simple to parse, but also less expressive.
Usual parsers are:
- LL and LR parsers, give leftmost and rightmost parse, respectively.
- CYK (Cocke-Younger-Kasami), made for CNF, can give all possible parses. (O(n3))
- GHR (Graham/Harrison/Ruzzo)
- Earley, which is attractive because it is given to be able to parse all CFGs. (O(n3))
In terms of expressiveness, the following can be said:
- RE is a subset of CFG
- CFG isn't a subset of RE
- a CFG intersected with a RE is a CFG
- Taking the intersection of two CFGs doesn't work
- Taking the complement also doesn't (can be seen as intersection with all-string CFG)
BNF and derivations
In the form of a list of
<symbol> ::= epression
Basic BNF makes it hard to specify options and repetitions
Regular languages are often parsed
- abstract syntax tree, AST - see syntax tree
- Bottom-up parsing - see shift-reduce parsing
- compiler compiler, compiler generator: Create a compiler for a given grammar
- concrete syntax tree - see syntax tree
- lexer, lexical analyzer, lexical scanner - tokenize an input. A necessary step in parsing most grammars
- lexer generator, lexical analyser generator: creates a lexical analyser based on some specification (e.g. all the terminals), which can save a lot of writing and rewriting bother
- lexer, lexical analyser - tokenizes input according to the terminals in some grammar(verify). Note that a parser generator often also creates one of these (see lexer generator)(verify)
- Shift-reduce parsing -
- Top-down parsing (cf. bottom-up/shift-reduce) 
- LALR grammar
- LALR parser - look-ahead LR parser; see LR parsers
- LR parser - reads input from left to right, produces rightmost derivation.
- LR grammar
- LL grammar
- LL parser - reads input from left to right, produces leftmost derivation
- k, as in 'LL(k) parser', 'LR(k)' and such refers to the unconsumed token lookahead. If such a parser exists for a given grammar, it need not backtrace and the parser can unambiguously decide the parse with one token of lookahead. Higher-k parsers are hard to write, although parser generators can deal with that for you. Theorhetically, LR(k) for k>1 can be transformed into LR(1)
- SLR grammar
- SLR parser - Simple LR parser. Parser-generated LR(0) state machine. Not practical for various real-world languages
- syntax tree
- concrete syntax tree, parse tree - showing the syntactic structure of a string
- abstract syntax tree - shows meaning rather than every detail in the input string
A parser (top-down / bottom-up?) often consists of
- an input buffer, containing tokens from the input string
- a stack for the terminals and non-terminals from the grammar yet to be parsed
- a parsing table of grammar rules, often in terms of the stack top and the next input token
Apparently there is a mild preference for LL parsers in Europe and LR in the USA, apparently because of the presentation of parsing theory along the way.
Parsers are often one of:
- LALR(1) - somewhat less powerful than LR(1), but simpler to implement. Various parser generators (e.g. yacc derivations) create LALR(1) parsers
- LR(0) - no lookahead, often too simple for use
- LR(1) - general, but complex to implement (...yourself). Many programming language grammars are LR(1)
Ordering of expressiveness of grammars, more to less: (currently approximate)
- yacc and derivations
- Common lexers
The formal side of informal language
Formal language describes taking such a structured approach to natural language (NLP).
This couldn't go without mention of the Chomskian tradition from the 60s, which was perhaps the first formal, almost mathematical take on language processing, which though not practiced the same way today was strides in the ability to reason about expressiveness and complexity and such.
The more basic models here resembles CFGs, but with various practical alterations.
For example, unification-based grammar is a step more general than CFGs. It can model CFGs (by not having features) and can do more than CFGs (...by using them). UGs are in fact extremely general, and as such harder to write fast parsers for(verify).
More widely, formal language is still fairly open-ended, with there being no singular best parse, meaning this becomes about modeling rather than representation. Nor is there a singular best representation of a parse (in general; there may be within a framework's restrictions).
In practice there are many variations and inventions, so you don't really talk about CFGs and such at all. See e.g. Grammar formalisms.
So there are a lot of approaches, styles of markup, and more.
Tagsets by themselves are the lexical categories you annotate text with.
These tend to be heavily tied to the particular formalism / implmentation that you analyse text with, often via a parse tree / dependency tree of some sort.
Which is one of a few reasons that the tags from a tagset may not translate directly to others, and the parse trees much less so - even within a formalism, consistency may come from conventions.
There are developments, like Universal Dependencies, that try to create treebanks with one more universal dependency-tree style.