Difference between revisions of "Formal grammars - regular expressions, CFGs, formal language"

From Helpful
Jump to: navigation, search
m (Parsers)
m (The formal side of informal language)
 
Line 406: Line 406:
  
 
Formal language describes taking such a structured approach to natural language ([[NLP]]).
 
Formal language describes taking such a structured approach to natural language ([[NLP]]).
 +
 +
 +
'''Grammar formalisms''' and '''grammar frameworks''' describe varied approaches of expressing grammars in more formal, structured terms.
 +
 +
Often used for [[computational linguistics]],
 +
specifically interests such as [[language modeling]] or [[machine translation]].
 +
...though note that translation and text analysis includes various methods that are more free-form than going via a structured grammar.
 +
 +
  
  
Line 446: Line 455:
 
that try to create treebanks with one more universal dependency-tree style.
 
that try to create treebanks with one more universal dependency-tree style.
  
 +
<!--
 +
 +
 +
<!--
 +
 +
Properties of grammars include:
 +
 +
* '''generative grammar''' [https://en.wikipedia.org/wiki/Generative_grammar] attempts to create a rule set that describes/generates word combinations that are [[grammatical]]
 +
: usually fully explicit in its grammar
 +
 +
 +
 +
* '''descriptive grammar''' (derivational grammar is sometimes used in a similar sense)
 +
 +
* '''lexicalized grammar''', '''lexical grammar'''
 +
: one which integrates knowledge of words, where structure varies with said words and/or derivation can be assisted
 +
: rather than focusing purely on the function of individual parts and considering words things that happen to slot in just to express different things
 +
 +
 +
 +
 +
Generative and transformational come from the Chomskian tradition (starting in the 60s)
 +
which is characterized by being formal, and claiming things about how the mind forms language.
 +
 +
There is no direct link between generative and mentalist,
 +
but it sometimes seemed like it when people working in this tradition tended to do both,
 +
and moves away from this tradition might question both.
 +
 +
The Chomskian approach is still practiced, but rarely without the footnote that things are
 +
certainly more complex than Chomsky imagined.
 +
 +
That said, Chomsky's formal approach made it much easier to reason about expressive power,
 +
e.g. made it easier to argue about much more expressive his transformational grammar '''transformational grammar'''[https://en.wikipedia.org/wiki/Transformational_grammar] {{comment|(roughly: a design of a generative grammar that adds transformations for specific tasks, like generating variations)}} was and wasn't, and how much of the constrained view we have to give up for it  (see e.g. {{search|On the generative power of transformational grammars}}).
 +
 +
 +
 +
 +
Grammars include:
 +
 +
* Head-Driven Phrase-Structure Grammar (HPSG)[https://en.wikipedia.org/wiki/Head-driven_phrase_structure_grammar]
 +
: lexicalized, non-derivational, generative
 +
 +
 +
* [[Context Free Grammar]] (CFG)
 +
 +
* probabilistic, a.k.a. stochastic context-free grammar, (SCFG, PCFG)
 +
: A CFG specification aimed at natural language, that doesn't try to have an unambiguous grammar and instead chooses the most-seen partial derivations, hoping to end up at a more natural parse tree rather than the first, the leftmost, or such.
 +
 +
* weighed CFG (WCFG)
 +
: each parse is scored according to the rules used
 +
: This is mechanically distinct from PCFGs, but turn out to be {{search|A Smith, M Johnson, "Weighted and Probabilistic Context-Free Grammars Are Equally Expressive"|just as expressive}}
 +
 +
 +
 +
* Generalised phrase structure grammar (GPSG)
 +
 +
* Unification Grammars:
 +
** Categorial Unification Grammar (CUG)
 +
** Tree Unification Grammar (TUG)
 +
** Functional Unification Grammar (FUG)
 +
 +
* Tree Adjoining Grammar (TAG) [https://en.wikipedia.org/wiki/Tree-adjoining_grammar]
 +
** Lexicalized Tree Adjunction Grammar (LTAG)
 +
** Multi-Component Tree Adjoining Grammars
 +
 +
* Lexical Functional Grammar (LFG)
 +
 +
 +
 +
 +
Unsorted:
 +
* Generalized Phrase Structure Grammar (GSPG)
 +
 +
* Context-sensitive grammar (CSG)
 +
 +
* Combinatory Categorial grammar (CCG)
 +
 +
* Lexical functional grammar (LFG)
 +
 +
* Role and reference grammar
 +
 +
* Systemic functional grammar
 +
 +
* Constraint Grammar
 +
 +
* Glue Semantics
 +
 +
 +
 +
==See also==
 +
* http://en.wikipedia.org/wiki/Category:Grammar_frameworks
 +
* http://en.wikipedia.org/wiki/Transformational_grammar
 +
 +
-->
 +
 +
[[Category:Grammar]]
 +
[[Category:Computational linguistics]]
 +
[[Category:Formalisms, programs, frameworks, approaches, theories, binding ideas, and such]]
  
 +
-->
  
  

Latest revision as of 20:38, 14 May 2022

This is more for overview of my own than for teaching or exercise.

Overview of the areas

Arithmetic · 'elementary mathematics' and similar concepts
Set theory, Category theory
Geometry and its relatives · Topology
Elementary algebra - Linear algebra - Abstract algebra
Calculus and analysis
Logic
Semi-sorted
 : Information theory · Number theory · Decision theory, game theory · Recreational mathematics · Dynamical systems · Unsorted or hard to sort


Math on data:

  • Statistics as a field
some introduction · areas of statistics
types of data · on random variables, distributions
Virtues and shortcomings of...
on sampling · probability
glossary · references, unsorted
Footnotes on various analyses

Other data analysis, data summarization, learning

Statistical modeling · Classification, clustering, decisions · dimensionality reduction · Optimization theory, control theory
Connectionism, neural nets · Evolutionary computing



These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Introduction

Regular Expressions and Finite State Automata

Regular expressions (excluding some complex extensions) can be compiled into FSAs. Regexps can be useful even in their somewhat limited expressiveness; they can contain infinitely iterative cycles, meaning they can limitedly model iteration in natural language, and besides that do optionality, alternatives and a few other things.

Relevant to the non-formal-language purpose, basic POSIX regexps have been used to look for simple words, filter in several word variations, alternative patterns in any text, information extraction, and more - though at least occasionally its output will be handled more exactly by something else. Linguistically, you would often work on a letter basis, for example to try to spell-correct, perhaps conjugate verbs, pluralize nouns, or on phonemes, syllables, for various purposes.


Sometimes it is even conceptually easier to build a machine instead of a regular expression. Both DFA's and NFA's (Deterministic and Nondeterministic Finite Automata) are usable in this context. (One is not necessarily a better choice. NFAs are sometimes more minimal, and parallelizable; DFAs are often easier to read and slightly faster to evaluate) Note that in their basic form, for each input string they transaction according to it, and either end up in an end state or not; they are acceptor machines.


Applications of this is things that deal with conceptual units, such as graphemes, morphemes and phonemes, such as he aforementioned conjugation of a verb, pluralisation of a noun, estimating where syllable breaks are, and so on - that is, recognize whether a form makes sense, rather than generating them, at least just yet.

To do this, one usually needs a fairly strict rule system, which are usually abstractly designed. This may introduce a lot of categories and properties for words because you will probably want to knowing what a word is (eg. an irregular plural noun) to be able to know what to do with it. After all, we're trying to avoid storing every word, and you can do a bunch, but only do so much with the last couple of letters.


Many regexp implementations compile to automata. Given that, it is interesting to be able to play with the resultant machines, or both in interaction, like you can with eg. this - after all, there are formal operations to combine machines, which means you can re-use and combine them. More about that in the transducer section.


Storage

One somewhat specialist use is data storage for efficient lookup.

Tuple Dictionaries

Perfect Hash


Examples / exercises

Regular expressions

Examples: (generally assuming alphabet is only lowercase a to z, sometimes general ASCII)

  • All lowercase strings ending in 'b':
    [a-z]*b
  • Repeated words:
    \b([a-zA-Z]+) \1
  • Capitalized alliterations:
    \b([A-Z])[A-Za-z]* \1
  • A string without an e:
    ^[a-df-z]*$
    (assuming presence of anchors)
  • A string with at least two vowels:
    ^[b-df-hj-np-tv-z]*[aeiuo]+[b-df-hj-np-tv-z]*[aeiuo]+[a-z]*$

Note that there are usually various solutions, some clearer and shorter than others, though sometimes with somewhat different results. For example, consontants could also be written </tt>[^aeiou]</tt>, but only when your alphabet only allows a-z -- with ascii that negation will be 250 characters, not the 21 you want.


Exercises:

  • Make a regexp that accepts a word containing two vowels, ending with a consonant, and having an odd number of letters. (Note that this would be easier with a regexp/FSA team)
  • Explore different implementations, how they deal with greediness, grouping, multiple matches, overlapping matches, allow backreferences, forward references and substitution.
  • Think of and possibly implement a KISS ELIZA, eg. with a number of rules like:
cat | tr '.' '\n' | tr -s '\n' | \
sed -e "s/I'm/I am/g" | sed -e "s/I've/I have/g" | \
sed -e "s/I am sad/I'm sorry to hear that./g" | \
sed -e "s/I am \([a-zA-Z\ ]*\)/Why do you say you are \1?/g" | \
sed -e "s/I have \([a-zA-Z\ ]*\)/How long have you had \1?/g"

...and other such rules. (Think of marking your output so it won't get processed again. You also probably want to use -r)


Automata

Examples: (when I get bothered with graphviz)

Exercises:

  • Write a regular expression for a machine. Algorithms exist for this. If you can't simplify it in one step, remember that a disjuction of all possible paths is correct, if not very minimal.
  • Describe an algorithm for negating a DFA, i.e., that builds a machine that doesn't accepts all strings except the one in the original.

Concepts and Algorithms

Regular Grammar, Regular Expression, Finite State Machine (FSM, or ~Automaton, FSA) (Deterministic FSA (DFA or DFSA), Nondeterminsitic FSA (NFA or NFSA)).


A FSA is usually a five-tuple: (Q,\Sigma,S,E,F) where:

  • Q is the set of states
  • \Sigma> is the set of symbols (the alphabet)
  • S is the start state or states \subset Q
  • E is the final accepting state or states \subset Q
  • F is the set of edges, seen as triples \subset (Q,(\Sigma \cup \epsilon),Q)

Whether it is deterministic or not (always has one state active) can be seen from S and F. If there is one start state, and no triples in Q which the first two match (ie. if from a single state a symbol is used to point to at mos one other node) it is nondeterministic.


There is an algorithm that minimises a machine to a more compact form. Interesting is that for the same language, there is only one minimal machine, so if you make a machine several ways (eg. once from state transitions, once compiled from a regexp), if after you minimize they are equivalent, that proves the accepted languages are the same.


State space search: The problem definition builds a set of possible answers that is looked through, either by pre-building it, or by lazily evaluating it, and either returning on the first answer, or exhaustively returning all. (Evaluating a NFA for a string is an example of such an algorithm, as is Perl's regexp implementation, which uses backtracking)


Finite State Transducers

Examples / Exercises

  • Implement the Soundex algorithm in a transducer.
  • Make a transducer that replace the c in words by an s or k according to their pronunciation.


Weighed Finite State Automata

Context Free Grammars

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


CFGs are those grammars which are described by a set of rules described by

N -> s

where

  • N is a nonterminal and
  • s is a list consisting of
    • terminals (which are unit strings),
    • nonterminals (i.e. the left-hand side of a rule), or
    • an empty string.


For example:

S -> A B
A -> a A
A -> a  #a common shorthand would combine these into "A -> A | a"
B -> b


Notes:

  • CFGs are more expressive than regular expressions in that it allows recursion (and not just a limited form of direct iteration).
  • Using recursion is not necessarily more expressive than regular languages are (the above is roughly equivalent to the regexp "aa+b"), but it can be.
  • There are formal languages that are not context free
...but many things that are close are made to be CFGs because it makes parsing simpler.
  • The context freeness comes from the way CFGs generate/accept a string: At any point, a rule N can be replaced by any right-side s, regardless of context. (as opposed to Context-sensitive_grammars, which can be a pain)


There are various subclassifications of CFG's, usually restrictive properties related to ease of parsing. These are usually descriptions of what forms rules can take, which translates into certain inherent guarantees in their use. Most of the common ones:

Usually (for all of these?(verify)) any CFG can be transformed into any other form - though it often doesn't result in a nicely readable set of rules.



Note that these grammars are not necessarily unambiguous.

When applies more mathematically we design grammars to be unambiguous. For example, markup and protocol scriptions - you e.g. see EBNF in a bunch of RFCs.


In terms of expressiveness, the following can be said:

  • RE is a subset of CFG
  • CFG isn't a subset of RE
  • a CFG intersected with a RE is a CFG

Also:

  • Taking the intersection of two CFGs doesn't work
  • Taking the complement also doesn't (can be seen as intersection with all-string CFG)



BNF and derivations

Parsers

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


  • abstract syntax tree, AST - see syntax tree
  • Bottom-up parsing - see shift-reduce parsing
  • compiler compiler, compiler generator: Create a compiler for a given grammar
  • concrete syntax tree - see syntax tree
  • lexer, lexical analyzer, lexical scanner - tokenize an input. A necessary step in parsing most grammars
  • lexer generator, lexical analyser generator: creates a lexical analyser based on some specification (e.g. all the terminals), which can save a lot of writing and rewriting bother
  • lexer, lexical analyser - tokenizes input according to the terminals in some grammar(verify). Note that a parser generator often also creates one of these (see lexer generator)(verify)
  • Shift-reduce parsing -
  • Top-down parsing (cf. bottom-up/shift-reduce) [1]
  • LALR grammar
  • LALR parser - look-ahead LR parser; see LR parsers
  • LR parser - reads input from left to right, produces rightmost derivation.
  • LR grammar
  • LL grammar
  • LL parser - reads input from left to right, produces leftmost derivation
  • k, as in 'LL(k) parser', 'LR(k)' and such refers to the unconsumed token lookahead. If such a parser exists for a given grammar, it need not backtrace and the parser can unambiguously decide the parse with one token of lookahead. Higher-k parsers are hard to write, although parser generators can deal with that for you. Theorhetically, LR(k) for k>1 can be transformed into LR(1)
  • SLR grammar
  • SLR parser - Simple LR parser. Parser-generated LR(0) state machine. Not practical for various real-world languages[2]
  • syntax tree
    • concrete syntax tree, parse tree - showing the syntactic structure of a string
    • abstract syntax tree - shows meaning rather than every detail in the input string


A parser (top-down / bottom-up?) often consists of

  • an input buffer, containing tokens from the input string
  • a stack for the terminals and non-terminals from the grammar yet to be parsed
  • a parsing table of grammar rules, often in terms of the stack top and the next input token


Apparently there is a mild preference for LL parsers in Europe and LR in the USA, apparently because of the presentation of parsing theory along the way.


Parsers are often one of:

  • LALR(1) - somewhat less powerful than LR(1), but simpler to implement. Various parser generators (e.g. yacc derivations) create LALR(1) parsers
  • LR(0) - no lookahead, often too simple for use
  • SLR(1)
  • LR(1) - general, but complex to implement (...yourself). Many programming language grammars are LR(1)


Ordering of expressiveness of grammars, more to less: (currently approximate)

  • LR
  • LALR
  • SLR


Software:

  • yacc and derivations
    • yacc (AT&T) - builds LALR parser for a CFG in C
    • GNU bison - builds LALR parser for a CFG in C, C++, or Java
    • Berkeley Yacc [3]
    • PLY ('Python Lex-Yacc') [4]
    • racc (ruby yacc) [5]
    • MKS yacc [6]
    • Abraxas pcyacc [7]
  • Common lexers
    • lex - lexer generator (often seen with yacc) [8]
    • flex, flex++ - lexer generator (often seen with bison) [9]
  • antlr


See also:


Unsorted:

Context-sensitive grammars

The formal side of informal language

Formal language describes taking such a structured approach to natural language (NLP).


Grammar formalisms and grammar frameworks describe varied approaches of expressing grammars in more formal, structured terms.

Often used for computational linguistics, specifically interests such as language modeling or machine translation. ...though note that translation and text analysis includes various methods that are more free-form than going via a structured grammar.



This couldn't go without mention of the Chomskian tradition from the 60s, which was perhaps the first formal, almost mathematical take on language processing, which though not practiced the same way today was strides in the ability to reason about expressiveness and complexity and such.


The more basic models here resembles CFGs, but with various practical alterations.


For example, unification-based grammar is a step more general than CFGs. It can model CFGs (by not having features) and can do more than CFGs (...by using them). UGs are in fact extremely general, and as such harder to write fast parsers for(verify).


More widely, formal language is still fairly open-ended, with there being no singular best parse, meaning this becomes about modeling rather than representation. Nor is there a singular best representation of a parse (in general; there may be within a framework's restrictions).


In practice there are many variations and inventions, so you don't really talk about CFGs and such at all. See e.g. Grammar formalisms.

So there are a lot of approaches, styles of markup, and more.


This also interplays with habits and choices at tagsets and dependency tree level.

Tagsets by themselves are the lexical categories you annotate text with.

These tend to be heavily tied to the particular formalism / implmentation that you analyse text with, often via a parse tree / dependency tree of some sort.

Which is one of a few reasons that the tags from a tagset may not translate directly to others, and the parse trees much less so - even within a formalism, consistency may come from conventions.

There are developments, like Universal Dependencies, that try to create treebanks with one more universal dependency-tree style.

-->