Regular expressions

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

This is not exhaustive (see various manuals).

There are better tutorials out there.


Introduction

Regular expressions (regexp, regex) conditionally match a string, according to the structure described in another string.


Regexps are commonly used to:

  • Match: Accept strings only if they are in the described form
  • Search: Try to look for parts of a string that conform -- e.g. to match a line by something it contains (e.g. grep)
  • Substitute/replace: Alter text patterns (regularly using grouping in some way)
  • Extract pieces of information (via grouping)
  • In linguistics and computer science, they are sometimes used to build (FSA-style) transducers


For example, say you have structural directories like

2012-12_proj1
2014-05_bio
temp

Then you may match the first two with something like:

[0-9]{4}-[0-9]{1,2}_[A-Za-z0-9]*

And take out the year, month, and rest by saying what you want to catch in distinct groups:

([0-9]{4})-([0-9]{1,2})_([A-Za-z0-9]*)


Part of the attraction is that a regexp is itself a piece of plain text, and are therefore easily written, transportable, and moderately easy to read -- though have a habit of becoming complex, so rewriting and even understanding can be hard. (A little documentation can go a long way).


Regexps are limited. CFGs can fundamentally describe/accept more types of input. For example, you never want to parse HTML with regexps.

Note that regular expressions apply easiest when the tests involved are mostly sequences, and are quickly less applicable when there are encompassing structures. This is related to the 'regular' referring to a somewhat restrictive formal class of expressiveness (see e.g. regular language).



Flavours

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Traditional regexps were useful, but minimal, so there are many expansions that are varying features and varying syntax (most don't change the basics, though).

when you do unusually funky stuff, you should probably consider what implementations will evaluate it' Many people only use fairly basic features, though, so it's not necessarily an issue.


Short lists often mention traditional unix, extended posix, perl, emacs.

A longer list:

  • Traditional unix-style (quite restrictive)
  • basic regular expression (BRE)
  • Extended regular expression (ERE), a.k.a. 'Extended' or POSIX modern (obsoletes 'traditional')
  • regexes in Perl code (some extensions, and can hook into code)
  • PCRE (Perl-Compatible Regular Expressions), a library that understands Perl regexp syntax, without the ability to embed perl code
  • Python (some extensions, see [1], maybe examples at Python usage notes/Regexp stuff)
  • Emacs
  • Tcl

...and it's much longer if you consider any feature distinct-making.


Various *nix utilities support multiple styles, so when something doesn't work, it can pay off to check the style you're using (before you decide your gargantuan regexp was probably wrong and rewrite it).

For example

  • grep does BRE by default (or with -G), ERE with -E, PCRE with -P (or unparsed substrings with -F)
  • find uses emacs style be default




Shell globs, e.g.
ls *.jpg
are easier to type, but not as expressive. They only support "any one character here" and "any series of characters here", e.g.
ls 201?-??-??_*.jpg
.

Playing around

Data to play with

For a while, the examples below assume our data is:

ape
boogie
apb
ap
onomatopoetic
arctic

Ways to play

Ways you can play with regular expressions include:

  • a webpage like regexpal.com (or one of many others, such as [2] or [3])
  • a bash shell, using
    cat filename | egrep 'a'
    , but be warned that shell escaping will probably bite you after a while.
    • (Note: you often want egrep (which is grep -E) for extended regexp; basic grep does only basic regexps)
  • the javascript console in your browser
  • langauges with built-in support and an interactive shell, such as Python, Ruby, or Perl.

When it serves readability, I follow the convention of putting a regexp in /slashes/, otherwise I don't. (Implementations vary in how and whether they accept/require slashes; most don't, as they are just delimiters)

Ignore things after # symbols, they're human comments.

Basics

Matching letters

Regular letters (a to z, A to Z, 0 through 9) match exactly themselves. For example:

cat example | grep arctic
cat example | grep ar

will both match 'arctic'/the last line of the test data, because the grep utility prints the whole line if your expression matches anywhere in it; this is one common use of matching. Grep is convenient to coders, and for other people handling text files.



Wildcards, cardinality and ranges

For example, there are wildcards.

There are a few different things you could call wildcards. For starters, there is the period, which accepts any single character. For example,
a..
will match ape and apb, and generally any string that has an a and two characters following it (basically anything as long as the buffer/line doesn't end before that).

(Note: Most implementations will match everything except newlines, unless you can tell them to do so, and do so. This can be useful to extract data from a multi-line string in one go.)


There are also:

  • ?
    , zero or one of the preceding object ('optional'),
  • *
    , zero or more of the preceding object, and
  • +
    , one or more of the preceding object.

For example:

a*      # 'zero or more "a" characters' (primarily useful in combinations)
a.*     # 'an "a" followed by zero or more arbitrary characters', 
        #  for example 'adwfg', 'agg', 'ab', and 'a'
a.+     # Similar, except means that there has to be at least one letter after the "a"
        # (e.g. 'adwfg', 'agg', 'ab', but not 'a') 
ap*     # 'an "a" followed any any number of ps'. In the example words only matches ap


More exact cardinality can often also be specified, using curly brackets (accolades): ((verify) which flavours accept this)

ble{1,2}d  # bleed or bled. (same effect as blee?d)
a{2,}      # two or more 'a' characters
a{,2}      # at most two 'a' characters


The square bracket range, like [abc], specifies a set of things to accept. In itelf, it will match a single character when it is one of the described characters.

ap[be]                # will match apb and ape
ap[a-e]               # you can also have a range (equivalent to [abcde])
ap[b-df-hj-np-tv-z]   # or more than one range. (basic latin consonants)
ap[^aeiou]            # ^ means negation, 
                      # so this means every character except basic latin vowels.
                      # (note that this is *not* the same as the last as 'every letter'
                      #  refers to anything, not just consonants)
[A-Z][A-Za-z]+        # Will match words (two letters or longer) starting with a capital


Notes:

  • To match these wildcard characters literally, escape them with a backslash, like
    \.
    \?
    \*
    \+
    , or use them inside square brackets.
  • to match a literal minus (-) inside square brackets, you can either escape it (e.g. {{inlinecode|[\-A-Za-z0-9_]), or make it the last character in the square brackets (e.g. {{inlinecode|[A-Za-z0-9_-])
  • Capitals are different from non-capitalized letters, but note that implementations will often offer the option to make the whole regexp case-insensitive -- which in some cases may be made unicode-smart.

Anchors

There are a few things that will be matched not by consuming a character / moving the cursor past a letter, but by the context of said cursor.

There are ^ and $, the start-of-buffer and end-of-buffer anchors, which will match when the cursor is at these positions.

^Subject:     # Only things starting with (instead of containing) "Subject:"
^food$        # matches only if there is exactly this in the buffer,
              #                              ...usually meaning 'alone on a line'
^$            # empty line

Grouping

Grouping is the concept of seeing part of a match as a unit. This an then be referenced later, either to the end of fething that data or to use it in a substitution. For example, to get the elements out of the ISO international date format, YYYY-MM-DD, you can use something like:

([0-9]{4})-([01][0-9])-([0123][0-9])

If the data were 1983-03-04, the groups would contain 1983, 03 and 04. Regexps can be useful as simple tokenizers and parsers this way (but only when the language is regular, which not many things are. For many things, a BNF-based parser is the way to go)


They can also do simple transformations. For example,

echo "1983-03-04" | 
  sed -r "s/([0-9]{4})-0?([1]?[0-9])-0?([123]?[0-9])/\3rd day in month \2 of the year \1/"

...would output 4rd day in month 3 of the year 1983. Note the changes that allow this to be fuzzier with zeroes.


Be careful when using regular expressions this way, because you will watch substrings. A classical mistake is obscenity filters using s/ass/butt/g, which makes for sentences like "People who make buttumptions about their search and replace options, will be embarbutted when they repeat this clbuttic mistake. " [4] [5]

Lookahead and lookbehind

refers to using the direct context as conditions, without consuming them. You can both at past characters and the next characters, and apply a logical not to the conditions.

See e.g. these examples in python.

The ability, exact requirements as well as syntax may differ between regexp implementations.


References

Backreferences are references to caught groups within the same regular expression.

For example, to match a single HTML tag and its closing counterpart:

"<([A-Za-z][^>]*)>.*</\1>"

To catch repeated words in typed text:

"\b(\w+)\s+\1\b" 


Forward references are references to something that will be captured later. This is harder to evaluate efficiently, and harder to evaluate at all. Not everything supports these, and not many people use them.

Controlling behaviour

Case insensitivity

Multiline

Commenting



Side notes

...about escaping

A problem you will likely meet is that doing regexps through the shell means you are dealing with two levels of string escaping, where it will interpret some backslashes and not others, partly depending on which type of string quoting you used.

There is obviously a system to it, but it gives most people I know headaches. Languages too will have this effect, but usually less so. If you want even less bother, use something like python for its raw strings which does not at all interpret backslashes (e.g.
r'\n'
for the literal backslash followed by an n characters).

searching vs. matching

When you tell the regexp library to do so, you can use regexps in slighty varying ways. Possible functionality:

  • Filter many lines; return only lines that contain a match somewhere - example: grep
  • Inquire whether a string has a match anywhere, for example looking for abbreviations
  • Inquire whether a string has a match starting at a given position - this is basically the same functionality, but takes less evaluation so is a little faster when you know what you're doing.
  • substitute, like already seen.

Support overview

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
Support for traditional unix extended perl
* y y y
. y y y
+ n y y
? n y y
[chars], [^chars] y y y
^, $ y y y
{n,m} y y y
(group)ing  ?  ? y
| as alternator n y y
backreferences y y y

Many implementations don't fit into this because they augment some features they consider useful.

? means unchecked.

(TODO: add forward references and other irregularities)

On speed

Expressions in regular languages can in theory be compiled to Finite State Automata (FSAs) (both DFA and NFA style, both of which have somewhat varying merit in different regexp situations), which means that most can be executed very efficiently, and their worst-case running time may be a degree of complexity better than what almost any implementation currently does.

Current real-world implementations often rely on backtracking in some form - Perl, .NET, Python, Java, etc. That is, they often use FSAs for local patterns, but use backtracing for some features and for complex wholes, which means efficiency varies between these implementations, complexity class doesn't seem to. (verify)

Apparently particularly for Perl's implementation (but likely for all) there exist few expression/data combinations that take explosively/unexpectedly/unnecessarily long, or lead it to terminate the evaluation. (See also 'catastrophic backtracking').


It seems that the most important reason backtracking is used instead of a state machine implementation is that only the very basic core of regexps (character matches, +, *, and ?, and a few others) are simple to implement this way. Grep and such can get away with that approach because they can get away with not supporting much more, but regexp libraries often provide so many features (which are themselves often the main reason to choose to do something via a regular expression) that a number of them are hard and/or inefficient in a state machine. Perl's regular expressions mean it doesn't even conform to a regular language anymore (though not in a very significant way).

Even so, it does seem that you can certainly do better in terms of speed than what most regexp libraries now do, even if it's in a subset of regexp cases.

See also

About

Cheat sheets

Try yourself

Tutorial-style

Thoughts

Unsorted