Regular expressions

From Helpful
(Redirected from Regexp)
Jump to navigation Jump to search
📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

This is not exhaustive (see various manuals).

There are better tutorials out there.


Also:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." 
Now they have two problems. [1]


Introduction

Regular expressions (regexp, regex) let you express a pattern in a string, which you can then use to do one of:

  • Match: Accept strings only if they are in the described form
  • Search: Try to look for parts of a string that conform -- e.g. to match a line by something it contains (e.g. grep)
which is like basically matching but can start later in the string
  • Extract pieces of information (via grouping)
  • Substitute/replace: Alter text patterns (regularly using grouping in some way)
  • they are sometimes used to build (FSA-style) transducers - state machines


For example, say you have structural directories like

2012-12_proj1
2014-05_bio
temp

You can match looking for the year-month_name pattern, with something like:

[0-9]{4}-[0-9]{1,2}_[A-Za-z0-9]*


More useful than just matching is extracting, e.g. the year, month, and name/rest:

([0-9]{4})-([0-9]{1,2})_([A-Za-z0-9]*)


Part of the attraction is that a regexp is itself a piece of plain text, and are therefore easily written, transportable, and generally simple to set up in code.

And moderately easy to read, though they have a habit of becoming complex, so rewriting and even understanding can be time consuming. (A little documentation can go a long way).


In general, regular expressions apply best when you are dealing with sequences of things.

This is related to the 'regular' referring to a a regular language), which is a fairly restricted formal grammar. Which is also why you can compile them into state machine, which is efficient when you want to match a lot of strings the same way, e.g. used in linguistics, also virus scanners.

They're pretty poor when dealing with encompassing structures. That's where you probably want a (simple) CFGs, which are more expressive but more work to set up, to consume, have more edge cases and are sometimes slower to parse.


Flavours

📃 These are primarily notes, intended to be a collection of useful fragments, that will probably never be complete in any sense.

Traditional regexps were useful, but minimal.

There are many later expansions that are varying features and varying syntax.

The basics rarely change, but some features (and the syntax of some features) can be specific to a flavour. Many people only use fairly basic features, though, so it's not necessarily an issue.


Shorter lists often mention traditional unix, extended posix, perl, emacs.

Various *nix utilities support a few styles, so when something doesn't work, it can pay off to check the style you're using. For example, grep does BRE by default (or with -G), ERE with -E, PCRE with -P (or unparsed substrings with -F)


A longer list:

  • Traditional unix-style (quite restrictive)
  • basic regular expression (BRE)
  • Extended regular expression (ERE), a.k.a. 'Extended' or POSIX modern (obsoletes 'traditional')
  • regexes in Perl code (some extensions, and can hook into code)
  • PCRE (Perl-Compatible Regular Expressions), a library that understands Perl regexp syntax, without the ability to embed perl code
  • Python (some extensions, see [2], maybe examples at Python usage notes/Regexp stuff)
  • Emacs
  • Tcl



Playing around

Data to play with

For a while, the examples below assume our data is:

ape
boogie
apb
ap
onomatopoetic
arctic

Ways to play

Ways you can play with regular expressions include:

  • the javascript console in your browser
  • langauges with built-in support and an interactive shell, such as Python, Ruby, or Perl.
  • a bash shell, using cat filename | egrep 'a', but be warned that shell escaping makes this harder than necessary
    • (Note: you often want egrep (which is grep -E) for extended regexp; basic grep does only basic regexps)


In the below,

when it serves readability, I follow the convention of putting a regexp in /slashes/, otherwise I don't. (Implementations vary in how and whether they accept/require slashes; most don't, as they are just delimiters)
Ignore things after # symbols, they're human comments.

Basics

Matching letters

Regular letters (a to z, A to Z, 0 through 9) match exactly themselves. For example:

cat example | grep arctic
cat example | grep ar

will both match 'arctic'/the last line of the test data, because the grep utility prints the whole line if your expression matches anywhere in it; this is one common use of matching. Grep is convenient to coders, and for other people handling text files.



Wildcards, cardinality and ranges

For example, there are wildcards.

There are a few different things you could call wildcards. For starters, there is the period, which accepts any single character. For example, a.. will match ape and apb, and generally any string that has an a and two characters following it (basically anything as long as the buffer/line doesn't end before that).


In most implementations, . will match everything except newlines.

If you want to match everything including newlines, there is often a flag you can set that explicitly means 'dot should match everything' .
If there is no such flag, or you want a trick that doesn't depend on that, consider tricks like like [\s\S])}}


There are also:

  • ?, zero or one of the preceding object
  • *, zero or more of the preceding object
  • +, one or more of the preceding object

For example:

a*      # 'zero or more "a" characters' (primarily useful in combinations)
a.*     # 'an "a" followed by zero or more arbitrary characters', 
        #  for example 'adwfg', 'agg', 'ab', and 'a'
a.+     # Similar, except means that there has to be at least one letter after the "a"
        # (e.g. 'adwfg', 'agg', 'ab', but not 'a') 
ap*     # 'an "a" followed any any number of ps'. In the example words only matches ap


More exact cardinality can often also be specified, using curly brackets (accolades): ((verify) which flavours accept this)

ble{1,2}d  # bleed or bled. (same effect as blee?d)
a{2,}      # two or more 'a' characters
a{,2}      # at most two 'a' characters


The square bracket range, like [abc], specifies a set of things to accept. In itelf, it will match a single character when it is one of the described characters.

ap[be]                # will match apb and ape
ap[a-e]               # you can also have a range (equivalent to [abcde])
ap[b-df-hj-np-tv-z]   # or more than one range. (basic latin consonants)
ap[^aeiou]            # ^ means negation, 
                      # so this means every character except basic latin vowels.
                      # (note that this is *not* the same as the last as 'every letter'
                      #  refers to anything, not just consonants)
[A-Z][A-Za-z]+        # Will match words (two letters or longer) starting with a capital


Notes:

  • To match these wildcard characters literally, escape them with a backslash, like \. \? \* \+, or use them inside square brackets.
  • to match a literal minus (-) inside square brackets, you can either escape it (e.g. {{inlinecode|[\-A-Za-z0-9_]), or make it the last character in the square brackets (e.g. {{inlinecode|[A-Za-z0-9_-])
  • Capitals are different from non-capitalized letters, but note that implementations will often offer the option to make the whole regexp case-insensitive -- which in some cases may be made unicode-smart.

Anchors

There are a few things that will be matched not by consuming a character / moving the cursor past a letter, but by the context of said cursor.

There are ^ and $, the start-of-buffer and end-of-buffer anchors, which will match when the cursor is at these positions.

^Subject:     # Only things starting with (instead of containing) "Subject:"
^food$        # matches only if there is exactly this in the buffer,
              #                              ...usually meaning 'alone on a line'
^$            # empty line

Grouping

Grouping is the concept of seeing part of a match as a unit. This an then be referenced later, either to the end of fething that data or to use it in a substitution. For example, to get the elements out of the ISO international date format, YYYY-MM-DD, you can use something like:

([0-9]{4})-([01][0-9])-([0123][0-9])

If the data were 1983-03-04, the groups would contain 1983, 03 and 04. Regexps can be useful as simple tokenizers and parsers this way (but only when the language is regular, which not many things are. For many things, a BNF-based parser is the way to go)


They can also do simple transformations. For example,

echo "1983-03-04" | 
  sed -r "s/([0-9]{4})-0?([1]?[0-9])-0?([123]?[0-9])/Day \3 in month \2 of the year \1/"

...would output 4rd day in month 3 of the year 1983. Note the changes that allow this to be fuzzier with zeroes.


Be careful when using regular expressions this way, because you will watch substrings. A classical mistake is obscenity filters using s/ass/butt/g, which makes for sentences like "People who make buttumptions about their search and replace options, will be embarbutted when they repeat this clbuttic mistake. " [5] [6]

Lookahead and lookbehind

refers to using the direct context as conditions, without consuming them. You can both at past characters and the next characters, and apply a logical not to the conditions.

See e.g. these examples in python.

The ability, exact requirements as well as syntax may differ between regexp implementations.


References

Backreferences are references to caught groups within the same regular expression.

For example, to match a single HTML tag and its closing counterpart:

"<([A-Za-z][^>]*)>.*</\1>"

To catch repeated words in typed text:

"\b(\w+)\s+\1\b" 


Forward references are references to something that will be captured later. This is harder to evaluate efficiently, and harder to evaluate at all. Not everything supports these, and not many people use them.

Controlling behaviour

Case insensitivity

Multiline

Commenting



Side notes

...about escaping

A problem you will likely meet is that doing regexps through the shell means you are dealing with two levels of string escaping, where it will interpret some backslashes and not others, partly depending on which type of string quoting you used.

There is obviously a system to it, but it gives most people I know headaches. Languages too will have this effect, but usually less so. If you want even less bother, use something like python for its raw strings which does not at all interpret backslashes (e.g. r'\n' for the literal backslash followed by an n characters).

searching vs. matching

When you tell the regexp library to do so, you can use regexps in slighty varying ways. Possible functionality:

  • Filter many lines; return only lines that contain a match somewhere - example: grep
  • Inquire whether a string has a match anywhere, for example looking for abbreviations
  • Inquire whether a string has a match starting at a given position - this is basically the same functionality, but takes less evaluation so is a little faster when you know what you're doing.
  • substitute, like already seen.

Support overview

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
Support for traditional unix extended perl
* y y y
. y y y
+ n y y
? n y y
[chars], [^chars] y y y
^, $ y y y
{n,m} y y y
(group)ing ? ? y
| as alternator n y y
backreferences y y y

Many implementations don't fit into this because they augment some features they consider useful.

? means unchecked.

(TODO: add forward references and other irregularities)

On speed

Expressions in regular languages can in theory be compiled to Finite State Automata (FSAs) (both DFA and NFA style, both of which have somewhat varying merit in different regexp situations), which means that most can be executed very efficiently, and their worst-case running time may be a degree of complexity better than what almost any implementation currently does.

Current real-world implementations often rely on backtracking in some form - Perl, .NET, Python, Java, etc. That is, they often use FSAs for local patterns, but use backtracing for some features and for complex wholes, which means efficiency varies between these implementations, complexity class doesn't seem to. (verify)

Apparently particularly for Perl's implementation (but likely for all) there exist few expression/data combinations that take explosively/unexpectedly/unnecessarily long, or lead it to terminate the evaluation. (See also 'catastrophic backtracking').


It seems that the most important reason backtracking is used instead of a state machine implementation is that only the very basic core of regexps (character matches, +, *, and ?, and a few others) are simple to implement this way. Grep and such can get away with that approach because they can get away with not supporting much more, but regexp libraries often provide so many features (which are themselves often the main reason to choose to do something via a regular expression) that a number of them are hard and/or inefficient in a state machine. Perl's regular expressions mean it doesn't even conform to a regular language anymore (though not in a very significant way).

Even so, it does seem that you can certainly do better in terms of speed than what most regexp libraries now do, even if it's in a subset of regexp cases.

See also

About

Cheat sheets

Try yourself

Tutorial-style

Thoughts

Unsorted