Python usage notes/Regexp stuff
Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
Frequently used functions
Finding patterns is usually done with one of:
- find anywhere within string
- returns a match object
- match at start of string (presumably slightly faster than search?)
- returns a match object
- find all matches in string
- returns a list of captured strings - shor to write when that's all you wanted
- find all matches in string
- returns a list of match objects - flexible if you wanted to do a more with each hit than just return its text
You will often want to capture specific parts of a string, by using parentheses brackets around that part.
Match objects
This also relates to how to get stuff out
A match object has things like:
- .group() - string contents of each captured group
- note: If a single group matches multiple times, only the last match is accessible (you're probably doing something regexps aren't fit for)
- .group(1) is the first captured group, .group(2) is the second, etc. (more technically, .group(n) is m.string[m.start(n):m.end(n)])
- there is one exception: .group(0) is not related to capturing at all, it is all text involved in the match by your regex
- returns a list of all explicitly captured groups's contents - all parts of the text that matched a bracketed part of your regexp
- Note: .end()] (it sometimes saves some typing to know that .span() returns the (start,end) tuple))
- of the whole if the argument is 0 / not given -- the default
- of a captured group if the argument is >= 1
- return -1 if the group did not contribute to the match
For an example of group() versus groups() -- say that
m = re.search(r'f(oo) ([0-9]+) ([0-9]+)', ' foo 1 2 ')
Then
m.groups() == ('oo', '1', '2')
and, basically-equivalently,
m.group(1) == 'oo'
m.group(2) == '1'
m.group(3) == '2'
And the special case of "all text involved in the match"
m.group(0) == 'foo 1 2'
Note that groups(0) means a completely different kind of thing
- There is an argument there, but it's not an index, it's the default value for optional captures (the default default is None)
- for example, to hijack an example from the docs:
m = re.match(r"(\d+)\.?(\d+)?", "24")
m.groups() == ('24', None)
m.groups('0') == ('24', '0') # so 0 instead of '0' means putting an ''integer'' there, which is often not what you want
...and more.
Examples
If you just want to know the piece of text that was matched overall (regardless of capturing)
>>> s = ' a 11 '
>>> m = re.search(r'\b(a|b|c) [0-9]+\b', s)
>>> m.groups()
('a',)
>>> m.group(1) # the number wasn't captured, but is still patch of the match
'a'
>>> m.group(0) # that number was still part of the overall match (0 is a special cse)
'a 11'
>>> st,en = m.span() # alternatively... (e.g. handy if you want to show the match in context again)
>>> st, en
(3, 7)
>>> s[st:en]
'a 11'
If you want all matches in a string
If you run the above on
s = ' a 11 c 22 '
you will find that search() will only match just one.
findall() will give you the captured strings
re.findall(r'\b(a|b|c) [0-9]+\b', s) == ['a', 'c']
...which may be close to what you want (with a few changes of the capturing), but if not, you probably want a match object for each, e.g.
list( re.finditer(r'\b(a|b|c) [0-9]+\b', s) )
== [ <re.Match object; span=(3, 7), match='a 11'>, <re.Match object; span=(10, 14), match='c 22'> ]
Other stuff
You can match wildcard repetition, but you cannot extract each individual wildcard match. For example, m = re.match('([0-9][0-9])+', '12345678') # with the idea of matching two numbers at a time
m.groups() == ('78',)
m.group(0) == '12345678'
m.group(1) == '78'
Workarounds
- repeatedly match (though for various cases, regexp add very little)
- consider re.findall()
- not equivalent at all - it will skip over character is doesn't match
Split details
re.split() can include or exclude what you split on, basically depending on whether capturing parentheses are used:
re.split('(-)', '1-800-1234') == ['1', '-', '800', '-', '1234']
re.split('-', '1-800-1234') == ['1', '800', '1234']
# Note that because they'll always be every second string, it's easy capture and selectively ignore separators, e.g.:
m = re.split('([ -])', '1-800 1234')
m == ['1', '-', '800', ' ', '1234']
m[::2] == ['1', '800', '1234']
m[1::2] == ['-', ' ']
Lookahead and lookbehind
Lookahead and lookbehind are useful to check something without consuming it.
Note that, dending on case, they can mean less efficient processing, so avoid them when you easily can.
Positive lookbehind (?<=), positive lookahead (?=):
#remove dashes only if inside a number (that is: digit on both sides)
re.sub('(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'
#Somewhat trickier use: force a space after each letter-dot pair
re.sub(r'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
re.findall('((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']
Note: Lookbehind requires fixed-length expressions, so (?<=Isaac\s+)Asimov is invalid.
#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
re.findall('&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '& &bad. ê link?a=1&b=2&coo=3')
# gives:
['amp', 'bad', '#234']
Negative lookbehind (?<!), negative lookahead (?!):
#remove dashes not at all adjacent to numbers:
re.sub('(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1')
# gives:
' -1-800-1234- -1'
Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):
re.findall('&((#x|x|#)?[A-Za-z0-9]+;)', '&x33; &#33; &#x33; &33;')
# gives:
[('&x33;', 'x'), ('&#33;', '#'), ('&#x33;', '#x'), ('&#33;', '')]
#while:
re.findall('(&(?:#x|x|#)?[A-Za-z0-9]+;)', '&x33; &#33; &#x33; &33;')
# gives:
['&x33;', '&#33;', '&#x33;', '&33;']
Raw strings
Because
- backslash is used for escape sequences in strings (in python and many other languages)
- combinations of a backslash character and another character has special meaning to regular expressions
...you will run into some cases of confusion.
While this is not purely for regexps (it's also useful for paths on windows), it's one of the main uses.
- ...to the point that some editors will treat a raw strings as "syntax highlight this as regexps"
For example,
'\b'
is the bell character (0x0d), while if you wanted to test for word border, you probably meant the two characters backslash and r. You can do:
'\\r'
When using raw strings, you can also write that as:
r'\r'
There are only some backslash escapes that are specially interpreted, so everything else will be seen as the two separate characters.
As such, you can get away with it in either form for cases like:
- '\s' (=='\\s')
- brackets
There are things that are not safe, but are not not used in regexps so you probably won't make them into problems:
- \t
- \f
- \n, except most people expect that to be newline so it's not generally an issue
The main cases you should worry about are probably
- \" and \' (also re)
- \r
- \b