Python usage notes/Regexp stuff

From Helpful
Jump to navigation Jump to search

Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted


Frequently used functions

Finding patterns is usually done with one of:

find anywhere within string
returns a match object
match at start of string (presumably slightly faster than search?)
returns a match object
find all matches in string
returns a list of captured strings - shor to write when that's all you wanted
find all matches in string
returns a list of match objects - flexible if you wanted to do a more with each hit than just return its text


You will often want to capture specific parts of a string, by using parentheses brackets around that part.

Match objects

This also relates to how to get stuff out -

A match object has things like:

  • .group() - string contents of each captured group
note: If a single group matches multiple times, only the last match is accessible (you're probably doing something regexps aren't fit for)
.group(1) is the first captured group, .group(2) is the second, etc.
.group(0) is a special case - it is the entire substring that's matched by your regex regardless of capture
.group(g) is m.string[m.start(g):m.end(g)]


returns a list of explicitly captured groups's contents
compare to group(0) (see above)


  • .start(), .end() (it sometimes saves some typing to know that .span() returns the start,end tuple))
the start and end index (by default overall (0), you can ask for specific groups >=1)
return -1 if the group did not contribute to the match


...and more.


Examples

If you just want to know the piece of text that was matched overall (regardless of capturing)

>>> s = '   a 11   '
>>> m = re.search(r'\b(a|b|c) [0-9]+\b', s)
>>> m.groups()
('a',)
>>> m.group(1)    # the number wasn't captured, but is still patch of the match
'a'
>>> m.group(0)    # that number was still part of the overall match (0 is a special cse)
'a 11'
>>> st,en = m.span() # alternatively...   (e.g. handy if you want to show the match in context again)
>>> st, en
(3, 7)
>>> s[st:en]
'a 11'


If you want all matches in a string

If you run the above on

s = '   a 11   c 22 '

you will find that search() will only match just one.

findall() will give you the captured strings

re.findall(r'\b(a|b|c) [0-9]+\b', s) == ['a', 'c']

...which may be close to what you want (with a few changes of the capturing), but if not, you probably want a match object for each, e.g.

list( re.finditer(r'\b(a|b|c) [0-9]+\b', s) ) 
        == [ <re.Match object; span=(3, 7), match='a 11'>,    <re.Match object; span=(10, 14), match='c 22'> ]



Other stuff

You can match wildcard repetition, but you cannot extract each individual wildcard match. For example, m = re.match('([0-9][0-9])+', '12345678') # with the idea of matching two numbers at a time

m.groups() == ('78',)
m.group(0) == '12345678'
m.group(1) == '78'

Workarounds

  • repeatedly match (though for various cases, regexp add very little)
  • consider re.findall()
not equivalent at all - it will skip over character is doesn't match

Split details

re.split() can include or exclude what you split on, basically depending on whether capturing parentheses are used:

re.split('(-)', '1-800-1234') == ['1', '-', '800', '-', '1234']
re.split('-',   '1-800-1234') == ['1', '800', '1234']

# Note that because they'll always be every second string, it's easy capture and selectively ignore separators, e.g.:
m = re.split('([ -])',   '1-800 1234')   
m       == ['1', '-', '800', ' ', '1234']
m[::2]  == ['1', '800', '1234']
m[1::2] == ['-', ' ']


Lookahead and lookbehind

Lookahead and lookbehind are useful to check something without consuming it.

Note that, dending on case, they can mean less efficient processing, so avoid them when you easily can.


Positive lookbehind (?<=), positive lookahead (?=):

#remove dashes only if inside a number (that is: digit on both sides)
re.sub('(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'

#Somewhat trickier use: force a space after each letter-dot pair
re.sub(r'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
re.findall('((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']

Note: Lookbehind requires fixed-length expressions, so (?<=Isaac\s+)Asimov is invalid.


#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
re.findall('&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '&amp; &bad. &#234 link?a=1&b=2&coo=3') 
# gives:
['amp', 'bad', '#234']


Negative lookbehind (?<!), negative lookahead (?!):

#remove dashes not at all adjacent to numbers:
re.sub('(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1') 
# gives:
' -1-800-1234-  -1'


Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):

re.findall('&amp;((#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
[('&amp;x33;', 'x'), ('&amp;#33;', '#'), ('&amp;#x33;', '#x'), ('&amp;#33;', '')]

#while:

re.findall('(&amp;(?:#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
['&amp;x33;', '&amp;#33;', '&amp;#x33;', '&33;']


Raw strings

Because

  • backslash is used for escape sequences in strings (in python and many other languages)
  • combinations of a backslash character and another character has special meaning to regular expressions

...you will run into some cases of confusion.

While this is not purely for regexps (it's also useful for paths on windows), it's one of the main uses.

...to the point that some editors will treat a raw strings as "syntax highlight this as regexps"


For example,

'\b' 

is the bell character (0x0d), while if you wanted to test for word border, you probably meant the two characters backslash and r. You can do:

'\\r'

When using raw strings, you can also write that as:

r'\r'


There are only some backslash escapes that are specially interpreted, so everything else will be seen as the two separate characters. As such, you can get away with it in either form for cases like:

  • '\s' (=='\\s')
  • brackets

There are things that are not safe, but are not not used in regexps so you probably won't make them into problems:

  • \t
  • \f
  • \n, except most people expect that to be newline so it's not generally an issue

The main cases you should worry about are probably

  • \" and \' (also re)
  • \r
  • \b