Python usage notes/Regexp stuff

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted


In general, you can use

  • re.match() - match from start of string, : returns a match object
  • re.search() - find anywhere within string, returns a match object
  • re.finditer() - find all matches in string, returns a list of match objects
  • re.findall() - find all matches in string, returns a list of captured strings


A match object has things like

  • .group()
If a group matches multiple times, only the last match is accessible
group(1) is the first captured group, group(2) is the second, etc.
group(0) is a special case - it is the entire substring that's matched by your regex regardless of capture
  • .groups()
returns only explicitly captured groups
compare to group(0) (see above)
  • .span(), start(), .end()
a start and end offset (by default overall, you can ask for specific groups >=1)

...and more.


If you just want to know the piece of text that was matched overall (regardless of capturing)

>>> s = '   a 11   '
>>> m = re.search(r'\b(a|b|c) [0-9]+\b', s)
>>> m.groups()
('a',)
>>> m.group(0)
'a 11'
>>> st,en = m.span() # alternatively...   (e.g. handy if you want to show the match in context again)
>>> s[st:en]
'a 11'


If you want all matches in a string If you run the above on

s = '   a 11   c 22 '

you will find it will still only match just one. findall will only give you the captured strings -

re.findall(r'\b(a|b|c) [0-9]+\b', s) == ['a', 'c']

which may be close to what you want (with a few changes of the capturing), but if not, you probably want a match object for each, e.g.

list( re.finditer(r'\b(a|b|c) [0-9]+\b', s) ) 
        == [ <re.Match object; span=(3, 7), match='a 11'>,    <re.Match object; span=(10, 14), match='c 22'> ]



Other stuff

You can match wildcard repetition, but you cannot extract each individual wildcard match. For example, m = re.match('([0-9][0-9])+', '12345678') # with the idea of matching two numbers at a time

m.groups() == ('78',)
m.group(0) == '12345678'
m.group(1) == '78'

Workarounds

  • repeatedly match (though for various cases, regexp add very little)
  • consider re.findall()
not equivalent at all - it will skip over character is doesn't match



Split details

re.split() can include or exclude what you split on, basically depending on whether capturing parentheses are used:

re.split('(-)', '1-800-1234') == ['1', '-', '800', '-', '1234']
re.split('-',   '1-800-1234') == ['1', '800', '1234']
 
# Note that because they'll always be every second string, it's easy capture and selectively ignore separators, e.g.:
m = re.split('([ -])',   '1-800 1234')   
m       == ['1', '-', '800', ' ', '1234']
m[::2]  == ['1', '800', '1234']
m[1::2] == ['-', ' ']


Lookahead and lookbehind

Lookahead and lookbehind are useful to check something without consuming it.

Note that, dending on case, they can mean less efficient processing, so avoid them when you easily can.


Positive lookbehind (?<=), positive lookahead (?=):

#remove dashes only if inside a number (that is: digit on both sides)
re.sub('(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'
 
#Somewhat trickier use: force a space after each letter-dot pair
re.sub(r'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
re.findall('((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']

Note: Lookbehind requires fixed-length expressions, so (?<=Isaac\s+)Asimov is invalid.


#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
re.findall('&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '&amp; &bad. &#234 link?a=1&b=2&coo=3') 
# gives:
['amp', 'bad', '#234']


Negative lookbehind (?<!), negative lookahead (?!):

#remove dashes not at all adjacent to numbers:
re.sub('(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1') 
# gives:
' -1-800-1234-  -1'


Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):

re.findall('&amp;((#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
[('&amp;x33;', 'x'), ('&amp;#33;', '#'), ('&amp;#x33;', '#x'), ('&amp;#33;', '')]
 
#while:
 
re.findall('(&amp;(?:#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
['&amp;x33;', '&amp;#33;', '&amp;#x33;', '&33;']