Python usage notes/Regexp stuff

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Threads and processes · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

speed, memory, debugging, profiling

semi-sorted


In general, you can use

  • re.search() - find anywhere within string
returns a match object
  • re.match() - match at start of string
returns a match object
  • re.findall() - find all matches in string, e.g.
returns a list of matches



Split details

re.split() can include or exclude what you split on, basically depending on whether capturing parentheses are used:

re.split('(-)', '1-800-1234') == ['1', '-', '800', '-', '1234']
re.split('-',   '1-800-1234') == ['1', '800', '1234']
 
# Note that because they'll always be every second string, it's easy capture and selectively ignore separators, e.g.:
m = re.split('([ -])',   '1-800 1234')   
m       == ['1', '-', '800', ' ', '1234']
m[::2]  == ['1', '800', '1234']
m[1::2] == ['-', ' ']


Lookahead and lookbehind

Lookahead and lookbehind are useful to check something without consuming it.

Note that, dending on case, they can mean less efficient processing, so avoid them when you easily can.


Positive lookbehind (?<=), positive lookahead (?=):

#remove dashes only if inside a number (that is: digit on both sides)
re.sub('(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'
 
#Somewhat trickier use: force a space after each letter-dot pair
re.sub(r'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
re.findall('((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']

Note: Lookbehind requires fixed-length expressions, so (?<=Isaac\s+)Asimov is invalid.


#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
re.findall('&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '&amp; &bad. &#234 link?a=1&b=2&coo=3') 
# gives:
['amp', 'bad', '#234']


Negative lookbehind (?<!), negative lookahead (?!):

#remove dashes not at all adjacent to numbers:
re.sub('(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1') 
# gives:
' -1-800-1234-  -1'


Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):

re.findall('&amp;((#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
[('&amp;x33;', 'x'), ('&amp;#33;', '#'), ('&amp;#x33;', '#x'), ('&amp;#33;', '')]
 
#while:
 
re.findall('(&amp;(?:#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
['&amp;x33;', '&amp;#33;', '&amp;#x33;', '&33;']