Python usage notes/Regexp stuff

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:


Split can include or exclude what you split on:

span style="color: #483d9b;">'(-)', '1-800-1234') == ['1', '-', '800', '-', '1234''-',   '1-800-1234') == ['1', '800', '1234']


Lookahead/lookbehind: These are useful to check something without consuming it. Note that they can mean less efficient processing, so avoid them when you easily can.

Positive lookbehind (?<=), positive lookahead (?=):

#remove dashes only if inside a number (that is: digit on both sides)
'(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'
 
#Somewhat trickier use: force a space after each Initial-dot pair
'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
'((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']

Note: Lookbehind requires fixed-length expressions; Something like (?<=Isaac\s+)Asimov is invalid.


#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
'&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '&amp; &bad. &#234 link?a=1&b=2&coo=3') 
# gives:
['amp', 'bad', '#234']


Negative lookbehind (?<!), negative lookahead (?!):

#remove dashes not at all adjacent to numbers:
'(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1') 
# gives:
' -1-800-1234-  -1'


Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):

span style="color: #483d9b;">'&amp;((#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
[('&amp;x33;', 'x'), ('&amp;#33;', '#'), ('&amp;#x33;', '#x'), ('&amp;#33;', '')]
 
#while:
'(&amp;(?:#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
['&amp;x33;', '&amp;#33;', '&amp;#x33;', '&33;']