Python usage notes/Regexp stuff

From Helpful
Revision as of 23:23, 30 August 2024 by Helpful (talk | contribs) (→‎Match objects)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings


IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted


Frequently used functions

Finding patterns is usually done with one of:

find anywhere within string
returns a match object
match at start of string (presumably slightly faster than search?)
returns a match object
find all matches in string
returns a list of captured strings - shor to write when that's all you wanted
find all matches in string
returns a list of match objects - flexible if you wanted to do a more with each hit than just return its text


You will often want to capture specific parts of a string, by using parentheses brackets around that part.

Match objects

This also relates to how to get stuff out

A match object has things like:

  • .group() - string contents of each captured group
note: If a single group matches multiple times, only the last match is accessible (you're probably doing something regexps aren't fit for)
.group(1) is the first captured group, .group(2) is the second, etc. (more technically, .group(n) is m.string[m.start(n):m.end(n)])
there is one exception: .group(0) is not related to capturing at all, it is all text involved in the match by your regex


returns a list of all explicitly captured groups's contents - all parts of the text that matched a bracketed part of your regexp


Note: .end()] (it sometimes saves some typing to know that .span() returns the (start,end) tuple))
of the whole if the argument is 0 / not given -- the default
of a captured group if the argument is >= 1
return -1 if the group did not contribute to the match


For an example of group() versus groups() -- say that

m = re.search(r'f(oo) ([0-9]+) ([0-9]+)', ' foo 1 2 ')

Then

m.groups() == ('oo', '1', '2')

and, basically-equivalently,

m.group(1) == 'oo'
m.group(2) == '1'
m.group(3) == '2'

And the special case of "all text involved in the match"

m.group(0) == 'foo 1 2'


Note that groups(0) means a completely different kind of thing

There is an argument there, but it's not an index, it's the default value for optional captures (the default default is None)
for example, to hijack an example from the docs:
m = re.match(r"(\d+)\.?(\d+)?", "24")
m.groups()    == ('24', None)
m.groups('0') == ('24', '0')    # so 0 instead of '0' means putting an ''integer'' there, which is often not what you want


...and more.


Examples

If you just want to know the piece of text that was matched overall (regardless of capturing)

>>> s = '   a 11   '
>>> m = re.search(r'\b(a|b|c) [0-9]+\b', s)
>>> m.groups()
('a',)
>>> m.group(1)    # the number wasn't captured, but is still patch of the match
'a'
>>> m.group(0)    # that number was still part of the overall match (0 is a special cse)
'a 11'
>>> st,en = m.span() # alternatively...   (e.g. handy if you want to show the match in context again)
>>> st, en
(3, 7)
>>> s[st:en]
'a 11'


If you want all matches in a string

If you run the above on

s = '   a 11   c 22 '

you will find that search() will only match just one.

findall() will give you the captured strings

re.findall(r'\b(a|b|c) [0-9]+\b', s) == ['a', 'c']

...which may be close to what you want (with a few changes of the capturing), but if not, you probably want a match object for each, e.g.

list( re.finditer(r'\b(a|b|c) [0-9]+\b', s) ) 
        == [ <re.Match object; span=(3, 7), match='a 11'>,    <re.Match object; span=(10, 14), match='c 22'> ]



Other stuff

You can match wildcard repetition, but you cannot extract each individual wildcard match. For example, m = re.match('([0-9][0-9])+', '12345678') # with the idea of matching two numbers at a time

m.groups() == ('78',)
m.group(0) == '12345678'
m.group(1) == '78'

Workarounds

  • repeatedly match (though for various cases, regexp add very little)
  • consider re.findall()
not equivalent at all - it will skip over character is doesn't match

Split details

re.split() can include or exclude what you split on, basically depending on whether capturing parentheses are used:

re.split('(-)', '1-800-1234') == ['1', '-', '800', '-', '1234']
re.split('-',   '1-800-1234') == ['1', '800', '1234']

# Note that because they'll always be every second string, it's easy capture and selectively ignore separators, e.g.:
m = re.split('([ -])',   '1-800 1234')   
m       == ['1', '-', '800', ' ', '1234']
m[::2]  == ['1', '800', '1234']
m[1::2] == ['-', ' ']


Lookahead and lookbehind

Lookahead and lookbehind are useful to check something without consuming it.

Note that, dending on case, they can mean less efficient processing, so avoid them when you easily can.


Positive lookbehind (?<=), positive lookahead (?=):

#remove dashes only if inside a number (that is: digit on both sides)
re.sub('(?<=[0-9])[\-](?=[0-9])', '', '-1-800-1234-') == '-18001234-'

#Somewhat trickier use: force a space after each letter-dot pair
re.sub(r'(?<=[A-Z])\.(?!\s)', '. ', 'A.N.Smith.') == 'A. N. Smith.'
# Match 'Asimov' only if preceded by 'Isaac' (numbers to see which one it matches)
re.findall('((?<=Isaac )Asimov[0-9])', 'Asimov1 Isaac Asimov2 Asimov3')
# gives:
['Asimov2']

Note: Lookbehind requires fixed-length expressions, so (?<=Isaac\s+)Asimov is invalid.


#Try to find entity names in raw, possibly badly formed HTML,
# but avoid things where the & seems to be part of a URL query string
re.findall('&([\#A-Za-z0-9]+)(?=[;\s\.\,])', '&amp; &bad. &#234 link?a=1&b=2&coo=3') 
# gives:
['amp', 'bad', '#234']


Negative lookbehind (?<!), negative lookahead (?!):

#remove dashes not at all adjacent to numbers:
re.sub('(?<![0-9])[\-](?![0-9])', '', '- -1-800-1234- - -1') 
# gives:
' -1-800-1234-  -1'


Non-grouping parentheses (?:) are useful when you want to avoid nested results, or avoid capturing certain chunks as results at all (but still use ?, *, +, {} or such on it):

re.findall('&amp;((#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
[('&amp;x33;', 'x'), ('&amp;#33;', '#'), ('&amp;#x33;', '#x'), ('&amp;#33;', '')]

#while:

re.findall('(&amp;(?:#x|x|#)?[A-Za-z0-9]+;)', '&amp;x33; &amp;#33; &amp;#x33; &33;')
# gives:
['&amp;x33;', '&amp;#33;', '&amp;#x33;', '&33;']


Raw strings

Because

  • backslash is used for escape sequences in strings (in python and many other languages)
  • combinations of a backslash character and another character has special meaning to regular expressions

...you will run into some cases of confusion.

While this is not purely for regexps (it's also useful for paths on windows), it's one of the main uses.

...to the point that some editors will treat a raw strings as "syntax highlight this as regexps"


For example,

'\b' 

is the bell character (0x0d), while if you wanted to test for word border, you probably meant the two characters backslash and r. You can do:

'\\r'

When using raw strings, you can also write that as:

r'\r'


There are only some backslash escapes that are specially interpreted, so everything else will be seen as the two separate characters. As such, you can get away with it in either form for cases like:

  • '\s' (=='\\s')
  • brackets

There are things that are not safe, but are not not used in regexps so you probably won't make them into problems:

  • \t
  • \f
  • \n, except most people expect that to be newline so it's not generally an issue

The main cases you should worry about are probably

  • \" and \' (also re)
  • \r
  • \b