Python parsing stuff

From Helpful
(Redirected from Parsing from python)
Jump to navigation Jump to search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Command line argument parsing

getopt is the simplest form, takes few lines of code, though is not as helpful as...


optparse was historically the more flexible/helpful thing

(≥py2.3, no development since 2.7 because...)

argparse

is what development moved to (≥py2.7).


Note that these are mostly for command line options that adhere fairly to POSIX recommended argument syntax (short and long styles, getopt is the most basic form), not necessarily your own creative definitions.




optparse

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Example:

from optparse import OptionParser

p = OptionParser()   # Has some arguments, but we're ignoring them here

p.add_option("-o", "--output",  dest="outputfile", action="store", 
             help="write output to named file")

p.add_option("-s", "--show",    dest="show",       action="store_true", default=False, 
             help="show output image in a window")

options, args = p.parse_args()    # defaults to parsing sys.argv[1:]
# note that errors in arguments means this call exits the program.

print 'options:  %r'%options
print 'args:     %r'%args



Basic notes:

  • help generation, basic store-what logic, and basic error handling is done for you.
...so you mostly just specify how each argument should be handled
typically you still want to do some checking of sensible values - e.g. in the example you may want to check whether the filename is valid, make it absolute, check that it doesn't exist
  • you can specify a short (e.g. -s) and/or a long form (e.g. --show)
  • -h and --help are registered by default
  • the things you name in dest will sit in an attribute
here options.outputfile and options.show


On add_option():

  • you can specify a default value for each attribute to take when no storage action is taken.
the default default value is None
  • dest specifies the attribute on the options object that is returned.
you can have the same dest from multiple options (e.g. for something that can have a handful of values)
but frankly, in most cases it's more predictable/readable to do this in your own logic afterwards
  • type requests conversion to a specific type.
Built-in: "string", "int", "long", "float", "complex", "choice", (and you can specify your own)


  • action can be one of the following: (the first few are probably most common)
    • "store": takes the next string that is next on the argument list
    • "store_true", store_false: specific cases of store_const for True and False. Useful for toggling things.
    • "store_false", store_false: specific cases of store_const for True and False
    • "store_const": store a pre-set value (from argument called const). No value is taken from the user arguments
    • "append": like store, but if a value was already present, we append instead of overwrite
    • "append_const": append, but with a configured value instead of a user value
    • "count": count the amount of times something is mentioned. You can use it to handle something like -v, -vv, -vvv, etc. as different levels of verbosity.
    • "callback": Call a named function. Mostly useful for hacking your own functionality on top, such as additional checks (e.g. checking whether one option was already set, used after another, reaction to ' -- ' meaning 'no more processing', etc.)
    • "help": react to use of this argument by printing help (You'll probably rarely do this yourself, because it's registered under -h and --help by default)
    • "version": react to use of this argument by printing a value handed along. You'll rarely do this yourself.



On errors:

optparse's response to errors consists mostly of printing an error message and exiting. There is no exception to catch, or ability to ignore errors.
If you want that, you'll have to subclass OptionParser and override its exit() and/or error().
if you want to raise an error during your own sanitizing
look at p.error()
and possibly want to play with p.print_help(), sometimes change p.usage, etc.
this is a little finicky, and one reason to use argparse, or maybe docopt

See also:

getopt

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Less powerful than optparse, also less code to write.

Separates/extracts options without values, options with values, and things not part of the options (e.g. filenames meant to be passed in) - though it seems not to deal with options after arguments(verify)

See http://docs.python.org/library/getopt.html


Example (mostly from python docs): Getopt is generally applied to sys.args[1:] (getopt.getopt(sys.args[1:], ...).

For code-example's sake, an array is literally supplied here:

import getopt

#In the real world you would use sys.argv[1:]
example = ['-a', '-b', '-cfoo', '-d', 'bar', 'file1', 'file2'] 

optlist, args = getopt.getopt(example, 'abc:d:')

Now optlist is [('-a', ''), ('-b', ''), ('-c', 'foo'), ('-d', 'bar')], and args is ['file1', 'file2']


Can also take long options. Example (mostly from python docs):

try:
    opts, args = getopt.getopt(sys.argv[1:], 
                               "ho:v",
                               ["help", "output="])
except getopt.GetoptError, err:
    print str(err) # will print something like "option -a not recognized"
    usage()
    sys.exit(2)

# default values, can be overwritten by the actual options
output = None
verbose = False

#iterate over the things we got
for o, a in opts:
    if o == "-v":
        verbose = True
    elif o in ("-h", "--help"):
        usage()
        sys.exit()
    elif o in ("-o", "--output"):
        output = a
    # ...
    else:
        raise RuntimeError("unhandled command line option")

argparse

TODO: detail it


click

Allows decorators to pass command line arguments to functions.


https://click.palletsprojects.com/


Byte parsing

Simpler

Very regular patterns may be most easily parsed with struct.

...or numpy if that's where you want it to end up anyway.

construct

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

construct is a library that takes a protocol description and can create a byte parser, as well as create data according to this format.

Going both ways means it is a declarative language of its own that can do complex things, once you grasp how it works.


See also:

Text parsing

pyparsing

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Natural language

nltk, spacy