Python usage notes - Numpy, scipy

Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly

Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Intro

numpy is a C extension that does n-dimensional arrays - a relatively generic basis that other things can build on.

Numpy itself mostly does basic matrix operations, and some linear algebra, and interfaces with BLAS and LAPACK, so is fairly fast (certainly much preferable ver number crunching in pure-python code. It can still help to build against a specific BLAS).

It also defines the array interface, which allows sharing of data without knowing each other's APIs, so there are various other libraries that build on it with more specific goals, or just cooperate nicely, including:

scipy is an important one, as are its various optional:
- scikits supporting many specific goals

pandas - for various data analysis, for a good part easing the pragmatic sides of loading and sellecting and summarizing and other massaging

PIL / Pillow does a bunch of image loading, saving and altering -- and lets you view images as arrays, making it easy and fast to go back and forth between other libraries mentioned here

matplotlib and mayavi for visualization (see also matplotlib notes)

Just because you may be interested, others include:

pandas, seaborn, plot.ly and ggplot seem to wrap matplotlib (verify) and make them easier and/or expand functionality

bokeh is targeted towards browsers

pygal creates SVG (so also browsers)

plot.ly and highchart are online-only (verify)

(most also integrate with ipython one way or another)

...etc.

There will be overlap between these (and they will often have subtle differences in behaviour or capabilities), e.g. between numpy's SVD[1] and scipy's SVD[2], between FFTs in different places, between scipy's and PIL's image functions.

numpy

Some concepts you may care about

On sizes/shapes and axes

The shape of matrices are funny to some people.

This mainly depends on whether you come from a math background, which typically describes matrices as y-by-x ,row-by-column), so:

array([ [1, 2, 3],
        [2, 4, 5]] )

would be a 2-by-3 matrix, and its shape is (2,3)

(...whereas many other fields (e.g. image processing) may prefer width-by-height, so if you do those things in numpy, things get occasionally confusing)

Since you're going to commonly select from axes, it helps to have a mental model of the way numpy deals with dimensions.

On one side there is no true interpretation, it's mainly just self-consistent convention,

On the other side there are numpy functions where it does matter (avoids reordering, sometimes faster), and it's handy to have the same conventions as most people.

As such, unless you have a reason otherwise, think of:

axis 0 as rows
axis 1 as columns
axis 2 as depth in a 3D box, slices in a stack of same-sized images, or such.

(If you dig deeper e.g. because you want efficient transfers to other software, you also end up having to know about C order versus Fortran order as well, but in general use you often wouldn't care about this.)

Reshaping

Since array data is a contiguous block, the shape is actually just metadata, that controls how you and functions view that data.

This also means you can change the shape as long as the total size stays constant.

This has a few use cases, mostly flattening away and introducing dimensions.

a=np.arange(15)
# now a is
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
# and a.shape is
(15,)

a.reshape(3,5)
# means a is
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

# while 
a.reshape(3,5,1)
# would give an array like
array([[[ 0], [ 1], [ 2], [ 3], [ 4]],
       [[ 5], [ 6], [ 7], [ 8], [ 9]],
       [[10], [11], [12], [13], [14]]])

Indexing / slicing

It's not just you, this stuff takes getting used to.

It's quite consistent, though, so eventually becomes natural enough.

To steal and shorten some examples from here, given:

q = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Numpy's array slicing syntax is similar to python's.

q[1:7]     == array([1, 2, 3, 4, 5, 6])

# with a non-1 step size:
q[1:7:2]   == array([1, 3, 5])

# negative indices interpreted as counts from the end (i.e. length plus it).
q[3:-4]    == array([3, 4, 5])

# negative stepping
q[-4:3:-1] == array([6, 5, 4])

More-dimensional slicing:

Wishes per-dimension are separated with commas and yields a same-or-lower-dimensional array, depending on whether you ask for a range or index.

Say you have this test data (with shape (2, 3, 1))

r = np.array([ [[1],[2],[3]],
               [[4],[5],[6]]])

For example, selecting the first row:

r[0,:,:] == array([[1], [2], [3]])

And to go from shape (2,3,1) to (2,3), you would probably go for:

s = r[:,:,0]

There are some shorthands you may care to know about.

I personally avoid them, because using just one explicit style makes it clearer when skimming code.

But sometimes they are more flexible. For example, when you don't want to assume a specific-dimension array, you can use ..., basically meaning ": for the rest"

# For example
r[0,...]  
# instead of the 3D-specific r[0,:,:]

r[...,0] 
# instead of the 3D-specific r[:,:,0]

# Similarly,
r[0] 
# does the same as r[0,...] 
# and is possible only because/when your selectivity lies in the first dimension(s)

Slicing is also useful to selectively apply operations (ufuncs). For example:

a[: , 1] /= 2      # divide the second column of a 2D matrix by two

a[0::3 , :] /= 2   # divide every third row by two

...and if the array was 3D, you effectively omitted the third dimensional wish, so you selected and altered all of it.

ufuncs and vectorization

Consider you can e.g. add or multiply by scalars:

>>> a = np.array([1,2,3,4])
>>> a * 2
array([2, 4, 6, 8])

This actually requires some glue to work (particularly because it mixes a python scalar with a numpy array), but it's so basic and useful that many such things are special-cased to just work (and that glue is often relatively simple, because...).

Numpy models the concept of a per-element function, called a 'universal function', better known as a ufunc.

This lets you mix many different things (generating, transforming, etc) and allows for things like:

>>> S = np.array([.25, .5, .75, 1])
>>> np.sin( np.pi * S )
array([  7.07106781e-01,   1.00000000e+00,   7.07106781e-01,   1.22464680e-16  ])

>>> B = np.linspace(1,10, 10)            # (which is array([1,2,3,4,5,6,7,8,9,10]))
>>> 1 + np.exp( np.cos( np.pi * B ) )

Notes:

Applying ufuncs is usually faster than alternatives - not because it's parallel or such, but because the iteration and evaluation can often happen in C

...occasionally with vectorized code

ufuncs include arithmetic, algebra, comparison logic, bitwise logic, things like conjugates.

combinations of ufuncs are effectively also ufuncs(verify)

Generating 1D arrays

Your options to generate such matrices include:

numpy.arange(), similar to python's range() but produces a numpy array

evenly spaced

Your basic start,stop,step

deals with floats

>>> numpy.arange(1, 5, 0.3)
array([ 1. ,  1.3,  1.6,  1.9,  2.2,  2.5,  2.8,  3.1,  3.4,  3.7,  4. ,
        4.3,  4.6,  4.9])

numpy.linspace() [3]

Evenly spaced

start, stop, and amount_of_steps in which to get there

avoids rounding errors at the endpoints (verify) so the last value will always be stop

>>> numpy.linspace(1,5,14)
array([ 1.        ,  1.30769231,  1.61538462,  1.92307692,  2.23076923,
        2.53846154,  2.84615385,  3.15384615,  3.46153846,  3.76923077,
        4.07692308,  4.38461538,  4.69230769,  5.        ])

Generating higher-D grids

It is regularly useful to evaluate functions for a given set of values, e.g. on a grid. Options include:

reshape something, often one of the above, e.g.

# numpy.arange(1,10).reshape(3,3)
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

numpy.meshgrid()

n-dimensional coordinate matrices, from 1D coordinate vectors

modelled after matlab's meshgrid

intuitively (for two): assume these two things need to be in the X and Y axis, so broadcast them according to the shape of the other

# e.g. a 2D
x = numpy.array([1,  2,  3,  4])     # using linspace here is often more flexible
y = numpy.array([10, 20, 30])
XX, YY = numpy.meshgrid(x, y)

XX
# array([[1, 2, 3, 4],
#        [1, 2, 3, 4],
#        [1, 2, 3, 4]])

YY
# array([[10, 10, 10, 10],
#        [20, 20, 20, 20],
#        [30, 30, 30, 30]])

XX + YY 
# array([[11, 12, 13, 14],
#        [21, 22, 23, 24],
#        [31, 32, 33, 34]])


# This is helpful e.g. when trying to write cleanish ufunc combinations that use both, e.g
x = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5])
y = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5])
XX, YY = numpy.meshgrid(x, y)
numpy.sqrt( XX*XX + YY*YY )
#array([[0.141, 0.224, 0.316, 0.412, 0.51 ],
#       [0.224, 0.283, 0.361, 0.447, 0.539],
#       [0.316, 0.361, 0.424, 0.5  , 0.583],
#       [0.412, 0.447, 0.5  , 0.566, 0.64 ],
#       [0.51 , 0.539, 0.583, 0.64 , 0.707]])

When it is enough to use integer based matrices, the above use has some shorter alternatives:

numpy.mgrid[] (Grid-shaped arrays of evenly spaced numbers in N-dimensions)

XX, YY = numpy.mgrid[10:40:10, 1:5] # same as the first XX, YY above

ogrid[] ('o'pen mesh grid) doesn't fill out the separate matrices, just sets up the shapes so they would broadcast the way you want when you combine them.

XX, YY = numpy.ogrid[10:40:10, 1:5]
XX
# array([[10],
#        [20],
#        [30]])
YY
# array([[1, 2, 3, 4]])

XX+YY
# array([[11, 12, 13, 14],
#        [21, 22, 23, 24],
#        [31, 32, 33, 34]])

numpy.indices() is like the combination of arange() and meshgrid(), counting in each direction.

This can be useful when you're working with images, or other pixel/voxel style data. For example, say you want the radius from an arbitrary center point.

# Some setup     (you may want to inspect intermediates yourself to see what's happening)
height, width = (5,5)
# subtract half: fractional pixels and zero-based numbers
#  intuitively: draw a 3x3 and 4x4 matrix and figure out where the center really is
cx = -0.5 + (0.5*width) 
cy = -0.5 + (0.5*height)
yarr,xarr = np.indices([height,width])  # this creates incrementing indices in each direction, so:
#   xarr is [0 1 2 3 4], [0 1 2 3 4], ...
#   yarr is  [0 0 0 0 0], [1 1 1 1 1], ...

# ...which makes the following easy to express (note: in a vectorized way):
numpy.sqrt((xarr-cx)**2+(yarr-cy)**2)
#array([[2.828, 2.236, 2.   , 2.236, 2.828],
#       [2.236, 1.414, 1.   , 1.414, 2.236],
#       [2.   , 1.   , 0.   , 1.   , 2.   ],
#       [2.236, 1.414, 1.   , 1.414, 2.236],
#       [2.828, 2.236, 2.   , 2.236, 2.828]])

# Or more interestingly, something like a gaussian (here intended for a fourier-space filter)
import matplotlib.pyplot, numpy as np

h = 10 # in pixels
height, width = (20,20)
cx = -0.5 + (0.5*width) 
cy = -0.5 + (0.5*height)
yarr,xarr = np.indices([height,width])  # this creates incrementing indices in each direction, so:
gaussian_lowpass = np.exp( -(  ( (xarr-cx)**2 / h**2 ) +((yarr-cy)**2 / h**2)) )

matplotlib.pyplot.imshow(gaussian_lowpass, cmap='gray')
matplotlib.pyplot.show()

There is also:

scitools.numpyutils.meshgrid - extended version of numpy.meshgrid to 1D, 2D and 3D grids, with sparse or dense coordinate arrays and matrix or grid indexing
scitools.numpyutils.ndgrid - same as calling meshgrid with indexing=’ij’ (matrix indexing)

scitools.UniformBoxGrid
scitools.BoxGrid

...and a number of tools around them.

Broadcasting

When combining matrices with operations that are element-wise, the most obvious requirement would be identical shapes.

Combining non-identical shapes works by following some specific rules. Broadcasting basically means one array will be repeated to fit (much as if there's some code looping it for you)

a = array([[ 1., 2.], 
           [ 3., 4.], 
           [ 5., 6.]]) 

b = array([-1., 3.]) 

a + b 
# will yield:
array([[ 0., 5.], 
       [ 2., 7.], 
       [ 4., 9.]]) 
# ...because b was repeated downwards

This works when relevant axis sizes match, or are 1. For example:

array([[1, 2, 3], 
       [4, 5, 6]]) + array([10])

array([[1, 2, 3], 
       [4, 5, 6]]) + array([[10],
                            [100]])

array([[1, 2, 3], 
       [4, 5, 6]]) + array([10,20,30])

Cases may intuitively feel somewhat ambiguous when there are several axes onto which you could combine. For example:

a = array([[ 1., 2.], 
           [ 3., 4 ]]) 
b = array([-1., 3.]) 

a + b  
# array([[ 0.,  5.],
#        [ 2.,  7.]])

# It's not that it chooses the first axis rather than the second, 
#  it's that b's shape effectively a row vector
# If you want it to apply to the second axis
a + b[:,np.newaxis] 
array([[ 0.,  1.],
       [ 6.,  7.]])
# Transposing wouldn't work - b is 1D with shape (2,). 
#  you could introduce the new dimension (now shape (1,2)) and then transpose it, yes.
a + b[numpy.newaxis,:].T

See also:

Structured arrays / record arrays

You can load/save/use mixed data types, in a C-struct-like way.

You can also give fields names, which lets you access them that way. (This is handled within the numpy.recarray class, a subclass of numpy.ndarray).

Some notes:

There are (confusingly) many different ways of telling numpy about the format. See: http://docs.scipy.org/doc/numpy/user/basics.rec.html#defining-structured-arrays

You can mix them, but that's just confusing

I prefer the long-names-with-bit-size variant (e.g. float32, uint16) for readability

each type can be prefixed by a repetition, or a shape, e.g. 3int8, (2,3)float64

You can define a data type with names, which gives you an array that you can also address with those names:

pt = np.dtype([
  ('chunk_id',   'a1'),   
  ('chunk_good', 'bool'),   # size of a bool is one byte
  ('chunk_used', 'bool'),   
  ('chunk_size', '<u4'),    # little-endian unsigned 32-bit integer
  ('format',     'S4'),     # 4-byte string
])
a = np.frombuffer('A\x00\x01\x01\x02\x03\x04qqhrB\x01\x00\x03\x06\x11\x12qqlr', dtype=pt)

a[0]['format'] 
# will be 'qhrr'

print repr(a)  # will show something like:
# array([('A', False, True, 67305985L, 'qqhr'),
#        ('B', True, False, 303105539L, 'qqlr')], # ...plus the dtype

See also:

http://docs.scipy.org/doc/numpy/user/basics.rec.html#structured-arrays

Data types

NOTE: much of this needs to be cleaned, completed, or at least double-checked.

The below oversimplifies things, omits some details on implicit conversions, some serialized forms (such as 16-bit floats), and more. For completeness, check the actual documentation.

Most of the interesting predefined types you want to hand to dtype arguments are on the left:

numpy dtype object                     numpy string spec     string spec for python module struct
                                                             (only when using the non-default "standard mode"!)
                                                             (see note below)

numpy.float32                           f4 or float32              f             
numpy.float64                           f8 or float64              d            
numpy.int64                             i8 or int64                q             
numpy.uint64                            u8 or uint64               Q           
numpy.int32                             i4 or int32             i or l
numpy.uint32                            u4 or uint32            I or L
numpy.int16                             i2 or int16                h          
numpy.uint16                            u2 or uint16               H           
numpy.int8                              i1 or int8                 b            
numpy.uint8                             u1 or uint8                B           
numpy.bool        (uses a byte)         b1 or bool                 ?                 
numpy.complex64   (two 32-bit floats)   c4 or complex64            
numpy.complex128  (two 64-bit floats)   c8 or complex128                  

                                                    < for little-endian
                                                    > for big-endian
                                                    = for native       
                                                    | for not applicable
                                                    (default is often = or |(verify))
fixed-size list-like things: 
 # being the length you want
'S#' / 'a#'    bytestring                                       s or p
'U#'           unicode string  (Py_UNICODE, so apparently UCS2 or UCS4 depending on the platform)
'V#'           raw data

Notes:

Defaults:
- The default float is often f8 (float64) (also when converting from python float)
- 'complex' defaults to complex128.
- the default integer may be i4 or i8 (platform-sized?(verify))

I mention the struct module, I left out array.

array's int sizes are platform-native, and there's no ability to control it like there is with struct, so it doesn't really fit in the above table.

struct is better, but still artuably more bother than necessary if you can use numpy

for the struct column, these are only valid when not using standard size (when using '<', '>', '!' or '='), not in native size (most int types can vary) See types. When you want portability, use standard mode.

Type hierarchy, testing

numpy has a hierarchy to its dtypes, which is useful when you e.g. want to test something in broader strokes - is it any variant of integer, is it signed or not, is it any sort of inexact number, etc.

Note that

the supertype of integer types is numpy.integer, which is the supertype of numpy.unsignedinteger and numpy.signedinteger

np.int and np.uint exist but are concrete types, so will not be a supertype of anything. Specifically, they are native-sized ints/uints

the supertype of float and complex is numpy.inexact

the supertype of float is numpy.floating

the superclass of string types and void* is numpy.flexible

For the tree, look for the docstring of numpy.core.numerictypes

So, for example, to see if something is any sort of integer you can use:

np.issubdtype(some_dtype, np.integer)
# or use the kind
#  (most commonly you are interested in f, u, i, or c)
ary.dtype.kind in ('u','i')

Or going further:

numpy.issubclass_(numpy.float32, numpy.inexact) == True
numpy.issubclass_(numpy.complex64, numpy.inexact) == True

numpy.issubclass_(numpy.int32,  numpy.integer) == True
numpy.issubclass_(numpy.int32,  numpy.signedinteger) == True
numpy.issubclass_(numpy.int32,  numpy.unsignedinteger) == False

numpy.issubclass_(numpy.uint32, numpy.integer) == True
numpy.issubclass_(numpy.uint32, numpy.signedinteger) == False
numpy.issubclass_(numpy.uint32, numpy.unsignedinteger) == True

On byte order (endianness)

Data's endianness is listed in dtype.byteorder, but that'll often be = for native (numpy.dtype.isnative would say True) (or | where it doesn't matter, e.g. 8-bit types)

dtype.descr seems to be more useful (verify)

changing endianness

ary.byteswap(inplace=True) # in-place meaning the backing bytes are re-ordered
ary.byteswap(inplace=False) # new array?(verify)
# as a view:     ( alternative: arr.view(arr.dtype.newbytorder(thatchar)) )
swapped   = o.newbyteorder('S') # swapped from its current endianness
as_little = o.newbyteorder('<')
as_big    = o.newbyteorder('>')
as_native = o.newbyteorder('=')

To find platform endianness', look at sys.byteorder, or do something like: (this should be clear, there are briefer ways)

oneone       = numpy.ones(1, dtype='=i2')
littleoneone = oneone.newbyteorder('<')
native_is_little = (oneone[0] == littleoneone[0])

See also:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html

http://docs.scipy.org/doc/numpy/user/basics.types.html

http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.byteorder.html

Filtering, searching, counting

Iterating

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

Sorting

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Sorting along an axis, independently

To do a sort along an axis, each separately, use sort() without order.

# new array:
numpy.sort(a, [axis, kind])
# in-place
a.sort([axis, kind])

axis - the axis to sort over

defaults to -1, the last axis (which seems to be the only variant that does not keep temporary data(verify), so is faster)

axis=None flattens the array

kind is one of: [4]

'quicksort' (fast, not stable)

'mergesort' (stable, extra work space)

'heapsort' (not stable)

Examples:

# given:
a = np.array(                 [[3,  1,  2],
                               [4,  9,  6],
                               [0,  2,  5]] )

# To sort within rows: (implied axis=-1, here the columns, and also equivalent to axis=1)
np.sort(a)  ==  array(        [[1,  2,  3],
                               [4,  6,  9],
                               [0,  2,  5]])

# To sort within columns:
np.sort(a, axis=0)  ==  array([[0,  1,  2],
                               [3,  2,  5],
                               [4,  9,  6]])


# Flatten (to give a 1D sorted array of all values)
np.sort(a, axis=None)  ==  array([1, 1, 2, 2, 3, 3, 4, 6, 9])

Sorting a table

...like you're sorting a table by one of its columns: reorder all columns the same way, by the values in one.

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

If you do not have fields defined (not a structured array), then argsort() is probably the most flexible but it's not in-place, which can matter to you for large arrays. If it does, see below.

("argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."[5])

# sort columns, by the values in the second column
>>> a[ a[:,1].argsort() ]
array([[3, 1, 2],
       [1, 2, 3],
       [4, 9, 6]])

# sort rows, by the values in the third row
>>> a[ a[2,:].argsort() ]
array([[3, 1, 2],
       [4, 9, 6],
       [1, 2, 3]])

When sorting a structured array' (i.e. when you have fields defined):

>>> a=np.array([[(3, 2, 1)],
                [(9, 7, 8)],
                [(4, 5, 6)]],
               dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])

...then you can sort by those fields (columns), in-place or not:

>>> a.sort(order=['FOO'], axis=0) # it seems you have to explicitly mention the right axis
>>> a                                   # even though it's the only one that makes sense
array([[(3, 2, 1)],
       [(4, 5, 6)],
       [(9, 7, 8)]],
      dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])

You can order by one field, then another for cases where the first are equal. (no example; the data above does not have such a case).

To column-sort field-less arrays in-place (can make sense for very big arrays which you are using as tables), you have to fake fields.

You can use a view to add fields onto an existing array. Then call the view's .sort() to sort the backing data in-place.

Caveats:

you can still only really sort by columns.

You have to get the dtype right. Getting a view via a hardcoded dtype, e.g. a.view('i4,i4,i4') (or i, or similar) won't be remotely portable, unless you also created the array with that same hardcoded type (below, numpy chooses a's type, which could be i4 or i8 depending on context)

>>> a = np.array([(3, 1, 2),
                  (4, 9, 6),
                  (1, 2, 3)])

>>> b = a.view( ','.join( [str(a.dtype)]*a.shape[-1]) )  # TODO: check this
>>> b
array([[(3, 1, 2)],
       [(4, 9, 6)],
       [(1, 2, 3)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

>>> b.sort(order=['f0'], axis=0)
>>> a
array([[(1, 2, 3)]
       [(3, 1, 2)]
       [(4, 9, 6)]])

Serializing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

....often to/from disk, sometimes memory.

Note that if instead of a local back-and-forth to storage you want to use these for interchange, then in most cases you have to do the extra bookkeeping/metadata stuff related to shape and data type (including signedness and endianness).

pickle

You can use pickle on numpy objects.

You probably want to explicitly use pickle.HIGHEST_PROTOCOL as it's smaller on disk and writes and reads a lot faster than the simple-and-stupid default. I regularly see a factor 10 or 20 faster for writing (more when using cPickle over pickle) and reading (with little difference between pickle/cPickle).

Text files

savetxt() saves to a space-(or-tab-or-such)-separated text file, which can be nice for trivial-to-code interchange with other things.

loadtxt() and genfromtxt() are relatively simple text-file parsers. The latter seems slightly more flexible.

fromregex() is like loadtxt()/genfromtxt(), but uses a regexps-capable parser to take out data.

Raw bytestrings (specify dtype)(verify)

Conversion to and from bytestrings:

tostring() basically constructs a python bytestring copy of just the array's byte storage

you'll have to keep track of shape and data type (including endianness) yourself, meaning this isn't hugely portable.

There is a similar numpy.fromstring(a,type), though chances you'll only pair it with numpy.tostring(a)

tofile() and fromfile() work on a file object instead.

You may consider numpy.memmap(), which lets you back an array by a mmap()ed file. This can be a convenient way to read in known raw structures, avoiding allocation when backing files are large. Note, however, that these are limited to 2GB even on 64-bit archs.

frombuffer() deals with any python buffer interface source (see also Python_usage_notes_-_struct,_buffer,_array,_bytes,_memoryview). When compared to fromstring it can help avoid double allocation.

fromiter() reads from any iterable, which can sometimes be a useful way to convert from python to numpy.

Matlab matrix data

There are some things added by scipy and by matplotlib, including scipy.io.savemat() and scipy.io.loadmat(). Again, this can be nice for interchange with Matlab (or other things that know this format). (the saving speed, loading speed, and space used is pretty good.)

npy binary format

numpy.save() and numpy.load() use numpy's own binary+metadata data format. which is one of the few options that does store enough metadata for portability(verify), allows C-order and Fortran order, and allows multiple arrays.

numpy.savez() stores a compressed variant (zip archive), which load() also understands.

For details of the file format used, look for npy format

Note that you can save to a StringIO/BytesIO, which can be nice when passing this into a database.

Going between python objects and numpy

In general, work in numpy, and minimize the amount of conversions you do.

Numpy to python

You're probably looking for tolist(), which creates a python list(-of-lists) filled with conversions to the closest applicable types.

Avoid going to python until/unless necessary - the copy to python is slowish, and working in python is slow compared to numpy

Python to numpy

Since numpy allocates its own contiguous array, this will also be a copy.

...although objects that implement the array protocol (and satisfy some other requirements?{{verify}) may be viewed instead.

Manipulating shape, size, and axes

Reshaping, reordering

These two do not change the backing data, it just looks at the same data differently (and so requires that new size fits exactly).

It returns a different-shaped view on the same backing data.

For example, say that you have:

>>> a = array([ 1, 2,  3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ]) # e.g. from linspace
>>> a.reshape( (6,2) )
array([[  1,   2 ],
       [  3,   4 ],
       [  5,   6 ],
       [  7,   8 ],
       [  9,  10 ],
       [ 11,  12 ]])

>>> a.reshape(  (2,6) )
array([[  1,   2,   3,   4,   5,   6  ],
       [  7,   8,   9,  10,  11,  12  ]])

Notes:

One of the lengths can be -1, which means "it's an exact multiple and you figure it out".

Note that when you reshape an array, the order= argument does not mean 'output should be this ordering' (as it usually does), but (?). If a copy is necessary, it will be C-ordered. If it can be done without a copy, the ordering depends on other factors. If you need the output in a specific ordering, you may need a .copy() with order= argument.

1D flattening
- numpy.ravel() is a reshape into 1D, and is essentially reshape(-1, order=order)
- ndarray.flatten() seems to always returns a copy (verify)
- ndarray.flat() gives a numpy iterator (relatively like a numpy iterator)

For certain specific conversions, allocation as (or readout as, which can be differnt) Fortran order or C order indexing can be relevant.

Reminder:

C-style index ordering: the last axis index changing fastest, back to the first axis index changing slowest.

Fortran-style index order: with the first index changing fastest, and the last index changing slowest

ndarray.flags tells you the array's current properties

numpy.asfortranarray()
numpy.ascontiguousarray() (C order)

numpy.require() can require ordering among other things (alignment, writeability, that we own the data) and will make a copy if necessary

http://stackoverflow.com/questions/17558388/reordering-of-numpy-arrays

Notes:

Iteration is always / defaults to C order (verify)

Resizing (in the resampling/interpolation sense)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(Note that most of these will return copied data)

Integer-factor reductions

decimating by taking every nth pixel via indexing, and throwing away the rest

size down only, integer multiples only

e.g. image[::2,::2]

skimage.measure.block_reduce

size down only, integer multiples only

(in image processing contexts this is sometimes called binning or coarsening)

does something per non-overlapping rectangle, e.g.

block_reduce(image, block_size=(2,2)) (defaults to np.sum())

block_reduce(image, block_size=(2,2), func=np.mean)

block_reduce(image, block_size=(2,3,1), func=np.max)

Presumably the code here does the same thing (haven't really looked yet)

scipy.signal.decimate()

size down only, integer factor only

uses interpolation, specifically order-8 Chebyshev type I, or 30 point FIR

Not very sorted yet

one of the functions in scipy.interpolate

e.g. map_coordinates(), which is actually way more general

scipy.signal.resample - resample using fourier space (zero-padding in frequency domain)

in one dimension only?(verify)

scipy.signal.resample_poly - polyphase and FIR filter

Somewhat image-specific

scipy.ndimage.imresize()

nearest, bilinear, bicubic, cubic, lanczos

meant for PIL interaction so coerces results to integer(verify)

scipy.ndimage.interpolation

e.g. zoom(), using spline interpolation of, by default, order 3

combining data

splitting

chunk up along a particular axis, return them as separate arrays (views?). (Sort of the counterpart of stacking)

numpy.split(axis=something) splits an array into a list of multiple sub-arrays of equal size.

numpy.hsplit() is split() with axis=1, i.e. horizontally, column-wise

numpy.vsplit() is split() with axis=0, i.e. vertically, row-wise

numpy.dsplit() is split() with axis=2, i.e. depth-wise

numpy.array_split() is similar, but when splitting into a given amount of items it does not require equal division.

>>> c = numpy.array([[ 1.,  2.],
                     [ 3.,  4.],
                     [ 5.,  6.],
                     [ 7.,  8.]])
>>> numpy.vsplit(c, 2)
[array([[ 1.,  2.],  [ 3.,  4.]]), 
 array([[ 5.,  6.],  [ 7.,  8.]])]

>>> numpy.hsplit(c, 2)
[array([[ 1.],
        [ 3.],
        [ 5.],
        [ 7.]]),
 array([[ 2.],
        [ 4.],
        [ 6.],
        [ 8.]])
]

>>> x = np.arange(9.0)
>>> np.split(x, 3)
[array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.,  8.])]
>>>
>>> np.split(x, [3, 5, 6, 10])
[array([ 0.,  1.,  2.]), array([ 3.,  4.]), array([ 5.]), array([ 6.,  7.,  8.]), array([], dtype=float64)]
>>> np.array_split(x, 4)
[array([ 0.,  1.,  2.]), array([ 3.,  4.]), array([ 5.,  6.]), array([ 7.,  8.])]

See also:

Growing

You can't really change the size of a numpy array without reallocating and copying.

a basic implication of it being backed contiguously in memory, which is good for speed

(could the realloc expand the existing allocation?(verify))

yes, this implies that numpy arrays that are ≥ 30% or so of RAM will often be trouble in practice. (You may wish to consider better-informed C code, or some well-informed numpy calls)

If you know the eventual size of the array, allocate that size immediately and assign data as you go.

This is probably the most memory-efficient way to go

If you don't need numpy operations on the axis of growth, consider making a python list of numpy matrices.

(pretty much because that isn't contiguous)

If you don't know the final size, but do want to end up with a single numpy array,

...consider first making the python list of numpy arrays, and eventually join or stack them

it will, for a short time, mean you use twice the memory that the data needs.

...but so will attempting to grow an existing array

Padding

delete, insert

newaxis

More axis manipulation

Unsorted

0D array, scalars

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Histograms, Binning as in dealing with values

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(cf. binning as in block-means coarsening -- see resizing)

numpy.digitize [6]

scipy.stats.binned_statistic [7]

numpy.bincount [8]

bin_means = numpy.histogram(data, bins, weights=data)[0] / numpy.histogram(data, bins)[0]

http://stackoverflow.com/questions/23594262/binning-values-of-a-function-in-python-numpy

Semi-sorted

dot versus matmul

Correlation

distance calculation

None

Converting a python array containing None

becomes NaN under float/double

is a TypeError casting to int

without explicit dtype can force the array to object

Bits

Bitmasks or other booleans

numpy.unpackbits converts 8-bits-in-an-uint8 to 1-bit-per-uint8

>>> np.unpackbits(array([[2], [7], [23]], dtype=np.uint8), axis=1)
array([[0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 0, 1, 1, 1]], dtype=uint8)

numpy.packbits converts 1-bit-per-someinteger to uint8, 8 bits at a time

padded to 8 with zeroes on the right side per element

so when you specify fewer than 8 you actually specify the leftmost bits - a bit odd perhaps, and meaning you sometimes need to do trickery like ([0]*(8-len(a)%8))+a

For example:

>>> np.packbits(array([  [ [1,0,1], [0,1,0] ],  
                         [ [1,1,0], [0,0,1] ]  ])
array( [  [ [160],[64] ]  ,  
          [ [192],[32]] ], dtype=uint8)

integers with unusual edges

Say you want to load from unusually-sized integers that imply packing across byte boundaries, often raw image data, e.g. 4, 10, 12, or 14 bits per pixel, 24 bits per sample.

Since there is no native data format for that, you will probably be loading that into uint8, uint16, uint16, uint16, and uint32 respectively.

If the data isn't huge, and this isn't a bottleneck in your app, then it's probably easiest to load it into a covneniently-sized integer type, then bit-massage it into multiple arrays with the dtype you want, and probably merge them. (As much in numpy code as possible so it's native)

If data is large, you probably want to avoid the intermediate allocations, and do the described array-to-array conversion in one step. See e.g. the cython example here.

Numpy and native code isn't quite as scary as it sounds.

For example, cython is aware of numpy

write a small numpy C extension

use an existing library via ctypes

Consider Fortran, in simpler cases it's somewhat less hairy than

To avoid any intermediate storage (relevant when working on huge data), you may to do the conversion during file load, probably bypassing numpy's loading completely. (Note you'll have to consider endianness and such yourself)

Spherical/polar coordinates

Making masks

Warnings and errors

RuntimeWarning: invalid value encountered in less_equal

...or in another comparison function (less, greater, etc)

This is most likely about np.nan or nan.infinity.

Those values can be avoidable when they come from

a div by zero, or log of zero
conversion from python None

In other cases they are useful (e.g. it's a matrix or timeseries that is missing data points)

...but functions statistics on them may not like these values.

check whether there are nan-aware functions (e.g. nanmedian, nanpercentile, nanmedian)

failing that, make a (flat) array with just the non-NAN values like:

x = x[numpy.logical_not(numpy.isnan(x))]

or the shorthand

x = x[~numpy.isnan(x)]

ValueError: data type must provide an itemsize

Means the type you handed in can't be coerced to the given dtype (regularly the implied default, double).

For example, you handed in an array of integer-as-strings, or an array mixing types where just one fails.

ValueError: zero-size array to some function without identity

Usually just means you're handing along an empty array.

There are a bunch of utility functions that assume they won't be handed an empty array, usually because it makes no sense to do so, so you just want a few more guards in your higher-level code.

In my case, I had an off-by-one error in some slicing logic.

RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility

These warnings come from something compiled against an older numpy than is installed, and signaling that the ABI might have changed.

It often comes from scipy. Which is typically built against the lowest supported numpy version, in part to be ABI-compatible, so it's typically fine.

If your distro packaging doesn't fix this, you can usually also fix this by one of:

downgrading numpy
getting pip to compile scipy against your installed numpy.
filter out these warnings.

See: https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility

underflow encountered in exp

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The simplest case is when you think ary==None when you want to test whether the variable has None assigned.

What you want in that cases is ary is None

A more interesting case is when you actually want to write a 'has nonzero values' test such as if peaks:.

The idea is that the array could be coerced to boolean through context. The issue is whether an array should reduce to true or false is often not well-defined: should all element tests be true for the whole thing to be considered true? Is one enough? (This is why it mentions np.all() and np.any(). It says 'with more than one element' because this actually is special-cased for single-value matrices, because that's useful)

...and the code you want may be something like

if np.any(peaks>0):

I get the impression that older code allowed some vaguer cases(verify), so you may run into old code that fails now.

Similarly, when you do things like:

extremes = data[ (data<0.2 or data>0.8) ]

That test inside is an expression that coerces both arguments to boolean, which reduces to the earlier problem (twice). (I think because it's a python operator? not sure(verify))

In this case, the better-defined way is a piece-wise logical or of two same-sized boolean arrays, e.g.

extremes = data[numpy.logical_or(data<0.2, data>0.8)]

Using a non-tuple sequence for multidimensional indexing is deprecated

More fully:

FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; 
use `arr[tuple(seq)]` instead of `arr[seq]`.
In the future this will be interpreted as an array index, `arr[np.array(seq)]`,
which will result either in an error or a different result.

The actual warning comes from numpy.

This actually is about the behaviour of slicing, and specifically numpy ~1.15.0 deprecating (multidimensional) indexing/slicing with anything other than tuple arguments.

As https://github.com/numpy/numpy/pull/9686 mentions, this because there used to be distinct behaviours depending on how you handed the same-valued arguments in, which is was probably a confusing/bad idea; your code probably should be changed updated if only to check whether or not it relied on the old behaviour.

If triggered from scipy, it's an older version of scipy, and not an issue. Since it has been fixed in scipy since, upgrading it gets rid of the warning.

elementwise == comparison failed and returning scalar instead; this will raise an error or perform elementwise comparison in the future

https://github.com/numpy/numpy/issues/9210

scipy

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Scipy builds more interesting stuff on top of the general-purpose numpy.

Scipy and the scikits may make it easier to quickly manipulate to and to visualize data (also via matplotlib). There are other packages you may like for visualisation, such as mayavi.

See also:

http://www.scipy.org/

scikits (scipy toolkits)

Separately installable scipy toolsets, often because they are specific specializations that just happen to build on scipy/numpy, or cannot be part of scipy because of licensing details.

See also:

http://scikits.appspot.com/scikits

Includes:

sound, DSP, and speech things:
- audiolab [9] (a.k.a. pyaudiolab, previously pyaudio), which wraps sndfile for various file format reading and writing
- samplerate (a.k.a. pysamplerate)
- Talkbox (signal/speech processing)

(Unsorted):
- (Sci)py-to-Matlab bridge (mlabwrap) [10]
- pymachine, machine learning (unfinished?) [11]
- Numerical optimization frameworks ([12] and [13])
- Approximate Nearest Neighbor Library (a wrapper for ANN)
- delaunay (...trangulation; see also voronoi)
- TimeSeries [14]
- RSFormats (reads common remote-sensing data formats)

As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.

TracGuide is a good place to start.

http://scipy.org/scipy/scikits/wiki

Duplication between numpy and scipy

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Since numpy already has some useful functionality, there is occasionally some duplication between numpy and scipy, e.g. in some linear algebra, some FFT.

In general, choose the scipy code. The stuff in numpy is often there for historical and backwards-compatibility reasons. The scipy code will often be a little more fleshed out, and sometimes faster. In a few cases it's just different implementations of the same idea.

Note that scipy starts with importing all basics from numpy (from numpy import *) so you can think of scipy as the main interface to work from.

See also:

http://stackoverflow.com/questions/6200910/relationship-between-scipy-and-numpy

Some specifics:

Scipy FFT versus numpy FFT

note that pyfftw.interfaces provides drop-in functions for both

numpy.fft and scipy.fftpack both use FFTPACK so should perform comparably

in practice scipy can be cleverer, e.g. runs checks whether your data is real and uses rfft if so. Yes you can do that yourself, but lazier programmers may find scipy isn't "randomly" slower.

scipy interleaves the result (verify)'s real and imaginary parts, though, which is slower to work with

Use of BLAS/LAPACK

It seems that packaged numpy and scipy are typically built against ATLAS

If you want to build them against e.g. OpenBLAS, you will need some work.

In theory it's little more than a pip install numpy/scipy after you have openblas, in practice it is more.

http://osdf.github.io/blog/numpyscipy-with-openblas-for-ubuntu-1204.html http://stackoverflow.com/questions/7496547/does-python-scipy-need-blas

Python usage notes - Numpy, scipy

Intro

numpy

Some concepts you may care about

On sizes/shapes and axes

Indexing / slicing

ufuncs and vectorization

Generating 1D arrays

Generating higher-D grids

Broadcasting

Structured arrays / record arrays

Data types

Type hierarchy, testing

On byte order (endianness)

Filtering, searching, counting

Iterating

Sorting

Sorting along an axis, independently

Sorting a table

See also (sorting)

Serializing

Going between python objects and numpy

Manipulating shape, size, and axes

Reshaping, reordering

Resizing (in the resampling/interpolation sense)

combining data

splitting

Growing

Padding

delete, insert

newaxis

More axis manipulation

Unsorted

0D array, scalars

Histograms, Binning as in dealing with values

Semi-sorted

dot versus matmul

Correlation

distance calculation

None

Bits

Bitmasks or other booleans

integers with unusual edges

Spherical/polar coordinates

Making masks

Warnings and errors

RuntimeWarning: invalid value encountered in less_equal

ValueError: data type must provide an itemsize

ValueError: zero-size array to some function without identity

RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility

underflow encountered in exp

The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Using a non-tuple sequence for multidimensional indexing is deprecated

elementwise == comparison failed and returning scalar instead; this will raise an error or perform elementwise comparison in the future

scipy

scikits (scipy toolkits)

Duplication between numpy and scipy

On speed

See also

Navigation menu