Python usage notes - Numpy, scipy

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

Intro

numpy is a C extension that does n-dimensional arrays - a relatively generic basis that other things can build on.

Numpy itself mostly does basic matrix operations, and some linear algebra, and interfaces with BLAS and LAPACK, so is fairly fast (certainly much preferable ver number crunching in pure-python code. It can still help to build against a specific BLAS).


It also defines the array interface, which allows sharing of data without knowing each other's APIs, so there are various other libraries that build on it with more specific goals, or just cooperate nicely, including:

  • scipy is an important one, as are its various optional:
    • scikits supporting many specific goals
  • pandas - for various data analysis, for a good part easing the pragmatic sides of loading and sellecting and summarizing and other massaging
  • PIL / Pillow does a bunch of image loading, saving and altering -- and lets you view images as arrays, making it easy and fast to go back and forth between other libraries mentioned here
Just because you may be interested, others include:
pandas, seaborn, plot.ly and ggplot seem to wrap matplotlib (verify) and make them easier and/or expand functionality
bokeh is targeted towards browsers
pygal creates SVG (so also browsers)
plot.ly and highchart are online-only (verify)
(most also integrate with ipython one way or another)

...etc.


There will be overlap between these (and they will often have subtle differences in behaviour or capabilities), e.g. between numpy and scipy's SVD, between FFTs in different places, between scipy's and PIL's image functions.


numpy

Some concepts you may care about

On sizes/shapes and axes

It helps to have a mental model of the way numpy deals with dimensions, yet it's also too flexible to have a singular one, so this axis stuff is partially just your own self-consistent convention.

However, it helps to have the same conventions as other people, and that of numpy functions('s default axis behaviour) where it has them.

As such, unless you have a reason otherwise, you may wish to think of:

  • axis 0 as rows
  • axis 1 as columns
  • axis 2 as depth in a 3D box, slices in a stack of same-sized images, or such.


Sometimes for ease, sometimes for performance. And you end up having to know about C order versus Fortran order and such, which is why pages like this tend to go into this in more depth than you end up using. It is good to know about this, though not central to use.


This is also why shape is a little funny to some people, depending on whether you come from a math background or not.

Math often describes matrices as y-by-x (row-by-column), so:

array([ [1, 2, 3],
        [2, 4, 5]] )

would be a 2-by-3 matrix, and its shape is (2,3)


Many other fields (e.g. images) may prefer width-by-height, and are going to be more confused.



Reshaping

Since numpy data is usually a contiguous block, the shape is just a way you and functions view/use it, and you can change just as long as the total size stays constant.


This has only a few direct use cases, mostly introducing dimensions.

a=np.arange(15)
# now a is
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
# and a.shape is
(15,)
 
a.reshape(3,5)
# means a is
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])
 
# while 
a.reshape(3,5,1)
# would give an array like
array([[[ 0], [ 1], [ 2], [ 3], [ 4]],
       [[ 5], [ 6], [ 7], [ 8], [ 9]],
       [[10], [11], [12], [13], [14]]])

Indexing / slicing

It's not just you, this stuff takes getting used to. It's quite consistent, though, so eventually becomes natural enough.


To steal and shorten some examples from here, given:

q = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Numpy's array slicing syntax is similar to python's.

q[1:7]     == array([1, 2, 3, 4, 5, 6])
 
# with a non-1 step size:
q[1:7:2]   == array([1, 3, 5])
 
# negative indices interpreted as counts from the end (i.e. length plus it).
q[3:-4]    == array([3, 4, 5])
 
q[-4:3:-1] == array([6, 5, 4])



More-dimensional slicing:

Per-dimension wishes are separated with commas and yields a same-or-lower-dimensional array, depending on whether you ask for a range or specific index.

Also, shorthands that select from fewer axes implicitly ask to leave any unmentioned ones in.

# Say you have:
>>> r = np.array([ [[1],[2],[3]],
                   [[4],[5],[6]]])
# Which has shape (2, 3, 1)

For example, selecting the first row:

r[0,:,:] == array([[1], [2], [3]]) 
 
# r[0] does the same, and is a shorthand mostly because your selectivity lies in the first dimension(s)
 
# To go from shape (2,3,1) to (2,3), you would probably go for:
s = r[:,:,0]
 
 
# Note that wen you don't want to assume selecting from 3 dimensional arrays, you can use
r[0,...]  
 
r[...,0] 
# ...respectively, essentially meaning "{{inlinecode|:}} for the rest"


Aside from views on the data, indexing is also useful to selectively apply operations (ufuncs) to a matrix. For example:

# divide the second column of a 2D matrix
a[: , 1] /= 2
# or every third row from the same
a[0::3 , :] /= 2

...and if the array was 3D, you effectively omitted the third dimensional wish, so you selected all of it.

ufuncs and vectorization

Consider you can e.g. add or multiply by scalars:

>>> a = np.array([1,2,3,4])
>>> a * 2
array([2, 4, 6, 8])

This actually requires some glue to work (mixing python scalar with a numpy array), but it's so common and useful it's not at all surprising that this is special-cased to just work.


Numpy actually does this in a more generic way, by modeling the concept of a per-element function: a 'universal function', a.k.a. ufunc.

This lets you mix many different things (generating, transforming, etc) and allows for things like:

>>> S = np.array([.25, .5, .75, 1])
>>> np.sin( np.pi * S )
array([  7.07106781e-01,   1.00000000e+00,   7.07106781e-01,   1.22464680e-16  ])
 
>>> B = np.linspace(1,10, 10)
>>> 1 + np.exp( np.cos( np.pi * B ) )

Notes:

  • Applying ufuncs is usually faster than alternatives - not because it's parallel or so, but because the iteration and evaluation can happen in C (with vectorized computation where possible)
  • ufuncs include arithmetic, algebra, comparison logic, bitwise logic, things like conjugates.
  • combinations of ufuncs are effectively also ufuncs(verify)

Generating 1D arrays

Your options to generate such matrices include:

  • numpy.arange()
    , like range() but produces a numpy array
evenly spaced
Your basic start,stop,step
>>> numpy.arange(1,5,0.3)
array([ 1. ,  1.3,  1.6,  1.9,  2.2,  2.5,  2.8,  3.1,  3.4,  3.7,  4. ,
        4.3,  4.6,  4.9])
  • numpy.linspace()
Evenly spaced
start, stop, and amount_of_steps in which to get there
so the last value will always be stop
also avoids rounding errors at the endpoints (verify)
>>> numpy.linspace(1,5,14)
array([ 1.        ,  1.30769231,  1.61538462,  1.92307692,  2.23076923,
        2.53846154,  2.84615385,  3.15384615,  3.46153846,  3.76923077,
        4.07692308,  4.38461538,  4.69230769,  5.        ])


Generating higher-D grids

It is regularly useful to evaluate functions for a given set of values, e.g. on a grid. Options include:


  • reshape something, often one of the above, e.g.
# numpy.arange(1,10).reshape(3,3)
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])


  • numpy.meshgrid()
n-dimensional coordinate matrices, from 1D coordinate vectors
modelled after matlab's meshgrid
Essentially: assume each of the inputs needs to go on one of the output dimensions, and broadcast them out.
# e.g. a 2D
x = numpy.array([1,  2,  3,  4])     # using linspace here is often more flexible
y = numpy.array([10, 20, 30])
XX, YY = numpy.meshgrid(x, y)
 
XX
# array([[1, 2, 3, 4],
#        [1, 2, 3, 4],
#        [1, 2, 3, 4]])
 
YY
# array([[10, 10, 10, 10],
#        [20, 20, 20, 20],
#        [30, 30, 30, 30]])
 
XX + YY 
# array([[11, 12, 13, 14],
#        [21, 22, 23, 24],
#        [31, 32, 33, 34]])
 
 
# This can e.g. be helpful when trying to write cleanish ufunc combinations that use both, e.g
DD = numpy.sqrt( XX*XX + YY*YY )


When it is enough to use integer based matrices, the above use has some shorter alternatives:

numpy.mgrid[]
(Grid-shaped arrays of evenly spaced numbers in N-dimensions)
XX, YY = numpy.mgrid[10:40:10, 1:5]
# same XX, YY as above


ogrid[]
('o'pen mesh grid) doesn't fill out the separate matrices, just sets up the shapes so they would broadcast the way you want.
XX, YY = numpy.ogrid[10:40:10, 1:5]
XX
# array([[10],
#        [20],
#        [30]])
YY
# array([[1, 2, 3, 4]])
 
XX+YY
# array([[11, 12, 13, 14],
#        [21, 22, 23, 24],
#        [31, 32, 33, 34]])


numpy.indices()
is like the combination of arange() and meshgrid(), counting in each direction.

This can be useful when you're working with images, or other pixel/voxel style data. For example, say you want the radius from an arbitrary center point.

# Some setup     (you may want to inspect intermediates yourself to see what's happening)
height, width = (5,5)
# subtract half: fractional pixels and zero-based numbers
#  intuitively: draw a 3x3 and 4x4 matrix and figure out where the center really is
cx = -0.5 + (0.5*width) 
cy = -0.5 + (0.5*height)
yarr,xarr = np.indices([height,width])  # this creates incrementing indices in each direction, so:
#   xarr is [0 1 2 3 4], [0 1 2 3 4], ...
#   yarr is  [0 0 0 0 0], [1 1 1 1 1], ...
 
# ...which makes the following easy to express (note: in a vectorized way):
radius = numpy.sqrt((xarr-cx)**2+(yarr-cy)**2)
# array([[ 2.82842712,  2.23606798,  2.        ,  2.23606798,  2.82842712],
#        [ 2.23606798,  1.41421356,  1.        ,  1.41421356,  2.23606798],
#        [ 2.        ,  1.        ,  0.        ,  1.        ,  2.        ],
#        [ 2.23606798,  1.41421356,  1.        ,  1.41421356,  2.23606798],
#        [ 2.82842712,  2.23606798,  2.        ,  2.23606798,  2.82842712]])
# Or more interestingly, something like a gaussian (here intended for a fourier-space filter)
import matplotlib.pyplot, numpy as np
 
h = 10 # in pixels
height, width = (20,20)
cx = -0.5 + (0.5*width) 
cy = -0.5 + (0.5*height)
yarr,xarr = np.indices([height,width])  # this creates incrementing indices in each direction, so:
gaussian_lowpass = np.exp( -(  ( (xarr-cx)**2 / h**2 ) +((yarr-cy)**2 / h**2)) )
 
matplotlib.pyplot.imshow(gaussian_lowpass, cmap='gray')
matplotlib.pyplot.show()



There is also:

  • scitools.numpyutils.meshgrid
    - extended version of numpy.meshgrid to 1D, 2D and 3D grids, with sparse or dense coordinate arrays and matrix or grid indexing
  • scitools.numpyutils.ndgrid
    - same as calling meshgrid with indexing=’ij’ (matrix indexing)
  • scitools.UniformBoxGrid
  • scitools.BoxGrid

...and a number of tools around them.

Broadcasting

When combining matrices with operations that are element-wise, the most obvious requirement would be identical shapes.


Combining non-identical shapes works by following some specific rules. Broadcasting basically means one array will be repeated to fit (much as if there's some code looping it for you)

a = array([[ 1., 2.], 
           [ 3., 4.], 
           [ 5., 6.]]) 
 
b = array([-1., 3.]) 
 
a + b 
# will yield:
array([[ 0., 5.], 
       [ 2., 7.], 
       [ 4., 9.]]) 
# ...because b was repeated downwards


This works when relevant axis sizes match, or are 1. For example:

array([[1, 2, 3], 
       [4, 5, 6]]) + array([10])
 
array([[1, 2, 3], 
       [4, 5, 6]]) + array([[10],
                            [100]])
 
array([[1, 2, 3], 
       [4, 5, 6]]) + array([10,20,30])


Cases may intuitively feel somewhat ambiguous when there are several axes onto which you could combine. For example:

a = array([[ 1., 2.], 
           [ 3., 4 ]]) 
b = array([-1., 3.]) 
 
a + b  
# array([[ 0.,  5.],
#        [ 2.,  7.]])
 
# It's not that it chooses the first axis rather than the second, 
#  it's that b's shape effectively a row vector
# If you want it to apply to the second axis
a + b[:,np.newaxis] 
array([[ 0.,  1.],
       [ 6.,  7.]])
# Transposing wouldn't work - b is 1D with shape (2,). 
#  you could introduce the new dimension (now shape (1,2)) and then transpose it, yes.
a + b[numpy.newaxis,:].T


See also:


Structured arrays / record arrays

You can load/save/use mixed data types, in a C-struct-like way.

You can also give fields names, which lets you access them that way. (This is handled within the numpy.recarray class, a subclass of numpy.ndarray).


Some notes:

You can mix them, but that's just confusing
I prefer the long-names-with-bit-size variant (e.g.
float32, uint16
) for readability
  • each type can be prefixed by a repetition, or a shape, e.g.
    3int8, (2,3)float64
  • You can define a data type with names, which gives you an array that you can also address with those names:
pt = np.dtype([
  ('chunk_id',   'a1'),   
  ('chunk_good', 'bool'),   # size of a bool is one byte
  ('chunk_used', 'bool'),   
  ('chunk_size', '<u4'),    # little-endian unsigned 32-bit integer
  ('format',     'S4'),     # 4-byte string
])
a = np.frombuffer('A\x00\x01\x01\x02\x03\x04qqhrB\x01\x00\x03\x06\x11\x12qqlr', dtype=pt)
 
a[0]['format'] 
# will be 'qhrr'
 
print repr(a)  # will show something like:
# array([('A', False, True, 67305985L, 'qqhr'),
#        ('B', True, False, 303105539L, 'qqlr')], # ...plus the dtype


See also:

Data types

NOTE: much of this needs to be cleaned, completed, or at least double-checked.


The below oversimplifies things, omits some details on implicit conversions, some serialized forms (such as 16-bit floats), and more. For completeness, check the actual documentation.

Most of the interesting predefined types you want to hand to dtype arguments are on the left:

numpy dtype object                     numpy string spec     string spec for python module struct
                                                             (only when using the non-default "standard mode"!)
                                                             (see note below)

numpy.float32                           f4 or float32              f             
numpy.float64                           f8 or float64              d            
numpy.int64                             i8 or int64                q             
numpy.uint64                            u8 or uint64               Q           
numpy.int32                             i4 or int32             i or l
numpy.uint32                            u4 or uint32            I or L
numpy.int16                             i2 or int16                h          
numpy.uint16                            u2 or uint16               H           
numpy.int8                              i1 or int8                 b            
numpy.uint8                             u1 or uint8                B           
numpy.bool        (uses a byte)         b1 or bool                 ?                 
numpy.complex64   (two 32-bit floats)   c4 or complex64            
numpy.complex128  (two 64-bit floats)   c8 or complex128                  

                                                    < for little-endian
                                                    > for big-endian
                                                    = for native       
                                                    | for not applicable
                                                    (default is often = or |(verify))
fixed-size list-like things: 
 # being the length you want
'S#' / 'a#'	bytestring                                       s or p
'U#'	        unicode string  (Py_UNICODE, so apparently UCS2 or UCS4 depending on the platform)
'V#'  	        raw data                                             


Notes:

  • Defaults:
    • The default float is often f8 (float64) (also when converting from python float)
    • 'complex' defaults to complex128.
    • the default integer may be i4 or i8 (platform-sized?(verify))


  • I mention the struct module, I left out array.
array's int sizes are platform-native, and there's no ability to control it like there is with struct, so it doesn't really fit in the above table.
struct is better, but still artuably more bother than necessary if you can use numpy
for the struct column, these are only valid when not using standard size (when using '<', '>', '!' or '='), not in native size (most int types can vary) See types. When you want portability, use standard mode.


Type hierarchy, testing

numpy has a hierarchy to its dtypes, which is useful when you e.g. want to test something in broader strokes - is it any variant of integer, is it signed or not, is it any sort of inexact number, etc.


Note that

the supertype of integer types is numpy.integer, which is the supertype of numpy.unsignedinteger and numpy.signedinteger
np.int and np.uint exist but are concrete types, so will not be a supertype of anything. Specifically, they are native-sized ints/uints
the supertype of float and complex is numpy.inexact
the supertype of float is numpy.floating
the superclass of string types and void* is numpy.flexible
For the tree, look for the docstring of numpy.core.numerictypes


So, for example, to see if something is any sort of integer you can use:

np.issubdtype(some_dtype, np.integer)
# or use the kind
#  (most commonly you are interested in f, u, i, or c)
ary.dtype.kind in ('u','i')


Or going further:

numpy.issubclass_(numpy.float32, numpy.inexact) == True
numpy.issubclass_(numpy.complex64, numpy.inexact) == True

numpy.issubclass_(numpy.int32,  numpy.integer) == True
numpy.issubclass_(numpy.int32,  numpy.signedinteger) == True
numpy.issubclass_(numpy.int32,  numpy.unsignedinteger) == False

numpy.issubclass_(numpy.uint32, numpy.integer) == True
numpy.issubclass_(numpy.uint32, numpy.signedinteger) == False
numpy.issubclass_(numpy.uint32, numpy.unsignedinteger) == True
On byte order (endianness)
Data's endianness is listed in dtype.byteorder, but that'll often be
=
for native (numpy.dtype.isnative would say True) (or
|
where it doesn't matter, e.g. 8-bit types)
  • dtype.descr seems to be more useful (verify)


changing endianness

ary.byteswap(inplace=True) # in-place meaning the backing bytes are re-ordered
ary.byteswap(inplace=False) # new array?(verify)
# as a view:     ( alternative: arr.view(arr.dtype.newbytorder(thatchar)) )
swapped   = o.newbyteorder('S') # swapped from its current endianness
as_little = o.newbyteorder('<')
as_big    = o.newbyteorder('>')
as_native = o.newbyteorder('=')


To find platform endianness', look at sys.byteorder, or do something like: (this should be clear, there are briefer ways)

oneone       = numpy.ones(1, dtype='=i2')
littleoneone = oneone.newbyteorder('<')
native_is_little = (oneone[0] == littleoneone[0])


See also:

Filtering, searching, counting

Iterating

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html

Sorting

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Sorting along an axis, independently

To do a sort along an axis, each separately, use sort() without order.

# new array:
numpy.sort(a, [axis, kind])
# in-place
a.sort([axis, kind])
  • axis - the axis to sort over
defaults to -1, the last axis (which seems to be the only variant that does not keep temporary data(verify), so is faster)
axis=None flattens the array
  • kind is one of: [1]
'quicksort' (fast, not stable)
'mergesort' (stable, extra work space)
'heapsort' (not stable)


Examples:

# given:
a = np.array(                 [[3,  1,  2],
                               [4,  9,  6],
                               [0,  2,  5]] )
 
# To sort within rows: (implied axis=-1, here the columns, and also equivalent to axis=1)
np.sort(a)  ==  array(        [[1,  2,  3],
                               [4,  6,  9],
                               [0,  2,  5]])
 
# To sort within columns:
np.sort(a, axis=0)  ==  array([[0,  1,  2],
                               [3,  2,  5],
                               [4,  9,  6]])
 
 
# Flatten (to give a 1D sorted array of all values)
np.sort(a, axis=None)  ==  array([1, 1, 2, 2, 3, 3, 4, 6, 9])

Sorting a table

...like you're sorting a table by one of its columns: reorder all columns the same way, by the values in one.

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


If you do not have fields defined (not a structured array), then argsort() is probably the most flexible but it's not in-place, which can matter to you for large arrays. If it does, see below.

("argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."[2])


# sort columns, by the values in the second column
>>> a[ a[:,1].argsort() ]
array([[3, 1, 2],
       [1, 2, 3],
       [4, 9, 6]])
 
# sort rows, by the values in the third row
>>> a[ a[2,:].argsort() ]
array([[3, 1, 2],
       [4, 9, 6],
       [1, 2, 3]])



When sorting a structured array' (i.e. when you have fields defined):

>>> a=np.array([[(3, 2, 1)],
                [(9, 7, 8)],
                [(4, 5, 6)]],
               dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])

...then you can sort by those fields (columns), in-place or not:

>>> a.sort(order=['FOO'], axis=0) # it seems you have to explicitly mention the right axis
>>> a                                   # even though it's the only one that makes sense
array([[(3, 2, 1)],
       [(4, 5, 6)],
       [(9, 7, 8)]],
      dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])

You can order by one field, then another for cases where the first are equal. (no example; the data above does not have such a case).



To column-sort field-less arrays in-place (can make sense for very big arrays which you are using as tables), you have to fake fields.

You can use a view to add fields onto an existing array. Then call the view's .sort() to sort the backing data in-place.

Caveats:

  • you can still only really sort by columns.
  • You have to get the dtype right. Getting a view via a hardcoded dtype, e.g.
    a.view('i4,i4,i4')
    (or i, or similar) won't be remotely portable, unless you also created the array with that same hardcoded type (below, numpy chooses a's type, which could be i4 or i8 depending on context)


>>> a = np.array([(3, 1, 2),
                  (4, 9, 6),
                  (1, 2, 3)])
 
>>> b = a.view( ','.join( [str(a.dtype)]*a.shape[-1]) )  # TODO: check this
>>> b
array([[(3, 1, 2)],
       [(4, 9, 6)],
       [(1, 2, 3)]],
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])
 
>>> b.sort(order=['f0'], axis=0)
>>> a
array([[(1, 2, 3)]
       [(3, 1, 2)]
       [(4, 9, 6)]])

See also

Serializing

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

....often to/from disk, sometimes memory.

Note that if instead of a local back-and-forth to storage you want to use these for interchange, then in most cases you have to do the extra bookkeeping/metadata stuff related to shape and data type (including signedness and endianness).



pickle

You can use pickle on numpy objects.

You probably want to explicitly use pickle.HIGHEST_PROTOCOL as it's smaller on disk and writes and reads a lot faster than the simple-and-stupid default. I regularly see a factor 10 or 20 faster for writing (more when using cPickle over pickle) and reading (with little difference between pickle/cPickle).



Text files


savetxt()
saves to a space-(or-tab-or-such)-separated text file, which can be nice for trivial-to-code interchange with other things.
loadtxt()
and
genfromtxt()
are relatively simple text-file parsers. The latter seems slightly more flexible.


fromregex()
is like loadtxt()/genfromtxt(), but uses a regexps-capable parser to take out data.



Raw bytestrings (specify dtype)(verify)

Conversion to and from bytestrings:

  • tostring()
    basically constructs a python bytestring copy of just the array's byte storage
you'll have to keep track of shape and data type (including endianness) yourself, meaning this isn't hugely portable.
  • There is a similar
    numpy.fromstring(a,type)
    , though chances you'll only pair it with numpy.tostring(a)


tofile()
and
fromfile()
work on a file object instead.


You may consider
numpy.memmap()
, which lets you back an array by a mmap()ed file.

This can be a convenient way to read in known raw structures, avoiding allocation when backing files are large. Note, however, that these are limited to 2GB even on 64-bit archs.


frombuffer()
deals with any python buffer interface source (see also Python_usage_notes_-_struct,_buffer,_array,_bytes,_memoryview).

When compared to fromstring it can help avoid double allocation.


fromiter()
reads from any iterable, which can sometimes be a useful way to convert from python to numpy.




Matlab matrix data
There are some things added by scipy and by matplotlib, including
scipy.io.savemat()
and
scipy.io.loadmat()
.

Again, this can be nice for interchange with Matlab (or other things that know this format). (the saving speed, loading speed, and space used is pretty good.)




npy binary format
numpy.save()
and
numpy.load()
use numpy's own binary+metadata data format.

which is one of the few options that does store enough metadata for portability(verify), allows C-order and Fortran order, and allows multiple arrays.

numpy.savez()
stores a compressed variant (zip archive), which load() also understands.

For details of the file format used, look for npy format

Note that you can save to a StringIO/BytesIO, which can be nice when passing this into a database.

Going between python objects and numpy

In general, work in numpy, and minimize the amount of conversions you do.


Numpy to python

You're probably looking for
tolist()
, which creates a python list(-of-lists) filled with conversions to the closest applicable types.
Avoid going to python until/unless necessary - the copy to python is slowish, and working in python is slow compared to numpy


Python to numpy

Since numpy allocates its own contiguous array, this will also be a copy.
...although objects that implement the array protocol (and satisfy some other requirements?{{verify}) may be viewed instead.

Manipulating shape, size, and axes

Reshaping, reordering

These two do not change the backing data, it just looks at the same data differently (and so requires that new size fits exactly).

It returns a different-shaped view on the same backing data.

For example, say that you have:

>>> a = array([ 1, 2,  3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ]) # e.g. from linspace
>>> a.reshape( (6,2) )
array([[  1,   2 ],
       [  3,   4 ],
       [  5,   6 ],
       [  7,   8 ],
       [  9,  10 ],
       [ 11,  12 ]])

>>> a.reshape(  (2,6) )
array([[  1,   2,   3,   4,   5,   6  ],
       [  7,   8,   9,  10,  11,  12  ]])

Notes:

  • One of the lengths can be -1, which means "it's an exact multiple and you figure it out".
  • Note that when you reshape an array, the order= argument does not mean 'output should be this ordering' (as it usually does), but (?). If a copy is necessary, it will be C-ordered. If it can be done without a copy, the ordering depends on other factors. If you need the output in a specific ordering, you may need a .copy() with order= argument.
  • 1D flattening
    • numpy.ravel() is a reshape into 1D, and is essentially
      {{{1}}}
    • ndarray.flatten() seems to always returns a copy (verify)
    • ndarray.flat() gives a numpy iterator (relatively like a numpy iterator)


For certain specific conversions, allocation as (or readout as, which can be differnt) Fortran order or C order indexing can be relevant.

Reminder:

C-style index ordering: the last axis index changing fastest, back to the first axis index changing slowest.
Fortran-style index order: with the first index changing fastest, and the last index changing slowest

ndarray.flags tells you the array's current properties

  • numpy.asfortranarray()</tt>
  • numpy.ascontiguousarray()</tt> (C order)

numpy.require() can require ordering among other things (alignment, writeability, that we own the data) and will make a copy if necessary


http://stackoverflow.com/questions/17558388/reordering-of-numpy-arrays


Notes:

  • Iteration is always / defaults to C order (verify)


Resizing (in the interpolation sense)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(Note that most of these will return copied data)



decimating via indexing (fast, not accurate)

  • e.g.
    image[::2,::2]
    just selects every second pixel


Block means (in image processing contexts this is sometimes called coarsening, or binning)

  • skimage.measure.block_reduce
does something per non-overlapping rectangle, e.g.
block_reduce(image, block_size=(2,2),   func=np.mean)  # default func is np.sum
block_reduce(image, block_size=(2,3,1), func=np.max)
  • Presumably the code here does the same thing (haven't looked yet)


reduction of size by integer factor -- but using interpolation (order-8 Chebyshev type I, or 30 point FIR)


e.g. map_coordinates(), which is actually way more general


nearest, bilinear, bicubic, cubic, lanczos
meant for PIL interaction so coerces results to integer(verify)
e.g. zoom(), using spline interpolation of, by default, order 3

combining data

splitting

chunk up along a particular axis, return them as separate arrays (views?). (Sort of the counterpart of stacking)


  • numpy.split(axis=something) splits an array into a list of multiple sub-arrays of equal size.
numpy.hsplit() is split() with axis=1, i.e. horizontally, column-wise
numpy.vsplit() is split() with axis=0, i.e. vertically, row-wise
numpy.dsplit() is split() with axis=2, i.e. depth-wise
  • numpy.array_split() is similar, but when splitting into a given amount of items it does not require equal division.
>>> c = numpy.array([[ 1.,  2.],
                     [ 3.,  4.],
                     [ 5.,  6.],
                     [ 7.,  8.]])
>>> numpy.vsplit(c, 2)
[array([[ 1.,  2.],  [ 3.,  4.]]), 
 array([[ 5.,  6.],  [ 7.,  8.]])]
 
>>> numpy.hsplit(c, 2)
[array([[ 1.],
        [ 3.],
        [ 5.],
        [ 7.]]),
 array([[ 2.],
        [ 4.],
        [ 6.],
        [ 8.]])
]
>>> x = np.arange(9.0)
>>> np.split(x, 3)
[array([ 0.,  1.,  2.]), array([ 3.,  4.,  5.]), array([ 6.,  7.,  8.])]
>>>
>>> np.split(x, [3, 5, 6, 10])
[array([ 0.,  1.,  2.]), array([ 3.,  4.]), array([ 5.]), array([ 6.,  7.,  8.]), array([], dtype=float64)]
>>> np.array_split(x, 4)
[array([ 0.,  1.,  2.]), array([ 3.,  4.]), array([ 5.,  6.]), array([ 7.,  8.])]


See also:

Growing

You can't really change the size of a numpy array without reallocating and copying.

a basic implication of it being backed contiguously in memory, which is good for speed
(could the realloc expand the existing allocation?(verify))
yes, this implies that numpy arrays that are ~30% of memory may not make you happy. (You may wish to consider better-informed C code, or some well-informed numpy calls)


If you know the eventual size of the array, allocate that size immediately and assign data as you go.

This is probably the most memory-efficient way to go


If you don't need numpy operations on the axis of growth, consider making a python list of numpy matrices.

(pretty much because that isn't contiguous)
  • If you don't know the final size, but do want to end up with a single numpy array,
...consider first making the python list of numpy arrays, and eventually join or stack them
it will, for a short time, mean you use twice the memory that the data needs.
...but so will attempting to grow an existing array


delete, insert

newaxis

More axis manipulation

Unsorted

0D array, scalars=

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Histograms, Binning as in dealing with values

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

(cf. binning as in block-means coarsening -- see resizing)


numpy.digitize [3]


scipy.stats.binned_statistic [4]


numpy.bincount [5]


bin_means = numpy.histogram(data, bins, weights=data)[0] / numpy.histogram(data, bins)[0]


http://stackoverflow.com/questions/23594262/binning-values-of-a-function-in-python-numpy

Semi-sorted

Bits

  • numpy.unpackbits converts 8-bits-in-an-uint8 to 1-bit-per-uint8
>>> np.unpackbits(array([[2], [7], [23]], dtype=np.uint8), axis=1)
array([[0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 0, 1, 1, 1]], dtype=uint8)
  • numpy.packbits converts 1-bit-per-someinteger to uint8, 8 bits at a time
Does not need the axis to be a unit of 8 long; for example:
>>> np.packbits(array([  [ [1,0,1], [0,1,0] ],  [ [1,1,0], [0,0,1] ]  ])
array( [  [ [160],[64] ]  ,  [ [192],[32]] ], dtype=uint8) 


If you want to load from an integer format packed some less usual way, often raw image data (4, 10, 12 bits per pixel, etc.)

If you want to load it in one go, efficiently, you may want to go with C, e.g.
write a small numpy C extension
use ctypes to interface with some known shared object
write it in fortran (actually seems somewhat less hairy than C(verify))
some other variant
if the data isn't huge, it is easiest to load it into bytes, then bit-massage it into an existing array with the dtype you want


Spherical/polar coordinates

Making masks

Warnings and errors

ValueError: data type must provide an itemsize

Often means you're mixing things that don't go into an array (with the given data type - or the default, float) in a natural way - say, a bunch of numbers and a string as-is.


ValueError: zero-size array to some function without identity

Really just means you're handing along an empty array.

There are a bunch of utility functions that assume they won't be handed an empty array, usually because it makes no sense.


In my case, I had an off-by-one error in some slicing logic.


scipy

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Scipy builds more interesting stuff on top of the general-purpose numpy.

Scipy and the scikits may make it easier to quickly manipulate to and to visualize data (also via matplotlib). There are other packages you may like for visualisation, such as mayavi.


See also:


scikits (scipy toolkits)

Separately installable scipy toolsets, often because they are specific specializations that just happen to build on scipy/numpy, or cannot be part of scipy because of licensing details.

See also:

Includes:

  • sound, DSP, and speech things:
    • audiolab [6] (a.k.a. pyaudiolab, previously pyaudio), which wraps sndfile for various file format reading and writing
    • samplerate (a.k.a. pysamplerate)
    • Talkbox (signal/speech processing)
  • (Unsorted):
    • (Sci)py-to-Matlab bridge (mlabwrap) [7]
    • pymachine, machine learning (unfinished?) [8]
    • Numerical optimization frameworks ([9] and [10])
    • Approximate Nearest Neighbor Library (a wrapper for ANN)
    • delaunay (...trangulation; see also voronoi)
    • TimeSeries [11]
    • RSFormats (reads common remote-sensing data formats)

As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.

TracGuide is a good place to start.


http://scipy.org/scipy/scikits/wiki

Duplication between numpy and scipy

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Since numpy already has some useful functionality, there is occasionally some duplication between numpy and scipy, e.g. in some linear algebra, some FFT.


In general, choose the scipy code. The stuff in numpy is often there for historical and backwards-compatibility reasons. The scipy code will often be a little more fleshed out, and sometimes faster. In a few cases it's just different implementations of the same idea.


Note that scipy starts with importing all basics from numpy (
from numpy import *
) so you can think of scipy as the main interface to work from.


See also:


Some specifics:

  • Scipy FFT versus numpy FFT
note that pyfftw.interfaces provides drop-in functions for both
numpy.fft and scipy.fftpack both use FFTPACK so should very comparable
in practice scipy can be cleverer, e.g. runs checks whether your data is real and uses rfft if so. Yes you can do that yourself, but lazier programmers may find scipy isn't "randomly" slower.
scipy interleaves the result (verify)'s real and imaginary parts, though, which is slower to work with
See also http://stackoverflow.com/questions/6363154/what-is-the-difference-between-numpy-fft-and-scipy-fftpack


It seems that packaged numpy and scipy are typically built against ATLAS
If you want to build them against e.g. OpenBLAS, you will need some work.
In theory it's little more than a pip install numpy/scipy after you have openblas, in practice it is more.

http://osdf.github.io/blog/numpyscipy-with-openblas-for-ubuntu-1204.html http://stackoverflow.com/questions/7496547/does-python-scipy-need-blas


On speed

See also

The array interface used by numpy and related:


Related in some way or other: