Difference between revisions of "Python usage notes - Numpy, scipy"
m (→Correlation) |
m (→Semi-sorted) |
||
Line 1,582: | Line 1,582: | ||
===Semi-sorted=== | ===Semi-sorted=== | ||
+ | |||
+ | <!-- | ||
+ | |||
+ | http://wiki.scipy.org/Tentative_NumPy_Tutorial | ||
+ | |||
+ | |||
+ | http://www.engr.ucsb.edu/~shell/che210d/numpy.pdf | ||
+ | |||
+ | |||
+ | http://pages.physics.cornell.edu/~myers/teaching/ComputationalMethods/python/arrays.html | ||
+ | |||
+ | |||
+ | --> | ||
+ | |||
====Correlation==== | ====Correlation==== | ||
Line 1,592: | Line 1,606: | ||
covariance(X, Y) / (stdv(X) * stdv(Y) | covariance(X, Y) / (stdv(X) * stdv(Y) | ||
: which makes a number of assumptions about the distribution, basically that it's gaussian | : which makes a number of assumptions about the distribution, basically that it's gaussian | ||
− | |||
* Spearman’s correlation | * Spearman’s correlation | ||
: similar idea, based on rank of each sample, basically to not assume much about the distribution {{verify}} | : similar idea, based on rank of each sample, basically to not assume much about the distribution {{verify}} | ||
− | |||
* Kendall's tau is something you'ld use on [[ordinal data]] | * Kendall's tau is something you'ld use on [[ordinal data]] | ||
Line 1,605: | Line 1,617: | ||
--> | --> | ||
− | |||
<!-- | <!-- | ||
− | + | numpy.corrcoef - pearson's r | |
− | + | ||
− | numpy.corrcoef | + | |
* if running on matrices, it will give a correlation matrix | * if running on matrices, it will give a correlation matrix | ||
* if you meant overall correlation piecewise, then .flatten() may be the simplest solution | * if you meant overall correlation piecewise, then .flatten() may be the simplest solution | ||
+ | |||
+ | |||
+ | numpy.correlate - is cross-correlation | ||
+ | : \Sum a[n+k] * conj(v[n]) | ||
Revision as of 17:03, 5 December 2019
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency
IO: networking and web · filesystem Data: Numpy, scipy · pandas · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Processes: threading · subprocess · multiprocessing · joblib · pty and pexpect Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML |
Contents
- 1 Intro
- 2 numpy
- 2.1 Some concepts you may care about
- 2.2 Filtering, searching, counting
- 2.3 Iterating
- 2.4 Sorting
- 2.5 Serializing
- 2.6 Going between python objects and numpy
- 2.7 Manipulating shape, size, and axes
- 2.8 Histograms, Binning as in dealing with values
- 2.9 Semi-sorted
- 2.10 Warnings and errors
- 2.10.1 RuntimeWarning: invalid value encountered in less_equal
- 2.10.2 ValueError: data type must provide an itemsize
- 2.10.3 ValueError: zero-size array to some function without identity
- 2.10.4 RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility
- 2.10.5 underflow encountered in exp
- 2.10.6 The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
- 2.10.7 Using a non-tuple sequence for multidimensional indexing is deprecated
- 2.10.8 elementwise == comparison failed and returning scalar instead; this will raise an error or perform elementwise comparison in the future
- 3 scipy
- 4 On speed
- 5 See also
Intro
numpy is a C extension that does n-dimensional arrays - a relatively generic basis that other things can build on.
Numpy itself mostly does basic matrix operations, and some linear algebra, and interfaces with BLAS and LAPACK, so is fairly fast (certainly much preferable ver number crunching in pure-python code. It can still help to build against a specific BLAS).
It also defines the array interface, which allows sharing of data without knowing each other's APIs, so there are various other libraries that build on it with more specific goals, or just cooperate nicely, including:
- pandas - for various data analysis, for a good part easing the pragmatic sides of loading and sellecting and summarizing and other massaging
- PIL / Pillow does a bunch of image loading, saving and altering -- and lets you view images as arrays, making it easy and fast to go back and forth between other libraries mentioned here
- matplotlib and mayavi for visualization (see also matplotlib notes)
- Just because you may be interested, others include:
...etc.
There will be overlap between these (and they will often have subtle differences in behaviour or capabilities), e.g.
between numpy and scipy's SVD,
between FFTs in different places,
between scipy's and PIL's image functions.
numpy
Some concepts you may care about
On sizes/shapes and axes
It helps to have a mental model of the way numpy deals with dimensions, yet it's also too flexible to have one true interpretation.
So this axis stuff is partially just your own self-consistent convention, but it often pays to have the same conventions as other people, and that of numpy functions ('s default axis behaviour) where it has them.
As such, unless you have a reason otherwise, you may wish to think of:
- axis 0 as rows
- axis 1 as columns
- axis 2 as depth in a 3D box, slices in a stack of same-sized images, or such.
Doing this can avoid confusion later, and sometimes helps performance.
If you dig deeper e.g. because you want efficient transfers to other software, you also end up having to know about C order versus Fortran order and such. In general use you often wouldn't care about this.
This is also why shape is a little funny to some people, depending on whether you come from a math background or not.
Math often describes matrices as y-by-x ,row-by-column), so:
array([ [1, 2, 3], [2, 4, 5]] )
would be a 2-by-3 matrix, and its shape is (2,3)
Many other fields (e.g. image processing) may prefer width-by-height, and are going to be more confused.
- Reshaping
Since array data is a contiguous block, the shape is actually just metadata thae control how you and functions view/use that data.
Which means you can change the shape as long as the total size stays constant. This has few direct use cases, mostly flattening away and introducing dimensions.
a=np.arange(15) # now a is array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) # and a.shape is (15,) a.reshape(3,5) # means a is array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]]) # while a.reshape(3,5,1) # would give an array like array([[[ 0], [ 1], [ 2], [ 3], [ 4]], [[ 5], [ 6], [ 7], [ 8], [ 9]], [[10], [11], [12], [13], [14]]])
Indexing / slicing
It's not just you, this stuff takes getting used to.
It's quite consistent, though, so eventually becomes natural enough.
To steal and shorten some examples from here, given:
q = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Numpy's array slicing syntax is similar to python's.
q[1:7] == array([1, 2, 3, 4, 5, 6]) # with a non-1 step size: q[1:7:2] == array([1, 3, 5]) # negative indices interpreted as counts from the end (i.e. length plus it). q[3:-4] == array([3, 4, 5]) # negative stepping is also allowed q[-4:3:-1] == array([6, 5, 4])
- More-dimensional slicing:
Per-dimension wishes are separated with commas and yields a same-or-lower-dimensional array, depending on whether you ask for a range or specific index.
# Say you have: >>> r = np.array([ [[1],[2],[3]], [[4],[5],[6]]]) # Which has shape (2, 3, 1)
For example, selecting the first row:
r[0,:,:] == array([[1], [2], [3]]) # To go from shape (2,3,1) to (2,3), you would probably go for: s = r[:,:,0]
There are some shorthands you may care to know about.
Aside from views on the data, slicing is also useful to selectively apply operations (ufuncs) to a matrix. For example:
a[: , 1] /= 2 # divide the second column of a 2D matrix by two a[0::3 , :] /= 2 # divide every third row by two
...and if the array was 3D, you effectively omitted the third dimensional wish, so you selected all of it.
ufuncs and vectorization
Consider you can e.g. add or multiply by scalars:
>>> a = np.array([1,2,3,4]) >>> a * 2 array([2, 4, 6, 8])
This actually requires some glue to work (particularly because it mixes a python scalar with a numpy array), but it's so basic and useful that many such things are special-cased to just work (and that glue is often relatively simple, because...).
Numpy models the concept of a per-element function, called a 'universal function', better known as a ufunc.
This lets you mix many different things (generating, transforming, etc) and allows for things like:
>>> S = np.array([.25, .5, .75, 1]) >>> np.sin( np.pi * S ) array([ 7.07106781e-01, 1.00000000e+00, 7.07106781e-01, 1.22464680e-16 ]) >>> B = np.linspace(1,10, 10) # (which is array([1,2,3,4,5,6,7,8,9,10])) >>> 1 + np.exp( np.cos( np.pi * B ) )
Notes:
- Applying ufuncs is usually faster than alternatives - not because it's parallel or such, but because the iteration and evaluation can often happen in C (sometimes with vectorized computation, where possible)
- ufuncs include arithmetic, algebra, comparison logic, bitwise logic, things like conjugates.
- combinations of ufuncs are effectively also ufuncs(verify)
Generating 1D arrays
Your options to generate such matrices include:
- numpy.arange(), similar to python's range() but produces a numpy array
- evenly spaced
- Your basic start,stop,step
- deals with floats
>>> numpy.arange(1, 5, 0.3) array([ 1. , 1.3, 1.6, 1.9, 2.2, 2.5, 2.8, 3.1, 3.4, 3.7, 4. , 4.3, 4.6, 4.9])
- numpy.linspace()[1]
- Evenly spaced
- start, stop, and amount_of_steps in which to get there
- avoids rounding errors at the endpoints (verify) so the last value will always be stop
>>> numpy.linspace(1,5,14) array([ 1. , 1.30769231, 1.61538462, 1.92307692, 2.23076923, 2.53846154, 2.84615385, 3.15384615, 3.46153846, 3.76923077, 4.07692308, 4.38461538, 4.69230769, 5. ])
Generating higher-D grids
It is regularly useful to evaluate functions for a given set of values, e.g. on a grid. Options include:
- reshape something, often one of the above, e.g.
# numpy.arange(1,10).reshape(3,3) array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
- numpy.meshgrid()
- n-dimensional coordinate matrices, from 1D coordinate vectors
- modelled after matlab's meshgrid
- intuitively (for two): assume these two things need to be in the X and Y axis, so broadcast them according to the shape of the other
# e.g. a 2D x = numpy.array([1, 2, 3, 4]) # using linspace here is often more flexible y = numpy.array([10, 20, 30]) XX, YY = numpy.meshgrid(x, y) XX # array([[1, 2, 3, 4], # [1, 2, 3, 4], # [1, 2, 3, 4]]) YY # array([[10, 10, 10, 10], # [20, 20, 20, 20], # [30, 30, 30, 30]]) XX + YY # array([[11, 12, 13, 14], # [21, 22, 23, 24], # [31, 32, 33, 34]]) # This is helpful e.g. when trying to write cleanish ufunc combinations that use both, e.g x = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5]) y = numpy.array([0.1, 0.2, 0.3, 0.4, 0.5]) XX, YY = numpy.meshgrid(x, y) numpy.sqrt( XX*XX + YY*YY ) #array([[0.141, 0.224, 0.316, 0.412, 0.51 ], # [0.224, 0.283, 0.361, 0.447, 0.539], # [0.316, 0.361, 0.424, 0.5 , 0.583], # [0.412, 0.447, 0.5 , 0.566, 0.64 ], # [0.51 , 0.539, 0.583, 0.64 , 0.707]])
When it is enough to use integer based matrices, the above use has some shorter alternatives:
XX, YY = numpy.mgrid[10:40:10, 1:5] # same as the first XX, YY above
XX, YY = numpy.ogrid[10:40:10, 1:5] XX # array([[10], # [20], # [30]]) YY # array([[1, 2, 3, 4]]) XX+YY # array([[11, 12, 13, 14], # [21, 22, 23, 24], # [31, 32, 33, 34]])
This can be useful when you're working with images, or other pixel/voxel style data. For example, say you want the radius from an arbitrary center point.
# Some setup (you may want to inspect intermediates yourself to see what's happening) height, width = (5,5) # subtract half: fractional pixels and zero-based numbers # intuitively: draw a 3x3 and 4x4 matrix and figure out where the center really is cx = -0.5 + (0.5*width) cy = -0.5 + (0.5*height) yarr,xarr = np.indices([height,width]) # this creates incrementing indices in each direction, so: # xarr is [0 1 2 3 4], [0 1 2 3 4], ... # yarr is [0 0 0 0 0], [1 1 1 1 1], ... # ...which makes the following easy to express (note: in a vectorized way): numpy.sqrt((xarr-cx)**2+(yarr-cy)**2) #array([[2.828, 2.236, 2. , 2.236, 2.828], # [2.236, 1.414, 1. , 1.414, 2.236], # [2. , 1. , 0. , 1. , 2. ], # [2.236, 1.414, 1. , 1.414, 2.236], # [2.828, 2.236, 2. , 2.236, 2.828]])
# Or more interestingly, something like a gaussian (here intended for a fourier-space filter) import matplotlib.pyplot, numpy as np h = 10 # in pixels height, width = (20,20) cx = -0.5 + (0.5*width) cy = -0.5 + (0.5*height) yarr,xarr = np.indices([height,width]) # this creates incrementing indices in each direction, so: gaussian_lowpass = np.exp( -( ( (xarr-cx)**2 / h**2 ) +((yarr-cy)**2 / h**2)) ) matplotlib.pyplot.imshow(gaussian_lowpass, cmap='gray') matplotlib.pyplot.show()
There is also:
- scitools.numpyutils.meshgrid- extended version of numpy.meshgrid to 1D, 2D and 3D grids, with sparse or dense coordinate arrays and matrix or grid indexing
- scitools.numpyutils.ndgrid- same as calling meshgrid with indexing=’ij’ (matrix indexing)
- scitools.UniformBoxGrid
- scitools.BoxGrid
...and a number of tools around them.
Broadcasting
When combining matrices with operations that are element-wise, the most obvious requirement would be identical shapes.
Combining non-identical shapes works by following some specific rules.
Broadcasting basically means one array will be repeated to fit (much as if there's some code looping it for you)
a = array([[ 1., 2.], [ 3., 4.], [ 5., 6.]]) b = array([-1., 3.]) a + b # will yield: array([[ 0., 5.], [ 2., 7.], [ 4., 9.]]) # ...because b was repeated downwards
This works when relevant axis sizes match, or are 1. For example:
array([[1, 2, 3], [4, 5, 6]]) + array([10]) array([[1, 2, 3], [4, 5, 6]]) + array([[10], [100]]) array([[1, 2, 3], [4, 5, 6]]) + array([10,20,30])
Cases may intuitively feel somewhat ambiguous when there are several axes onto which you could combine.
For example:
a = array([[ 1., 2.], [ 3., 4 ]]) b = array([-1., 3.]) a + b # array([[ 0., 5.], # [ 2., 7.]]) # It's not that it chooses the first axis rather than the second, # it's that b's shape effectively a row vector # If you want it to apply to the second axis a + b[:,np.newaxis] array([[ 0., 1.], [ 6., 7.]]) # Transposing wouldn't work - b is 1D with shape (2,). # you could introduce the new dimension (now shape (1,2)) and then transpose it, yes. a + b[numpy.newaxis,:].T
See also:
- http://scipy-lectures.github.io/intro/numpy/numpy.html#broadcasting
- http://www.scipy.org/EricsBroadcastingDoc
Structured arrays / record arrays
You can load/save/use mixed data types, in a C-struct-like way.
You can also give fields names, which lets you access them that way. (This is handled within the numpy.recarray class, a subclass of numpy.ndarray).
Some notes:
- There are (confusingly) many different ways of telling numpy about the format. See: http://docs.scipy.org/doc/numpy/user/basics.rec.html#defining-structured-arrays
- You can mix them, but that's just confusing
- I prefer the long-names-with-bit-size variant (e.g. float32, uint16) for readability
- each type can be prefixed by a repetition, or a shape, e.g. 3int8, (2,3)float64
- You can define a data type with names, which gives you an array that you can also address with those names:
pt = np.dtype([ ('chunk_id', 'a1'), ('chunk_good', 'bool'), # size of a bool is one byte ('chunk_used', 'bool'), ('chunk_size', '<u4'), # little-endian unsigned 32-bit integer ('format', 'S4'), # 4-byte string ]) a = np.frombuffer('A\x00\x01\x01\x02\x03\x04qqhrB\x01\x00\x03\x06\x11\x12qqlr', dtype=pt) a[0]['format'] # will be 'qhrr' print repr(a) # will show something like: # array([('A', False, True, 67305985L, 'qqhr'), # ('B', True, False, 303105539L, 'qqlr')], # ...plus the dtype
See also:
Data types
NOTE: much of this needs to be cleaned, completed, or at least double-checked.
The below oversimplifies things, omits some details on implicit conversions, some serialized forms (such as 16-bit floats), and more. For completeness, check the actual documentation.
Most of the interesting predefined types you want to hand to dtype arguments are on the left:
numpy dtype object numpy string spec string spec for python module struct (only when using the non-default "standard mode"!) (see note below) numpy.float32 f4 or float32 f numpy.float64 f8 or float64 d numpy.int64 i8 or int64 q numpy.uint64 u8 or uint64 Q numpy.int32 i4 or int32 i or l numpy.uint32 u4 or uint32 I or L numpy.int16 i2 or int16 h numpy.uint16 u2 or uint16 H numpy.int8 i1 or int8 b numpy.uint8 u1 or uint8 B numpy.bool (uses a byte) b1 or bool ? numpy.complex64 (two 32-bit floats) c4 or complex64 numpy.complex128 (two 64-bit floats) c8 or complex128 < for little-endian > for big-endian = for native | for not applicable (default is often = or |(verify)) fixed-size list-like things: # being the length you want 'S#' / 'a#' bytestring s or p 'U#' unicode string (Py_UNICODE, so apparently UCS2 or UCS4 depending on the platform) 'V#' raw data
Notes:
- Defaults:
- The default float is often f8 (float64) (also when converting from python float)
- 'complex' defaults to complex128.
- the default integer may be i4 or i8 (platform-sized?(verify))
- I mention the struct module, I left out array.
- array's int sizes are platform-native, and there's no ability to control it like there is with struct, so it doesn't really fit in the above table.
- struct is better, but still artuably more bother than necessary if you can use numpy
- for the struct column, these are only valid when not using standard size (when using '<', '>', '!' or '='), not in native size (most int types can vary) See types. When you want portability, use standard mode.
Type hierarchy, testing
numpy has a hierarchy to its dtypes, which is useful when you e.g. want to test something in broader strokes - is it any variant of integer, is it signed or not, is it any sort of inexact number, etc.
Note that
- the supertype of integer types is numpy.integer, which is the supertype of numpy.unsignedinteger and numpy.signedinteger
- np.int and np.uint exist but are concrete types, so will not be a supertype of anything. Specifically, they are native-sized ints/uints
- the supertype of float and complex is numpy.inexact
- the supertype of float is numpy.floating
- the superclass of string types and void* is numpy.flexible
- For the tree, look for the docstring of numpy.core.numerictypes
So, for example, to see if something is any sort of integer you can use:
np.issubdtype(some_dtype, np.integer) # or use the kind # (most commonly you are interested in f, u, i, or c) ary.dtype.kind in ('u','i')
Or going further:
numpy.issubclass_(numpy.float32, numpy.inexact) == True numpy.issubclass_(numpy.complex64, numpy.inexact) == True numpy.issubclass_(numpy.int32, numpy.integer) == True numpy.issubclass_(numpy.int32, numpy.signedinteger) == True numpy.issubclass_(numpy.int32, numpy.unsignedinteger) == False numpy.issubclass_(numpy.uint32, numpy.integer) == True numpy.issubclass_(numpy.uint32, numpy.signedinteger) == False numpy.issubclass_(numpy.uint32, numpy.unsignedinteger) == True
On byte order (endianness)
Data's endianness is listed in dtype.byteorder, but that'll often be- dtype.descr seems to be more useful (verify)
changing endianness
ary.byteswap(inplace=True) # in-place meaning the backing bytes are re-ordered ary.byteswap(inplace=False) # new array?(verify) # as a view: ( alternative: arr.view(arr.dtype.newbytorder(thatchar)) ) swapped = o.newbyteorder('S') # swapped from its current endianness as_little = o.newbyteorder('<') as_big = o.newbyteorder('>') as_native = o.newbyteorder('=')
To find platform endianness', look at sys.byteorder, or do something like: (this should be clear, there are briefer ways)
oneone = numpy.ones(1, dtype='=i2') littleoneone = oneone.newbyteorder('<') native_is_little = (oneone[0] == littleoneone[0])
See also:
- http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html
- http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
Filtering, searching, counting
Iterating
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
Sorting
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Sorting along an axis, independently
To do a sort along an axis, each separately, use sort() without order.
# new array: numpy.sort(a, [axis, kind]) # in-place a.sort([axis, kind])
- axis - the axis to sort over
- defaults to -1, the last axis (which seems to be the only variant that does not keep temporary data(verify), so is faster)
- axis=None flattens the array
- kind is one of: [2]
- 'quicksort' (fast, not stable)
- 'mergesort' (stable, extra work space)
- 'heapsort' (not stable)
Examples:
# given: a = np.array( [[3, 1, 2], [4, 9, 6], [0, 2, 5]] ) # To sort within rows: (implied axis=-1, here the columns, and also equivalent to axis=1) np.sort(a) == array( [[1, 2, 3], [4, 6, 9], [0, 2, 5]]) # To sort within columns: np.sort(a, axis=0) == array([[0, 1, 2], [3, 2, 5], [4, 9, 6]]) # Flatten (to give a 1D sorted array of all values) np.sort(a, axis=None) == array([1, 1, 2, 2, 3, 3, 4, 6, 9])
Sorting a table
...like you're sorting a table by one of its columns: reorder all columns the same way, by the values in one.
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
If you do not have fields defined (not a structured array), then argsort() is probably the most flexible but it's not in-place, which can matter to you for large arrays. If it does, see below.
("argsort returns an array of indices of the same shape as a that index data along the given axis in sorted order."[3])
# sort columns, by the values in the second column >>> a[ a[:,1].argsort() ] array([[3, 1, 2], [1, 2, 3], [4, 9, 6]]) # sort rows, by the values in the third row >>> a[ a[2,:].argsort() ] array([[3, 1, 2], [4, 9, 6], [1, 2, 3]])
When sorting a structured array' (i.e. when you have fields defined):
>>> a=np.array([[(3, 2, 1)], [(9, 7, 8)], [(4, 5, 6)]], dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])
...then you can sort by those fields (columns), in-place or not:
>>> a.sort(order=['FOO'], axis=0) # it seems you have to explicitly mention the right axis >>> a # even though it's the only one that makes sense array([[(3, 2, 1)], [(4, 5, 6)], [(9, 7, 8)]], dtype=[('FOO', '<i8'), ('BAR', '<i8'), ('QUU', '<i8')])
You can order by one field, then another for cases where the first are equal. (no example; the data above does not have such a case).
To column-sort field-less arrays in-place (can make sense for very big arrays which you are using as tables), you have to fake fields.
You can use a view to add fields onto an existing array. Then call the view's .sort() to sort the backing data in-place.
Caveats:
- you can still only really sort by columns.
- You have to get the dtype right. Getting a view via a hardcoded dtype, e.g. a.view('i4,i4,i4')(or i, or similar) won't be remotely portable, unless you also created the array with that same hardcoded type (below, numpy chooses a's type, which could be i4 or i8 depending on context)
>>> a = np.array([(3, 1, 2), (4, 9, 6), (1, 2, 3)]) >>> b = a.view( ','.join( [str(a.dtype)]*a.shape[-1]) ) # TODO: check this >>> b array([[(3, 1, 2)], [(4, 9, 6)], [(1, 2, 3)]], dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')]) >>> b.sort(order=['f0'], axis=0) >>> a array([[(1, 2, 3)] [(3, 1, 2)] [(4, 9, 6)]])
See also
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.sort.html#numpy-sort
- http://stackoverflow.com/questions/2828059/sorting-arrays-in-numpy-by-column
- http://stackoverflow.com/questions/5047407/python-numpy-array-sorting
Serializing
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
....often to/from disk, sometimes memory.
Note that if instead of a local back-and-forth to storage you want to use these for interchange, then in most cases you have to do the extra bookkeeping/metadata stuff related to shape and data type (including signedness and endianness).
- pickle
You can use pickle on numpy objects.
You probably want to explicitly use pickle.HIGHEST_PROTOCOL as it's smaller on disk and writes and reads a lot faster than the simple-and-stupid default. I regularly see a factor 10 or 20 faster for writing (more when using cPickle over pickle) and reading (with little difference between pickle/cPickle).
- Text files
- Raw bytestrings (specify dtype)(verify)
Conversion to and from bytestrings:
- tostring()basically constructs a python bytestring copy of just the array's byte storage
- you'll have to keep track of shape and data type (including endianness) yourself, meaning this isn't hugely portable.
- There is a similar numpy.fromstring(a,type), though chances you'll only pair it with numpy.tostring(a)
This can be a convenient way to read in known raw structures, avoiding allocation when backing files are large. Note, however, that these are limited to 2GB even on 64-bit archs.
When compared to fromstring it can help avoid double allocation.
- Matlab matrix data
Again, this can be nice for interchange with Matlab (or other things that know this format). (the saving speed, loading speed, and space used is pretty good.)
- npy binary format
which is one of the few options that does store enough metadata for portability(verify), allows C-order and Fortran order, and allows multiple arrays.
For details of the file format used, look for npy format
Note that you can save to a StringIO/BytesIO, which can be nice when passing this into a database.
Going between python objects and numpy
In general, work in numpy, and minimize the amount of conversions you do.
Numpy to python
- You're probably looking for tolist(), which creates a python list(-of-lists) filled with conversions to the closest applicable types.
- Avoid going to python until/unless necessary - the copy to python is slowish, and working in python is slow compared to numpy
Python to numpy
- Since numpy allocates its own contiguous array, this will also be a copy.
- ...although objects that implement the array protocol (and satisfy some other requirements?{{verify}) may be viewed instead.
Manipulating shape, size, and axes
Reshaping, reordering
These two do not change the backing data, it just looks at the same data differently (and so requires that new size fits exactly).
It returns a different-shaped view on the same backing data.
For example, say that you have:
>>> a = array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 ]) # e.g. from linspace >>> a.reshape( (6,2) ) array([[ 1, 2 ], [ 3, 4 ], [ 5, 6 ], [ 7, 8 ], [ 9, 10 ], [ 11, 12 ]]) >>> a.reshape( (2,6) ) array([[ 1, 2, 3, 4, 5, 6 ], [ 7, 8, 9, 10, 11, 12 ]])
Notes:
- One of the lengths can be -1, which means "it's an exact multiple and you figure it out".
- Note that when you reshape an array, the order= argument does not mean 'output should be this ordering' (as it usually does), but (?). If a copy is necessary, it will be C-ordered. If it can be done without a copy, the ordering depends on other factors. If you need the output in a specific ordering, you may need a .copy() with order= argument.
- 1D flattening
- numpy.ravel() is a reshape into 1D, and is essentially reshape(-1, order=order)
- ndarray.flatten() seems to always returns a copy (verify)
- ndarray.flat() gives a numpy iterator (relatively like a numpy iterator)
- numpy.ravel() is a reshape into 1D, and is essentially
For certain specific conversions, allocation as (or readout as, which can be differnt) Fortran order or C order indexing can be relevant.
Reminder:
- C-style index ordering: the last axis index changing fastest, back to the first axis index changing slowest.
- Fortran-style index order: with the first index changing fastest, and the last index changing slowest
ndarray.flags tells you the array's current properties
- numpy.asfortranarray()</tt>
- numpy.ascontiguousarray()</tt> (C order)
numpy.require() can require ordering among other things (alignment, writeability, that we own the data) and will make a copy if necessary
http://stackoverflow.com/questions/17558388/reordering-of-numpy-arrays
Notes:
- Iteration is always / defaults to C order (verify)
Resizing (in the resampling/interpolation sense)
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
(Note that most of these will return copied data)
Integer-factor reductions
- decimating by taking every nth pixel via indexing, and throwing away the rest
- size down only, integer multiples only
- e.g. image[::2,::2]
- with default arguments, this is 'block means', which in image processing contexts this is sometimes called binning or coarsening
- size down only, integer multiples only
- does something per non-overlapping rectangle, e.g.
block_reduce(image, block_size=(2,2), func=np.mean) # default func is np.sum block_reduce(image, block_size=(2,3,1), func=np.max)
- Presumably the code here does the same thing (haven't really looked yet)
- size down only, integer factor only
- uses interpolation, specifically order-8 Chebyshev type I, or 30 point FIR
Not very sorted yet
- one of the functions in scipy.interpolate
- e.g. map_coordinates(), which is actually way more general
- scipy.signal.resample - resample using fourier space (zero-padding in frequency domain)
- in one dimension only?(verify)
- scipy.signal.resample_poly - polyphase and FIR filter
Somewhat image-specific
- nearest, bilinear, bicubic, cubic, lanczos
- meant for PIL interaction so coerces results to integer(verify)
- e.g. zoom(), using spline interpolation of, by default, order 3
combining data
splitting
chunk up along a particular axis, return them as separate arrays (views?). (Sort of the counterpart of stacking)
- numpy.split(axis=something) splits an array into a list of multiple sub-arrays of equal size.
- numpy.hsplit() is split() with axis=1, i.e. horizontally, column-wise
- numpy.vsplit() is split() with axis=0, i.e. vertically, row-wise
- numpy.dsplit() is split() with axis=2, i.e. depth-wise
- numpy.array_split() is similar, but when splitting into a given amount of items it does not require equal division.
>>> c = numpy.array([[ 1., 2.], [ 3., 4.], [ 5., 6.], [ 7., 8.]]) >>> numpy.vsplit(c, 2) [array([[ 1., 2.], [ 3., 4.]]), array([[ 5., 6.], [ 7., 8.]])] >>> numpy.hsplit(c, 2) [array([[ 1.], [ 3.], [ 5.], [ 7.]]), array([[ 2.], [ 4.], [ 6.], [ 8.]]) ]
>>> x = np.arange(9.0) >>> np.split(x, 3) [array([ 0., 1., 2.]), array([ 3., 4., 5.]), array([ 6., 7., 8.])] >>> >>> np.split(x, [3, 5, 6, 10]) [array([ 0., 1., 2.]), array([ 3., 4.]), array([ 5.]), array([ 6., 7., 8.]), array([], dtype=float64)] >>> np.array_split(x, 4) [array([ 0., 1., 2.]), array([ 3., 4.]), array([ 5., 6.]), array([ 7., 8.])]
See also:
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.split.html#numpy.split
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.array_split.html#numpy.array_split
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.hsplit.html#numpy.hsplit
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.vsplit.html#numpy.vsplit
- http://docs.scipy.org/doc/numpy/reference/generated/numpy.dsplit.html#numpy.dsplit
Growing
You can't really change the size of a numpy array without reallocating and copying.
- a basic implication of it being backed contiguously in memory, which is good for speed
- (could the realloc expand the existing allocation?(verify))
- yes, this implies that numpy arrays that are ~30% of memory may not make you happy. (You may wish to consider better-informed C code, or some well-informed numpy calls)
If you know the eventual size of the array, allocate that size immediately and assign data as you go.
- This is probably the most memory-efficient way to go
If you don't need numpy operations on the axis of growth, consider making a python list of numpy matrices.
- (pretty much because that isn't contiguous)
- If you don't know the final size, but do want to end up with a single numpy array,
- ...consider first making the python list of numpy arrays, and eventually join or stack them
- it will, for a short time, mean you use twice the memory that the data needs.
- ...but so will attempting to grow an existing array
Padding
delete, insert
newaxis
More axis manipulation
Unsorted
0D array, scalars
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Histograms, Binning as in dealing with values
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
(cf. binning as in block-means coarsening -- see resizing)
numpy.digitize [4]
scipy.stats.binned_statistic [5]
numpy.bincount [6]
bin_means = numpy.histogram(data, bins, weights=data)[0] / numpy.histogram(data, bins)[0]
http://stackoverflow.com/questions/23594262/binning-values-of-a-function-in-python-numpy
Semi-sorted
Correlation
distance calculation
None
Converting a python array containing None
- becomes NaN under float/double
- is a TypeError casting to int
- without explicit dtype can force the array to object
Bits
Bitmasks or other booleans
numpy.unpackbits converts 8-bits-in-an-uint8 to 1-bit-per-uint8
>>> np.unpackbits(array([[2], [7], [23]], dtype=np.uint8), axis=1) array([[0, 0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 1, 0, 1, 1, 1]], dtype=uint8)
numpy.packbits converts 1-bit-per-someinteger to uint8, 8 bits at a time
- padded to 8 with zeroes on the right side per element
- so when you specify fewer than 8 you actually specify the leftmost bits - a bit odd perhaps, and meaning you sometimes need to do trickery like ([0]*(8-len(a)%8))+a
- For example:
>>> np.packbits(array([ [ [1,0,1], [0,1,0] ], [ [1,1,0], [0,0,1] ] ]) array( [ [ [160],[64] ] , [ [192],[32]] ], dtype=uint8)
integers with unusual edges
Say you want to load from unusually-sized integers that imply packing across byte boundaries, often raw image data, e.g. 4, 10, 12, or 14 bits per pixel, 24 bits per sample.
Since there is no native data format for that, you will probably be loading that into uint8, uint16, uint16, uint16, and uint32 respectively.
If the data isn't huge, and this isn't a bottleneck in your app, then it's probably easiest to load it into a covneniently-sized integer type, then bit-massage it into multiple arrays with the dtype you want, and probably merge them. (As much in numpy code as possible so it's native)
If data is large, you probably want to avoid the intermediate allocations, and do the described array-to-array conversion in one step.
See e.g. the cython example here.
Numpy and native code isn't quite as scary as it sounds.
- For example, cython is aware of numpy
- write a small numpy C extension
- use an existing library via ctypes
- Consider Fortran, in simpler cases it's somewhat less hairy than
To avoid any intermediate storage (relevant when working on huge data),
you may to do the conversion during file load, probably bypassing numpy's loading completely.
(Note you'll have to consider endianness and such yourself)
Spherical/polar coordinates
Making masks
Warnings and errors
RuntimeWarning: invalid value encountered in less_equal
...or in another comparison function (less, greater, etc)
This is most likely about np.nan or nan.infinity.
Those values can be avoidable when they come from
- a div by zero, or log of zero
- conversion from python None
In other cases they are useful (e.g. it's a matrix or timeseries missing data points), but are doing statistics on them, then
- check whether there are nan-aware functions (e.g. nanmedian, nanpercentile, nanmedian)
- failing that, make a (flat) array with just the non-NAN values like:
x[numpy.logical_not(numpy.isnan(x))]
or the shorthand
x = x[~numpy.isnan(x)]
ValueError: data type must provide an itemsize
Means the type you handed in can't be coerced to the given dtype (which my well be the default, often double).
For example, you handed in an array of integer-as-strings,
or an array mixing types where just one fails.
ValueError: zero-size array to some function without identity
Really just means you're handing along an empty array.
There are a bunch of utility functions that assume they won't be handed an empty array, usually because it makes no sense.
In my case, I had an off-by-one error in some slicing logic.
RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility
These warnings come from something compiled against an older numpy than is installed, and signaling that the ABI might have changed.
It often comes from scipy.
Which is typically built against the lowest supported numpy version, in part to be ABI-compatible, so it's typically fine.
If your distro packaging doesn't fix this,
you can usually also fix this by one of:
- downgrading numpy
- getting pip to compile scipy against your installed numpy.
- filter out these warnings.
underflow encountered in exp
The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Using a non-tuple sequence for multidimensional indexing is deprecated
More fully:
FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
The actual warning comes from numpy.
This actually is about the behaviour of slicing, and specifically numpy ~1.15.0 deprecating (multidimensional) indexing/slicing with anything other than tuple arguments.
As https://github.com/numpy/numpy/pull/9686 mentions, this because there used to be distinct behaviours depending on how you handed the same-valued arguments in, which is was probably a confusing/bad idea; your code probably should be changed updated if only to check whether or not it relied on the old behaviour.
If triggered from scipy, it's an older version of scipy, and not an issue.
Since it has been fixed in scipy since, upgrading it gets rid of the warning.
elementwise == comparison failed and returning scalar instead; this will raise an error or perform elementwise comparison in the future
https://github.com/numpy/numpy/issues/9210
scipy
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Scipy builds more interesting stuff on top of the general-purpose numpy.
Scipy and the scikits may make it easier to quickly manipulate to and to visualize data (also via matplotlib). There are other packages you may like for visualisation, such as mayavi.
See also:
scikits (scipy toolkits)
Separately installable scipy toolsets, often because they are specific specializations that just happen to build on scipy/numpy, or cannot be part of scipy because of licensing details.
See also:
Includes:
- sound, DSP, and speech things:
- audiolab [7] (a.k.a. pyaudiolab, previously pyaudio), which wraps sndfile for various file format reading and writing
- samplerate (a.k.a. pysamplerate)
- Talkbox (signal/speech processing)
- (Unsorted):
- (Sci)py-to-Matlab bridge (mlabwrap) [8]
- pymachine, machine learning (unfinished?) [9]
- Numerical optimization frameworks ([10] and [11])
- Approximate Nearest Neighbor Library (a wrapper for ANN)
- delaunay (...trangulation; see also voronoi)
- TimeSeries [12]
- RSFormats (reads common remote-sensing data formats)
As all Wiki pages, this page is editable, this means that you can modify the contents of this page simply by using your web-browser. Simply click on the "Edit this page" link at the bottom of the page. WikiFormatting will give you a detailed description of available Wiki formatting commands.
TracGuide is a good place to start.
http://scipy.org/scipy/scikits/wiki
Duplication between numpy and scipy
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Since numpy already has some useful functionality, there is occasionally some duplication between numpy and scipy, e.g. in some linear algebra, some FFT.
In general, choose the scipy code. The stuff in numpy is often there for historical and backwards-compatibility reasons.
The scipy code will often be a little more fleshed out, and sometimes faster.
In a few cases it's just different implementations of the same idea.
See also:
Some specifics:
- Scipy FFT versus numpy FFT
- note that pyfftw.interfaces provides drop-in functions for both
- numpy.fft and scipy.fftpack both use FFTPACK so should perform comparably
- in practice scipy can be cleverer, e.g. runs checks whether your data is real and uses rfft if so. Yes you can do that yourself, but lazier programmers may find scipy isn't "randomly" slower.
- scipy interleaves the result (verify)'s real and imaginary parts, though, which is slower to work with
- See also http://stackoverflow.com/questions/6363154/what-is-the-difference-between-numpy-fft-and-scipy-fftpack
- It seems that packaged numpy and scipy are typically built against ATLAS
- If you want to build them against e.g. OpenBLAS, you will need some work.
- In theory it's little more than a pip install numpy/scipy after you have openblas, in practice it is more.
http://osdf.github.io/blog/numpyscipy-with-openblas-for-ubuntu-1204.html http://stackoverflow.com/questions/7496547/does-python-scipy-need-blas
On speed
See also
The array interface used by numpy and related:
Related in some way or other:
- http://wiki.python.org/moin/NumericAndScientific (some other scientific-style)
- http://dirac.cnrs-orleans.fr/plone/software/scientificpython/