Python usage notes - pandas, dask
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML |
Contents
Pandas
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
Pandas is, in part, a bunch of convenience in
- reading and writing data
- structuring data,
- applying numpy-style things to data (it extends ndarray so numpy is applicable for things that are numeric)
Data
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
pandas...
Notes:
- series, and by extension DFs, implement ndarray
- which means you can fairly easily interact with numpy, scipy, matplotlib, etc.
- note there's some basic matplotlib integration, see e.g. df.plot(), and variants via e.g. df.plot.area() / df.plot(kind='area') and similar for line, pie, scatter, bar, barh, box, density, hexbin, hist, and more
- which means you can fairly easily interact with numpy, scipy, matplotlib, etc.
- though strings work a little differently
- as do mixes of types
- Dataframes (and the pandas API in general) imitate R data.frame
- which also deals with somewhat more flexible data than python's numpy does - which comes up e.g. in typing details
- There are some more things that act much like a DF, but represent more complex state
- see e.g. the grouping example below.
- series has
- a .shape
- a .dtype
- a
- dataframe has
- a .shape
- a .dtypes(a Series rather than a list of dtypes, for some reason)
- a
- While you may often read DFs from files
- Series can e.g. be constructed from lists
- DataFrames can e.g. be from dicts, like (example from docs):
ser = pd.Series([1,2,3]) frame = pd.DataFrame({ "Name": ["Braund, Mr. Owen Harris", "Allen, Mr. William Henry", "Bonnell, Miss. Elizabeth"], "Age": [22, 35, 58], "Sex": ["male", "male", "female"] })
On the index
On types
Because each series is basically ndarray
- typing is largely just numpy's
- you may not be used to seeing object
- plus some of its own own extension types, which it does
- to add a few useful types,
- to help deal with missing/NA handling (see below).
On object
extension types
On missing data
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me) |
It's not unusual to have some data be missing, so the ability to mark it as such, and be able to work with such data, is useful.
The implication of how to mark, and how to best deal, varies per use and per type.
The forms of missing data - within numpy things are more interesting.
- ndarrays work on arrays of a single type - and as a result doesn't necessarily represent missing data in arrays.
- It can be practical to (ab)use NaN in float arrays - doesn't need a special type, and some processing can be told to ignore NaNs.
- In many other types, e.g. integer, there is no reasonable special value
- other than maybe case-specific ones like -1, but this is going to do give wrong answers, rather than errors, if you forget
- You can use object arrays, but they're just pointing at a mix of types, and a lot of numeric processing won't like that.
- There are things you could do with masked arrays
- but it amounts to extra state you have to remember to manage.
The forms of missing data - within pandas it's mostly the numpy part, but with some extra details
- Pandas introduces nullable ints (see #On_types), e.g. Int64 which differs from e.g. numpy's int64
- Historically, pandas has represented missing data with
- (numpy.)NaN for float types
- NaT for datetime (verify)
- object columns could end up with any of these considered-NA things (NaN, NA, None), but you may not want such a mix for other reasons
- such as your confusion
- and calculations not particularly working
- Since around 1.0.0 pandas is trying to move to its (pandas.)NA instead (verify)
- NA itself is a type that does act similarly to nan (e.g. NA + 1 is NA, NA**0==1)
- but cannot be used easilbly/sensibly in a lot of arithmetic or logic. "Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:"
- meaning it will either propagate NA, or complain, but not do nonsense - which is probably what you want
- isna(), notna() helps detect these consistently across dtypes -- except integer (see below)
- isna() and isnull() are identical. Both existing is for people coming from R (where both exist but are not the same)
- You're given functions to help
- .fillna() to replace NA values with a specific value,
- .dropna() to drop rows with NA in any column, or in specific specific columns
- ... or drop columns with NA, if you give it axis=1
- There are things that you may want to count as missing, depending on context. Consider
- empty strings, e.g. with df.replace(, replacement) (where that replacement could be pandas.NA)
- things like inf and -inf
- the difference between a numpy int and pandas nullable int can be subtle. It's easy to not notice the difference between:
Age2 int64 Age3 Int64
"Except integer?"
A numpy integer cannot represent None/NaN/missing.
This is probably the main reason why pandas will, given data that it knows are integers, will easily use float64 (where it can use NaN)
- (keep in mind float64 stores integers accurately within -9007199254740991..9007199254740991 range)
- and you can explicitly use its own nullable, Int64 (convenient but has a few footnotes)
Relatedly, introducting NAs can promote an integer into a float64, and a boolean to an object. [1]
Some examples that may help
None becomes Nan in float64:
>>> pandas.Series([1,2,None]) 0 1.0 1 2.0 2 NaN dtype: float64
None becomes NA in Int64 and string
>>> pandas.Series([1,2,None], dtype='Int64') 0 1 1 2 2 <NA> dtype: Int64 >>> pandas.Series(['1','2',None], dtype='string') 0 1 1 2 2 <NA> dtype: string
None becomes NaT in datetime64
>>> pandas.Series([1,2,None], dtype='datetime64[ns]') 0 1970-01-01 00:00:00.000000001 1 1970-01-01 00:00:00.000000002 2 NaT dtype: datetime64[ns]
None stays None in object (and things can become object for somewhat-uncontrolled reasons)
>>> pandas.Series(['1','2',None]) 0 1 1 2 2 None dtype: object
...but that's just because you gave it mixed types, not because object pefers None.
Consider that in the following it first becomes NA by force, then can stay NA:
>>> pandas.Series(['1','2',None], dtype='string').astype('object') 0 1 1 2 2 <NA> dtype: object
Things happen during conversions
>>> frame = pd.DataFrame({ "Name": ["Braund, Mr. Owen Harris", "Allen, Mr. William Henry", "Bonnell, Miss. Elizabeth"], "Age": [22, 35, None], "Sex": ["male", "male", "female"], }) >>> frame['Age2'] = frame['Age'].astype('Int64') >>> frame.dtypes Name object Age float64 Sex object Age2 Int64 >>> frame Name Age Sex Age2 0 Braund, Mr. Owen Harris 22.0 male 22 1 Allen, Mr. William Henry 35.0 male 35 2 Bonnell, Miss. Elizabeth NaN female <NA>
See also: