Python usage notes - pandas, dask
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
Pandas
Pandas is, in part, a bunch of convenience in
- reading and writing data
- structuring data,
- applying numpy-style things to data (it extends ndarray so numpy is applicable for things that are numeric)
Data
pandas imitates R data.frame, and
Notes:
- This works out as more flexible around tabular data than e.g. just numpy would be
- in part while numy has a matrix of one type, each series in a dataframe has its own type, so it can group more interesting data in a single structured object
- series, and by extension DFs, do implement ndarray, which means you can still interact with numpy, scipy, matplotlib, etc.
- though strings work a little differently
- as do mixes of types (also consider pandas's tendency to go for the object type)
- there's some basic matplotlib integration (see e.g. df.plot(), and variants via e.g. df.plot.area() / df.plot(kind='area') and similar for line, pie, scatter, bar, barh, box, density, hexbin, hist, and more)
- There are some more things that act much like a DF, but represent more complex state
- see e.g. the grouping example below.
- series has
- a .shape
- a .dtype
- dataframe has
- a .shape
- a .dtypes (a Series rather than a list of dtypes, for some reason)
- DFs are often read from files
- ...but you can also
- construct Series e.g. from lists
- constrict DataFrames e.g. from dicts
- like (example from docs):
ser = pd.Series([1,2,3])
frame = pd.DataFrame({
"Name": ["Braund, Mr. Owen Harris", "Allen, Mr. William Henry", "Bonnell, Miss. Elizabeth"],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"]
})
On the index
On types
Because each series is basically ndarray
- typing is largely just numpy's
- ...but there's a little more, and you're probably not used to seeing object
- plus some of its own own extension types, which it does
- to add a few useful types,
- to help deal with missing/NA handling (see below).
On object
extension types
On missing data
It's not unusual to have some data be missing.
So it's useful to mark that specifically.
The implication of how to mark, and how to best deal, varies per use and per type.
The forms of missing data - python side there's things like python's nan, numpy.NaN, and None, and in python data structures you can mix and match.
The forms of missing data - within numpy things are more interesting.
- ndarrays work on arrays of a single type - and as a result whether there is already special value to represent missing data depends on the type
- It can be practical to (ab)use NaN in float arrays - doesn't need a special type, and some processing can be told to ignore and/or specifically complain about NaNs already
- In many other types, e.g. integer, there is no reasonable special value
- other than maybe case-specific ones like -1, but this is going to do give wrong answers, rather than errors, if you forget
- You can use object arrays, but they're just pointing at a mix of types, and a lot of numeric processing won't like that.
- There are things you could do with masked arrays
- but it amounts to extra state you have to remember to manage.
The forms of missing data - within pandas it's mostly the numpy part, but with some extra details
- Pandas introduced nullable ints (see #On_types), e.g. Int64 which differs from e.g. numpy's int64
- object columns could end up with any of these considered-NA things (NaN, NA, None), but you may not want such a mix for other reasons
- such as your confusion
- and calculations not particularly working
- Historically, pandas has represented missing data with
- (numpy.)NaN for float types
- NaT for datetime (verify)
- Since around 1.0.0 pandas is trying to move to its (pandas.)NA instead (verify)
- NA itself is a type that does act similarly to nan (e.g. NA + 1 is NA, NA**0==1)
- but cannot be used easilbly/sensibly in a lot of arithmetic or logic. "Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:"
- meaning it will either propagate NA, or complain, but not do nonsense - which is probably what you want
- isna(), notna() helps detect these consistently across dtypes -- except integer (see below)
- isna() and isnull() are identical. Both existing is for people coming from R (where both exist but are not the same)
- You're given functions to help
- .fillna() to replace NA values with a specific value,
- .dropna() to drop rows with NA in any column, or in specific specific columns
- ... or drop columns with NA, if you give it axis=1
- There are things that you may want to count as missing, depending on context. Consider
- empty strings, e.g. with df.replace(, replacement) (where that replacement could be pandas.NA)
- things like inf and -inf
- the difference between a numpy int and pandas nullable int can be subtle. It's easy to not notice the difference between:
Age2 int64 Age3 Int64
"Except integer?"
A numpy integer cannot represent None/NaN/missing.
This is probably the main reason why pandas, given data that it knows are integers, will easily use float64 (where it can use NaN)
- (keep in mind float64 stores integers accurately within -9007199254740991..9007199254740991 range)
- and you can explicitly use its own nullable, Int64 (convenient but has a few footnotes)
Relatedly, introducting NAs can promote an integer into a float64, and a boolean to an object. [1]
Some examples that may help
None becomes Nan in float64:
>>> pandas.Series([1,2,None])
0 1.0
1 2.0
2 NaN
dtype: float64
None becomes NA in Int64 and string
>>> pandas.Series([1,2,None], dtype='Int64')
0 1
1 2
2 <NA>
dtype: Int64
>>> pandas.Series(['1','2',None], dtype='string')
0 1
1 2
2 <NA>
dtype: string
None becomes NaT in datetime64
>>> pandas.Series([1,2,None], dtype='datetime64[ns]')
0 1970-01-01 00:00:00.000000001
1 1970-01-01 00:00:00.000000002
2 NaT
dtype: datetime64[ns]
None stays None in object (and things can become object for somewhat-uncontrolled reasons)
>>> pandas.Series(['1','2',None])
0 1
1 2
2 None
dtype: object
...but that's just because you gave it mixed types, not because object pefers None.
Consider that in the following it first becomes NA by force, then can stay NA:
>>> pandas.Series(['1','2',None], dtype='string').astype('object')
0 1
1 2
2 <NA>
dtype: object
Things happen during conversions
>>> frame = pd.DataFrame({
"Name": ["Braund, Mr. Owen Harris", "Allen, Mr. William Henry", "Bonnell, Miss. Elizabeth"],
"Age": [22, 35, None],
"Sex": ["male", "male", "female"],
})
>>> frame['Age2'] = frame['Age'].astype('Int64')
>>> frame.dtypes
Name object
Age float64
Sex object
Age2 Int64
>>> frame
Name Age Sex Age2
0 Braund, Mr. Owen Harris 22.0 male 22
1 Allen, Mr. William Henry 35.0 male 35
2 Bonnell, Miss. Elizabeth NaN female <NA>
See also:
Input and output
Extra parsing
Large datasets
Inspecting, cleaning, selecting, filtering
Some poking of data new to you
If using a parser like read_csv, you might care to know what type pandas guessed
df.dtypes
Cardinality of each column (amount of distinct values in it)
df.apply( pandas.Series.nunique )
How many bad/missing values (NaN, None, NaT), per column
df.isna().sum()
Amount of empty strings, per column
df.isin(['']).sum() # note: isin has some more complex uses
Show all rows with NA somewhere
df[df.isna().any(axis=1)]
Common and uncommon values per column, to see if there's specific weirdness to fish out
for column in df.columns:
print( df[column].value_counts() )
...note that that doesn't count NA. If you really need that, try something like
for column in df.columns:
print( df[[column]].groupby(column, dropna=False, as_index=False).size() )
Similarly, if you want to figure out what kind of things an object column is pointing to, something like (TODO: see if there is a more direct way of doing that)
pandas.Series( [ type(val) for val in df['rating']] ).value_counts()
Look at distribution of numeric rows
df.describe()
- amount of distinct values in each column (useful to find types, tags, keywords)
df.apply(pandas.Series.nunique)
Look at common and uncommon values for a columnvalue_counts
df['colname'].value_counts()
or for all columns per column with
for column in df.columns:
print( df[column].value_counts() )
Say the above revealed some weird substrings in a specific column, and you want to see the records they came from
You might write
bads = df[ df['rating'].str.contains('badstr') ]
You'll likely find it saying:
ValueError: cannot index with vector containing NA / NaN values
...because the contains() will pass through NA or NaN (and things like it may too), which doesn't work in the selection around it. In this case, what you'ld want is
bads = df[ df['rating'].str.contains('badstr',na=False) ]
To invert a df/series item-wise, you can't use a not (that's a python boolean context), but ~ works, e.g.
df = df[ ~df['rating'].str.contains('min',na=False) ]
See if Dataframe dtypes seem reasonable (e.g. see whether any alterations changed things as expected)
df.dtypes
Since it is generally more useful to select columns than rows, [] selects columns, say
df["Age"]
row selection / ranges via the index: loc and iloc
iloc gets rows (and/or columns) from index by value.
- ...and happens to be inclusive
iloc gets rows (and/or columns) from index by locations.
- which will be exclusive, because integer ranges
to basically steal an example from here, if you have a series like
s = pd.Series(list("abcdefg"), index=[49, 48, 47, 0, 1, 2,2])
>>> s.iloc[0:3] # first three, by position note: head() is a little simpler
49 a
48 b
47 c
>>> s.loc[1:2] # index values 1 through 2
1 e
2 f
2 g
Note that handing a single value to loc/iloc will make it return a Series, a range a DataFrame
If you want a DataFrame always, say something like df.iloc[ [i] ] instead of df.iloc[ i ]
If you want individual records, you might actually want that series form because it prints If you wanted to get that longer form, you might like something like
for labels, row_series in df.head(4).iterrows(): # same as iteritems? You can also get plain tuples via itertuples) print(row_series)
row selection / ranges via series values:
There's also a bunch of operator overloading, so you can e.g. do
df["Age"] > 35
and get a bool Series back, so (like in numpy) you can e.g. select subsets like
df[ df["Age"] > 35 ]
When you want to filter both rows and/or columns, you will need loc or iloc. Consider:
df.loc[df["Age"] > 35, "Name"]
By rows
- .head(n)
- .tail(n)
Adding, transforming
Consider:
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882 air_quality["ratio_paris_antwerp"] = air_quality["station_paris"] / air_quality["station_antwerp"]
Rename columns:
air_quality_renamed = air_quality.rename( columns={"station_antwerp": "BETR801", "station_paris": "FR04014", "station_london": "London Westminster"})
You can also apply() a function (which gets faster when when it's C rather than pure python)
apply
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply
pipe
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook
-->
Grouping
Average age of passengers
titanic["Age"].mean()
Can work independently for multiple columns:
titanic[["Age", "Fare"]].mean()
Average age of passengers, split by sex:
titanic[["Sex", "Age"]].groupby("Sex").mean()
The parts:
- select the two relevant columns (to keep clutter and calculation down)
- groupby creates a specific type of object (that conceptually contains the two groups in the data - and sometimes it's useful to iterate through the groups yourself)
- mean on that object works that out for each group
If you want to summarize by something other than you're grouping by, you might groupby more of the data and only select the relevant columns, e.g.
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
- that groupby object still contains all the data, but grouped
- selecting fare from that has just that column, still grouped the same way
(You could also do
titanic"Fare","Sex","Pclass".groupby(["Sex", "Pclass"])["Fare"].mean()
...but it's mainly just more typing.
---
More complex count
grp = df.groupby( ['release_year', 'rating'] ) #, as_index=False
#grp is a DataFrameGroupBy object - which seems to contain the DF, plus grouping data
# To inspect, do something like:
#for key, item in grp:
# print(grp.get_group(key), "\n")
print( "\n- group .size() (~= tuple count) -")
grp_size = grp.size() # grp_size is a series
# For a sense of what it did:
for key, val in grp_size.head(5).items():
print('%r -> %r'%(key,val))
# because it's a series, we don't (need to) hand anything to sort_values
t = grp_size.sort_values()
#reset_index makes this a DF again, (by) setting a new 0-to-#rows index
t.reset_index().rename(columns={0:'count'})
-->
Plotting
Sorting
Structure changes
Series.ravel() [2] - returns as (1D) ndarray
stack() [pandas.DataFrame.stack]
unstack() [3]
https://www.datasciencemadesimple.com/reshape-using-stack-unstack-function-pandas-python/
pivot() [4]
pivot_table() [5]
Datafame.melt() [6]
Datafame.explode() [7]
Datafame.squeeze() [8]
Sorting rows
titanic.sort_values
Does not change numeric indexing?(verify)
Doesn't affect statistics, more for printing,
pivot()
stack() and unstack()
https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-reshaping
Merging, combining
Chaining functions or not
Most operations on a dataframe return a dataframe (altered, subset, etc.), so you can write a bunch of (pre)processing pretty succinctly.
pd.read_csv(fp) .rename(columns=str.lower) .drop('unnamed: 36', axis=1) .pipe(extract_city_name) .pipe(time_to_datetime, ['dep_time', 'arr_time', 'crs_arr_time', 'crs_dep_time']) .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']), .groupby(['unique_carrier', pd.TimeGrouper("H")])
Chaining does not make debugging any simpler,
and sometimes you want one of the intermediate objects because you want to do multiple things with it and that's the easier way to write it.
https://tomaugspurger.github.io/method-chaining
In-place operations
Many alterations can also be done in-place, which can be preferable on huge dataframes
Handling details
Type specifics
Time
Strings
Unsorted
Dask
You can see Dask as pandas plus...
- lazy evaluation
- parallelization, in that
- partitioning data and evaluation to multiple processes
- parallelized reading/writing,
- parallelized implementation of parts of numpy/scipy,
- multi-core or multi-node[9].
It's more of a scheduler than anything else,
so how much faster it is - or isn't - and how much harder it may be to write (and whether something is parallelize at all)
depends entirely on what you're doing.
Where parallelization is a poor fit, and datasets does easily fit in memory, then dask may be a little slower than pandas.
That said, if you can design your case for dask, then it should scale better - and the small fry aren't your real issue anyway.