Python usage notes - pandas, dask

From Helpful
(Redirected from Python usage notes - pandas)
Jump to navigation Jump to search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Pandas

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Pandas is, in part, a bunch of convenience in

reading and writing data
structuring data,
applying numpy-style things to data (it extends ndarray so numpy is applicable for things that are numeric)



Data

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


pandas imitates R data.frame, and

puts any one variable in a Series
collects Series in a DataFrame


Notes:

  • This works out as more flexible around tabular data than e.g. just numpy would be
in part while numy has a matrix of one type, each series in a dataframe has its own type, so it can group more interesting data in a single structured object
  • series, and by extension DFs, do implement ndarray, which means you can still interact with numpy, scipy, matplotlib, etc.
though strings work a little differently
as do mixes of types (also consider pandas's tendency to go for the object type)
there's some basic matplotlib integration (see e.g. df.plot(), and variants via e.g. df.plot.area() / df.plot(kind='area') and similar for line, pie, scatter, bar, barh, box, density, hexbin, hist, and more)
  • There are some more things that act much like a DF, but represent more complex state
see e.g. the grouping example below.


  • series has
a .shape
a .dtype
  • dataframe has
a .shape
a .dtypes (a Series rather than a list of dtypes, for some reason)


  • DFs are often read from files
...but you can also
construct Series e.g. from lists
constrict DataFrames e.g. from dicts
like (example from docs):
ser   = pd.Series([1,2,3])

frame = pd.DataFrame({
  "Name": ["Braund, Mr. Owen Harris",  "Allen, Mr. William Henry",   "Bonnell, Miss. Elizabeth"],
  "Age":  [22, 35, 58],
  "Sex": ["male", "male", "female"]
})


On the index

On types

Because each series is basically ndarray

...but there's a little more, and you're probably not used to seeing object
  • plus some of its own own extension types, which it does
to add a few useful types,
to help deal with missing/NA handling (see below).


On object

extension types

On missing data

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

It's not unusual to have some data be missing.

So it's useful to mark that specifically.


The implication of how to mark, and how to best deal, varies per use and per type.



The forms of missing data - python side there's things like python's nan, numpy.NaN, and None, and in python data structures you can mix and match.


The forms of missing data - within numpy things are more interesting.

  • ndarrays work on arrays of a single type - and as a result whether there is already special value to represent missing data depends on the type
  • It can be practical to (ab)use NaN in float arrays - doesn't need a special type, and some processing can be told to ignore and/or specifically complain about NaNs already
  • In many other types, e.g. integer, there is no reasonable special value
other than maybe case-specific ones like -1, but this is going to do give wrong answers, rather than errors, if you forget
  • You can use object arrays, but they're just pointing at a mix of types, and a lot of numeric processing won't like that.
but it amounts to extra state you have to remember to manage.


The forms of missing data - within pandas it's mostly the numpy part, but with some extra details

  • Pandas introduced nullable ints (see #On_types), e.g. Int64 which differs from e.g. numpy's int64
  • object columns could end up with any of these considered-NA things (NaN, NA, None), but you may not want such a mix for other reasons
such as your confusion
and calculations not particularly working
  • Historically, pandas has represented missing data with
(numpy.)NaN for float types
NaT for datetime (verify)
  • Since around 1.0.0 pandas is trying to move to its (pandas.)NA instead (verify)
NA itself is a type that does act similarly to nan (e.g. NA + 1 is NA, NA**0==1)
but cannot be used easilbly/sensibly in a lot of arithmetic or logic. "Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:"
meaning it will either propagate NA, or complain, but not do nonsense - which is probably what you want
  • isna(), notna() helps detect these consistently across dtypes -- except integer (see below)
isna() and isnull() are identical. Both existing is for people coming from R (where both exist but are not the same)
  • You're given functions to help
.fillna() to replace NA values with a specific value,
.dropna() to drop rows with NA in any column, or in specific specific columns
... or drop columns with NA, if you give it axis=1
  • There are things that you may want to count as missing, depending on context. Consider
    • empty strings, e.g. with df.replace(, replacement) (where that replacement could be pandas.NA)
    • things like inf and -inf
  • the difference between a numpy int and pandas nullable int can be subtle. It's easy to not notice the difference between:
Age2       int64
Age3       Int64


"Except integer?"

A numpy integer cannot represent None/NaN/missing.

This is probably the main reason why pandas, given data that it knows are integers, will easily use float64 (where it can use NaN)

(keep in mind float64 stores integers accurately within -9007199254740991..9007199254740991 range)
  • and you can explicitly use its own nullable, Int64 (convenient but has a few footnotes)


Relatedly, introducting NAs can promote an integer into a float64, and a boolean to an object. [1]



Some examples that may help


None becomes Nan in float64:

>>> pandas.Series([1,2,None])
0    1.0
1    2.0
2    NaN
dtype: float64

None becomes NA in Int64 and string

>>> pandas.Series([1,2,None], dtype='Int64')
0       1
1       2
2    <NA>
dtype: Int64

>>> pandas.Series(['1','2',None], dtype='string')
0       1
1       2
2    <NA>
dtype: string

None becomes NaT in datetime64

>>> pandas.Series([1,2,None], dtype='datetime64[ns]')
0   1970-01-01 00:00:00.000000001
1   1970-01-01 00:00:00.000000002
2                             NaT
dtype: datetime64[ns]


None stays None in object (and things can become object for somewhat-uncontrolled reasons)

>>> pandas.Series(['1','2',None])
0       1
1       2
2    None
dtype: object

...but that's just because you gave it mixed types, not because object pefers None.

Consider that in the following it first becomes NA by force, then can stay NA:

>>> pandas.Series(['1','2',None], dtype='string').astype('object')
0       1
1       2
2    <NA>
dtype: object


Things happen during conversions

>>> frame = pd.DataFrame({
    "Name": ["Braund, Mr. Owen Harris",  "Allen, Mr. William Henry",   "Bonnell, Miss. Elizabeth"],
    "Age":  [22, 35, None],

    "Sex": ["male", "male", "female"],
})

>>> frame['Age2'] = frame['Age'].astype('Int64')

>>> frame.dtypes
Name     object
Age     float64
Sex      object
Age2      Int64

>>> frame
                       Name   Age     Sex  Age2
0   Braund, Mr. Owen Harris  22.0    male    22
1  Allen, Mr. William Henry  35.0    male    35
2  Bonnell, Miss. Elizabeth   NaN  female  <NA>



See also:


https://stackoverflow.com/questions/34913590/fillna-in-multiple-columns-in-place-in-python-pandas/34916691

Input and output

Extra parsing

Large datasets

Inspecting, cleaning, selecting, filtering

Some poking of data new to you

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

If using a parser like read_csv, you might care to know what type pandas guessed

df.dtypes

Cardinality of each column (amount of distinct values in it)

df.apply( pandas.Series.nunique )

How many bad/missing values (NaN, None, NaT), per column

df.isna().sum()

Amount of empty strings, per column

df.isin(['']).sum()         # note: isin has some more complex uses


Show all rows with NA somewhere

df[df.isna().any(axis=1)]



Common and uncommon values per column, to see if there's specific weirdness to fish out

for column in df.columns:
    print( df[column].value_counts() )

...note that that doesn't count NA. If you really need that, try something like

for column in df.columns:
    print( df[[column]].groupby(column, dropna=False, as_index=False).size() )

Similarly, if you want to figure out what kind of things an object column is pointing to, something like (TODO: see if there is a more direct way of doing that)

pandas.Series( [ type(val) for val in df['rating']] ).value_counts()


Look at distribution of numeric rows

df.describe()


  1. amount of distinct values in each column (useful to find types, tags, keywords)
df.apply(pandas.Series.nunique)


Look at common and uncommon values for a columnvalue_counts

df['colname'].value_counts()

or for all columns per column with

for column in df.columns:
    print( df[column].value_counts() )


Say the above revealed some weird substrings in a specific column, and you want to see the records they came from

You might write

bads = df[ df['rating'].str.contains('badstr') ]

You'll likely find it saying:

ValueError: cannot index with vector containing NA / NaN values

...because the contains() will pass through NA or NaN (and things like it may too), which doesn't work in the selection around it. In this case, what you'ld want is

bads = df[ df['rating'].str.contains('badstr',na=False) ]


To invert a df/series item-wise, you can't use a not (that's a python boolean context), but ~ works, e.g.

df = df[ ~df['rating'].str.contains('min',na=False) ]


See if Dataframe dtypes seem reasonable (e.g. see whether any alterations changed things as expected)

df.dtypes


Since it is generally more useful to select columns than rows, [] selects columns, say

df["Age"]


row selection / ranges via the index: loc and iloc


iloc gets rows (and/or columns) from index by value.

...and happens to be inclusive

iloc gets rows (and/or columns) from index by locations.

which will be exclusive, because integer ranges

to basically steal an example from here, if you have a series like

s = pd.Series(list("abcdefg"), index=[49, 48, 47, 0, 1, 2,2])

>>> s.iloc[0:3] # first three, by position    note: head() is a little simpler
49    a
48    b
47    c

>>> s.loc[1:2] # index values 1 through 2
1    e
2    f
2    g

Note that handing a single value to loc/iloc will make it return a Series, a range a DataFrame

If you want a DataFrame always, say something like df.iloc[ [i] ] instead of df.iloc[ i ]


If you want individual records, you might actually want that series form because it prints If you wanted to get that longer form, you might like something like

for labels, row_series in df.head(4).iterrows(): # same as iteritems?    You can also get plain tuples via itertuples)
  print(row_series)



row selection / ranges via series values:


There's also a bunch of operator overloading, so you can e.g. do

df["Age"] > 35

and get a bool Series back, so (like in numpy) you can e.g. select subsets like

df[ df["Age"] > 35 ]



https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html#pandas-series-str-contains




When you want to filter both rows and/or columns, you will need loc or iloc. Consider:

df.loc[df["Age"] > 35, "Name"]


By rows

.head(n)
.tail(n)




Adding, transforming

Consider:

air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

air_quality["ratio_paris_antwerp"] = air_quality["station_paris"] / air_quality["station_antwerp"]


Rename columns:

air_quality_renamed = air_quality.rename(
 columns={"station_antwerp": "BETR801",
 "station_paris": "FR04014",
 "station_london": "London Westminster"})


You can also apply() a function (which gets faster when when it's C rather than pure python)


apply

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html#pandas.DataFrame.apply


pipe

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pipe.html



https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook


-->

Grouping

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.



Average age of passengers

titanic["Age"].mean()

Can work independently for multiple columns:

titanic[["Age", "Fare"]].mean()


Average age of passengers, split by sex:

titanic[["Sex", "Age"]].groupby("Sex").mean()

The parts:

select the two relevant columns (to keep clutter and calculation down)
groupby creates a specific type of object (that conceptually contains the two groups in the data - and sometimes it's useful to iterate through the groups yourself)
mean on that object works that out for each group



If you want to summarize by something other than you're grouping by, you might groupby more of the data and only select the relevant columns, e.g.

titanic.groupby(["Sex", "Pclass"])["Fare"].mean()
that groupby object still contains all the data, but grouped
selecting fare from that has just that column, still grouped the same way


(You could also do

titanic"Fare","Sex","Pclass".groupby(["Sex", "Pclass"])["Fare"].mean()

...but it's mainly just more typing.



https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html

---

More complex count

grp = df.groupby( ['release_year', 'rating'] ) #, as_index=False 

#grp is a DataFrameGroupBy object - which seems to contain the DF, plus grouping data
# To inspect, do something like:
#for key, item in grp: 
#    print(grp.get_group(key), "\n")
 
print( "\n- group .size()   (~= tuple count) -")
 
grp_size = grp.size()  # grp_size is a series
# For a sense of what it did:
for key, val in grp_size.head(5).items():
    print('%r ->  %r'%(key,val))
     
# because it's a series, we don't (need to) hand anything to  sort_values
t = grp_size.sort_values()
#reset_index makes this a DF again, (by) setting a new 0-to-#rows index
t.reset_index().rename(columns={0:'count'})

-->


Plotting

Sorting

Structure changes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Series.ravel() [2] - returns as (1D) ndarray


stack() [pandas.DataFrame.stack]

unstack() [3]

https://www.datasciencemadesimple.com/reshape-using-stack-unstack-function-pandas-python/


pivot() [4]

pivot_table() [5]


Datafame.melt() [6]

Datafame.explode() [7]

Datafame.squeeze() [8]


Sorting rows

titanic.sort_values

Does not change numeric indexing?(verify)

Doesn't affect statistics, more for printing,


pivot()


stack() and unstack() 



https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/07_reshape_table_layout.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-reshaping

Merging, combining

Chaining functions or not

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Most operations on a dataframe return a dataframe (altered, subset, etc.), so you can write a bunch of (pre)processing pretty succinctly.

pd.read_csv(fp)
  .rename(columns=str.lower)
  .drop('unnamed: 36', axis=1)
  .pipe(extract_city_name)
  .pipe(time_to_datetime, ['dep_time', 'arr_time', 'crs_arr_time', 'crs_dep_time'])
  .assign(fl_date=lambda x: pd.to_datetime(x['fl_date']),
  .groupby(['unique_carrier', pd.TimeGrouper("H")])


Chaining does not make debugging any simpler, and sometimes you want one of the intermediate objects because you want to do multiple things with it and that's the easier way to write it.


https://tomaugspurger.github.io/method-chaining


In-place operations

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Many alterations can also be done in-place, which can be preferable on huge dataframes

Handling details

Type specifics

Time

Strings

Unsorted

Dask

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


You can see Dask as pandas plus...

  • lazy evaluation
  • parallelization, in that
    • partitioning data and evaluation to multiple processes
    • parallelized reading/writing,
    • parallelized implementation of parts of numpy/scipy,
multi-core or multi-node[9].


It's more of a scheduler than anything else, so how much faster it is - or isn't - and how much harder it may be to write (and whether something is parallelize at all) depends entirely on what you're doing.


Where parallelization is a poor fit, and datasets does easily fit in memory, then dask may be a little slower than pandas.

That said, if you can design your case for dask, then it should scale better - and the small fry aren't your real issue anyway.



https://docs.dask.org/en/stable/