Python usage notes - pandas, dask

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Threads and processes · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

speed, memory, debugging, profiling

semi-sorted

Pandas

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Pandas is, in part, a bunch of convenience in

reading and writing data
structuring data,
applying numpy-style things to data (it extends ndarray so numpy is applicable for things that are numeric)



Data

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

pandas...

puts things in Series
collects Series in DataFrames


Notes:

  • series, and by extension DFs, implement ndarray
which means you can fairly easily interact with numpy, scipy, matplotlib, etc.
note there's some basic matplotlib integration, see e.g. df.plot(), and variants via e.g. df.plot.area() / df.plot(kind='area') and similar for line, pie, scatter, bar, barh, box, density, hexbin, hist, and more
though strings work a little differently
as do mixes of types
  • Dataframes (and the pandas API in general) imitate R data.frame
which also deals with somewhat more flexible data than python's numpy does - which comes up e.g. in typing details
  • There are some more things that act much like a DF, but represent more complex state
see e.g. the grouping example below.


  • series has
a
.shape
a
.dtype
  • dataframe has
a
.shape
a
.dtypes
(a Series rather than a list of dtypes, for some reason)


  • While you may often read DFs from files
Series can e.g. be constructed from lists
DataFrames can e.g. be from dicts, like (example from docs):
ser   = pd.Series([1,2,3])
 
frame = pd.DataFrame({
  "Name": ["Braund, Mr. Owen Harris",  "Allen, Mr. William Henry",   "Bonnell, Miss. Elizabeth"],
  "Age":  [22, 35, 58],
  "Sex": ["male", "male", "female"]
})


Series

DataFrame


On the index

On types

Because each series is basically ndarray

you may not be used to seeing object
  • plus some of its own own extension types, which it does
to add a few useful types,
to help deal with missing/NA handling (see below).


On object

extension types

On missing data

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

It's not unusual to have some data be missing, so the ability to mark it as such, and be able to work with such data, is useful.

The implication of how to mark, and how to best deal, varies per use and per type.



The forms of missing data - python side there's things like python's
nan
,
numpy.NaN
, and
None
, and in python data structures you can mix and match.


The forms of missing data - within numpy things are more interesting.

  • ndarrays work on arrays of a single type - and as a result doesn't necessarily represent missing data in arrays.
  • It can be practical to (ab)use NaN in float arrays - doesn't need a special type, and some processing can be told to ignore NaNs.
  • In many other types, e.g. integer, there is no reasonable special value
other than maybe case-specific ones like -1, but this is going to do give wrong answers, rather than errors, if you forget
  • You can use object arrays, but they're just pointing at a mix of types, and a lot of numeric processing won't like that.
but it amounts to extra state you have to remember to manage.


The forms of missing data - within pandas it's mostly the numpy part, but with some extra details

  • Pandas introduces nullable ints (see #On_types), e.g. Int64 which differs from e.g. numpy's int64
  • Historically, pandas has represented missing data with
(numpy.)NaN for float types
NaT for datetime (verify)
  • object columns could end up with any of these considered-NA things (NaN, NA, None), but you may not want such a mix for other reasons
such as your confusion
and calculations not particularly working
  • Since around 1.0.0 pandas is trying to move to its (pandas.)NA instead (verify)
NA itself is a type that does act similarly to nan (e.g. NA + 1 is NA, NA**0==1)
but cannot be used easilbly/sensibly in a lot of arithmetic or logic. "Compared to np.nan, pd.NA behaves differently in certain operations. In addition to arithmetic operations, pd.NA also propagates as “missing” or “unknown” in comparison operations:"
meaning it will either propagate NA, or complain, but not do nonsense - which is probably what you want
  • isna(), notna() helps detect these consistently across dtypes -- except integer (see below)
isna() and isnull() are identical. Both existing is for people coming from R (where both exist but are not the same)
  • You're given functions to help
.fillna() to replace NA values with a specific value,
.dropna() to drop rows with NA in any column, or in specific specific columns
... or drop columns with NA, if you give it axis=1
  • There are things that you may want to count as missing, depending on context. Consider
    • empty strings, e.g. with df.replace(, replacement) (where that replacement could be pandas.NA)
    • things like inf and -inf
  • the difference between a numpy int and pandas nullable int can be subtle. It's easy to not notice the difference between:
Age2       int64
Age3       Int64


"Except integer?"

A numpy integer cannot represent None/NaN/missing.

This is probably the main reason why pandas will, given data that it knows are integers, will easily use float64 (where it can use NaN)

(keep in mind float64 stores integers accurately within -9007199254740991..9007199254740991 range)
  • and you can explicitly use its own nullable, Int64 (convenient but has a few footnotes)

Relatedly, introducting NAs can promote an integer into a float64, and a boolean to an object. [1]



Some examples that may help


None becomes Nan in float64:

>>> pandas.Series([1,2,None])
0    1.0
1    2.0
2    NaN
dtype: float64

None becomes NA in Int64 and string

>>> pandas.Series([1,2,None], dtype='Int64')
0       1
1       2
2    <NA>
dtype: Int64
 
>>> pandas.Series(['1','2',None], dtype='string')
0       1
1       2
2    <NA>
dtype: string

None becomes NaT in datetime64

>>> pandas.Series([1,2,None], dtype='datetime64[ns]')
0   1970-01-01 00:00:00.000000001
1   1970-01-01 00:00:00.000000002
2                             NaT
dtype: datetime64[ns]


None stays None in object (and things can become object for somewhat-uncontrolled reasons)

>>> pandas.Series(['1','2',None])
0       1
1       2
2    None
dtype: object

...but that's just because you gave it mixed types, not because object pefers None.

Consider that in the following it first becomes NA by force, then can stay NA:

>>> pandas.Series(['1','2',None], dtype='string').astype('object')
0       1
1       2
2    <NA>
dtype: object


Things happen during conversions

>>> frame = pd.DataFrame({
    "Name": ["Braund, Mr. Owen Harris",  "Allen, Mr. William Henry",   "Bonnell, Miss. Elizabeth"],
    "Age":  [22, 35, None],
 
    "Sex": ["male", "male", "female"],
})
 
​>>> frame['Age2'] = frame['Age'].astype('Int64')
 
>>> frame.dtypes
Name     object
Age     float64
Sex      object
Age2      Int64
 
>>> frame
                       Name   Age     Sex  Age2
0   Braund, Mr. Owen Harris  22.0    male    22
1  Allen, Mr. William Henry  35.0    male    35
2  Bonnell, Miss. Elizabeth   NaN  female  <NA>



See also:


https://stackoverflow.com/questions/34913590/fillna-in-multiple-columns-in-place-in-python-pandas/34916691

Input and output

Extra parsing

Large datasets

Inspecting, cleaning, selecting, filtering

Some poking of data new to you

Sorting

Structure changes

Merging, combining

Chaining / in place

Grouping and summarizing

Handling details

Type specifics

Time

Strings

Unsorted

Dask