Commit c56cd197 authored by Joachim Krois's avatar Joachim Krois
Browse files

updated lessons for workshop

parent 75bd5425
%% Cell type:markdown id: tags:
# Applied Data Analysis I
# Applied Data Analysis I - The Basics
%% Cell type:markdown id: tags:
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)`
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)`
* _function_ $\to$ `OBJECT = pd.function_name(arg1, arg2, ...)`
* _method_ $\to$ `OBJECT.method_name(arg1, arg2, ...)`
* _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_
%% Cell type:markdown id: tags:
# The `pandas` library
%% Cell type:code id: tags:
``` python
import pandas as pd
```
%% Cell type:markdown id: tags:
`numpy` but with labled rows and columns
%% Cell type:markdown id: tags:
one dimensional `pd.Series` object and two dimensional `pd.DataFrame` object
%% Cell type:markdown id: tags:
***
## The `pd.Series` object
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:raw id: tags:
??pd.Series
%% Cell type:markdown id: tags:
`pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`
%% Cell type:code id: tags:
``` python
from numpy import random
random.seed(123)
my_data = random.randint(low=-10, high=10, size=26,)
my_data
```
%% Output
array([ 3, -8, -8, -4, 7, 9, 0, -9, -10, 7, 5, -1, -10,
4, -10, 5, 9, 4, -6, -10, 6, -6, 7, -7, -8, -3])
%% Cell type:code id: tags:
``` python
s = pd.Series(data=my_data, name="my_pandas_series")
s
```
%% Output
0 3
1 -8
2 -8
3 -4
4 7
5 9
6 0
7 -9
8 -10
9 7
10 5
11 -1
12 -10
13 4
14 -10
15 5
16 9
17 4
18 -6
19 -10
20 6
21 -6
22 7
23 -7
24 -8
25 -3
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
**Element-wise arithmeitic**
%% Cell type:code id: tags:
``` python
s*0.1
```
%% Output
0 0.3
1 -0.8
2 -0.8
3 -0.4
4 0.7
5 0.9
6 0.0
7 -0.9
8 -1.0
9 0.7
10 0.5
11 -0.1
12 -1.0
13 0.4
14 -1.0
15 0.5
16 0.9
17 0.4
18 -0.6
19 -1.0
20 0.6
21 -0.6
22 0.7
23 -0.7
24 -0.8
25 -0.3
Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags:
***
### `pd.Series` attribues
* _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_
%% Cell type:code id: tags:
``` python
s.dtypes
```
%% Output
dtype('int32')
%% Cell type:code id: tags:
``` python
s.index
```
%% Output
RangeIndex(start=0, stop=26, step=1)
%% Cell type:markdown id: tags:
***
### Selection and slicing by index
%% Cell type:code id: tags:
``` python
s[2]
```
%% Output
-8
%% Cell type:code id: tags:
``` python
s[2:6]
```
%% Output
2 -8
3 -4
4 7
5 9
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
#### Challenge:
> Change the index to (arbitrary) letters of the alphabet
%% Cell type:code id: tags:
``` python
import string
letters = string.ascii_uppercase
letters
```
%% Output
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
%% Cell type:code id: tags:
``` python
s.index = [l for l in letters]
s
```
%% Output
A 3
B -8
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
L -1
M -10
N 4
O -10
P 5
Q 9
R 4
S -6
T -10
U 6
V -6
W 7
X -7
Y -8
Z -3
Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags:
``` python
s.index
```
%% Output
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
dtype='object')
%% Cell type:code id: tags:
``` python
s["C"]
```
%% Output
-8
%% Cell type:code id: tags:
``` python
s["C":"K"]
```
%% Output
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
***
### `pd.Series` methods
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
``` python
s
```
%% Output
A 3
B -8
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
L -1
M -10
N 4
O -10
P 5
Q 9
R 4
S -6
T -10
U 6
V -6
W 7
X -7
Y -8
Z -3
Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags:
``` python
s.sum()
```
%% Output
-34
%% Cell type:code id: tags:
``` python
s.mean()
```
%% Output
-1.3076923076923077
%% Cell type:code id: tags:
``` python
s.max()
```
%% Output
9
%% Cell type:code id: tags:
``` python
s.min()
```
%% Output
-10
%% Cell type:code id: tags:
``` python
s.median()
```
%% Output
-2.0
%% Cell type:code id: tags:
``` python
s.quantile(q=0.5)
```
%% Output
-2.0
%% Cell type:code id: tags:
``` python
s.quantile(q=[0.25, 0.5, 0.75])
```
%% Output
0.25 -8.0
0.50 -2.0
0.75 5.0
Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags:
***
## The `pd.DataFrame` object
%% Cell type:code id: tags:
``` python
from IPython.display import IFrame
IFrame("http://duelingdata.blogspot.de/2016/01/the-beatles.html", width="100%", height=400)
```
%% Output
<IPython.lib.display.IFrame at 0x8ca6ba8>
%% Cell type:markdown id: tags:
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
``` python
df = pd.DataFrame({"id" : range(1,5),
"Name" : ["John", "Paul", "George", "Ringo"],
"Last Name" : ["Lennon", "McCartney", "Harrison", "Star"],
"dead" : [True, False, True, False],
"year_born" : [1940, 1942, 1943, 1940],
"no_of_songs" : [62, 58, 24, 3]
})
df
```
%% Output
Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags:
***
### `pd.DataFrame` attribues
* _attribute_ $\to$ `OBJECT.attribute`
%% Cell type:code id: tags:
``` python
df.dtypes
```
%% Output
Last Name object
Name object
dead bool
id int64
no_of_songs int64
year_born int64
dtype: object
%% Cell type:code id: tags:
``` python
# axis 1
df.index
```
%% Output
RangeIndex(start=0, stop=4, step=1)
%% Cell type:code id: tags:
``` python
df.set_index("id")
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags:
`df.set_index("id", inplace=True)`
or
`df = df.set_index("id")`
%% Cell type:code id: tags:
``` python
df.set_index("id", inplace=True)
```
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
``` python
# axis 2
df.columns
```
%% Output
Index(['Last Name', 'Name', 'dead', 'no_of_songs', 'year_born'], dtype='object')
%% Cell type:markdown id: tags:
***
### Selection and slicing by indices
%% Cell type:markdown id: tags:
**Column index**
%% Cell type:code id: tags:
``` python
df["Name"]
```
%% Output
id
1 John
2 Paul
3 George
4 Ringo
Name: Name, dtype: object
%% Cell type:code id: tags:
``` python
df[["Name", "Last Name"]]
```
%% Output
Name Last Name
id
1 John Lennon
2 Paul McCartney
3 George Harrison
4 Ringo Star
%% Cell type:code id: tags:
``` python
df.dead
```
%% Output
id
1 True
2 False
3 True
4 False
Name: dead, dtype: bool
%% Cell type:markdown id: tags:
**Row index**
`.loc[]`, `.iloc[]`
%% Cell type:code id: tags:
``` python
df.loc[1]
```
%% Output
Last Name Lennon
Name John
dead True
no_of_songs 62
year_born 1940
Name: 1, dtype: object
%% Cell type:code id: tags:
``` python
df.iloc[0]
```
%% Output
Last Name Lennon
Name John
dead True
no_of_songs 62
year_born 1940
Name: 1, dtype: object
%% Cell type:markdown id: tags:
**Row and Columns indices**
`df.loc[row, col]`
%% Cell type:code id: tags:
``` python
df.loc[1, "Last Name"]
```
%% Output
'Lennon'
%% Cell type:code id: tags:
``` python
df.loc[2:4, ["Name", "dead"]]
```
%% Output
Name dead
id
2 Paul False
3 George True
4 Ringo False
%% Cell type:markdown id: tags:
**Logical indexing**
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
``` python
df.no_of_songs > 50
```
%% Output
id
1 True
2 True
3 False
4 False
Name: no_of_songs, dtype: bool
%% Cell type:code id: tags:
``` python
df.loc[df.no_of_songs > 50]
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
%% Cell type:code id: tags:
``` python
df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942)]
```
%% Output
Last Name Name dead no_of_songs year_born
id
2 McCartney Paul False 58 1942
%% Cell type:code id: tags:
``` python
df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942), ["Last Name", "Name"]]
```
%% Output
Last Name Name
id
2 McCartney Paul
%% Cell type:markdown id: tags:
***
### Manipulating columns, rows and particular entries
%% Cell type:markdown id: tags:
**Add a row to the data set**
%% Cell type:code id: tags:
``` python
from numpy import nan
df.loc[5] = ["Mouse", "Mickey", nan, nan, 1928]
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John 1.0 62.0 1940
2 McCartney Paul 0.0 58.0 1942
3 Harrison George 1.0 24.0 1943
4 Star Ringo 0.0 3.0 1940
5 Mouse Mickey NaN NaN 1928
%% Cell type:code id: tags:
``` python
df.dtypes
```
%% Output
Last Name object
Name object
dead float64
no_of_songs float64
year_born int64
dtype: object
%% Cell type:markdown id: tags:
**Add a column to the data set**
%% Cell type:code id: tags:
``` python
now = pd.datetime.today().year
now
```
%% Output
2017
%% Cell type:code id: tags:
``` python
df["age"] = now - df.year_born
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags:
**Change a particular entry**
%% Cell type:code id: tags:
``` python
df.loc[5, "Name"] = "Mini"
```
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags:
***
### `pd.DataFrame` methods
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
``` python
df.info()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5
Data columns (total 6 columns):
Last Name 5 non-null object
Name 5 non-null object
dead 4 non-null float64
no_of_songs 4 non-null float64
year_born 5 non-null int64
age 5 non-null int64
dtypes: float64(2), int64(2), object(2)
memory usage: 440.0+ bytes
%% Cell type:code id: tags:
``` python
df.describe()
```
%% Output
dead no_of_songs year_born age
count 4.00000 4.000000 5.0000 5.0000
mean 0.50000 36.750000 1938.6000 78.4000
std 0.57735 28.229712 6.0663 6.0663
min 0.00000 3.000000 1928.0000 74.0000
25% 0.00000 18.750000 1940.0000 75.0000
50% 0.50000 41.000000 1940.0000 77.0000
75% 1.00000 59.000000 1942.0000 77.0000
max 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags:
``` python
df.describe(include="all")
```
%% Output
Last Name Name dead no_of_songs year_born age
count 5 5 4.00000 4.000000 5.0000 5.0000
unique 5 5 NaN NaN NaN NaN
top McCartney Paul NaN NaN NaN NaN
freq 1 1 NaN NaN NaN NaN
mean NaN NaN 0.50000 36.750000 1938.6000 78.4000
std NaN NaN 0.57735 28.229712 6.0663 6.0663
min NaN NaN 0.00000 3.000000 1928.0000 74.0000
25% NaN NaN 0.00000 18.750000 1940.0000 75.0000
50% NaN NaN 0.50000 41.000000 1940.0000 77.0000
75% NaN NaN 1.00000 59.000000 1942.0000 77.0000
max NaN NaN 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:code id: tags:
``` python
# row-wise
df.sum()
```
%% Output
Last Name LennonMcCartneyHarrisonStarMouse
Name JohnPaulGeorgeRingoMini
dead 2
no_of_songs 147
year_born 9693
age 392
dtype: object
%% Cell type:code id: tags:
``` python
#column-wise
df.sum(axis=1)
```
%% Output
id
1 2080.0
2 2075.0
3 2042.0
4 2020.0
5 2017.0
dtype: float64
%% Cell type:markdown id: tags:
#### `groupby` method
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags:
``` python
df.groupby("dead")
```
%% Output
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000007CB89B0>
%% Cell type:code id: tags:
``` python
df.groupby("dead").sum()
```
%% Output
no_of_songs year_born age
dead
0.0 61.0 3882 152
1.0 86.0 3883 151
%% Cell type:code id: tags:
``` python
df.groupby("dead")["no_of_songs"].sum()
```
%% Output
dead
0.0 61.0
1.0 86.0
Name: no_of_songs, dtype: float64
%% Cell type:code id: tags:
``` python
df.groupby("dead")["no_of_songs"].mean()
```
%% Output
dead
0.0 30.5
1.0 43.0
Name: no_of_songs, dtype: float64
%% Cell type:code id: tags:
``` python
df.groupby("dead")["no_of_songs"].agg(["mean", "max", "min"])
```
%% Output
mean max min
dead
0.0 30.5 58.0 3.0
1.0 43.0 62.0 24.0
%% Cell type:markdown id: tags:
#### `plot` method
%% Cell type:code id: tags:
``` python
% matplotlib inline
```
%% Cell type:code id: tags:
``` python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags:
``` python
df[["no_of_songs", "age"]].plot()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8563b70>
%% Cell type:code id: tags:
``` python
df["dead"].plot.hist()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8861b38>
%% Cell type:code id: tags:
``` python
df["age"].plot.bar()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x88d1be0>
%% Cell type:code id: tags:
``` python
ax = df["age"].plot.bar()
ax.set_xticklabels(df.Name);
```
%% Output
%% Cell type:code id: tags:
``` python
ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name);
```
%% Output
%% Cell type:code id: tags:
``` python
ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name)
ax.set_title("The Beatles and ... something else", size=18);
```
%% Output
%% Cell type:code id: tags:
``` python
ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name)
ax.set_title("The Beatles and ... something else", size=18)
ax.set_xlabel("")
ax.set_ylabel("age", size=12);
```
%% Output
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
......
This source diff could not be displayed because it is too large. You can view the blob instead.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment