Commit c56cd197 authored by Joachim Krois's avatar Joachim Krois
Browse files

updated lessons for workshop

parent 75bd5425
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Applied Data Analysis I # Applied Data Analysis I - The Basics
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)` * _function_ $\to$ `OBJECT = pd.function_name(arg1, arg2, ...)`
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)` * _method_ $\to$ `OBJECT.method_name(arg1, arg2, ...)`
* _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_ * _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# The `pandas` library # The `pandas` library
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pandas as pd import pandas as pd
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`numpy` but with labled rows and columns `numpy` but with labled rows and columns
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
one dimensional `pd.Series` object and two dimensional `pd.DataFrame` object one dimensional `pd.Series` object and two dimensional `pd.DataFrame` object
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
## The `pd.Series` object ## The `pd.Series` object
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)` * _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:raw id: tags: %% Cell type:raw id: tags:
??pd.Series ??pd.Series
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)` `pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from numpy import random from numpy import random
random.seed(123) random.seed(123)
my_data = random.randint(low=-10, high=10, size=26,) my_data = random.randint(low=-10, high=10, size=26,)
my_data my_data
``` ```
%% Output %% Output
array([ 3, -8, -8, -4, 7, 9, 0, -9, -10, 7, 5, -1, -10, array([ 3, -8, -8, -4, 7, 9, 0, -9, -10, 7, 5, -1, -10,
4, -10, 5, 9, 4, -6, -10, 6, -6, 7, -7, -8, -3]) 4, -10, 5, 9, 4, -6, -10, 6, -6, 7, -7, -8, -3])
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s = pd.Series(data=my_data, name="my_pandas_series") s = pd.Series(data=my_data, name="my_pandas_series")
s s
``` ```
%% Output %% Output
0 3 0 3
1 -8 1 -8
2 -8 2 -8
3 -4 3 -4
4 7 4 7
5 9 5 9
6 0 6 0
7 -9 7 -9
8 -10 8 -10
9 7 9 7
10 5 10 5
11 -1 11 -1
12 -10 12 -10
13 4 13 4
14 -10 14 -10
15 5 15 5
16 9 16 9
17 4 17 4
18 -6 18 -6
19 -10 19 -10
20 6 20 6
21 -6 21 -6
22 7 22 7
23 -7 23 -7
24 -8 24 -8
25 -3 25 -3
Name: my_pandas_series, dtype: int32 Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Element-wise arithmeitic** **Element-wise arithmeitic**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s*0.1 s*0.1
``` ```
%% Output %% Output
0 0.3 0 0.3
1 -0.8 1 -0.8
2 -0.8 2 -0.8
3 -0.4 3 -0.4
4 0.7 4 0.7
5 0.9 5 0.9
6 0.0 6 0.0
7 -0.9 7 -0.9
8 -1.0 8 -1.0
9 0.7 9 0.7
10 0.5 10 0.5
11 -0.1 11 -0.1
12 -1.0 12 -1.0
13 0.4 13 0.4
14 -1.0 14 -1.0
15 0.5 15 0.5
16 0.9 16 0.9
17 0.4 17 0.4
18 -0.6 18 -0.6
19 -1.0 19 -1.0
20 0.6 20 0.6
21 -0.6 21 -0.6
22 0.7 22 0.7
23 -0.7 23 -0.7
24 -0.8 24 -0.8
25 -0.3 25 -0.3
Name: my_pandas_series, dtype: float64 Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### `pd.Series` attribues ### `pd.Series` attribues
* _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_ * _attribute_ $\to$ `OBJECT.attribute` $\qquad$ _Note that the attribute is called without parenthesis_
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.dtypes s.dtypes
``` ```
%% Output %% Output
dtype('int32') dtype('int32')
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.index s.index
``` ```
%% Output %% Output
RangeIndex(start=0, stop=26, step=1) RangeIndex(start=0, stop=26, step=1)
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### Selection and slicing by index ### Selection and slicing by index
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s[2] s[2]
``` ```
%% Output %% Output
-8 -8
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s[2:6] s[2:6]
``` ```
%% Output %% Output
2 -8 2 -8
3 -4 3 -4
4 7 4 7
5 9 5 9
Name: my_pandas_series, dtype: int32 Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Challenge: #### Challenge:
> Change the index to (arbitrary) letters of the alphabet > Change the index to (arbitrary) letters of the alphabet
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import string import string
letters = string.ascii_uppercase letters = string.ascii_uppercase
letters letters
``` ```
%% Output %% Output
'ABCDEFGHIJKLMNOPQRSTUVWXYZ' 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.index = [l for l in letters] s.index = [l for l in letters]
s s
``` ```
%% Output %% Output
A 3 A 3
B -8 B -8
C -8 C -8
D -4 D -4
E 7 E 7
F 9 F 9
G 0 G 0
H -9 H -9
I -10 I -10
J 7 J 7
K 5 K 5
L -1 L -1
M -10 M -10
N 4 N 4
O -10 O -10
P 5 P 5
Q 9 Q 9
R 4 R 4
S -6 S -6
T -10 T -10
U 6 U 6
V -6 V -6
W 7 W 7
X -7 X -7
Y -8 Y -8
Z -3 Z -3
Name: my_pandas_series, dtype: int32 Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.index s.index
``` ```
%% Output %% Output
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'], 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
dtype='object') dtype='object')
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s["C"] s["C"]
``` ```
%% Output %% Output
-8 -8
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s["C":"K"] s["C":"K"]
``` ```
%% Output %% Output
C -8 C -8
D -4 D -4
E 7 E 7
F 9 F 9
G 0 G 0
H -9 H -9
I -10 I -10
J 7 J 7
K 5 K 5
Name: my_pandas_series, dtype: int32 Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### `pd.Series` methods ### `pd.Series` methods
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)` * _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s s
``` ```
%% Output %% Output
A 3 A 3
B -8 B -8
C -8 C -8
D -4 D -4
E 7 E 7
F 9 F 9
G 0 G 0
H -9 H -9
I -10 I -10
J 7 J 7
K 5 K 5
L -1 L -1
M -10 M -10
N 4 N 4
O -10 O -10
P 5 P 5
Q 9 Q 9
R 4 R 4
S -6 S -6
T -10 T -10
U 6 U 6
V -6 V -6
W 7 W 7
X -7 X -7
Y -8 Y -8
Z -3 Z -3
Name: my_pandas_series, dtype: int32 Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.sum() s.sum()
``` ```
%% Output %% Output
-34 -34
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.mean() s.mean()
``` ```
%% Output %% Output
-1.3076923076923077 -1.3076923076923077
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.max() s.max()
``` ```
%% Output %% Output
9 9
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.min() s.min()
``` ```
%% Output %% Output
-10 -10
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.median() s.median()
``` ```
%% Output %% Output
-2.0 -2.0
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.quantile(q=0.5) s.quantile(q=0.5)
``` ```
%% Output %% Output
-2.0 -2.0
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
s.quantile(q=[0.25, 0.5, 0.75]) s.quantile(q=[0.25, 0.5, 0.75])
``` ```
%% Output %% Output
0.25 -8.0 0.25 -8.0
0.50 -2.0 0.50 -2.0
0.75 5.0 0.75 5.0
Name: my_pandas_series, dtype: float64 Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
## The `pd.DataFrame` object ## The `pd.DataFrame` object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from IPython.display import IFrame from IPython.display import IFrame
IFrame("http://duelingdata.blogspot.de/2016/01/the-beatles.html", width="100%", height=400) IFrame("http://duelingdata.blogspot.de/2016/01/the-beatles.html", width="100%", height=400)
``` ```
%% Output %% Output
<IPython.lib.display.IFrame at 0x8ca6ba8> <IPython.lib.display.IFrame at 0x8ca6ba8>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
* _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)` * _function_ $\to$ `OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df = pd.DataFrame({"id" : range(1,5), df = pd.DataFrame({"id" : range(1,5),
"Name" : ["John", "Paul", "George", "Ringo"], "Name" : ["John", "Paul", "George", "Ringo"],
"Last Name" : ["Lennon", "McCartney", "Harrison", "Star"], "Last Name" : ["Lennon", "McCartney", "Harrison", "Star"],
"dead" : [True, False, True, False], "dead" : [True, False, True, False],
"year_born" : [1940, 1942, 1943, 1940], "year_born" : [1940, 1942, 1943, 1940],
"no_of_songs" : [62, 58, 24, 3] "no_of_songs" : [62, 58, 24, 3]
}) })
df df
``` ```
%% Output %% Output
Last Name Name dead id no_of_songs year_born Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940 0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942 1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943 2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940 3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### `pd.DataFrame` attribues ### `pd.DataFrame` attribues
* _attribute_ $\to$ `OBJECT.attribute` * _attribute_ $\to$ `OBJECT.attribute`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.dtypes df.dtypes
``` ```
%% Output %% Output
Last Name object Last Name object
Name object Name object
dead bool dead bool
id int64 id int64
no_of_songs int64 no_of_songs int64
year_born int64 year_born int64
dtype: object dtype: object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# axis 1 # axis 1
df.index df.index
``` ```
%% Output %% Output
RangeIndex(start=0, stop=4, step=1) RangeIndex(start=0, stop=4, step=1)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.set_index("id") df.set_index("id")
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
1 Lennon John True 62 1940 1 Lennon John True 62 1940
2 McCartney Paul False 58 1942 2 McCartney Paul False 58 1942
3 Harrison George True 24 1943 3 Harrison George True 24 1943
4 Star Ringo False 3 1940 4 Star Ringo False 3 1940
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead id no_of_songs year_born Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940 0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942 1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943 2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940 3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`df.set_index("id", inplace=True)` `df.set_index("id", inplace=True)`
or or
`df = df.set_index("id")` `df = df.set_index("id")`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.set_index("id", inplace=True) df.set_index("id", inplace=True)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
1 Lennon John True 62 1940 1 Lennon John True 62 1940
2 McCartney Paul False 58 1942 2 McCartney Paul False 58 1942
3 Harrison George True 24 1943 3 Harrison George True 24 1943
4 Star Ringo False 3 1940 4 Star Ringo False 3 1940
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# axis 2 # axis 2
df.columns df.columns
``` ```
%% Output %% Output
Index(['Last Name', 'Name', 'dead', 'no_of_songs', 'year_born'], dtype='object') Index(['Last Name', 'Name', 'dead', 'no_of_songs', 'year_born'], dtype='object')
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### Selection and slicing by indices ### Selection and slicing by indices
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Column index** **Column index**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df["Name"] df["Name"]
``` ```
%% Output %% Output
id id
1 John 1 John
2 Paul 2 Paul
3 George 3 George
4 Ringo 4 Ringo
Name: Name, dtype: object Name: Name, dtype: object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df[["Name", "Last Name"]] df[["Name", "Last Name"]]
``` ```
%% Output %% Output
Name Last Name Name Last Name
id id
1 John Lennon 1 John Lennon
2 Paul McCartney 2 Paul McCartney
3 George Harrison 3 George Harrison
4 Ringo Star 4 Ringo Star
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.dead df.dead
``` ```
%% Output %% Output
id id
1 True 1 True
2 False 2 False
3 True 3 True
4 False 4 False
Name: dead, dtype: bool Name: dead, dtype: bool
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Row index** **Row index**
`.loc[]`, `.iloc[]` `.loc[]`, `.iloc[]`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[1] df.loc[1]
``` ```
%% Output %% Output
Last Name Lennon Last Name Lennon
Name John Name John
dead True dead True
no_of_songs 62 no_of_songs 62
year_born 1940 year_born 1940
Name: 1, dtype: object Name: 1, dtype: object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.iloc[0] df.iloc[0]
``` ```
%% Output %% Output
Last Name Lennon Last Name Lennon
Name John Name John
dead True dead True
no_of_songs 62 no_of_songs 62
year_born 1940 year_born 1940
Name: 1, dtype: object Name: 1, dtype: object
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Row and Columns indices** **Row and Columns indices**
`df.loc[row, col]` `df.loc[row, col]`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[1, "Last Name"] df.loc[1, "Last Name"]
``` ```
%% Output %% Output
'Lennon' 'Lennon'
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[2:4, ["Name", "dead"]] df.loc[2:4, ["Name", "dead"]]
``` ```
%% Output %% Output
Name dead Name dead
id id
2 Paul False 2 Paul False
3 George True 3 George True
4 Ringo False 4 Ringo False
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Logical indexing** **Logical indexing**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
1 Lennon John True 62 1940 1 Lennon John True 62 1940
2 McCartney Paul False 58 1942 2 McCartney Paul False 58 1942
3 Harrison George True 24 1943 3 Harrison George True 24 1943
4 Star Ringo False 3 1940 4 Star Ringo False 3 1940
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.no_of_songs > 50 df.no_of_songs > 50
``` ```
%% Output %% Output
id id
1 True 1 True
2 True 2 True
3 False 3 False
4 False 4 False
Name: no_of_songs, dtype: bool Name: no_of_songs, dtype: bool
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[df.no_of_songs > 50] df.loc[df.no_of_songs > 50]
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
1 Lennon John True 62 1940 1 Lennon John True 62 1940
2 McCartney Paul False 58 1942 2 McCartney Paul False 58 1942
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942)] df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942)]
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
2 McCartney Paul False 58 1942 2 McCartney Paul False 58 1942
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942), ["Last Name", "Name"]] df.loc[(df.no_of_songs > 50) & (df.year_born >= 1942), ["Last Name", "Name"]]
``` ```
%% Output %% Output
Last Name Name Last Name Name
id id
2 McCartney Paul 2 McCartney Paul
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### Manipulating columns, rows and particular entries ### Manipulating columns, rows and particular entries
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Add a row to the data set** **Add a row to the data set**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from numpy import nan from numpy import nan
df.loc[5] = ["Mouse", "Mickey", nan, nan, 1928] df.loc[5] = ["Mouse", "Mickey", nan, nan, 1928]
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born Last Name Name dead no_of_songs year_born
id id
1 Lennon John 1.0 62.0 1940 1 Lennon John 1.0 62.0 1940
2 McCartney Paul 0.0 58.0 1942 2 McCartney Paul 0.0 58.0 1942
3 Harrison George 1.0 24.0 1943 3 Harrison George 1.0 24.0 1943
4 Star Ringo 0.0 3.0 1940 4 Star Ringo 0.0 3.0 1940
5 Mouse Mickey NaN NaN 1928 5 Mouse Mickey NaN NaN 1928
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.dtypes df.dtypes
``` ```
%% Output %% Output
Last Name object Last Name object
Name object Name object
dead float64 dead float64
no_of_songs float64 no_of_songs float64
year_born int64 year_born int64
dtype: object dtype: object
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Add a column to the data set** **Add a column to the data set**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
now = pd.datetime.today().year now = pd.datetime.today().year
now now
``` ```
%% Output %% Output
2017 2017
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df["age"] = now - df.year_born df["age"] = now - df.year_born
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
id id
1 Lennon John 1.0 62.0 1940 77 1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75 2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74 3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77 4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89 5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Change a particular entry** **Change a particular entry**
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[5, "Name"] = "Mini" df.loc[5, "Name"] = "Mini"
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
id id
1 Lennon John 1.0 62.0 1940 77 1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75 2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74 3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77 4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89 5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
*** ***
### `pd.DataFrame` methods ### `pd.DataFrame` methods
* _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)` * _method_ $\to$ `OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.info() df.info()
``` ```
%% Output %% Output
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5 Int64Index: 5 entries, 1 to 5
Data columns (total 6 columns): Data columns (total 6 columns):
Last Name 5 non-null object Last Name 5 non-null object
Name 5 non-null object Name 5 non-null object
dead 4 non-null float64 dead 4 non-null float64
no_of_songs 4 non-null float64 no_of_songs 4 non-null float64
year_born 5 non-null int64 year_born 5 non-null int64
age 5 non-null int64 age 5 non-null int64
dtypes: float64(2), int64(2), object(2) dtypes: float64(2), int64(2), object(2)
memory usage: 440.0+ bytes memory usage: 440.0+ bytes
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.describe() df.describe()
``` ```
%% Output %% Output
dead no_of_songs year_born age dead no_of_songs year_born age
count 4.00000 4.000000 5.0000 5.0000 count 4.00000 4.000000 5.0000 5.0000
mean 0.50000 36.750000 1938.6000 78.4000 mean 0.50000 36.750000 1938.6000 78.4000
std 0.57735 28.229712 6.0663 6.0663 std 0.57735 28.229712 6.0663 6.0663
min 0.00000 3.000000 1928.0000 74.0000 min 0.00000 3.000000 1928.0000 74.0000
25% 0.00000 18.750000 1940.0000 75.0000 25% 0.00000 18.750000 1940.0000 75.0000
50% 0.50000 41.000000 1940.0000 77.0000 50% 0.50000 41.000000 1940.0000 77.0000
75% 1.00000 59.000000 1942.0000 77.0000 75% 1.00000 59.000000 1942.0000 77.0000
max 1.00000 62.000000 1943.0000 89.0000 max 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.describe(include="all") df.describe(include="all")
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
count 5 5 4.00000 4.000000 5.0000 5.0000 count 5 5 4.00000 4.000000 5.0000 5.0000
unique 5 5 NaN NaN NaN NaN unique 5 5 NaN NaN NaN NaN
top McCartney Paul NaN NaN NaN NaN top McCartney Paul NaN NaN NaN NaN
freq 1 1 NaN NaN NaN NaN freq 1 1 NaN NaN NaN NaN
mean NaN NaN 0.50000 36.750000 1938.6000 78.4000 mean NaN NaN 0.50000 36.750000 1938.6000 78.4000
std NaN NaN 0.57735 28.229712 6.0663 6.0663 std NaN NaN 0.57735 28.229712 6.0663 6.0663
min NaN NaN 0.00000 3.000000 1928.0000 74.0000 min NaN NaN 0.00000 3.000000 1928.0000 74.0000
25% NaN NaN 0.00000 18.750000 1940.0000 75.0000 25% NaN NaN 0.00000 18.750000 1940.0000 75.0000
50% NaN NaN 0.50000 41.000000 1940.0000 77.0000 50% NaN NaN 0.50000 41.000000 1940.0000 77.0000
75% NaN NaN 1.00000 59.000000 1942.0000 77.0000 75% NaN NaN 1.00000 59.000000 1942.0000 77.0000
max NaN NaN 1.00000 62.000000 1943.0000 89.0000 max NaN NaN 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
id id
1 Lennon John 1.0 62.0 1940 77 1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75 2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74 3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77 4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89 5 Mouse Mini NaN NaN 1928 89
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
# row-wise # row-wise
df.sum() df.sum()
``` ```
%% Output %% Output
Last Name LennonMcCartneyHarrisonStarMouse Last Name LennonMcCartneyHarrisonStarMouse
Name JohnPaulGeorgeRingoMini Name JohnPaulGeorgeRingoMini
dead 2 dead 2
no_of_songs 147 no_of_songs 147
year_born 9693 year_born 9693
age 392 age 392
dtype: object dtype: object
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#column-wise #column-wise
df.sum(axis=1) df.sum(axis=1)
``` ```
%% Output %% Output
id id
1 2080.0 1 2080.0
2 2075.0 2 2075.0
3 2042.0 3 2042.0
4 2020.0 4 2020.0
5 2017.0 5 2017.0
dtype: float64 dtype: float64
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### `groupby` method #### `groupby` method
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
id id
1 Lennon John 1.0 62.0 1940 77 1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75 2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74 3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77 4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89 5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.groupby("dead") df.groupby("dead")
``` ```
%% Output %% Output
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000007CB89B0> <pandas.core.groupby.DataFrameGroupBy object at 0x0000000007CB89B0>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.groupby("dead").sum() df.groupby("dead").sum()
``` ```
%% Output %% Output
no_of_songs year_born age no_of_songs year_born age
dead dead
0.0 61.0 3882 152 0.0 61.0 3882 152
1.0 86.0 3883 151 1.0 86.0 3883 151
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.groupby("dead")["no_of_songs"].sum() df.groupby("dead")["no_of_songs"].sum()
``` ```
%% Output %% Output
dead dead
0.0 61.0 0.0 61.0
1.0 86.0 1.0 86.0
Name: no_of_songs, dtype: float64 Name: no_of_songs, dtype: float64
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.groupby("dead")["no_of_songs"].mean() df.groupby("dead")["no_of_songs"].mean()
``` ```
%% Output %% Output
dead dead
0.0 30.5 0.0 30.5
1.0 43.0 1.0 43.0
Name: no_of_songs, dtype: float64 Name: no_of_songs, dtype: float64
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.groupby("dead")["no_of_songs"].agg(["mean", "max", "min"]) df.groupby("dead")["no_of_songs"].agg(["mean", "max", "min"])
``` ```
%% Output %% Output
mean max min mean max min
dead dead
0.0 30.5 58.0 3.0 0.0 30.5 58.0 3.0
1.0 43.0 62.0 24.0 1.0 43.0 62.0 24.0
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### `plot` method #### `plot` method
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
% matplotlib inline % matplotlib inline
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df df
``` ```
%% Output %% Output
Last Name Name dead no_of_songs year_born age Last Name Name dead no_of_songs year_born age
id id
1 Lennon John 1.0 62.0 1940 77 1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75 2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74 3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77 4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89 5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df[["no_of_songs", "age"]].plot() df[["no_of_songs", "age"]].plot()
``` ```
%% Output %% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8563b70> <matplotlib.axes._subplots.AxesSubplot at 0x8563b70>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df["dead"].plot.hist() df["dead"].plot.hist()
``` ```
%% Output %% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8861b38> <matplotlib.axes._subplots.AxesSubplot at 0x8861b38>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df["age"].plot.bar() df["age"].plot.bar()
``` ```
%% Output %% Output
<matplotlib.axes._subplots.AxesSubplot at 0x88d1be0> <matplotlib.axes._subplots.AxesSubplot at 0x88d1be0>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ax = df["age"].plot.bar() ax = df["age"].plot.bar()
ax.set_xticklabels(df.Name); ax.set_xticklabels(df.Name);
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ax = df["age"].plot.bar(rot=0) ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name); ax.set_xticklabels(df.Name);
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ax = df["age"].plot.bar(rot=0) ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name) ax.set_xticklabels(df.Name)
ax.set_title("The Beatles and ... something else", size=18); ax.set_title("The Beatles and ... something else", size=18);
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
ax = df["age"].plot.bar(rot=0) ax = df["age"].plot.bar(rot=0)
ax.set_xticklabels(df.Name) ax.set_xticklabels(df.Name)
ax.set_title("The Beatles and ... something else", size=18) ax.set_title("The Beatles and ... something else", size=18)
ax.set_xlabel("") ax.set_xlabel("")
ax.set_ylabel("age", size=12); ax.set_ylabel("age", size=12);
``` ```
%% Output %% Output
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
This source diff could not be displayed because it is too large. You can view the blob instead.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment