Skip to content
GitLab
Projects
Groups
Snippets
/
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
swc-bb
swc-lessons
python-notebooks
Commits
c56cd197
Commit
c56cd197
authored
Jan 04, 2018
by
Joachim Krois
Browse files
updated lessons for workshop
parent
75bd5425
Changes
2
Hide whitespace changes
Inline
Side-by-side
SWC-2018-02-22-Applied-Data-Analysis-I-Bascis.ipynb
View file @
c56cd197
...
...
@@ -4,15 +4,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Applied Data Analysis I"
"# Applied Data Analysis I
- The Basics
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* _function_ $\\to$ `OBJECT = pd.function_name(ag
rs
1, arg2, ...)`\n",
"* _method_ $\\to$ `OBJECT.method_name(ag
rs
1, arg2, ...)`\n",
"* _function_ $\\to$ `OBJECT = pd.function_name(a
r
g1, arg2, ...)`\n",
"* _method_ $\\to$ `OBJECT.method_name(a
r
g1, arg2, ...)`\n",
"* _attribute_ $\\to$ `OBJECT.attribute` $\\qquad$ _Note that the attribute is called without parenthesis_"
]
},
...
...
%% Cell type:markdown id: tags:
# Applied Data Analysis I
# Applied Data Analysis I
- The Basics
%% Cell type:markdown id: tags:
*
_function_ $
\t
o$
`OBJECT = pd.function_name(ag
rs
1, arg2, ...)`
*
_method_ $
\t
o$
`OBJECT.method_name(ag
rs
1, arg2, ...)`
*
_function_ $
\t
o$
`OBJECT = pd.function_name(a
r
g1, arg2, ...)`
*
_method_ $
\t
o$
`OBJECT.method_name(a
r
g1, arg2, ...)`
*
_attribute_ $
\t
o$
`OBJECT.attribute`
$
\q
quad$ _Note that the attribute is called without parenthesis_
%% Cell type:markdown id: tags:
# The `pandas` library
%% Cell type:code id: tags:
```
python
import
pandas
as
pd
```
%% Cell type:markdown id: tags:
`numpy`
but with labled rows and columns
%% Cell type:markdown id: tags:
one dimensional
`pd.Series`
object and two dimensional
`pd.DataFrame`
object
%% Cell type:markdown id: tags:
***
## The `pd.Series` object
*
_function_ $
\t
o$
`OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:raw id: tags:
??pd.Series
%% Cell type:markdown id: tags:
`pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)`
%% Cell type:code id: tags:
```
python
from
numpy
import
random
random
.
seed
(
123
)
my_data
=
random
.
randint
(
low
=-
10
,
high
=
10
,
size
=
26
,)
my_data
```
%% Output
array([ 3, -8, -8, -4, 7, 9, 0, -9, -10, 7, 5, -1, -10,
4, -10, 5, 9, 4, -6, -10, 6, -6, 7, -7, -8, -3])
%% Cell type:code id: tags:
```
python
s
=
pd
.
Series
(
data
=
my_data
,
name
=
"my_pandas_series"
)
s
```
%% Output
0 3
1 -8
2 -8
3 -4
4 7
5 9
6 0
7 -9
8 -10
9 7
10 5
11 -1
12 -10
13 4
14 -10
15 5
16 9
17 4
18 -6
19 -10
20 6
21 -6
22 7
23 -7
24 -8
25 -3
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
**Element-wise arithmeitic**
%% Cell type:code id: tags:
```
python
s
*
0.1
```
%% Output
0 0.3
1 -0.8
2 -0.8
3 -0.4
4 0.7
5 0.9
6 0.0
7 -0.9
8 -1.0
9 0.7
10 0.5
11 -0.1
12 -1.0
13 0.4
14 -1.0
15 0.5
16 0.9
17 0.4
18 -0.6
19 -1.0
20 0.6
21 -0.6
22 0.7
23 -0.7
24 -0.8
25 -0.3
Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags:
***
### `pd.Series` attribues
*
_attribute_ $
\t
o$
`OBJECT.attribute`
$
\q
quad$ _Note that the attribute is called without parenthesis_
%% Cell type:code id: tags:
```
python
s
.
dtypes
```
%% Output
dtype('int32')
%% Cell type:code id: tags:
```
python
s
.
index
```
%% Output
RangeIndex(start=0, stop=26, step=1)
%% Cell type:markdown id: tags:
***
### Selection and slicing by index
%% Cell type:code id: tags:
```
python
s
[
2
]
```
%% Output
-8
%% Cell type:code id: tags:
```
python
s
[
2
:
6
]
```
%% Output
2 -8
3 -4
4 7
5 9
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
#### Challenge:
> Change the index to (arbitrary) letters of the alphabet
%% Cell type:code id: tags:
```
python
import
string
letters
=
string
.
ascii_uppercase
letters
```
%% Output
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
%% Cell type:code id: tags:
```
python
s
.
index
=
[
l
for
l
in
letters
]
s
```
%% Output
A 3
B -8
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
L -1
M -10
N 4
O -10
P 5
Q 9
R 4
S -6
T -10
U 6
V -6
W 7
X -7
Y -8
Z -3
Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags:
```
python
s
.
index
```
%% Output
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N',
'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
dtype='object')
%% Cell type:code id: tags:
```
python
s
[
"C"
]
```
%% Output
-8
%% Cell type:code id: tags:
```
python
s
[
"C"
:
"K"
]
```
%% Output
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
Name: my_pandas_series, dtype: int32
%% Cell type:markdown id: tags:
***
### `pd.Series` methods
*
_method_ $
\t
o$
`OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
```
python
s
```
%% Output
A 3
B -8
C -8
D -4
E 7
F 9
G 0
H -9
I -10
J 7
K 5
L -1
M -10
N 4
O -10
P 5
Q 9
R 4
S -6
T -10
U 6
V -6
W 7
X -7
Y -8
Z -3
Name: my_pandas_series, dtype: int32
%% Cell type:code id: tags:
```
python
s
.
sum
()
```
%% Output
-34
%% Cell type:code id: tags:
```
python
s
.
mean
()
```
%% Output
-1.3076923076923077
%% Cell type:code id: tags:
```
python
s
.
max
()
```
%% Output
9
%% Cell type:code id: tags:
```
python
s
.
min
()
```
%% Output
-10
%% Cell type:code id: tags:
```
python
s
.
median
()
```
%% Output
-2.0
%% Cell type:code id: tags:
```
python
s
.
quantile
(
q
=
0.5
)
```
%% Output
-2.0
%% Cell type:code id: tags:
```
python
s
.
quantile
(
q
=
[
0.25
,
0.5
,
0.75
])
```
%% Output
0.25 -8.0
0.50 -2.0
0.75 5.0
Name: my_pandas_series, dtype: float64
%% Cell type:markdown id: tags:
***
## The `pd.DataFrame` object
%% Cell type:code id: tags:
```
python
from
IPython.display
import
IFrame
IFrame
(
"http://duelingdata.blogspot.de/2016/01/the-beatles.html"
,
width
=
"100%"
,
height
=
400
)
```
%% Output
<IPython.lib.display.IFrame at 0x8ca6ba8>
%% Cell type:markdown id: tags:
*
_function_ $
\t
o$
`OBJECT = pd.function_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
```
python
df
=
pd
.
DataFrame
({
"id"
:
range
(
1
,
5
),
"Name"
:
[
"John"
,
"Paul"
,
"George"
,
"Ringo"
],
"Last Name"
:
[
"Lennon"
,
"McCartney"
,
"Harrison"
,
"Star"
],
"dead"
:
[
True
,
False
,
True
,
False
],
"year_born"
:
[
1940
,
1942
,
1943
,
1940
],
"no_of_songs"
:
[
62
,
58
,
24
,
3
]
})
df
```
%% Output
Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags:
***
### `pd.DataFrame` attribues
*
_attribute_ $
\t
o$
`OBJECT.attribute`
%% Cell type:code id: tags:
```
python
df
.
dtypes
```
%% Output
Last Name object
Name object
dead bool
id int64
no_of_songs int64
year_born int64
dtype: object
%% Cell type:code id: tags:
```
python
# axis 1
df
.
index
```
%% Output
RangeIndex(start=0, stop=4, step=1)
%% Cell type:code id: tags:
```
python
df
.
set_index
(
"id"
)
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead id no_of_songs year_born
0 Lennon John True 1 62 1940
1 McCartney Paul False 2 58 1942
2 Harrison George True 3 24 1943
3 Star Ringo False 4 3 1940
%% Cell type:markdown id: tags:
`df.set_index("id", inplace=True)`
or
`df = df.set_index("id")`
%% Cell type:code id: tags:
```
python
df
.
set_index
(
"id"
,
inplace
=
True
)
```
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
```
python
# axis 2
df
.
columns
```
%% Output
Index(['Last Name', 'Name', 'dead', 'no_of_songs', 'year_born'], dtype='object')
%% Cell type:markdown id: tags:
***
### Selection and slicing by indices
%% Cell type:markdown id: tags:
**Column index**
%% Cell type:code id: tags:
```
python
df
[
"Name"
]
```
%% Output
id
1 John
2 Paul
3 George
4 Ringo
Name: Name, dtype: object
%% Cell type:code id: tags:
```
python
df
[[
"Name"
,
"Last Name"
]]
```
%% Output
Name Last Name
id
1 John Lennon
2 Paul McCartney
3 George Harrison
4 Ringo Star
%% Cell type:code id: tags:
```
python
df
.
dead
```
%% Output
id
1 True
2 False
3 True
4 False
Name: dead, dtype: bool
%% Cell type:markdown id: tags:
**Row index**
`.loc[]`
,
`.iloc[]`
%% Cell type:code id: tags:
```
python
df
.
loc
[
1
]
```
%% Output
Last Name Lennon
Name John
dead True
no_of_songs 62
year_born 1940
Name: 1, dtype: object
%% Cell type:code id: tags:
```
python
df
.
iloc
[
0
]
```
%% Output
Last Name Lennon
Name John
dead True
no_of_songs 62
year_born 1940
Name: 1, dtype: object
%% Cell type:markdown id: tags:
**Row and Columns indices**
`df.loc[row, col]`
%% Cell type:code id: tags:
```
python
df
.
loc
[
1
,
"Last Name"
]
```
%% Output
'Lennon'
%% Cell type:code id: tags:
```
python
df
.
loc
[
2
:
4
,
[
"Name"
,
"dead"
]]
```
%% Output
Name dead
id
2 Paul False
3 George True
4 Ringo False
%% Cell type:markdown id: tags:
**Logical indexing**
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
3 Harrison George True 24 1943
4 Star Ringo False 3 1940
%% Cell type:code id: tags:
```
python
df
.
no_of_songs
>
50
```
%% Output
id
1 True
2 True
3 False
4 False
Name: no_of_songs, dtype: bool
%% Cell type:code id: tags:
```
python
df
.
loc
[
df
.
no_of_songs
>
50
]
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John True 62 1940
2 McCartney Paul False 58 1942
%% Cell type:code id: tags:
```
python
df
.
loc
[(
df
.
no_of_songs
>
50
)
&
(
df
.
year_born
>=
1942
)]
```
%% Output
Last Name Name dead no_of_songs year_born
id
2 McCartney Paul False 58 1942
%% Cell type:code id: tags:
```
python
df
.
loc
[(
df
.
no_of_songs
>
50
)
&
(
df
.
year_born
>=
1942
),
[
"Last Name"
,
"Name"
]]
```
%% Output
Last Name Name
id
2 McCartney Paul
%% Cell type:markdown id: tags:
***
### Manipulating columns, rows and particular entries
%% Cell type:markdown id: tags:
**Add a row to the data set**
%% Cell type:code id: tags:
```
python
from
numpy
import
nan
df
.
loc
[
5
]
=
[
"Mouse"
,
"Mickey"
,
nan
,
nan
,
1928
]
df
```
%% Output
Last Name Name dead no_of_songs year_born
id
1 Lennon John 1.0 62.0 1940
2 McCartney Paul 0.0 58.0 1942
3 Harrison George 1.0 24.0 1943
4 Star Ringo 0.0 3.0 1940
5 Mouse Mickey NaN NaN 1928
%% Cell type:code id: tags:
```
python
df
.
dtypes
```
%% Output
Last Name object
Name object
dead float64
no_of_songs float64
year_born int64
dtype: object
%% Cell type:markdown id: tags:
**Add a column to the data set**
%% Cell type:code id: tags:
```
python
now
=
pd
.
datetime
.
today
().
year
now
```
%% Output
2017
%% Cell type:code id: tags:
```
python
df
[
"age"
]
=
now
-
df
.
year_born
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags:
**Change a particular entry**
%% Cell type:code id: tags:
```
python
df
.
loc
[
5
,
"Name"
]
=
"Mini"
```
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:markdown id: tags:
***
### `pd.DataFrame` methods
*
_method_ $
\t
o$
`OBJECT.method_name(agrs1, arg2, ...)`
%% Cell type:code id: tags:
```
python
df
.
info
()
```
%% Output
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 1 to 5
Data columns (total 6 columns):
Last Name 5 non-null object
Name 5 non-null object
dead 4 non-null float64
no_of_songs 4 non-null float64
year_born 5 non-null int64
age 5 non-null int64
dtypes: float64(2), int64(2), object(2)
memory usage: 440.0+ bytes
%% Cell type:code id: tags:
```
python
df
.
describe
()
```
%% Output
dead no_of_songs year_born age
count 4.00000 4.000000 5.0000 5.0000
mean 0.50000 36.750000 1938.6000 78.4000
std 0.57735 28.229712 6.0663 6.0663
min 0.00000 3.000000 1928.0000 74.0000
25% 0.00000 18.750000 1940.0000 75.0000
50% 0.50000 41.000000 1940.0000 77.0000
75% 1.00000 59.000000 1942.0000 77.0000
max 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags:
```
python
df
.
describe
(
include
=
"all"
)
```
%% Output
Last Name Name dead no_of_songs year_born age
count 5 5 4.00000 4.000000 5.0000 5.0000
unique 5 5 NaN NaN NaN NaN
top McCartney Paul NaN NaN NaN NaN
freq 1 1 NaN NaN NaN NaN
mean NaN NaN 0.50000 36.750000 1938.6000 78.4000
std NaN NaN 0.57735 28.229712 6.0663 6.0663
min NaN NaN 0.00000 3.000000 1928.0000 74.0000
25% NaN NaN 0.00000 18.750000 1940.0000 75.0000
50% NaN NaN 0.50000 41.000000 1940.0000 77.0000
75% NaN NaN 1.00000 59.000000 1942.0000 77.0000
max NaN NaN 1.00000 62.000000 1943.0000 89.0000
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mini NaN NaN 1928 89
%% Cell type:code id: tags:
```
python
# row-wise
df
.
sum
()
```
%% Output
Last Name LennonMcCartneyHarrisonStarMouse
Name JohnPaulGeorgeRingoMini
dead 2
no_of_songs 147
year_born 9693
age 392
dtype: object
%% Cell type:code id: tags:
```
python
#column-wise
df
.
sum
(
axis
=
1
)
```
%% Output
id
1 2080.0
2 2075.0
3 2042.0
4 2020.0
5 2017.0
dtype: float64
%% Cell type:markdown id: tags:
#### `groupby` method
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags:
```
python
df
.
groupby
(
"dead"
)
```
%% Output
<pandas.core.groupby.DataFrameGroupBy object at 0x0000000007CB89B0>
%% Cell type:code id: tags:
```
python
df
.
groupby
(
"dead"
).
sum
()
```
%% Output
no_of_songs year_born age
dead
0.0 61.0 3882 152
1.0 86.0 3883 151
%% Cell type:code id: tags:
```
python
df
.
groupby
(
"dead"
)[
"no_of_songs"
].
sum
()
```
%% Output
dead
0.0 61.0
1.0 86.0
Name: no_of_songs, dtype: float64
%% Cell type:code id: tags:
```
python
df
.
groupby
(
"dead"
)[
"no_of_songs"
].
mean
()
```
%% Output
dead
0.0 30.5
1.0 43.0
Name: no_of_songs, dtype: float64
%% Cell type:code id: tags:
```
python
df
.
groupby
(
"dead"
)[
"no_of_songs"
].
agg
([
"mean"
,
"max"
,
"min"
])
```
%% Output
mean max min
dead
0.0 30.5 58.0 3.0
1.0 43.0 62.0 24.0
%% Cell type:markdown id: tags:
#### `plot` method
%% Cell type:code id: tags:
```
python
%
matplotlib
inline
```
%% Cell type:code id: tags:
```
python
df
```
%% Output
Last Name Name dead no_of_songs year_born age
id
1 Lennon John 1.0 62.0 1940 77
2 McCartney Paul 0.0 58.0 1942 75
3 Harrison George 1.0 24.0 1943 74
4 Star Ringo 0.0 3.0 1940 77
5 Mouse Mickey NaN NaN 1928 89
%% Cell type:code id: tags:
```
python
df
[[
"no_of_songs"
,
"age"
]].
plot
()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8563b70>
%% Cell type:code id: tags:
```
python
df
[
"dead"
].
plot
.
hist
()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x8861b38>
%% Cell type:code id: tags:
```
python
df
[
"age"
].
plot
.
bar
()
```
%% Output
<matplotlib.axes._subplots.AxesSubplot at 0x88d1be0>
%% Cell type:code id: tags:
```
python
ax
=
df
[
"age"
].
plot
.
bar
()
ax
.
set_xticklabels
(
df
.
Name
);
```
%% Output
%% Cell type:code id: tags:
```
python
ax
=
df
[
"age"
].
plot
.
bar
(
rot
=
0
)
ax
.
set_xticklabels
(
df
.
Name
);
```
%% Output
%% Cell type:code id: tags:
```
python
ax
=
df
[
"age"
].
plot
.
bar
(
rot
=
0
)
ax
.
set_xticklabels
(
df
.
Name
)
ax
.
set_title
(
"The Beatles and ... something else"
,
size
=
18
);
```
%% Output
%% Cell type:code id: tags:
```
python
ax
=
df
[
"age"
].
plot
.
bar
(
rot
=
0
)
ax
.
set_xticklabels
(
df
.
Name
)
ax
.
set_title
(
"The Beatles and ... something else"
,
size
=
18
)
ax
.
set_xlabel
(
""
)
ax
.
set_ylabel
(
"age"
,
size
=
12
);
```
%% Output
%% Cell type:code id: tags:
```
python
``
`
%%
Cell
type
:
code
id
:
tags
:
```
python
```
...
...
SWC-2018-02-22-Applied-Data-Analysis-I-Time-Series-Analysis.ipynb
0 → 100644
View file @
c56cd197
This source diff could not be displayed because it is too large. You can
view the blob
instead.
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment