import matplotlib.pyplot as plt
import numpy as np
import pandas as pdMcKinney Chapter 5 - Getting Started with pandas
FINA 6333 for Spring 2024
1 Introduction
Chapter 5 of Wes McKinney’s Python for Data Analysis discusses the fundamentals of pandas, which will be our main tool for the rest of the semester. pandas is an abbreviation for panel data, which provide time-stamped data for multiple individuals or firms.
Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.
We will use pandas—a wrapper for NumPy that helps us manipulate and combine data—every day for the rest of the course.
2 Introduction to pandas Data Structures
To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.
2.1 Series
A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.
The early examples use integer and string labels, but date-time labels are most useful.
obj = pd.Series([4, 7, -5, 3])
obj0 4
1 7
2 -5
3 3
dtype: int64
Contrast obj with a NumPy array equivalent:
np.array([4, 7, -5, 3])array([ 4, 7, -5, 3])
obj.valuesarray([ 4, 7, -5, 3], dtype=int64)
obj.index # similar to range(4)RangeIndex(start=0, stop=4, step=1)
We did not explicitly assign an index to obj, so obj has an integer index that starts at 0. We can explicitly assign an index with the index= argument.
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2d 4
b 7
a -5
c 3
dtype: int64
obj2.indexIndex(['d', 'b', 'a', 'c'], dtype='object')
obj2['a']-5
obj2.iloc[2]-5
obj2['d'] = 6
obj2d 6
b 7
a -5
c 3
dtype: int64
obj2[['c', 'a', 'd']]c 3
a -5
d 6
dtype: int64
A pandas series behaves like a NumPy array. We can use Boolean filters and perform vectorized mathematical operations.
obj2 > 0d True
b True
a False
c True
dtype: bool
obj2[obj2 > 0]d 6
b 7
c 3
dtype: int64
obj2 * 2d 12
b 14
a -10
c 6
dtype: int64
'b' in obj2True
'e' in obj2False
We can create a pandas series from a dictionary. The dictionary labels become the series index.
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
We can create a pandas series from a list, too. Note that pandas respects the order of the assigned index. Also, pandas keeps California with NaN (not a number or missing value) and drops Utah because it was not in the index.
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4California NaN
Ohio 35000.0000
Oregon 16000.0000
Texas 71000.0000
dtype: float64
Indices are one of pandas’ super powers. When we perform mathematical operations, pandas aligns series by their indices. Here NaN is “not a number”, which indicates missing values. NaN is considered a float, so the data type switches from int64 to float64.
obj3 + obj4California NaN
Ohio 70000.0000
Oregon 32000.0000
Texas 142000.0000
Utah NaN
dtype: float64
2.2 DataFrame
A pandas data frame is like a worksheet in an Excel workbook with row and columns that provide fast indexing.
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.
There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
data = {
'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)
frame| state | year | pop | |
|---|---|---|---|
| 0 | Ohio | 2000 | 1.5000 |
| 1 | Ohio | 2001 | 1.7000 |
| 2 | Ohio | 2002 | 3.6000 |
| 3 | Nevada | 2001 | 2.4000 |
| 4 | Nevada | 2002 | 2.9000 |
| 5 | Nevada | 2003 | 3.2000 |
We did not specify an index, so frame has the default index of integers starting at 0.
frame2 = pd.DataFrame(
data,
columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five', 'six']
)
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | NaN |
| two | 2001 | Ohio | 1.7000 | NaN |
| three | 2002 | Ohio | 3.6000 | NaN |
| four | 2001 | Nevada | 2.4000 | NaN |
| five | 2002 | Nevada | 2.9000 | NaN |
| six | 2003 | Nevada | 3.2000 | NaN |
If we extract one column, via either df.column or df['column'], the result is a series. We can use either the df.colname or the df['colname'] syntax to extract a column from a data frame as a series. However, we must use the df['colname'] syntax to add* a column to a data frame.* Also, we must use the df['colname'] syntax to extract or add a column whose name contains a whitespace.
frame2['state']one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
frame2.stateone Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
Similarly, if we extract one row. via either df.loc['rowlabel'] or df.iloc[rownumber], the result is a series.
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | NaN |
| two | 2001 | Ohio | 1.7000 | NaN |
| three | 2002 | Ohio | 3.6000 | NaN |
| four | 2001 | Nevada | 2.4000 | NaN |
| five | 2002 | Nevada | 2.9000 | NaN |
| six | 2003 | Nevada | 3.2000 | NaN |
frame2.loc['one']year 2000
state Ohio
pop 1.5000
debt NaN
Name: one, dtype: object
Data frame have two dimensions, so we have to slice data frames more precisely than series.
- The
.loc[]method slices by row labels and column names - The
.iloc[]method slices by integer row and label indices
frame2.loc['three']year 2002
state Ohio
pop 3.6000
debt NaN
Name: three, dtype: object
frame2.iloc[2]year 2002
state Ohio
pop 3.6000
debt NaN
Name: three, dtype: object
We can use NumPy’s [row, column] syntanx with .loc[] and .iloc[].
frame2.loc['three', 'state'] # row, column'Ohio'
frame2.loc['three', ['state', 'pop']] # row, columnstate Ohio
pop 3.6000
Name: three, dtype: object
We can assign either scalars or arrays to data frame columns.
- Scalars will broadcast to every row in the data frame
- Arrays must have the same length as the column
frame2['debt'] = 16.5
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | 16.5000 |
| two | 2001 | Ohio | 1.7000 | 16.5000 |
| three | 2002 | Ohio | 3.6000 | 16.5000 |
| four | 2001 | Nevada | 2.4000 | 16.5000 |
| five | 2002 | Nevada | 2.9000 | 16.5000 |
| six | 2003 | Nevada | 3.2000 | 16.5000 |
frame2['debt'] = np.arange(6.)
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | 0.0000 |
| two | 2001 | Ohio | 1.7000 | 1.0000 |
| three | 2002 | Ohio | 3.6000 | 2.0000 |
| four | 2001 | Nevada | 2.4000 | 3.0000 |
| five | 2002 | Nevada | 2.9000 | 4.0000 |
| six | 2003 | Nevada | 3.2000 | 5.0000 |
If we assign a series to a data frame column, pandas will use the index to align it with the data frame. Data frame rows not in the series will be missing values NaN.
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
valtwo -1.2000
four -1.5000
five -1.7000
dtype: float64
frame2['debt'] = val
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | NaN |
| two | 2001 | Ohio | 1.7000 | -1.2000 |
| three | 2002 | Ohio | 3.6000 | NaN |
| four | 2001 | Nevada | 2.4000 | -1.5000 |
| five | 2002 | Nevada | 2.9000 | -1.7000 |
| six | 2003 | Nevada | 3.2000 | NaN |
We can add columns to our data frame, then delete them with del.
frame2['eastern'] = (frame2.state == 'Ohio')
frame2| year | state | pop | debt | eastern | |
|---|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | NaN | True |
| two | 2001 | Ohio | 1.7000 | -1.2000 | True |
| three | 2002 | Ohio | 3.6000 | NaN | True |
| four | 2001 | Nevada | 2.4000 | -1.5000 | False |
| five | 2002 | Nevada | 2.9000 | -1.7000 | False |
| six | 2003 | Nevada | 3.2000 | NaN | False |
del frame2['eastern']
frame2| year | state | pop | debt | |
|---|---|---|---|---|
| one | 2000 | Ohio | 1.5000 | NaN |
| two | 2001 | Ohio | 1.7000 | -1.2000 |
| three | 2002 | Ohio | 3.6000 | NaN |
| four | 2001 | Nevada | 2.4000 | -1.5000 |
| five | 2002 | Nevada | 2.9000 | -1.7000 |
| six | 2003 | Nevada | 3.2000 | NaN |
2.3 Index Objects
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
indexIndex(['a', 'b', 'c'], dtype='object')
Index objects are immutable!
# index[1] = 'd' # TypeError: Index does not support mutable operationsIndices can contain duplicates, so an index does not guarantee our data are duplicate-free.
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labelsIndex(['foo', 'foo', 'bar', 'bar'], dtype='object')
3 Essential Functionality
This section provides the most import pandas operations. It is difficult to provide an exhaustive reference, but this section provides a head start on the core pandas functionality.
3.1 Dropping Entries from an Axis
Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis.
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obja 0.0000
b 1.0000
c 2.0000
d 3.0000
e 4.0000
dtype: float64
obj_without_d_and_c = obj.drop(['d', 'c'])
obj_without_d_and_ca 0.0000
b 1.0000
e 4.0000
dtype: float64
The .drop() method works on data frames, too.
data = pd.DataFrame(
np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
data.drop(['Colorado', 'Ohio']) # implied ", axis=0"| one | two | three | four | |
|---|---|---|---|---|
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
data.drop(['Colorado', 'Ohio'], axis=0)| one | two | three | four | |
|---|---|---|---|---|
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
data.drop(index=['Colorado', 'Ohio'])| one | two | three | four | |
|---|---|---|---|---|
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
The .drop() method accepts an axis argument and the default is axis=0 to drop rows based on labels. To drop columns, we use axis=1 or axis='columns'.
data.drop('two', axis=1)| one | three | four | |
|---|---|---|---|
| Ohio | 0 | 2 | 3 |
| Colorado | 4 | 6 | 7 |
| Utah | 8 | 10 | 11 |
| New York | 12 | 14 | 15 |
data.drop(columns='two')| one | three | four | |
|---|---|---|---|
| Ohio | 0 | 2 | 3 |
| Colorado | 4 | 6 | 7 |
| Utah | 8 | 10 | 11 |
| New York | 12 | 14 | 15 |
3.2 Indexing, Selection, and Filtering
Indexing, selecting, and filtering will be among our most-used pandas features.
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obja 0.0000
b 1.0000
c 2.0000
d 3.0000
dtype: float64
obj['b']1.0000
obj.iloc[1]1.0000
obj.iloc[1:3]b 1.0000
c 2.0000
dtype: float64
When we slice with labels, the left and right endpoints are inclusive.
obj['b':'c']b 1.0000
c 2.0000
dtype: float64
obj['b':'c'] = 5
obja 0.0000
b 5.0000
c 5.0000
d 3.0000
dtype: float64
data = pd.DataFrame(
np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four']
)
data| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
Indexing one column returns a series.
data['two']Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
Indexing two or more columns returns a data frame.
data[['three', 'one']]| three | one | |
|---|---|---|
| Ohio | 2 | 0 |
| Colorado | 6 | 4 |
| Utah | 10 | 8 |
| New York | 14 | 12 |
If we want a one-column data frame, we can use [[]]:
data['three']Ohio 2
Colorado 6
Utah 10
New York 14
Name: three, dtype: int32
data[['three']]| three | |
|---|---|
| Ohio | 2 |
| Colorado | 6 |
| Utah | 10 |
| New York | 14 |
data.iloc[:2]| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
We can index a data frame with Booleans, as we did with NumPy arrays.
data < 5| one | two | three | four | |
|---|---|---|---|---|
| Ohio | True | True | True | True |
| Colorado | True | False | False | False |
| Utah | False | False | False | False |
| New York | False | False | False | False |
data| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 1 | 2 | 3 |
| Colorado | 4 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
data[data < 5] = 0
data| one | two | three | four | |
|---|---|---|---|---|
| Ohio | 0 | 0 | 0 | 0 |
| Colorado | 0 | 5 | 6 | 7 |
| Utah | 8 | 9 | 10 | 11 |
| New York | 12 | 13 | 14 | 15 |
Finally, we can chain slices.
data.iloc[:, :3][data.three > 5]| one | two | three | |
|---|---|---|---|
| Colorado | 0 | 5 | 6 |
| Utah | 8 | 9 | 10 |
| New York | 12 | 13 | 14 |
Table 5-4 summarizes data frame indexing and slicing options:
df[val]: Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)df.loc[val]: Selects single row or subset of rows from the DataFrame by labeldf.loc[:, val]: Selects single column or subset of columns by labeldf.loc[val1, val2]: Select both rows and columns by labeldf.iloc[where]: Selects single row or subset of rows from the DataFrame by integer positiondf.iloc[:, where]: Selects single column or subset of columns by integer positiondf.iloc[where_i, where_j]: Select both rows and columns by integer positiondf.at[label_i, label_j]: Select a single scalar value by row and column labeldf.iat[i, j]: Select a single scalar value by row and column position (integers) reindex method Select either rows or columns by labelsget_value,set_valuemethods: Select single value by row and column label
pandas is powerful and these options can be overwhelming! We will typically use df[val] to select columns (here val is either a string or list of strings), df.loc[val] to select rows (here val is a row label), and df.loc[val1, val2] to select both rows and columns. The other options add flexibility, and we may occasionally use them. However, our data will be large enough that counting row and column number will be tedious, making .iloc[] impractical.
3.3 Arithmetic and Data Alignment
An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels.
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])s1a 7.3000
c -2.5000
d 3.4000
e 1.5000
dtype: float64
s2a -2.1000
c 3.6000
e -1.5000
f 4.0000
g 3.1000
dtype: float64
s1 + s2a 5.2000
c 1.1000
d NaN
e 0.0000
f NaN
g NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])df1| b | c | d | |
|---|---|---|---|
| Ohio | 0.0000 | 1.0000 | 2.0000 |
| Texas | 3.0000 | 4.0000 | 5.0000 |
| Colorado | 6.0000 | 7.0000 | 8.0000 |
df2| b | d | e | |
|---|---|---|---|
| Utah | 0.0000 | 1.0000 | 2.0000 |
| Ohio | 3.0000 | 4.0000 | 5.0000 |
| Texas | 6.0000 | 7.0000 | 8.0000 |
| Oregon | 9.0000 | 10.0000 | 11.0000 |
df1 + df2| b | c | d | e | |
|---|---|---|---|---|
| Colorado | NaN | NaN | NaN | NaN |
| Ohio | 3.0000 | NaN | 6.0000 | NaN |
| Oregon | NaN | NaN | NaN | NaN |
| Texas | 9.0000 | NaN | 12.0000 | NaN |
| Utah | NaN | NaN | NaN | NaN |
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})df1| A | |
|---|---|
| 0 | 1 |
| 1 | 2 |
df2| B | |
|---|---|
| 0 | 3 |
| 1 | 4 |
df1 - df2| A | B | |
|---|---|---|
| 0 | NaN | NaN |
| 1 | NaN | NaN |
Always check your output!
3.3.1 Arithmetic methods with fill values
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nandf1| a | b | c | d | |
|---|---|---|---|---|
| 0 | 0.0000 | 1.0000 | 2.0000 | 3.0000 |
| 1 | 4.0000 | 5.0000 | 6.0000 | 7.0000 |
| 2 | 8.0000 | 9.0000 | 10.0000 | 11.0000 |
df2| a | b | c | d | e | |
|---|---|---|---|---|---|
| 0 | 0.0000 | 1.0000 | 2.0000 | 3.0000 | 4.0000 |
| 1 | 5.0000 | NaN | 7.0000 | 8.0000 | 9.0000 |
| 2 | 10.0000 | 11.0000 | 12.0000 | 13.0000 | 14.0000 |
| 3 | 15.0000 | 16.0000 | 17.0000 | 18.0000 | 19.0000 |
df1 + df2| a | b | c | d | e | |
|---|---|---|---|---|---|
| 0 | 0.0000 | 2.0000 | 4.0000 | 6.0000 | NaN |
| 1 | 9.0000 | NaN | 13.0000 | 15.0000 | NaN |
| 2 | 18.0000 | 20.0000 | 22.0000 | 24.0000 | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN |
We can specify a fill value for NaN values. Note that pandas fills would-be NaN values in each data frame before the arithmetic operation.
df1.add(df2, fill_value=0)| a | b | c | d | e | |
|---|---|---|---|---|---|
| 0 | 0.0000 | 2.0000 | 4.0000 | 6.0000 | 4.0000 |
| 1 | 9.0000 | 5.0000 | 13.0000 | 15.0000 | 9.0000 |
| 2 | 18.0000 | 20.0000 | 22.0000 | 24.0000 | 14.0000 |
| 3 | 15.0000 | 16.0000 | 17.0000 | 18.0000 | 19.0000 |
3.3.2 Operations between DataFrame and Series
arr = np.arange(12.).reshape((3, 4))
arrarray([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
arr[0]array([0., 1., 2., 3.])
arr - arr[0]array([[0., 0., 0., 0.],
[4., 4., 4., 4.],
[8., 8., 8., 8.]])
Arithmetic operations between series and data frames behave the same as the example above.
frame = pd.DataFrame(
np.arange(12.).reshape((4, 3)),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
series = frame.iloc[0]frame| b | d | e | |
|---|---|---|---|
| Utah | 0.0000 | 1.0000 | 2.0000 |
| Ohio | 3.0000 | 4.0000 | 5.0000 |
| Texas | 6.0000 | 7.0000 | 8.0000 |
| Oregon | 9.0000 | 10.0000 | 11.0000 |
seriesb 0.0000
d 1.0000
e 2.0000
Name: Utah, dtype: float64
frame - series| b | d | e | |
|---|---|---|---|
| Utah | 0.0000 | 0.0000 | 0.0000 |
| Ohio | 3.0000 | 3.0000 | 3.0000 |
| Texas | 6.0000 | 6.0000 | 6.0000 |
| Oregon | 9.0000 | 9.0000 | 9.0000 |
series2 = pd.Series(range(3), index=['b', 'e', 'f'])frame| b | d | e | |
|---|---|---|---|
| Utah | 0.0000 | 1.0000 | 2.0000 |
| Ohio | 3.0000 | 4.0000 | 5.0000 |
| Texas | 6.0000 | 7.0000 | 8.0000 |
| Oregon | 9.0000 | 10.0000 | 11.0000 |
series2b 0
e 1
f 2
dtype: int64
frame + series2| b | d | e | f | |
|---|---|---|---|---|
| Utah | 0.0000 | NaN | 3.0000 | NaN |
| Ohio | 3.0000 | NaN | 6.0000 | NaN |
| Texas | 6.0000 | NaN | 9.0000 | NaN |
| Oregon | 9.0000 | NaN | 12.0000 | NaN |
series3 = frame['d']frame.sub(series3, axis='index')| b | d | e | |
|---|---|---|---|
| Utah | -1.0000 | 0.0000 | 1.0000 |
| Ohio | -1.0000 | 0.0000 | 1.0000 |
| Texas | -1.0000 | 0.0000 | 1.0000 |
| Oregon | -1.0000 | 0.0000 | 1.0000 |
3.4 Function Application and Mapping
np.random.seed(42)
frame = pd.DataFrame(
np.random.randn(4, 3),
columns=list('bde'),
index=['Utah', 'Ohio', 'Texas', 'Oregon']
)
frame| b | d | e | |
|---|---|---|---|
| Utah | 0.4967 | -0.1383 | 0.6477 |
| Ohio | 1.5230 | -0.2342 | -0.2341 |
| Texas | 1.5792 | 0.7674 | -0.4695 |
| Oregon | 0.5426 | -0.4634 | -0.4657 |
frame.abs()| b | d | e | |
|---|---|---|---|
| Utah | 0.4967 | 0.1383 | 0.6477 |
| Ohio | 1.5230 | 0.2342 | 0.2341 |
| Texas | 1.5792 | 0.7674 | 0.4695 |
| Oregon | 0.5426 | 0.4634 | 0.4657 |
Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:
Note that we can use anonymous (lambda) functions “on the fly”:
1.5792 - 0.49671.0825
frame.apply(lambda x: x.max() - x.min()) # implied axis=0b 1.0825
d 1.2309
e 1.1172
dtype: float64
frame.apply(lambda x: x.max() - x.min(), axis=1)Utah 0.7860
Ohio 1.7572
Texas 2.0487
Oregon 1.0083
dtype: float64
However, under the hood, the .apply() is basically a for loop and much slowly than optimized, built-in methods. Here is an example of the speed costs of .apply():
%timeit frame['e'].abs()15.2 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit frame['e'].apply(np.abs)32.6 µs ± 9.67 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
4 Summarizing and Computing Descriptive Statistics
df = pd.DataFrame(
[[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two']
)
df| one | two | |
|---|---|---|
| a | 1.4000 | NaN |
| b | 7.1000 | -4.5000 |
| c | NaN | NaN |
| d | 0.7500 | -1.3000 |
df.sum() # implied axis=0one 9.2500
two -5.8000
dtype: float64
df.sum(axis=1)a 1.4000
b 2.6000
c 0.0000
d -0.5500
dtype: float64
df.mean(axis=1, skipna=False)a NaN
b 1.3000
c NaN
d -0.2750
dtype: float64
The .idxmax() method returns the label for the maximum observation.
df.idxmax()one b
two d
dtype: object
The .describe() returns summary statistics for each numerical column in a data frame.
df.describe()| one | two | |
|---|---|---|
| count | 3.0000 | 2.0000 |
| mean | 3.0833 | -2.9000 |
| std | 3.4937 | 2.2627 |
| min | 0.7500 | -4.5000 |
| 25% | 1.0750 | -3.7000 |
| 50% | 1.4000 | -2.9000 |
| 75% | 4.2500 | -2.1000 |
| max | 7.1000 | -1.3000 |
For non-numerical data, .describe() returns alternative summary statistics.
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()count 16
unique 3
top a
freq 8
dtype: object
4.1 Correlation and Covariance
import yfinance as yftickers = yf.Tickers('AAPL IBM MSFT GOOG')prices = tickers.history(period='max', auto_adjust=False)[*********************100%%**********************] 4 of 4 completed
prices.tail()| Adj Close | Close | Dividends | ... | Open | Stock Splits | Volume | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AAPL | GOOG | IBM | MSFT | AAPL | GOOG | IBM | MSFT | AAPL | GOOG | ... | IBM | MSFT | AAPL | GOOG | IBM | MSFT | AAPL | GOOG | IBM | MSFT | |
| Date | |||||||||||||||||||||
| 2024-01-19 | 191.5600 | 147.9700 | 171.4800 | 398.6700 | 191.5600 | 147.9700 | 171.4800 | 398.6700 | 0.0000 | 0.0000 | ... | 170.5900 | 395.7600 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 68741000.0000 | 27170900.0000 | 6925800 | 29272000.0000 |
| 2024-01-22 | 193.8900 | 147.7100 | 172.8300 | 396.5100 | 193.8900 | 147.7100 | 172.8300 | 396.5100 | 0.0000 | 0.0000 | ... | 172.8200 | 400.0200 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 60133900.0000 | 21829200.0000 | 4926000 | 27016900.0000 |
| 2024-01-23 | 195.1800 | 148.6800 | 173.9400 | 398.9000 | 195.1800 | 148.6800 | 173.9400 | 398.9000 | 0.0000 | 0.0000 | ... | 172.9000 | 395.7500 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 42355600.0000 | 14113600.0000 | 3983500 | 20525900.0000 |
| 2024-01-24 | 194.5000 | 150.3500 | 173.9300 | 402.5600 | 194.5000 | 150.3500 | 173.9300 | 402.5600 | 0.0000 | 0.0000 | ... | 174.7600 | 401.5400 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 53631300.0000 | 19245000.0000 | 9097800 | 24867000.0000 |
| 2024-01-25 | 194.1700 | 153.6400 | 190.4300 | 404.8700 | 194.1700 | 153.6400 | 190.4300 | 404.8700 | 0.0000 | 0.0000 | ... | 184.9600 | 404.3200 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 54460179.0000 | 21334519.0000 | 28357994 | 20529365.0000 |
5 rows × 32 columns
prices['Adj Close'].tail()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2024-01-19 | 191.5600 | 147.9700 | 171.4800 | 398.6700 |
| 2024-01-22 | 193.8900 | 147.7100 | 172.8300 | 396.5100 |
| 2024-01-23 | 195.1800 | 148.6800 | 173.9400 | 398.9000 |
| 2024-01-24 | 194.5000 | 150.3500 | 173.9300 | 402.5600 |
| 2024-01-25 | 194.1700 | 153.6400 | 190.4300 | 404.8700 |
The prices data frames contains daily data for AAPL, IBM, MSFT, and GOOG. The Adj Close column provides a reverse-engineered daily closing price that accounts for dividends paid and stock splits (and reverse splits). As a result, the .pct_change() in Adj Close considers both price changes (i.e., capital gains) and dividends, so \(R_t = \frac{(P_t + D_t) - P_{t-1}}{P_{t-1}} = \frac{\text{Adj Close}_t - \text{Adj Close}_{t-1}}{\text{Adj Close}_{t-1}}.\)
returns = prices['Adj Close'].pct_change().dropna()
returns| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2004-08-20 | 0.0029 | 0.0794 | 0.0042 | 0.0029 |
| 2004-08-23 | 0.0091 | 0.0101 | -0.0070 | 0.0044 |
| 2004-08-24 | 0.0280 | -0.0414 | 0.0007 | 0.0000 |
| 2004-08-25 | 0.0344 | 0.0108 | 0.0042 | 0.0114 |
| 2004-08-26 | 0.0487 | 0.0180 | -0.0045 | -0.0040 |
| ... | ... | ... | ... | ... |
| 2024-01-19 | 0.0155 | 0.0206 | 0.0278 | 0.0122 |
| 2024-01-22 | 0.0122 | -0.0018 | 0.0079 | -0.0054 |
| 2024-01-23 | 0.0067 | 0.0066 | 0.0064 | 0.0060 |
| 2024-01-24 | -0.0035 | 0.0112 | -0.0001 | 0.0092 |
| 2024-01-25 | -0.0017 | 0.0219 | 0.0949 | 0.0057 |
4891 rows × 4 columns
We multiply by 252 to annualize mean daily returns because means grow linearly with time and there are (about) 252 trading days per year.
returns.mean().mul(252)AAPL 0.3648
GOOG 0.2591
IBM 0.0990
MSFT 0.2005
dtype: float64
We multiply by \(\sqrt{252}\) to annualize the standard deviation of daily returns because variances grow linearly with time, there are (about) 252 trading days per year, and the standard deviation is the square root of the variance.
returns.std().mul(np.sqrt(252))AAPL 0.3277
GOOG 0.3071
IBM 0.2272
MSFT 0.2722
dtype: float64
The best explanation I have found on why stock return volatility (the standard deviation of stocks returns) grows with the square root of time is at the bottom of page 7 of chapter 8 of Ivo Welch’s free corporate finance textbook.
We can calculate pairwise correlations.
returns['MSFT'].corr(returns['IBM'])0.4926
We can also calculate correlation matrices.
returns.corr()| AAPL | GOOG | IBM | MSFT | |
|---|---|---|---|---|
| AAPL | 1.0000 | 0.5189 | 0.4275 | 0.5235 |
| GOOG | 0.5189 | 1.0000 | 0.3953 | 0.5626 |
| IBM | 0.4275 | 0.3953 | 1.0000 | 0.4926 |
| MSFT | 0.5235 | 0.5626 | 0.4926 | 1.0000 |
returns.corr().loc['MSFT', 'IBM']0.4926