McKinney Chapter 5 - Getting Started with pandas

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 5 of Wes McKinney’s Python for Data Analysis discusses the fundamentals of pandas, which will be our main tool for the rest of the semester. pandas is an abbreviation for panel data, which provide time-stamped data for multiple individuals or firms.

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

We will use pandas—a wrapper for NumPy that helps us manipulate and combine data—every day for the rest of the course.

2 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

2.1 Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

The early examples use integer and string labels, but date-time labels are most useful.

obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

Contrast obj with a NumPy array equivalent:

np.array([4, 7, -5, 3])
array([ 4,  7, -5,  3])
obj.values
array([ 4,  7, -5,  3], dtype=int64)
obj.index  # similar to range(4)
RangeIndex(start=0, stop=4, step=1)

We did not explicitly assign an index to obj, so obj has an integer index that starts at 0. We can explicitly assign an index with the index= argument.

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2
d    4
b    7
a   -5
c    3
dtype: int64
obj2.index
Index(['d', 'b', 'a', 'c'], dtype='object')
obj2['a']
-5
obj2.iloc[2]
-5
obj2['d'] = 6
obj2
d    6
b    7
a   -5
c    3
dtype: int64
obj2[['c', 'a', 'd']]
c    3
a   -5
d    6
dtype: int64

A pandas series behaves like a NumPy array. We can use Boolean filters and perform vectorized mathematical operations.

obj2 > 0
d     True
b     True
a    False
c     True
dtype: bool
obj2[obj2 > 0]
d    6
b    7
c    3
dtype: int64
obj2 * 2
d    12
b    14
a   -10
c     6
dtype: int64
'b' in obj2
True
'e' in obj2
False

We can create a pandas series from a dictionary. The dictionary labels become the series index.

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

We can create a pandas series from a list, too. Note that pandas respects the order of the assigned index. Also, pandas keeps California with NaN (not a number or missing value) and drops Utah because it was not in the index.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
California          NaN
Ohio         35000.0000
Oregon       16000.0000
Texas        71000.0000
dtype: float64

Indices are one of pandas’ super powers. When we perform mathematical operations, pandas aligns series by their indices. Here NaN is “not a number”, which indicates missing values. NaN is considered a float, so the data type switches from int64 to float64.

obj3 + obj4
California           NaN
Ohio          70000.0000
Oregon        32000.0000
Texas        142000.0000
Utah                 NaN
dtype: float64

2.2 DataFrame

A pandas data frame is like a worksheet in an Excel workbook with row and columns that provide fast indexing.

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)

frame
state year pop
0 Ohio 2000 1.5000
1 Ohio 2001 1.7000
2 Ohio 2002 3.6000
3 Nevada 2001 2.4000
4 Nevada 2002 2.9000
5 Nevada 2003 3.2000

We did not specify an index, so frame has the default index of integers starting at 0.

frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)

frame2
year state pop debt
one 2000 Ohio 1.5000 NaN
two 2001 Ohio 1.7000 NaN
three 2002 Ohio 3.6000 NaN
four 2001 Nevada 2.4000 NaN
five 2002 Nevada 2.9000 NaN
six 2003 Nevada 3.2000 NaN

If we extract one column, via either df.column or df['column'], the result is a series. We can use either the df.colname or the df['colname'] syntax to extract a column from a data frame as a series. However, we must use the df['colname'] syntax to add* a column to a data frame.* Also, we must use the df['colname'] syntax to extract or add a column whose name contains a whitespace.

frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
frame2.state
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Similarly, if we extract one row. via either df.loc['rowlabel'] or df.iloc[rownumber], the result is a series.

frame2
year state pop debt
one 2000 Ohio 1.5000 NaN
two 2001 Ohio 1.7000 NaN
three 2002 Ohio 3.6000 NaN
four 2001 Nevada 2.4000 NaN
five 2002 Nevada 2.9000 NaN
six 2003 Nevada 3.2000 NaN
frame2.loc['one']
year      2000
state     Ohio
pop     1.5000
debt       NaN
Name: one, dtype: object

Data frame have two dimensions, so we have to slice data frames more precisely than series.

  1. The .loc[] method slices by row labels and column names
  2. The .iloc[] method slices by integer row and label indices
frame2.loc['three']
year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object
frame2.iloc[2]
year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object

We can use NumPy’s [row, column] syntanx with .loc[] and .iloc[].

frame2.loc['three', 'state'] # row, column
'Ohio'
frame2.loc['three', ['state', 'pop']] # row, column
state     Ohio
pop     3.6000
Name: three, dtype: object

We can assign either scalars or arrays to data frame columns.

  1. Scalars will broadcast to every row in the data frame
  2. Arrays must have the same length as the column
frame2['debt'] = 16.5
frame2
year state pop debt
one 2000 Ohio 1.5000 16.5000
two 2001 Ohio 1.7000 16.5000
three 2002 Ohio 3.6000 16.5000
four 2001 Nevada 2.4000 16.5000
five 2002 Nevada 2.9000 16.5000
six 2003 Nevada 3.2000 16.5000
frame2['debt'] = np.arange(6.)
frame2
year state pop debt
one 2000 Ohio 1.5000 0.0000
two 2001 Ohio 1.7000 1.0000
three 2002 Ohio 3.6000 2.0000
four 2001 Nevada 2.4000 3.0000
five 2002 Nevada 2.9000 4.0000
six 2003 Nevada 3.2000 5.0000

If we assign a series to a data frame column, pandas will use the index to align it with the data frame. Data frame rows not in the series will be missing values NaN.

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val
two    -1.2000
four   -1.5000
five   -1.7000
dtype: float64
frame2['debt'] = val
frame2
year state pop debt
one 2000 Ohio 1.5000 NaN
two 2001 Ohio 1.7000 -1.2000
three 2002 Ohio 3.6000 NaN
four 2001 Nevada 2.4000 -1.5000
five 2002 Nevada 2.9000 -1.7000
six 2003 Nevada 3.2000 NaN

We can add columns to our data frame, then delete them with del.

frame2['eastern'] = (frame2.state == 'Ohio')
frame2
year state pop debt eastern
one 2000 Ohio 1.5000 NaN True
two 2001 Ohio 1.7000 -1.2000 True
three 2002 Ohio 3.6000 NaN True
four 2001 Nevada 2.4000 -1.5000 False
five 2002 Nevada 2.9000 -1.7000 False
six 2003 Nevada 3.2000 NaN False
del frame2['eastern']
frame2
year state pop debt
one 2000 Ohio 1.5000 NaN
two 2001 Ohio 1.7000 -1.2000
three 2002 Ohio 3.6000 NaN
four 2001 Nevada 2.4000 -1.5000
five 2002 Nevada 2.9000 -1.7000
six 2003 Nevada 3.2000 NaN

2.3 Index Objects

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')

Index objects are immutable!

# index[1] = 'd'  # TypeError: Index does not support mutable operations

Indices can contain duplicates, so an index does not guarantee our data are duplicate-free.

dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels
Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

3 Essential Functionality

This section provides the most import pandas operations. It is difficult to provide an exhaustive reference, but this section provides a head start on the core pandas functionality.

3.1 Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis.

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
a   0.0000
b   1.0000
c   2.0000
d   3.0000
e   4.0000
dtype: float64
obj_without_d_and_c = obj.drop(['d', 'c'])
obj_without_d_and_c
a   0.0000
b   1.0000
e   4.0000
dtype: float64

The .drop() method works on data frames, too.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado', 'Ohio']) # implied ", axis=0"
one two three four
Utah 8 9 10 11
New York 12 13 14 15
data.drop(['Colorado', 'Ohio'], axis=0)
one two three four
Utah 8 9 10 11
New York 12 13 14 15
data.drop(index=['Colorado', 'Ohio'])
one two three four
Utah 8 9 10 11
New York 12 13 14 15

The .drop() method accepts an axis argument and the default is axis=0 to drop rows based on labels. To drop columns, we use axis=1 or axis='columns'.

data.drop('two', axis=1)
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15
data.drop(columns='two')
one three four
Ohio 0 2 3
Colorado 4 6 7
Utah 8 10 11
New York 12 14 15

3.2 Indexing, Selection, and Filtering

Indexing, selecting, and filtering will be among our most-used pandas features.

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
a   0.0000
b   1.0000
c   2.0000
d   3.0000
dtype: float64
obj['b']
1.0000
obj.iloc[1]
1.0000
obj.iloc[1:3]
b   1.0000
c   2.0000
dtype: float64

When we slice with labels, the left and right endpoints are inclusive.

obj['b':'c']
b   1.0000
c   2.0000
dtype: float64
obj['b':'c'] = 5
obj
a   0.0000
b   5.0000
c   5.0000
d   3.0000
dtype: float64
data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Indexing one column returns a series.

data['two']
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

Indexing two or more columns returns a data frame.

data[['three', 'one']]
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

If we want a one-column data frame, we can use [[]]:

data['three']
Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int32
data[['three']]
three
Ohio 2
Colorado 6
Utah 10
New York 14
data.iloc[:2]
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7

We can index a data frame with Booleans, as we did with NumPy arrays.

data < 5
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data[data < 5] = 0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Finally, we can chain slices.

data.iloc[:, :3][data.three > 5]
one two three
Colorado 0 5 6
Utah 8 9 10
New York 12 13 14

Table 5-4 summarizes data frame indexing and slicing options:

  • df[val]: Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
  • df.loc[val]: Selects single row or subset of rows from the DataFrame by label
  • df.loc[:, val]: Selects single column or subset of columns by label
  • df.loc[val1, val2]: Select both rows and columns by label
  • df.iloc[where]: Selects single row or subset of rows from the DataFrame by integer position
  • df.iloc[:, where]: Selects single column or subset of columns by integer position
  • df.iloc[where_i, where_j]: Select both rows and columns by integer position
  • df.at[label_i, label_j]: Select a single scalar value by row and column label
  • df.iat[i, j]: Select a single scalar value by row and column position (integers) reindex method Select either rows or columns by labels
  • get_value, set_value methods: Select single value by row and column label

pandas is powerful and these options can be overwhelming! We will typically use df[val] to select columns (here val is either a string or list of strings), df.loc[val] to select rows (here val is a row label), and df.loc[val1, val2] to select both rows and columns. The other options add flexibility, and we may occasionally use them. However, our data will be large enough that counting row and column number will be tedious, making .iloc[] impractical.

3.3 Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels.

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
a    7.3000
c   -2.5000
d    3.4000
e    1.5000
dtype: float64
s2
a   -2.1000
c    3.6000
e   -1.5000
f    4.0000
g    3.1000
dtype: float64
s1 + s2
a   5.2000
c   1.1000
d      NaN
e   0.0000
f      NaN
g      NaN
dtype: float64
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
b c d
Ohio 0.0000 1.0000 2.0000
Texas 3.0000 4.0000 5.0000
Colorado 6.0000 7.0000 8.0000
df2
b d e
Utah 0.0000 1.0000 2.0000
Ohio 3.0000 4.0000 5.0000
Texas 6.0000 7.0000 8.0000
Oregon 9.0000 10.0000 11.0000
df1 + df2
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0000 NaN 6.0000 NaN
Oregon NaN NaN NaN NaN
Texas 9.0000 NaN 12.0000 NaN
Utah NaN NaN NaN NaN
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df1
A
0 1
1 2
df2
B
0 3
1 4
df1 - df2
A B
0 NaN NaN
1 NaN NaN

Always check your output!

3.3.1 Arithmetic methods with fill values

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan
df1
a b c d
0 0.0000 1.0000 2.0000 3.0000
1 4.0000 5.0000 6.0000 7.0000
2 8.0000 9.0000 10.0000 11.0000
df2
a b c d e
0 0.0000 1.0000 2.0000 3.0000 4.0000
1 5.0000 NaN 7.0000 8.0000 9.0000
2 10.0000 11.0000 12.0000 13.0000 14.0000
3 15.0000 16.0000 17.0000 18.0000 19.0000
df1 + df2
a b c d e
0 0.0000 2.0000 4.0000 6.0000 NaN
1 9.0000 NaN 13.0000 15.0000 NaN
2 18.0000 20.0000 22.0000 24.0000 NaN
3 NaN NaN NaN NaN NaN

We can specify a fill value for NaN values. Note that pandas fills would-be NaN values in each data frame before the arithmetic operation.

df1.add(df2, fill_value=0)
a b c d e
0 0.0000 2.0000 4.0000 6.0000 4.0000
1 9.0000 5.0000 13.0000 15.0000 9.0000
2 18.0000 20.0000 22.0000 24.0000 14.0000
3 15.0000 16.0000 17.0000 18.0000 19.0000

3.3.2 Operations between DataFrame and Series

arr = np.arange(12.).reshape((3, 4))
arr
array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])
arr[0]
array([0., 1., 2., 3.])
arr - arr[0]
array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

Arithmetic operations between series and data frames behave the same as the example above.

frame = pd.DataFrame(
    np.arange(12.).reshape((4, 3)),
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

series = frame.iloc[0]
frame
b d e
Utah 0.0000 1.0000 2.0000
Ohio 3.0000 4.0000 5.0000
Texas 6.0000 7.0000 8.0000
Oregon 9.0000 10.0000 11.0000
series
b   0.0000
d   1.0000
e   2.0000
Name: Utah, dtype: float64
frame - series
b d e
Utah 0.0000 0.0000 0.0000
Ohio 3.0000 3.0000 3.0000
Texas 6.0000 6.0000 6.0000
Oregon 9.0000 9.0000 9.0000
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame
b d e
Utah 0.0000 1.0000 2.0000
Ohio 3.0000 4.0000 5.0000
Texas 6.0000 7.0000 8.0000
Oregon 9.0000 10.0000 11.0000
series2
b    0
e    1
f    2
dtype: int64
frame + series2
b d e f
Utah 0.0000 NaN 3.0000 NaN
Ohio 3.0000 NaN 6.0000 NaN
Texas 6.0000 NaN 9.0000 NaN
Oregon 9.0000 NaN 12.0000 NaN
series3 = frame['d']
frame.sub(series3, axis='index')
b d e
Utah -1.0000 0.0000 1.0000
Ohio -1.0000 0.0000 1.0000
Texas -1.0000 0.0000 1.0000
Oregon -1.0000 0.0000 1.0000

3.4 Function Application and Mapping

np.random.seed(42)
frame = pd.DataFrame(
    np.random.randn(4, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

frame
b d e
Utah 0.4967 -0.1383 0.6477
Ohio 1.5230 -0.2342 -0.2341
Texas 1.5792 0.7674 -0.4695
Oregon 0.5426 -0.4634 -0.4657
frame.abs()
b d e
Utah 0.4967 0.1383 0.6477
Ohio 1.5230 0.2342 0.2341
Texas 1.5792 0.7674 0.4695
Oregon 0.5426 0.4634 0.4657

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

Note that we can use anonymous (lambda) functions “on the fly”:

1.5792 - 0.4967
1.0825
frame.apply(lambda x: x.max() - x.min()) # implied axis=0
b   1.0825
d   1.2309
e   1.1172
dtype: float64
frame.apply(lambda x: x.max() - x.min(), axis=1)
Utah     0.7860
Ohio     1.7572
Texas    2.0487
Oregon   1.0083
dtype: float64

However, under the hood, the .apply() is basically a for loop and much slowly than optimized, built-in methods. Here is an example of the speed costs of .apply():

%timeit frame['e'].abs()
15.2 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit frame['e'].apply(np.abs)
32.6 µs ± 9.67 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

4 Summarizing and Computing Descriptive Statistics

df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
)

df
one two
a 1.4000 NaN
b 7.1000 -4.5000
c NaN NaN
d 0.7500 -1.3000
df.sum() # implied axis=0
one    9.2500
two   -5.8000
dtype: float64
df.sum(axis=1)
a    1.4000
b    2.6000
c    0.0000
d   -0.5500
dtype: float64
df.mean(axis=1, skipna=False)
a       NaN
b    1.3000
c       NaN
d   -0.2750
dtype: float64

The .idxmax() method returns the label for the maximum observation.

df.idxmax()
one    b
two    d
dtype: object

The .describe() returns summary statistics for each numerical column in a data frame.

df.describe()
one two
count 3.0000 2.0000
mean 3.0833 -2.9000
std 3.4937 2.2627
min 0.7500 -4.5000
25% 1.0750 -3.7000
50% 1.4000 -2.9000
75% 4.2500 -2.1000
max 7.1000 -1.3000

For non-numerical data, .describe() returns alternative summary statistics.

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()
count     16
unique     3
top        a
freq       8
dtype: object

4.1 Correlation and Covariance

import yfinance as yf
tickers = yf.Tickers('AAPL IBM MSFT GOOG')
prices = tickers.history(period='max', auto_adjust=False)
[*********************100%%**********************]  4 of 4 completed
prices.tail()
Adj Close Close Dividends ... Open Stock Splits Volume
AAPL GOOG IBM MSFT AAPL GOOG IBM MSFT AAPL GOOG ... IBM MSFT AAPL GOOG IBM MSFT AAPL GOOG IBM MSFT
Date
2024-01-19 191.5600 147.9700 171.4800 398.6700 191.5600 147.9700 171.4800 398.6700 0.0000 0.0000 ... 170.5900 395.7600 0.0000 0.0000 0.0000 0.0000 68741000.0000 27170900.0000 6925800 29272000.0000
2024-01-22 193.8900 147.7100 172.8300 396.5100 193.8900 147.7100 172.8300 396.5100 0.0000 0.0000 ... 172.8200 400.0200 0.0000 0.0000 0.0000 0.0000 60133900.0000 21829200.0000 4926000 27016900.0000
2024-01-23 195.1800 148.6800 173.9400 398.9000 195.1800 148.6800 173.9400 398.9000 0.0000 0.0000 ... 172.9000 395.7500 0.0000 0.0000 0.0000 0.0000 42355600.0000 14113600.0000 3983500 20525900.0000
2024-01-24 194.5000 150.3500 173.9300 402.5600 194.5000 150.3500 173.9300 402.5600 0.0000 0.0000 ... 174.7600 401.5400 0.0000 0.0000 0.0000 0.0000 53631300.0000 19245000.0000 9097800 24867000.0000
2024-01-25 194.1700 153.6400 190.4300 404.8700 194.1700 153.6400 190.4300 404.8700 0.0000 0.0000 ... 184.9600 404.3200 0.0000 0.0000 0.0000 0.0000 54460179.0000 21334519.0000 28357994 20529365.0000

5 rows × 32 columns

prices['Adj Close'].tail()
AAPL GOOG IBM MSFT
Date
2024-01-19 191.5600 147.9700 171.4800 398.6700
2024-01-22 193.8900 147.7100 172.8300 396.5100
2024-01-23 195.1800 148.6800 173.9400 398.9000
2024-01-24 194.5000 150.3500 173.9300 402.5600
2024-01-25 194.1700 153.6400 190.4300 404.8700

The prices data frames contains daily data for AAPL, IBM, MSFT, and GOOG. The Adj Close column provides a reverse-engineered daily closing price that accounts for dividends paid and stock splits (and reverse splits). As a result, the .pct_change() in Adj Close considers both price changes (i.e., capital gains) and dividends, so \(R_t = \frac{(P_t + D_t) - P_{t-1}}{P_{t-1}} = \frac{\text{Adj Close}_t - \text{Adj Close}_{t-1}}{\text{Adj Close}_{t-1}}.\)

returns = prices['Adj Close'].pct_change().dropna()
returns
AAPL GOOG IBM MSFT
Date
2004-08-20 0.0029 0.0794 0.0042 0.0029
2004-08-23 0.0091 0.0101 -0.0070 0.0044
2004-08-24 0.0280 -0.0414 0.0007 0.0000
2004-08-25 0.0344 0.0108 0.0042 0.0114
2004-08-26 0.0487 0.0180 -0.0045 -0.0040
... ... ... ... ...
2024-01-19 0.0155 0.0206 0.0278 0.0122
2024-01-22 0.0122 -0.0018 0.0079 -0.0054
2024-01-23 0.0067 0.0066 0.0064 0.0060
2024-01-24 -0.0035 0.0112 -0.0001 0.0092
2024-01-25 -0.0017 0.0219 0.0949 0.0057

4891 rows × 4 columns

We multiply by 252 to annualize mean daily returns because means grow linearly with time and there are (about) 252 trading days per year.

returns.mean().mul(252)
AAPL   0.3648
GOOG   0.2591
IBM    0.0990
MSFT   0.2005
dtype: float64

We multiply by \(\sqrt{252}\) to annualize the standard deviation of daily returns because variances grow linearly with time, there are (about) 252 trading days per year, and the standard deviation is the square root of the variance.

returns.std().mul(np.sqrt(252))
AAPL   0.3277
GOOG   0.3071
IBM    0.2272
MSFT   0.2722
dtype: float64

The best explanation I have found on why stock return volatility (the standard deviation of stocks returns) grows with the square root of time is at the bottom of page 7 of chapter 8 of Ivo Welch’s free corporate finance textbook.

We can calculate pairwise correlations.

returns['MSFT'].corr(returns['IBM'])
0.4926

We can also calculate correlation matrices.

returns.corr()
AAPL GOOG IBM MSFT
AAPL 1.0000 0.5189 0.4275 0.5235
GOOG 0.5189 1.0000 0.3953 0.5626
IBM 0.4275 0.3953 1.0000 0.4926
MSFT 0.5235 0.5626 0.4926 1.0000
returns.corr().loc['MSFT', 'IBM']
0.4926