McKinney Chapter 5 - Getting Started with pandas

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 5 of Wes McKinney’s Python for Data Analysis discusses the fundamentals of pandas, which will be our main tool for the rest of the semester. pandas is an abbreviation for panel data, which provide time-stamped data for multiple individuals or firms.

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

pandas will be a major tool of interest throughout much of the rest of the book. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy’s idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.

We will use pandas—a wrapper for NumPy that helps us manipulate and combine data—every day for the rest of the course.

2 Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

2.1 Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data.

The early examples use integer and string labels, but date-time labels are most useful.

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

Contrast obj with a NumPy array equivalent:

np.array([4, 7, -5, 3])

array([ 4,  7, -5,  3])

obj.values

array([ 4,  7, -5,  3], dtype=int64)

obj.index  # similar to range(4)

RangeIndex(start=0, stop=4, step=1)

We did not explicitly assign an index to obj, so obj has an integer index that starts at 0. We can explicitly assign an index with the index= argument.

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

obj2['a']

-5

obj2.iloc[2]

-5

obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

A pandas series behaves like a NumPy array. We can use Boolean filters and perform vectorized mathematical operations.

obj2 > 0

d     True
b     True
a    False
c     True
dtype: bool

obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

'b' in obj2

True

'e' in obj2

False

We can create a pandas series from a dictionary. The dictionary labels become the series index.

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

We can create a pandas series from a list, too. Note that pandas respects the order of the assigned index. Also, pandas keeps California with NaN (not a number or missing value) and drops Utah because it was not in the index.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California          NaN
Ohio         35000.0000
Oregon       16000.0000
Texas        71000.0000
dtype: float64

Indices are one of pandas’ super powers. When we perform mathematical operations, pandas aligns series by their indices. Here NaN is “not a number”, which indicates missing values. NaN is considered a float, so the data type switches from int64 to float64.

obj3 + obj4

California           NaN
Ohio          70000.0000
Oregon        32000.0000
Texas        142000.0000
Utah                 NaN
dtype: float64

2.2 DataFrame

A pandas data frame is like a worksheet in an Excel workbook with row and columns that provide fast indexing.

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are outside the scope of this book.

There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:

data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}
frame = pd.DataFrame(data)

frame

	state	year	pop
0	Ohio	2000	1.5000
1	Ohio	2001	1.7000
2	Ohio	2002	3.6000
3	Nevada	2001	2.4000
4	Nevada	2002	2.9000
5	Nevada	2003	3.2000

We did not specify an index, so frame has the default index of integers starting at 0.

frame2 = pd.DataFrame(
    data, 
    columns=['year', 'state', 'pop', 'debt'],
    index=['one', 'two', 'three', 'four', 'five', 'six']
)

frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	NaN
two	2001	Ohio	1.7000	NaN
three	2002	Ohio	3.6000	NaN
four	2001	Nevada	2.4000	NaN
five	2002	Nevada	2.9000	NaN
six	2003	Nevada	3.2000	NaN

If we extract one column, via either df.column or df['column'], the result is a series. We can use either the df.colname or the df['colname'] syntax to extract a column from a data frame as a series. However, we must use the df['colname'] syntax to add* a column to a data frame.* Also, we must use the df['colname'] syntax to extract or add a column whose name contains a whitespace.

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Similarly, if we extract one row. via either df.loc['rowlabel'] or df.iloc[rownumber], the result is a series.

frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	NaN
two	2001	Ohio	1.7000	NaN
three	2002	Ohio	3.6000	NaN
four	2001	Nevada	2.4000	NaN
five	2002	Nevada	2.9000	NaN
six	2003	Nevada	3.2000	NaN

frame2.loc['one']

year      2000
state     Ohio
pop     1.5000
debt       NaN
Name: one, dtype: object

Data frame have two dimensions, so we have to slice data frames more precisely than series.

The .loc[] method slices by row labels and column names
The .iloc[] method slices by integer row and label indices

frame2.loc['three']

year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object

frame2.iloc[2]

year      2002
state     Ohio
pop     3.6000
debt       NaN
Name: three, dtype: object

We can use NumPy’s [row, column] syntanx with .loc[] and .iloc[].

frame2.loc['three', 'state'] # row, column

'Ohio'

frame2.loc['three', ['state', 'pop']] # row, column

state     Ohio
pop     3.6000
Name: three, dtype: object

We can assign either scalars or arrays to data frame columns.

Scalars will broadcast to every row in the data frame
Arrays must have the same length as the column

frame2['debt'] = 16.5
frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	16.5000
two	2001	Ohio	1.7000	16.5000
three	2002	Ohio	3.6000	16.5000
four	2001	Nevada	2.4000	16.5000
five	2002	Nevada	2.9000	16.5000
six	2003	Nevada	3.2000	16.5000

frame2['debt'] = np.arange(6.)
frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	0.0000
two	2001	Ohio	1.7000	1.0000
three	2002	Ohio	3.6000	2.0000
four	2001	Nevada	2.4000	3.0000
five	2002	Nevada	2.9000	4.0000
six	2003	Nevada	3.2000	5.0000

If we assign a series to a data frame column, pandas will use the index to align it with the data frame. Data frame rows not in the series will be missing values NaN.

val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
val

two    -1.2000
four   -1.5000
five   -1.7000
dtype: float64

frame2['debt'] = val
frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	NaN
two	2001	Ohio	1.7000	-1.2000
three	2002	Ohio	3.6000	NaN
four	2001	Nevada	2.4000	-1.5000
five	2002	Nevada	2.9000	-1.7000
six	2003	Nevada	3.2000	NaN

We can add columns to our data frame, then delete them with del.

frame2['eastern'] = (frame2.state == 'Ohio')
frame2

	year	state	pop	debt	eastern
one	2000	Ohio	1.5000	NaN	True
two	2001	Ohio	1.7000	-1.2000	True
three	2002	Ohio	3.6000	NaN	True
four	2001	Nevada	2.4000	-1.5000	False
five	2002	Nevada	2.9000	-1.7000	False
six	2003	Nevada	3.2000	NaN	False

del frame2['eastern']
frame2

	year	state	pop	debt
one	2000	Ohio	1.5000	NaN
two	2001	Ohio	1.7000	-1.2000
three	2002	Ohio	3.6000	NaN
four	2001	Nevada	2.4000	-1.5000
five	2002	Nevada	2.9000	-1.7000
six	2003	Nevada	3.2000	NaN

2.3 Index Objects

obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

Index objects are immutable!

# index[1] = 'd'  # TypeError: Index does not support mutable operations

Indices can contain duplicates, so an index does not guarantee our data are duplicate-free.

dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

3 Essential Functionality

This section provides the most import pandas operations. It is difficult to provide an exhaustive reference, but this section provides a head start on the core pandas functionality.

3.1 Dropping Entries from an Axis

Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis.

obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a   0.0000
b   1.0000
c   2.0000
d   3.0000
e   4.0000
dtype: float64

obj_without_d_and_c = obj.drop(['d', 'c'])
obj_without_d_and_c

a   0.0000
b   1.0000
e   4.0000
dtype: float64

The .drop() method works on data frames, too.

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data.drop(['Colorado', 'Ohio']) # implied ", axis=0"

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

data.drop(['Colorado', 'Ohio'], axis=0)

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

data.drop(index=['Colorado', 'Ohio'])

	one	two	three	four
Utah	8	9	10	11
New York	12	13	14	15

The .drop() method accepts an axis argument and the default is axis=0 to drop rows based on labels. To drop columns, we use axis=1 or axis='columns'.

data.drop('two', axis=1)

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

data.drop(columns='two')

	one	three	four
Ohio	0	2	3
Colorado	4	6	7
Utah	8	10	11
New York	12	14	15

3.2 Indexing, Selection, and Filtering

Indexing, selecting, and filtering will be among our most-used pandas features.

obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a   0.0000
b   1.0000
c   2.0000
d   3.0000
dtype: float64

obj['b']

1.0000

obj.iloc[1]

1.0000

obj.iloc[1:3]

b   1.0000
c   2.0000
dtype: float64

When we slice with labels, the left and right endpoints are inclusive.

obj['b':'c']

b   1.0000
c   2.0000
dtype: float64

obj['b':'c'] = 5
obj

a   0.0000
b   5.0000
c   5.0000
d   3.0000
dtype: float64

data = pd.DataFrame(
    np.arange(16).reshape((4, 4)),
    index=['Ohio', 'Colorado', 'Utah', 'New York'],
    columns=['one', 'two', 'three', 'four']
)

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Indexing one column returns a series.

data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

Indexing two or more columns returns a data frame.

data[['three', 'one']]

	three	one
Ohio	2	0
Colorado	6	4
Utah	10	8
New York	14	12

If we want a one-column data frame, we can use [[]]:

data['three']

Ohio         2
Colorado     6
Utah        10
New York    14
Name: three, dtype: int32

data[['three']]

	three
Ohio	2
Colorado	6
Utah	10
New York	14

data.iloc[:2]

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7

We can index a data frame with Booleans, as we did with NumPy arrays.

data < 5

	one	two	three	four
Ohio	True	True	True	True
Colorado	True	False	False	False
Utah	False	False	False	False
New York	False	False	False	False

data

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

data[data < 5] = 0
data

	one	two	three	four
Ohio	0	0	0	0
Colorado	0	5	6	7
Utah	8	9	10	11
New York	12	13	14	15

Finally, we can chain slices.

data.iloc[:, :3][data.three > 5]

	one	two	three
Colorado	0	5	6
Utah	8	9	10
New York	12	13	14

Table 5-4 summarizes data frame indexing and slicing options:

df[val]: Select single column or sequence of columns from the DataFrame; special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion)
df.loc[val]: Selects single row or subset of rows from the DataFrame by label
df.loc[:, val]: Selects single column or subset of columns by label
df.loc[val1, val2]: Select both rows and columns by label
df.iloc[where]: Selects single row or subset of rows from the DataFrame by integer position
df.iloc[:, where]: Selects single column or subset of columns by integer position
df.iloc[where_i, where_j]: Select both rows and columns by integer position
df.at[label_i, label_j]: Select a single scalar value by row and column label
df.iat[i, j]: Select a single scalar value by row and column position (integers) reindex method Select either rows or columns by labels
get_value, set_value methods: Select single value by row and column label

pandas is powerful and these options can be overwhelming! We will typically use df[val] to select columns (here val is either a string or list of strings), df.loc[val] to select rows (here val is a row label), and df.loc[val1, val2] to select both rows and columns. The other options add flexibility, and we may occasionally use them. However, our data will be large enough that counting row and column number will be tedious, making .iloc[] impractical.

3.3 Arithmetic and Data Alignment

An important pandas feature for some applications is the behavior of arithmetic between objects with different indexes. When you are adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs. For users with database experience, this is similar to an automatic outer join on the index labels.

s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1

a    7.3000
c   -2.5000
d    3.4000
e    1.5000
dtype: float64

s2

a   -2.1000
c    3.6000
e   -1.5000
f    4.0000
g    3.1000
dtype: float64

s1 + s2

a   5.2000
c   1.1000
d      NaN
e   0.0000
f      NaN
g      NaN
dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'), index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1

	b	c	d
Ohio	0.0000	1.0000	2.0000
Texas	3.0000	4.0000	5.0000
Colorado	6.0000	7.0000	8.0000

df2

	b	d	e
Utah	0.0000	1.0000	2.0000
Ohio	3.0000	4.0000	5.0000
Texas	6.0000	7.0000	8.0000
Oregon	9.0000	10.0000	11.0000

df1 + df2

	b	c	d	e
Colorado	NaN	NaN	NaN	NaN
Ohio	3.0000	NaN	6.0000	NaN
Oregon	NaN	NaN	NaN	NaN
Texas	9.0000	NaN	12.0000	NaN
Utah	NaN	NaN	NaN	NaN

df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})

df1

	A
0	1
1	2

df2

	B
0	3
1	4

df1 - df2

	A	B
0	NaN	NaN
1	NaN	NaN

Always check your output!

3.3.1 Arithmetic methods with fill values

df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df2.loc[1, 'b'] = np.nan

df1

	a	b	c	d
0	0.0000	1.0000	2.0000	3.0000
1	4.0000	5.0000	6.0000	7.0000
2	8.0000	9.0000	10.0000	11.0000

df2

	a	b	c	d	e
0	0.0000	1.0000	2.0000	3.0000	4.0000
1	5.0000	NaN	7.0000	8.0000	9.0000
2	10.0000	11.0000	12.0000	13.0000	14.0000
3	15.0000	16.0000	17.0000	18.0000	19.0000

df1 + df2

	a	b	c	d	e
0	0.0000	2.0000	4.0000	6.0000	NaN
1	9.0000	NaN	13.0000	15.0000	NaN
2	18.0000	20.0000	22.0000	24.0000	NaN
3	NaN	NaN	NaN	NaN	NaN

We can specify a fill value for NaN values. Note that pandas fills would-be NaN values in each data frame before the arithmetic operation.

df1.add(df2, fill_value=0)

	a	b	c	d	e
0	0.0000	2.0000	4.0000	6.0000	4.0000
1	9.0000	5.0000	13.0000	15.0000	9.0000
2	18.0000	20.0000	22.0000	24.0000	14.0000
3	15.0000	16.0000	17.0000	18.0000	19.0000

3.3.2 Operations between DataFrame and Series

arr = np.arange(12.).reshape((3, 4))
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

arr[0]

array([0., 1., 2., 3.])

arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

Arithmetic operations between series and data frames behave the same as the example above.

frame = pd.DataFrame(
    np.arange(12.).reshape((4, 3)),
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

series = frame.iloc[0]

frame

	b	d	e
Utah	0.0000	1.0000	2.0000
Ohio	3.0000	4.0000	5.0000
Texas	6.0000	7.0000	8.0000
Oregon	9.0000	10.0000	11.0000

series

b   0.0000
d   1.0000
e   2.0000
Name: Utah, dtype: float64

frame - series

	b	d	e
Utah	0.0000	0.0000	0.0000
Ohio	3.0000	3.0000	3.0000
Texas	6.0000	6.0000	6.0000
Oregon	9.0000	9.0000	9.0000

series2 = pd.Series(range(3), index=['b', 'e', 'f'])

frame

	b	d	e
Utah	0.0000	1.0000	2.0000
Ohio	3.0000	4.0000	5.0000
Texas	6.0000	7.0000	8.0000
Oregon	9.0000	10.0000	11.0000

series2

b    0
e    1
f    2
dtype: int64

frame + series2

	b	d	e	f
Utah	0.0000	NaN	3.0000	NaN
Ohio	3.0000	NaN	6.0000	NaN
Texas	6.0000	NaN	9.0000	NaN
Oregon	9.0000	NaN	12.0000	NaN

series3 = frame['d']

frame.sub(series3, axis='index')

	b	e
Utah	-1.0000	1.0000
Ohio	-1.0000	1.0000
Texas	-1.0000	1.0000
Oregon	-1.0000	1.0000

3.4 Function Application and Mapping

np.random.seed(42)
frame = pd.DataFrame(
    np.random.randn(4, 3), 
    columns=list('bde'),
    index=['Utah', 'Ohio', 'Texas', 'Oregon']
)

frame

	b	d	e
Utah	0.4967	-0.1383	0.6477
Ohio	1.5230	-0.2342	-0.2341
Texas	1.5792	0.7674	-0.4695
Oregon	0.5426	-0.4634	-0.4657

frame.abs()

	b	d	e
Utah	0.4967	0.1383	0.6477
Ohio	1.5230	0.2342	0.2341
Texas	1.5792	0.7674	0.4695
Oregon	0.5426	0.4634	0.4657

Another frequent operation is applying a function on one-dimensional arrays to each column or row. DataFrame’s apply method does exactly this:

Note that we can use anonymous (lambda) functions “on the fly”:

1.5792 - 0.4967

1.0825

frame.apply(lambda x: x.max() - x.min()) # implied axis=0

b   1.0825
d   1.2309
e   1.1172
dtype: float64

frame.apply(lambda x: x.max() - x.min(), axis=1)

Utah     0.7860
Ohio     1.7572
Texas    2.0487
Oregon   1.0083
dtype: float64

However, under the hood, the .apply() is basically a for loop and much slowly than optimized, built-in methods. Here is an example of the speed costs of .apply():

%timeit frame['e'].abs()

15.2 µs ± 4.5 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%timeit frame['e'].apply(np.abs)

32.6 µs ± 9.67 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

4 Summarizing and Computing Descriptive Statistics

df = pd.DataFrame(
    [[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],
    index=['a', 'b', 'c', 'd'],
    columns=['one', 'two']
)

df

	one	two
a	1.4000	NaN
b	7.1000	-4.5000
c	NaN	NaN
d	0.7500	-1.3000

df.sum() # implied axis=0

one    9.2500
two   -5.8000
dtype: float64

df.sum(axis=1)

a    1.4000
b    2.6000
c    0.0000
d   -0.5500
dtype: float64

df.mean(axis=1, skipna=False)

a       NaN
b    1.3000
c       NaN
d   -0.2750
dtype: float64

The .idxmax() method returns the label for the maximum observation.

df.idxmax()

one    b
two    d
dtype: object

The .describe() returns summary statistics for each numerical column in a data frame.

df.describe()

	one	two
count	3.0000	2.0000
mean	3.0833	-2.9000
std	3.4937	2.2627
min	0.7500	-4.5000
25%	1.0750	-3.7000
50%	1.4000	-2.9000
75%	4.2500	-2.1000
max	7.1000	-1.3000

For non-numerical data, .describe() returns alternative summary statistics.

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

4.1 Correlation and Covariance

import yfinance as yf

tickers = yf.Tickers('AAPL IBM MSFT GOOG')

prices = tickers.history(period='max', auto_adjust=False)

[*********************100%%**********************]  4 of 4 completed

prices.tail()

	Adj Close				Close				Dividends		...	Open		Stock Splits				Volume
	AAPL	GOOG	IBM	MSFT	AAPL	GOOG	IBM	MSFT	AAPL	GOOG	...	IBM	MSFT	AAPL	GOOG	IBM	MSFT	AAPL	GOOG	IBM	MSFT
Date
2024-01-19	191.5600	147.9700	171.4800	398.6700	191.5600	147.9700	171.4800	398.6700	0.0000	0.0000	...	170.5900	395.7600	0.0000	0.0000	0.0000	0.0000	68741000.0000	27170900.0000	6925800	29272000.0000
2024-01-22	193.8900	147.7100	172.8300	396.5100	193.8900	147.7100	172.8300	396.5100	0.0000	0.0000	...	172.8200	400.0200	0.0000	0.0000	0.0000	0.0000	60133900.0000	21829200.0000	4926000	27016900.0000
2024-01-23	195.1800	148.6800	173.9400	398.9000	195.1800	148.6800	173.9400	398.9000	0.0000	0.0000	...	172.9000	395.7500	0.0000	0.0000	0.0000	0.0000	42355600.0000	14113600.0000	3983500	20525900.0000
2024-01-24	194.5000	150.3500	173.9300	402.5600	194.5000	150.3500	173.9300	402.5600	0.0000	0.0000	...	174.7600	401.5400	0.0000	0.0000	0.0000	0.0000	53631300.0000	19245000.0000	9097800	24867000.0000
2024-01-25	194.1700	153.6400	190.4300	404.8700	194.1700	153.6400	190.4300	404.8700	0.0000	0.0000	...	184.9600	404.3200	0.0000	0.0000	0.0000	0.0000	54460179.0000	21334519.0000	28357994	20529365.0000

5 rows × 32 columns

prices['Adj Close'].tail()

	AAPL	GOOG	IBM	MSFT
Date
2024-01-19	191.5600	147.9700	171.4800	398.6700
2024-01-22	193.8900	147.7100	172.8300	396.5100
2024-01-23	195.1800	148.6800	173.9400	398.9000
2024-01-24	194.5000	150.3500	173.9300	402.5600
2024-01-25	194.1700	153.6400	190.4300	404.8700

The prices data frames contains daily data for AAPL, IBM, MSFT, and GOOG. The Adj Close column provides a reverse-engineered daily closing price that accounts for dividends paid and stock splits (and reverse splits). As a result, the .pct_change() in Adj Close considers both price changes (i.e., capital gains) and dividends, so \(R_t = \frac{(P_t + D_t) - P_{t-1}}{P_{t-1}} = \frac{\text{Adj Close}_t - \text{Adj Close}_{t-1}}{\text{Adj Close}_{t-1}}.\)

returns = prices['Adj Close'].pct_change().dropna()
returns

	AAPL	GOOG	IBM	MSFT
Date
2004-08-20	0.0029	0.0794	0.0042	0.0029
2004-08-23	0.0091	0.0101	-0.0070	0.0044
2004-08-24	0.0280	-0.0414	0.0007	0.0000
2004-08-25	0.0344	0.0108	0.0042	0.0114
2004-08-26	0.0487	0.0180	-0.0045	-0.0040
...	...	...	...	...
2024-01-19	0.0155	0.0206	0.0278	0.0122
2024-01-22	0.0122	-0.0018	0.0079	-0.0054
2024-01-23	0.0067	0.0066	0.0064	0.0060
2024-01-24	-0.0035	0.0112	-0.0001	0.0092
2024-01-25	-0.0017	0.0219	0.0949	0.0057

4891 rows × 4 columns

We multiply by 252 to annualize mean daily returns because means grow linearly with time and there are (about) 252 trading days per year.

returns.mean().mul(252)

AAPL   0.3648
GOOG   0.2591
IBM    0.0990
MSFT   0.2005
dtype: float64

We multiply by \(\sqrt{252}\) to annualize the standard deviation of daily returns because variances grow linearly with time, there are (about) 252 trading days per year, and the standard deviation is the square root of the variance.

returns.std().mul(np.sqrt(252))

AAPL   0.3277
GOOG   0.3071
IBM    0.2272
MSFT   0.2722
dtype: float64

The best explanation I have found on why stock return volatility (the standard deviation of stocks returns) grows with the square root of time is at the bottom of page 7 of chapter 8 of Ivo Welch’s free corporate finance textbook.

We can calculate pairwise correlations.

returns['MSFT'].corr(returns['IBM'])

0.4926

We can also calculate correlation matrices.

returns.corr()

	AAPL	GOOG	IBM	MSFT
AAPL	1.0000	0.5189	0.4275	0.5235
GOOG	0.5189	1.0000	0.3953	0.5626
IBM	0.4275	0.3953	1.0000	0.4926
MSFT	0.5235	0.5626	0.4926	1.0000

returns.corr().loc['MSFT', 'IBM']

0.4926