import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yfMcKinney Chapter 10 - Data Aggregation and Group Operations
FINA 6333 for Spring 2024
1 Introduction
Chapter 10 of Wes McKinney’s Python for Data Analysis discusses groupby operations, which are the pandas equivalent of Excel pivot tables. Pivot tables help us calculate statistics (e.g., sum, mean, and median) for one set of variables by groups of other variables (e.g., weekday or ticker). For example, we could use a pivot table to calculate mean daily stock returns by weekday.
We will focus on:
- Using
.groupby()to group by columns, indexes, and functions - Using
.agg()to aggregate multiple functions - Using pivot tables as an alternative to
.groupby()
Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'2 GroupBy Mechanics
“Split-apply-combine” is an excellent way to describe and visualize pandas groupby operations.
Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 10-1 for a mockup of a simple group aggregation.
Figure 10-1 visualizes a split-apply-combine operation that:
- Splits by the
keycolumn (i.e., “groups bykey”) - Applies the sum operation to the
datacolumn (i.e., “and sumsdata”) - Combines the grouped sums
I describe this operation as “sum the data column by groups formed on the key column.”
np.random.seed(42)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
'key2' : ['one', 'two', 'one', 'two', 'one'],
'data1' : np.random.randn(5),
'data2' : np.random.randn(5)})
df| key1 | key2 | data1 | data2 | |
|---|---|---|---|---|
| 0 | a | one | 0.4967 | -0.2341 |
| 1 | a | two | -0.1383 | 1.5792 |
| 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
| 4 | a | one | -0.2342 | 0.5426 |
Here is one way to calculate the means of data1 by groups formed on key1.
df.loc[df['key1'] == 'a', 'data1'].mean()0.0414
df.loc[df['key1'] == 'b', 'data1'].mean()1.0854
We can do this calculation more quickly!
- Use the
.groupby()method to group bykey1 - Use the
.mean()method to sumdata1within each value ofkey1
Note that without the .mean() method, pandas only sets up the grouped object, which can accept the .mean() method.
grouped = df['data1'].groupby(df['key1'])
grouped<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B3D7023290>
grouped.mean()key1
a 0.0414
b 1.0854
Name: data1, dtype: float64
We can can chain the .groupby() and .mean() methods!
df['data1'].groupby(df['key1']).mean()key1
a 0.0414
b 1.0854
Name: data1, dtype: float64
If we prefer our result as a dataframe instead of a series, we can wrap data1 with two sets of square brackets.
df[['data1']].groupby(df['key1']).mean()| data1 | |
|---|---|
| key1 | |
| a | 0.0414 |
| b | 1.0854 |
We can group by more than one variable. We get a hierarchical row index (or row multi-index) when we group by more than one variable.
means = df['data1'].groupby([df['key1'], df['key2']]).mean()
meanskey1 key2
a one 0.1313
two -0.1383
b one 0.6477
two 1.5230
Name: data1, dtype: float64
We can use the .unstack() method if we want to use both rows and columns to organize data. Recall the .unstack() method un-stacks the inner index level (i.e., level = -1) by default so that key2 values become the columns.
means.unstack()| key2 | one | two |
|---|---|---|
| key1 | ||
| a | 0.1313 | -0.1383 |
| b | 0.6477 | 1.5230 |
The grouping variables can be columns in the data frame we want to group with the .groupby() method. Our grouping variables are typically columns in the data frame we want to group, so this syntax is more compact and easier to understand.
df| key1 | key2 | data1 | data2 | |
|---|---|---|---|---|
| 0 | a | one | 0.4967 | -0.2341 |
| 1 | a | two | -0.1383 | 1.5792 |
| 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
| 4 | a | one | -0.2342 | 0.5426 |
However, we need to make sure that all the columns we pass to the aggregation method (e.g., .mean()) are numerical. Otherwise, pandas will give us a difficult to decipher error about methods and data types. For example, in the following code, pandas tries to calculate the mean of column key2, which is a string.
# df.groupby('key1').mean() # TypeError: agg function failed [how->mean,dtype->object]To avoid this error, we need to either:
- Slice the numerical columns (e.g.,
['data1', 'data2']) - Group on all the non-numerical columns so pandas does not pass them to the aggregation function (e.g.,
df.groupby(['key1', 'key2']))
df.groupby('key1')[['data1', 'data2']].mean()| data1 | data2 | |
|---|---|---|
| key1 | ||
| a | 0.0414 | 0.6292 |
| b | 1.0854 | 0.1490 |
df.groupby(['key1', 'key2']).mean()| data1 | data2 | ||
|---|---|---|---|
| key1 | key2 | ||
| a | one | 0.1313 | 0.1542 |
| two | -0.1383 | 1.5792 | |
| b | one | 0.6477 | 0.7674 |
| two | 1.5230 | -0.4695 |
We can use tab completion to reminder ourselves of methods we can apply to grouped series and data frames.
2.1 Iterating Over Groups
We can iterate over groups, too, because the .groupby() method generates a sequence of tuples. Each tuples contains the value(s) of the grouping variable(s) and its chunk of the dataframe. McKinney provides two loops to show how to iterate over groups.
for k1, group in df.groupby('key1'):
print(k1, group, sep='\n')a
key1 key2 data1 data2
0 a one 0.4967 -0.2341
1 a two -0.1383 1.5792
4 a one -0.2342 0.5426
b
key1 key2 data1 data2
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
for (k1, k2), group in df.groupby(['key1', 'key2']):
print((k1, k2), group, sep='\n')('a', 'one')
key1 key2 data1 data2
0 a one 0.4967 -0.2341
4 a one -0.2342 0.5426
('a', 'two')
key1 key2 data1 data2
1 a two -0.1383 1.5792
('b', 'one')
key1 key2 data1 data2
2 b one 0.6477 0.7674
('b', 'two')
key1 key2 data1 data2
3 b two 1.5230 -0.4695
2.2 Grouping with Functions
We can also group with functions. Below, we group with the len function, which calculates the length of the first names in the row index. We could instead add a helper column to people, but it is easier to pass a function to .groupby().
np.random.seed(42)
people = pd.DataFrame(
data=np.random.randn(5, 5),
columns=['a', 'b', 'c', 'd', 'e'],
index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']
)
people| a | b | c | d | e | |
|---|---|---|---|---|---|
| Joe | 0.4967 | -0.1383 | 0.6477 | 1.5230 | -0.2342 |
| Steve | -0.2341 | 1.5792 | 0.7674 | -0.4695 | 0.5426 |
| Wes | -0.4634 | -0.4657 | 0.2420 | -1.9133 | -1.7249 |
| Jim | -0.5623 | -1.0128 | 0.3142 | -0.9080 | -1.4123 |
| Travis | 1.4656 | -0.2258 | 0.0675 | -1.4247 | -0.5444 |
people.groupby(len).sum()| a | b | c | d | e | |
|---|---|---|---|---|---|
| 3 | -0.5290 | -1.6168 | 1.2039 | -1.2983 | -3.3714 |
| 5 | -0.2341 | 1.5792 | 0.7674 | -0.4695 | 0.5426 |
| 6 | 1.4656 | -0.2258 | 0.0675 | -1.4247 | -0.5444 |
We can mix functions, lists, dictionaries, etc. that we pass to .groupby().
key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()| a | b | c | d | e | ||
|---|---|---|---|---|---|---|
| 3 | one | -0.4634 | -0.4657 | 0.2420 | -1.9133 | -1.7249 |
| two | -0.5623 | -1.0128 | 0.3142 | -0.9080 | -1.4123 | |
| 5 | one | -0.2341 | 1.5792 | 0.7674 | -0.4695 | 0.5426 |
| 6 | two | 1.4656 | -0.2258 | 0.0675 | -1.4247 | -0.5444 |
d = {'Joe': 'a', 'Jim': 'b'}
people.groupby([len, d]).min()| a | b | c | d | e | ||
|---|---|---|---|---|---|---|
| 3 | a | 0.4967 | -0.1383 | 0.6477 | 1.5230 | -0.2342 |
| b | -0.5623 | -1.0128 | 0.3142 | -0.9080 | -1.4123 |
d_2 = {'Joe': 'Cool', 'Jim': 'Nerd', 'Travis': 'Cool'}
people.groupby([len, d_2]).min()| a | b | c | d | e | ||
|---|---|---|---|---|---|---|
| 3 | Cool | 0.4967 | -0.1383 | 0.6477 | 1.5230 | -0.2342 |
| Nerd | -0.5623 | -1.0128 | 0.3142 | -0.9080 | -1.4123 | |
| 6 | Cool | 1.4656 | -0.2258 | 0.0675 | -1.4247 | -0.5444 |
2.3 Grouping by Index Levels
We can also group by index levels. We can specify index levels by either level number or name.
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
[1, 3, 5, 1, 3]],
names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)
hier_df.T| 0 | 1 | 2 | 3 | ||
|---|---|---|---|---|---|
| cty | tenor | ||||
| US | 1 | 0.1109 | -0.6017 | -1.2208 | 0.7385 |
| 3 | -1.1510 | 1.8523 | 0.2089 | 0.1714 | |
| 5 | 0.3757 | -0.0135 | -1.9597 | -0.1156 | |
| JP | 1 | -0.6006 | -1.0577 | -1.3282 | -0.3011 |
| 3 | -0.2917 | 0.8225 | 0.1969 | -1.4785 |
hier_df.T.groupby(level='cty').count()| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| cty | ||||
| JP | 2 | 2 | 2 | 2 |
| US | 3 | 3 | 3 | 3 |
hier_df.T.groupby(level='tenor').count()| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| tenor | ||||
| 1 | 2 | 2 | 2 | 2 |
| 3 | 2 | 2 | 2 | 2 |
| 5 | 1 | 1 | 1 | 1 |
3 Data Aggregation
Table 10-1 provides the optimized groupby methods:
count: Number of non-NA values in the groupsum: Sum of non-NA valuesmean: Mean of non-NA valuesmedian: Arithmetic median of non-NA valuesstd,var: Unbiased (n – 1 denominator) standard deviation and variancemin,max: Minimum and maximum of non-NA valuesprod: Product of non-NA valuesfirst,last: First and last non-NA values
These optimized methods are fast and efficient, but pandas lets us use other, non-optimized methods. First, any series method is available.
df| key1 | key2 | data1 | data2 | |
|---|---|---|---|---|
| 0 | a | one | 0.4967 | -0.2341 |
| 1 | a | two | -0.1383 | 1.5792 |
| 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
| 4 | a | one | -0.2342 | 0.5426 |
df.groupby('key1')['data1'].quantile(0.9)key1
a 0.3697
b 1.4355
Name: data1, dtype: float64
Second, we can write our own functions and pass them to the .agg() method. These functions should accept an array and returns a single value.
def max_minus_min(arr):
return arr.max() - arr.min()df.sort_values(by=['key1', 'data1'])| key1 | key2 | data1 | data2 | |
|---|---|---|---|---|
| 4 | a | one | -0.2342 | 0.5426 |
| 1 | a | two | -0.1383 | 1.5792 |
| 0 | a | one | 0.4967 | -0.2341 |
| 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
df.groupby('key1')['data1'].agg(max_minus_min)key1
a 0.7309
b 0.8753
Name: data1, dtype: float64
Some other methods work, too, even if they are do not aggregate an array to a single value.
df.groupby('key1')['data1'].describe()| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| key1 | ||||||||
| a | 3.0000 | 0.0414 | 0.3972 | -0.2342 | -0.1862 | -0.1383 | 0.1792 | 0.4967 |
| b | 2.0000 | 1.0854 | 0.6190 | 0.6477 | 0.8665 | 1.0854 | 1.3042 | 1.5230 |
3.1 Column-Wise and Multiple Function Application
The .agg() methods provides two more handy features:
- We can pass multiple functions to operate on all of the columns
- We can pass specific functions to operate on specific columns
Here is an example with multiple functions:
df.groupby('key1')['data1'].agg(['mean', 'median', 'min', 'max'])| mean | median | min | max | |
|---|---|---|---|---|
| key1 | ||||
| a | 0.0414 | -0.1383 | -0.2342 | 0.4967 |
| b | 1.0854 | 1.0854 | 0.6477 | 1.5230 |
df.groupby('key1')[['data1', 'data2']].agg(['mean', 'median', 'min', 'max'])| data1 | data2 | |||||||
|---|---|---|---|---|---|---|---|---|
| mean | median | min | max | mean | median | min | max | |
| key1 | ||||||||
| a | 0.0414 | -0.1383 | -0.2342 | 0.4967 | 0.6292 | 0.5426 | -0.2341 | 1.5792 |
| b | 1.0854 | 1.0854 | 0.6477 | 1.5230 | 0.1490 | 0.1490 | -0.4695 | 0.7674 |
What if I wanted to calculate the mean of data1 and the median of data2 by key1?
df.groupby('key1').agg({'data1': 'mean', 'data2': 'median'})| data1 | data2 | |
|---|---|---|
| key1 | ||
| a | 0.0414 | 0.5426 |
| b | 1.0854 | 0.1490 |
What if I wanted to calculate the mean and standard deviation of data1 and the median of data2 by key1?
df.groupby('key1').agg({'data1': ['mean', 'std'], 'data2': 'median'})| data1 | data2 | ||
|---|---|---|---|
| mean | std | median | |
| key1 | |||
| a | 0.0414 | 0.3972 | 0.5426 |
| b | 1.0854 | 0.6190 | 0.1490 |
4 Apply: General split-apply-combine
The .agg() method aggrates an array to a single value. We can use the .apply() method for more general calculations.
We can combine the .groupby() and .apply() methods to:
- Split a dataframe by grouping variables
- Call the applied function on each chunk of the original dataframe
- Recombine the output of the applied function
def top(x, col, n=1):
return x.sort_values(col).head(n)df| key1 | key2 | data1 | data2 | |
|---|---|---|---|---|
| 0 | a | one | 0.4967 | -0.2341 |
| 1 | a | two | -0.1383 | 1.5792 |
| 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
| 4 | a | one | -0.2342 | 0.5426 |
df.groupby('key1').apply(top, col='data1')| key1 | key2 | data1 | data2 | ||
|---|---|---|---|---|---|
| key1 | |||||
| a | 4 | a | one | -0.2342 | 0.5426 |
| b | 2 | b | one | 0.6477 | 0.7674 |
df.groupby('key1').apply(top, col='data1', n=2)| key1 | key2 | data1 | data2 | ||
|---|---|---|---|---|---|
| key1 | |||||
| a | 4 | a | one | -0.2342 | 0.5426 |
| 1 | a | two | -0.1383 | 1.5792 | |
| b | 2 | b | one | 0.6477 | 0.7674 |
| 3 | b | two | 1.5230 | -0.4695 |
5 Pivot Tables and Cross-Tabulation
Above we manually made pivot tables with the groupby(), .agg(), .apply() and .unstack() methods. pandas provides a literal interpreation of Excel-style pivot tables with the .pivot_table() method and the pandas.pivot_table() function. These also provide row and column totals via “margins”. It is worthwhile to read-through the .pivot_table() docstring several times.
ind = (
yf.download(tickers='^GSPC ^DJI ^IXIC ^FTSE ^N225 ^HSI')
.rename_axis(columns=['Variable', 'Index'])
.stack()
)
ind.head()[*********************100%%**********************] 6 of 6 completed
| Variable | Adj Close | Close | High | Low | Open | Volume | |
|---|---|---|---|---|---|---|---|
| Date | Index | ||||||
| 1927-12-30 | ^GSPC | 17.6600 | 17.6600 | 17.6600 | 17.6600 | 17.6600 | 0.0000 |
| 1928-01-03 | ^GSPC | 17.7600 | 17.7600 | 17.7600 | 17.7600 | 17.7600 | 0.0000 |
| 1928-01-04 | ^GSPC | 17.7200 | 17.7200 | 17.7200 | 17.7200 | 17.7200 | 0.0000 |
| 1928-01-05 | ^GSPC | 17.5500 | 17.5500 | 17.5500 | 17.5500 | 17.5500 | 0.0000 |
| 1928-01-06 | ^GSPC | 17.6600 | 17.6600 | 17.6600 | 17.6600 | 17.6600 | 0.0000 |
The default aggregation function for .pivot_table() is mean.
ind.loc['2015':].pivot_table(index='Index')| Variable | Adj Close | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|---|
| Index | ||||||
| ^DJI | 26474.0396 | 26474.0396 | 26614.0050 | 26317.6183 | 26469.7020 | 290895360.8696 |
| ^FTSE | 7045.8903 | 7045.8903 | 7087.2046 | 7003.6069 | 7045.3863 | 804146245.6710 |
| ^GSPC | 3152.9679 | 3152.9679 | 3169.6184 | 3133.9186 | 3152.3787 | 4016121530.4348 |
| ^HSI | 24334.1870 | 24334.1870 | 24498.6004 | 24164.3352 | 24350.3348 | 2026656088.8050 |
| ^IXIC | 9133.9826 | 9133.9826 | 9194.6083 | 9064.2100 | 9132.1387 | 3273306814.4285 |
| ^N225 | 23519.0408 | 23519.0408 | 23642.2996 | 23385.7755 | 23518.4156 | 95240957.9230 |
ind.loc['2015':].pivot_table(index='Index', aggfunc='median')| Variable | Adj Close | Close | High | Low | Open | Volume |
|---|---|---|---|---|---|---|
| Index | ||||||
| ^DJI | 26065.6494 | 26065.6494 | 26199.2002 | 25921.8994 | 26083.4893 | 297655000.0000 |
| ^FTSE | 7190.5000 | 7190.5000 | 7226.2000 | 7148.0000 | 7190.0000 | 757404150.0000 |
| ^GSPC | 2902.8950 | 2902.8950 | 2912.7450 | 2891.8750 | 2906.6550 | 3821900000.0000 |
| ^HSI | 24615.1309 | 24615.1309 | 24771.5996 | 24476.1992 | 24645.1992 | 1868937800.0000 |
| ^IXIC | 8000.2300 | 8000.2300 | 8048.5801 | 7953.6699 | 8011.6802 | 2447750000.0000 |
| ^N225 | 22470.3242 | 22470.3242 | 22555.8008 | 22338.5400 | 22460.9902 | 82750000.0000 |
We can use values to select specific variables, pd.Grouper() to sample different date windows, and aggfunc to select specific aggregation functions.
(
ind
.loc['2015':]
.reset_index()
.pivot_table(
values='Close',
index=pd.Grouper(key='Date', freq='A'),
columns='Index',
aggfunc=['min', 'max']
)
)| min | max | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Index | ^DJI | ^FTSE | ^GSPC | ^HSI | ^IXIC | ^N225 | ^DJI | ^FTSE | ^GSPC | ^HSI | ^IXIC | ^N225 |
| Date | ||||||||||||
| 2015-12-31 | 15666.4404 | 5874.1001 | 1867.6100 | 20556.5996 | 4506.4902 | 16795.9609 | 18312.3906 | 7104.0000 | 2130.8201 | 28442.7500 | 5218.8599 | 20868.0293 |
| 2016-12-31 | 15660.1797 | 5537.0000 | 1829.0800 | 18319.5801 | 4266.8398 | 14952.0195 | 19974.6191 | 7142.7998 | 2271.7200 | 24099.6992 | 5487.4399 | 19494.5293 |
| 2017-12-31 | 19732.4004 | 7099.2002 | 2257.8301 | 22134.4707 | 5429.0801 | 18335.6309 | 24837.5098 | 7687.7998 | 2690.1599 | 30003.4902 | 6994.7598 | 22939.1797 |
| 2018-12-31 | 21792.1992 | 6584.7002 | 2351.1001 | 24585.5293 | 6192.9199 | 19155.7402 | 26828.3906 | 7877.5000 | 2930.7500 | 33154.1211 | 8109.6899 | 24270.6191 |
| 2019-12-31 | 22686.2207 | 6692.7002 | 2447.8899 | 25064.3594 | 6463.5000 | 19561.9609 | 28645.2598 | 7686.6001 | 3240.0200 | 30157.4902 | 9022.3896 | 24066.1191 |
| 2020-12-31 | 18591.9297 | 4993.8999 | 2237.3999 | 21696.1309 | 6860.6699 | 16552.8301 | 30606.4805 | 7674.6001 | 3756.0701 | 29056.4199 | 12899.4199 | 27568.1504 |
| 2021-12-31 | 29982.6191 | 6407.5000 | 3700.6499 | 22744.8594 | 12609.1602 | 27013.2500 | 36488.6289 | 7420.7002 | 4793.0601 | 31084.9395 | 16057.4404 | 30670.0996 |
| 2022-12-31 | 28725.5098 | 6826.2002 | 3577.0300 | 14687.0195 | 10213.2900 | 24717.5293 | 36799.6484 | 7672.3999 | 4796.5601 | 24965.5508 | 15832.7998 | 29332.1602 |
| 2023-12-31 | 31819.1406 | 7256.8999 | 3808.1001 | 16201.4902 | 10305.2402 | 25716.8594 | 37710.1016 | 8014.2998 | 4783.3501 | 22688.9004 | 15099.1797 | 33753.3281 |
| 2024-12-31 | 37266.6719 | 7446.2998 | 4688.6802 | 14961.1797 | 14510.2998 | 33288.2891 | 39069.1094 | 7728.5000 | 5087.0298 | 16788.5508 | 16041.6201 | 39098.6797 |