McKinney Chapter 10 - Data Aggregation and Group Operations

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 10 of Wes McKinney’s Python for Data Analysis discusses groupby operations, which are the pandas equivalent of Excel pivot tables. Pivot tables help us calculate statistics (e.g., sum, mean, and median) for one set of variables by groups of other variables (e.g., weekday or ticker). For example, we could use a pivot table to calculate mean daily stock returns by weekday.

We will focus on:

  1. Using .groupby() to group by columns, indexes, and functions
  2. Using .agg() to aggregate multiple functions
  3. Using pivot tables as an alternative to .groupby()

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yf
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

2 GroupBy Mechanics

“Split-apply-combine” is an excellent way to describe and visualize pandas groupby operations.

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 10-1 for a mockup of a simple group aggregation.

Figure 10-1 visualizes a split-apply-combine operation that:

  1. Splits by the key column (i.e., “groups by key”)
  2. Applies the sum operation to the data column (i.e., “and sums data”)
  3. Combines the grouped sums

I describe this operation as “sum the data column by groups formed on the key column.”

np.random.seed(42)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df
key1 key2 data1 data2
0 a one 0.4967 -0.2341
1 a two -0.1383 1.5792
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
4 a one -0.2342 0.5426

Here is one way to calculate the means of data1 by groups formed on key1.

df.loc[df['key1'] == 'a', 'data1'].mean()
0.0414
df.loc[df['key1'] == 'b', 'data1'].mean()
1.0854

We can do this calculation more quickly!

  1. Use the .groupby() method to group by key1
  2. Use the .mean() method to sum data1 within each value of key1

Note that without the .mean() method, pandas only sets up the grouped object, which can accept the .mean() method.

grouped = df['data1'].groupby(df['key1'])
grouped
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B3D7023290>
grouped.mean()
key1
a   0.0414
b   1.0854
Name: data1, dtype: float64

We can can chain the .groupby() and .mean() methods!

df['data1'].groupby(df['key1']).mean()
key1
a   0.0414
b   1.0854
Name: data1, dtype: float64

If we prefer our result as a dataframe instead of a series, we can wrap data1 with two sets of square brackets.

df[['data1']].groupby(df['key1']).mean()
data1
key1
a 0.0414
b 1.0854

We can group by more than one variable. We get a hierarchical row index (or row multi-index) when we group by more than one variable.

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means
key1  key2
a     one     0.1313
      two    -0.1383
b     one     0.6477
      two     1.5230
Name: data1, dtype: float64

We can use the .unstack() method if we want to use both rows and columns to organize data. Recall the .unstack() method un-stacks the inner index level (i.e., level = -1) by default so that key2 values become the columns.

means.unstack()
key2 one two
key1
a 0.1313 -0.1383
b 0.6477 1.5230

The grouping variables can be columns in the data frame we want to group with the .groupby() method. Our grouping variables are typically columns in the data frame we want to group, so this syntax is more compact and easier to understand.

df
key1 key2 data1 data2
0 a one 0.4967 -0.2341
1 a two -0.1383 1.5792
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
4 a one -0.2342 0.5426

However, we need to make sure that all the columns we pass to the aggregation method (e.g., .mean()) are numerical. Otherwise, pandas will give us a difficult to decipher error about methods and data types. For example, in the following code, pandas tries to calculate the mean of column key2, which is a string.

# df.groupby('key1').mean() # TypeError: agg function failed [how->mean,dtype->object]

To avoid this error, we need to either:

  1. Slice the numerical columns (e.g., ['data1', 'data2'])
  2. Group on all the non-numerical columns so pandas does not pass them to the aggregation function (e.g., df.groupby(['key1', 'key2']))
df.groupby('key1')[['data1', 'data2']].mean()
data1 data2
key1
a 0.0414 0.6292
b 1.0854 0.1490
df.groupby(['key1', 'key2']).mean()
data1 data2
key1 key2
a one 0.1313 0.1542
two -0.1383 1.5792
b one 0.6477 0.7674
two 1.5230 -0.4695

We can use tab completion to reminder ourselves of methods we can apply to grouped series and data frames.

2.1 Iterating Over Groups

We can iterate over groups, too, because the .groupby() method generates a sequence of tuples. Each tuples contains the value(s) of the grouping variable(s) and its chunk of the dataframe. McKinney provides two loops to show how to iterate over groups.

for k1, group in df.groupby('key1'):
    print(k1, group, sep='\n')
a
  key1 key2   data1   data2
0    a  one  0.4967 -0.2341
1    a  two -0.1383  1.5792
4    a  one -0.2342  0.5426
b
  key1 key2  data1   data2
2    b  one 0.6477  0.7674
3    b  two 1.5230 -0.4695
for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2), group, sep='\n')
('a', 'one')
  key1 key2   data1   data2
0    a  one  0.4967 -0.2341
4    a  one -0.2342  0.5426
('a', 'two')
  key1 key2   data1  data2
1    a  two -0.1383 1.5792
('b', 'one')
  key1 key2  data1  data2
2    b  one 0.6477 0.7674
('b', 'two')
  key1 key2  data1   data2
3    b  two 1.5230 -0.4695

2.2 Grouping with Functions

We can also group with functions. Below, we group with the len function, which calculates the length of the first names in the row index. We could instead add a helper column to people, but it is easier to pass a function to .groupby().

np.random.seed(42)
people = pd.DataFrame(
    data=np.random.randn(5, 5), 
    columns=['a', 'b', 'c', 'd', 'e'], 
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']
)

people
a b c d e
Joe 0.4967 -0.1383 0.6477 1.5230 -0.2342
Steve -0.2341 1.5792 0.7674 -0.4695 0.5426
Wes -0.4634 -0.4657 0.2420 -1.9133 -1.7249
Jim -0.5623 -1.0128 0.3142 -0.9080 -1.4123
Travis 1.4656 -0.2258 0.0675 -1.4247 -0.5444
people.groupby(len).sum()
a b c d e
3 -0.5290 -1.6168 1.2039 -1.2983 -3.3714
5 -0.2341 1.5792 0.7674 -0.4695 0.5426
6 1.4656 -0.2258 0.0675 -1.4247 -0.5444

We can mix functions, lists, dictionaries, etc. that we pass to .groupby().

key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()
a b c d e
3 one -0.4634 -0.4657 0.2420 -1.9133 -1.7249
two -0.5623 -1.0128 0.3142 -0.9080 -1.4123
5 one -0.2341 1.5792 0.7674 -0.4695 0.5426
6 two 1.4656 -0.2258 0.0675 -1.4247 -0.5444
d = {'Joe': 'a', 'Jim': 'b'}
people.groupby([len, d]).min()
a b c d e
3 a 0.4967 -0.1383 0.6477 1.5230 -0.2342
b -0.5623 -1.0128 0.3142 -0.9080 -1.4123
d_2 = {'Joe': 'Cool', 'Jim': 'Nerd', 'Travis': 'Cool'}
people.groupby([len, d_2]).min()
a b c d e
3 Cool 0.4967 -0.1383 0.6477 1.5230 -0.2342
Nerd -0.5623 -1.0128 0.3142 -0.9080 -1.4123
6 Cool 1.4656 -0.2258 0.0675 -1.4247 -0.5444

2.3 Grouping by Index Levels

We can also group by index levels. We can specify index levels by either level number or name.

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df.T
0 1 2 3
cty tenor
US 1 0.1109 -0.6017 -1.2208 0.7385
3 -1.1510 1.8523 0.2089 0.1714
5 0.3757 -0.0135 -1.9597 -0.1156
JP 1 -0.6006 -1.0577 -1.3282 -0.3011
3 -0.2917 0.8225 0.1969 -1.4785
hier_df.T.groupby(level='cty').count()
0 1 2 3
cty
JP 2 2 2 2
US 3 3 3 3
hier_df.T.groupby(level='tenor').count()
0 1 2 3
tenor
1 2 2 2 2
3 2 2 2 2
5 1 1 1 1

3 Data Aggregation

Table 10-1 provides the optimized groupby methods:

  • count: Number of non-NA values in the group
  • sum: Sum of non-NA values
  • mean: Mean of non-NA values
  • median: Arithmetic median of non-NA values
  • std, var: Unbiased (n – 1 denominator) standard deviation and variance
  • min, max: Minimum and maximum of non-NA values
  • prod: Product of non-NA values
  • first, last: First and last non-NA values

These optimized methods are fast and efficient, but pandas lets us use other, non-optimized methods. First, any series method is available.

df
key1 key2 data1 data2
0 a one 0.4967 -0.2341
1 a two -0.1383 1.5792
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
4 a one -0.2342 0.5426
df.groupby('key1')['data1'].quantile(0.9)
key1
a   0.3697
b   1.4355
Name: data1, dtype: float64

Second, we can write our own functions and pass them to the .agg() method. These functions should accept an array and returns a single value.

def max_minus_min(arr):
    return arr.max() - arr.min()
df.sort_values(by=['key1', 'data1'])
key1 key2 data1 data2
4 a one -0.2342 0.5426
1 a two -0.1383 1.5792
0 a one 0.4967 -0.2341
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
df.groupby('key1')['data1'].agg(max_minus_min)
key1
a   0.7309
b   0.8753
Name: data1, dtype: float64

Some other methods work, too, even if they are do not aggregate an array to a single value.

df.groupby('key1')['data1'].describe()
count mean std min 25% 50% 75% max
key1
a 3.0000 0.0414 0.3972 -0.2342 -0.1862 -0.1383 0.1792 0.4967
b 2.0000 1.0854 0.6190 0.6477 0.8665 1.0854 1.3042 1.5230

3.1 Column-Wise and Multiple Function Application

The .agg() methods provides two more handy features:

  1. We can pass multiple functions to operate on all of the columns
  2. We can pass specific functions to operate on specific columns

Here is an example with multiple functions:

df.groupby('key1')['data1'].agg(['mean', 'median', 'min', 'max'])
mean median min max
key1
a 0.0414 -0.1383 -0.2342 0.4967
b 1.0854 1.0854 0.6477 1.5230
df.groupby('key1')[['data1', 'data2']].agg(['mean', 'median', 'min', 'max'])
data1 data2
mean median min max mean median min max
key1
a 0.0414 -0.1383 -0.2342 0.4967 0.6292 0.5426 -0.2341 1.5792
b 1.0854 1.0854 0.6477 1.5230 0.1490 0.1490 -0.4695 0.7674

What if I wanted to calculate the mean of data1 and the median of data2 by key1?

df.groupby('key1').agg({'data1': 'mean', 'data2': 'median'})
data1 data2
key1
a 0.0414 0.5426
b 1.0854 0.1490

What if I wanted to calculate the mean and standard deviation of data1 and the median of data2 by key1?

df.groupby('key1').agg({'data1': ['mean', 'std'], 'data2': 'median'})
data1 data2
mean std median
key1
a 0.0414 0.3972 0.5426
b 1.0854 0.6190 0.1490

4 Apply: General split-apply-combine

The .agg() method aggrates an array to a single value. We can use the .apply() method for more general calculations.

We can combine the .groupby() and .apply() methods to:

  1. Split a dataframe by grouping variables
  2. Call the applied function on each chunk of the original dataframe
  3. Recombine the output of the applied function
def top(x, col, n=1):
    return x.sort_values(col).head(n)
df
key1 key2 data1 data2
0 a one 0.4967 -0.2341
1 a two -0.1383 1.5792
2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695
4 a one -0.2342 0.5426
df.groupby('key1').apply(top, col='data1')
key1 key2 data1 data2
key1
a 4 a one -0.2342 0.5426
b 2 b one 0.6477 0.7674
df.groupby('key1').apply(top, col='data1', n=2)
key1 key2 data1 data2
key1
a 4 a one -0.2342 0.5426
1 a two -0.1383 1.5792
b 2 b one 0.6477 0.7674
3 b two 1.5230 -0.4695

5 Pivot Tables and Cross-Tabulation

Above we manually made pivot tables with the groupby(), .agg(), .apply() and .unstack() methods. pandas provides a literal interpreation of Excel-style pivot tables with the .pivot_table() method and the pandas.pivot_table() function. These also provide row and column totals via “margins”. It is worthwhile to read-through the .pivot_table() docstring several times.

ind = (
    yf.download(tickers='^GSPC ^DJI ^IXIC ^FTSE ^N225 ^HSI')
    .rename_axis(columns=['Variable', 'Index'])
    .stack()
)

ind.head()
[*********************100%%**********************]  6 of 6 completed
Variable Adj Close Close High Low Open Volume
Date Index
1927-12-30 ^GSPC 17.6600 17.6600 17.6600 17.6600 17.6600 0.0000
1928-01-03 ^GSPC 17.7600 17.7600 17.7600 17.7600 17.7600 0.0000
1928-01-04 ^GSPC 17.7200 17.7200 17.7200 17.7200 17.7200 0.0000
1928-01-05 ^GSPC 17.5500 17.5500 17.5500 17.5500 17.5500 0.0000
1928-01-06 ^GSPC 17.6600 17.6600 17.6600 17.6600 17.6600 0.0000

The default aggregation function for .pivot_table() is mean.

ind.loc['2015':].pivot_table(index='Index')
Variable Adj Close Close High Low Open Volume
Index
^DJI 26474.0396 26474.0396 26614.0050 26317.6183 26469.7020 290895360.8696
^FTSE 7045.8903 7045.8903 7087.2046 7003.6069 7045.3863 804146245.6710
^GSPC 3152.9679 3152.9679 3169.6184 3133.9186 3152.3787 4016121530.4348
^HSI 24334.1870 24334.1870 24498.6004 24164.3352 24350.3348 2026656088.8050
^IXIC 9133.9826 9133.9826 9194.6083 9064.2100 9132.1387 3273306814.4285
^N225 23519.0408 23519.0408 23642.2996 23385.7755 23518.4156 95240957.9230
ind.loc['2015':].pivot_table(index='Index', aggfunc='median')
Variable Adj Close Close High Low Open Volume
Index
^DJI 26065.6494 26065.6494 26199.2002 25921.8994 26083.4893 297655000.0000
^FTSE 7190.5000 7190.5000 7226.2000 7148.0000 7190.0000 757404150.0000
^GSPC 2902.8950 2902.8950 2912.7450 2891.8750 2906.6550 3821900000.0000
^HSI 24615.1309 24615.1309 24771.5996 24476.1992 24645.1992 1868937800.0000
^IXIC 8000.2300 8000.2300 8048.5801 7953.6699 8011.6802 2447750000.0000
^N225 22470.3242 22470.3242 22555.8008 22338.5400 22460.9902 82750000.0000

We can use values to select specific variables, pd.Grouper() to sample different date windows, and aggfunc to select specific aggregation functions.

(
    ind
    .loc['2015':]
    .reset_index()
    .pivot_table(
        values='Close',
        index=pd.Grouper(key='Date', freq='A'),
        columns='Index',
        aggfunc=['min', 'max']
    )
)
min max
Index ^DJI ^FTSE ^GSPC ^HSI ^IXIC ^N225 ^DJI ^FTSE ^GSPC ^HSI ^IXIC ^N225
Date
2015-12-31 15666.4404 5874.1001 1867.6100 20556.5996 4506.4902 16795.9609 18312.3906 7104.0000 2130.8201 28442.7500 5218.8599 20868.0293
2016-12-31 15660.1797 5537.0000 1829.0800 18319.5801 4266.8398 14952.0195 19974.6191 7142.7998 2271.7200 24099.6992 5487.4399 19494.5293
2017-12-31 19732.4004 7099.2002 2257.8301 22134.4707 5429.0801 18335.6309 24837.5098 7687.7998 2690.1599 30003.4902 6994.7598 22939.1797
2018-12-31 21792.1992 6584.7002 2351.1001 24585.5293 6192.9199 19155.7402 26828.3906 7877.5000 2930.7500 33154.1211 8109.6899 24270.6191
2019-12-31 22686.2207 6692.7002 2447.8899 25064.3594 6463.5000 19561.9609 28645.2598 7686.6001 3240.0200 30157.4902 9022.3896 24066.1191
2020-12-31 18591.9297 4993.8999 2237.3999 21696.1309 6860.6699 16552.8301 30606.4805 7674.6001 3756.0701 29056.4199 12899.4199 27568.1504
2021-12-31 29982.6191 6407.5000 3700.6499 22744.8594 12609.1602 27013.2500 36488.6289 7420.7002 4793.0601 31084.9395 16057.4404 30670.0996
2022-12-31 28725.5098 6826.2002 3577.0300 14687.0195 10213.2900 24717.5293 36799.6484 7672.3999 4796.5601 24965.5508 15832.7998 29332.1602
2023-12-31 31819.1406 7256.8999 3808.1001 16201.4902 10305.2402 25716.8594 37710.1016 8014.2998 4783.3501 22688.9004 15099.1797 33753.3281
2024-12-31 37266.6719 7446.2998 4688.6802 14961.1797 14510.2998 33288.2891 39069.1094 7728.5000 5087.0298 16788.5508 16041.6201 39098.6797