McKinney Chapter 10 - Data Aggregation and Group Operations

1 Introduction

Chapter 10 of Wes McKinney’s Python for Data Analysis discusses groupby operations, which are the pandas equivalent of Excel pivot tables. Pivot tables help us calculate statistics (e.g., sum, mean, and median) for one set of variables by groups of other variables (e.g., weekday or ticker). For example, we could use a pivot table to calculate mean daily stock returns by weekday.

We will focus on:

Using .groupby() to group by columns, indexes, and functions
Using .agg() to aggregate multiple functions
Using pivot tables as an alternative to .groupby()

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yf

%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

2 GroupBy Mechanics

“Split-apply-combine” is an excellent way to describe and visualize pandas groupby operations.

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 10-1 for a mockup of a simple group aggregation.

Figure 10-1 visualizes a split-apply-combine operation that:

Splits by the key column (i.e., “groups by key”)
Applies the sum operation to the data column (i.e., “and sums data”)
Combines the grouped sums

I describe this operation as “sum the data column by groups formed on the key column.”

np.random.seed(42)
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})

df

	key1	key2	data1	data2
0	a	one	0.4967	-0.2341
1	a	two	-0.1383	1.5792
2	b	one	0.6477	0.7674
3	b	two	1.5230	-0.4695
4	a	one	-0.2342	0.5426

Here is one way to calculate the means of data1 by groups formed on key1.

df.loc[df['key1'] == 'a', 'data1'].mean()

0.0414

df.loc[df['key1'] == 'b', 'data1'].mean()

1.0854

We can do this calculation more quickly!

Use the .groupby() method to group by key1
Use the .mean() method to sum data1 within each value of key1

Note that without the .mean() method, pandas only sets up the grouped object, which can accept the .mean() method.

grouped = df['data1'].groupby(df['key1'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B3D7023290>

grouped.mean()

key1
a   0.0414
b   1.0854
Name: data1, dtype: float64

We can can chain the .groupby() and .mean() methods!

df['data1'].groupby(df['key1']).mean()

key1
a   0.0414
b   1.0854
Name: data1, dtype: float64

If we prefer our result as a dataframe instead of a series, we can wrap data1 with two sets of square brackets.

df[['data1']].groupby(df['key1']).mean()

	data1
key1
a	0.0414
b	1.0854

We can group by more than one variable. We get a hierarchical row index (or row multi-index) when we group by more than one variable.

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means

key1  key2
a     one     0.1313
      two    -0.1383
b     one     0.6477
      two     1.5230
Name: data1, dtype: float64

We can use the .unstack() method if we want to use both rows and columns to organize data. Recall the .unstack() method un-stacks the inner index level (i.e., level = -1) by default so that key2 values become the columns.

means.unstack()

key2	one	two
key1
a	0.1313	-0.1383
b	0.6477	1.5230

The grouping variables can be columns in the data frame we want to group with the .groupby() method. Our grouping variables are typically columns in the data frame we want to group, so this syntax is more compact and easier to understand.

df

	key1	key2	data1	data2
0	a	one	0.4967	-0.2341
1	a	two	-0.1383	1.5792
2	b	one	0.6477	0.7674
3	b	two	1.5230	-0.4695
4	a	one	-0.2342	0.5426

However, we need to make sure that all the columns we pass to the aggregation method (e.g., .mean()) are numerical. Otherwise, pandas will give us a difficult to decipher error about methods and data types. For example, in the following code, pandas tries to calculate the mean of column key2, which is a string.

# df.groupby('key1').mean() # TypeError: agg function failed [how->mean,dtype->object]

To avoid this error, we need to either:

Slice the numerical columns (e.g., ['data1', 'data2'])
Group on all the non-numerical columns so pandas does not pass them to the aggregation function (e.g., df.groupby(['key1', 'key2']))

df.groupby('key1')[['data1', 'data2']].mean()

	data1	data2
key1
a	0.0414	0.6292
b	1.0854	0.1490

df.groupby(['key1', 'key2']).mean()

		data1	data2
key1	key2
a	one	0.1313	0.1542
a	two	-0.1383	1.5792
b	one	0.6477	0.7674
b	two	1.5230	-0.4695

We can use tab completion to reminder ourselves of methods we can apply to grouped series and data frames.

2.1 Iterating Over Groups

We can iterate over groups, too, because the .groupby() method generates a sequence of tuples. Each tuples contains the value(s) of the grouping variable(s) and its chunk of the dataframe. McKinney provides two loops to show how to iterate over groups.

for k1, group in df.groupby('key1'):
    print(k1, group, sep='\n')

a
  key1 key2   data1   data2
0    a  one  0.4967 -0.2341
1    a  two -0.1383  1.5792
4    a  one -0.2342  0.5426
b
  key1 key2  data1   data2
2    b  one 0.6477  0.7674
3    b  two 1.5230 -0.4695

for (k1, k2), group in df.groupby(['key1', 'key2']):
    print((k1, k2), group, sep='\n')

('a', 'one')
  key1 key2   data1   data2
0    a  one  0.4967 -0.2341
4    a  one -0.2342  0.5426
('a', 'two')
  key1 key2   data1  data2
1    a  two -0.1383 1.5792
('b', 'one')
  key1 key2  data1  data2
2    b  one 0.6477 0.7674
('b', 'two')
  key1 key2  data1   data2
3    b  two 1.5230 -0.4695

2.2 Grouping with Functions

We can also group with functions. Below, we group with the len function, which calculates the length of the first names in the row index. We could instead add a helper column to people, but it is easier to pass a function to .groupby().

np.random.seed(42)
people = pd.DataFrame(
    data=np.random.randn(5, 5), 
    columns=['a', 'b', 'c', 'd', 'e'], 
    index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis']
)

people

	a	b	c	d	e
Joe	0.4967	-0.1383	0.6477	1.5230	-0.2342
Steve	-0.2341	1.5792	0.7674	-0.4695	0.5426
Wes	-0.4634	-0.4657	0.2420	-1.9133	-1.7249
Jim	-0.5623	-1.0128	0.3142	-0.9080	-1.4123
Travis	1.4656	-0.2258	0.0675	-1.4247	-0.5444

people.groupby(len).sum()

	a	b	c	d	e
3	-0.5290	-1.6168	1.2039	-1.2983	-3.3714
5	-0.2341	1.5792	0.7674	-0.4695	0.5426
6	1.4656	-0.2258	0.0675	-1.4247	-0.5444

We can mix functions, lists, dictionaries, etc. that we pass to .groupby().

key_list = ['one', 'one', 'one', 'two', 'two']
people.groupby([len, key_list]).min()

		a	b	c	d	e
3	one	-0.4634	-0.4657	0.2420	-1.9133	-1.7249
3	two	-0.5623	-1.0128	0.3142	-0.9080	-1.4123
5	one	-0.2341	1.5792	0.7674	-0.4695	0.5426
6	two	1.4656	-0.2258	0.0675	-1.4247	-0.5444

d = {'Joe': 'a', 'Jim': 'b'}
people.groupby([len, d]).min()

		a	b	c	d	e
3	a	0.4967	-0.1383	0.6477	1.5230	-0.2342
3	b	-0.5623	-1.0128	0.3142	-0.9080	-1.4123

d_2 = {'Joe': 'Cool', 'Jim': 'Nerd', 'Travis': 'Cool'}
people.groupby([len, d_2]).min()

		a	b	c	d	e
3	Cool	0.4967	-0.1383	0.6477	1.5230	-0.2342
3	Nerd	-0.5623	-1.0128	0.3142	-0.9080	-1.4123
6	Cool	1.4656	-0.2258	0.0675	-1.4247	-0.5444

2.3 Grouping by Index Levels

We can also group by index levels. We can specify index levels by either level number or name.

columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                    [1, 3, 5, 1, 3]],
                                    names=['cty', 'tenor'])
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)

hier_df.T

		0	1	2	3
cty	tenor
US	1	0.1109	-0.6017	-1.2208	0.7385
	3	-1.1510	1.8523	0.2089	0.1714
	5	0.3757	-0.0135	-1.9597	-0.1156
JP	1	-0.6006	-1.0577	-1.3282	-0.3011
JP	3	-0.2917	0.8225	0.1969	-1.4785

hier_df.T.groupby(level='cty').count()

	0	1	2	3
cty
JP	2	2	2	2
US	3	3	3	3

hier_df.T.groupby(level='tenor').count()

	0	1	2	3
tenor
1	2	2	2	2
3	2	2	2	2
5	1	1	1	1

3 Data Aggregation

Table 10-1 provides the optimized groupby methods:

count: Number of non-NA values in the group
sum: Sum of non-NA values
mean: Mean of non-NA values
median: Arithmetic median of non-NA values
std, var: Unbiased (n – 1 denominator) standard deviation and variance
min, max: Minimum and maximum of non-NA values
prod: Product of non-NA values
first, last: First and last non-NA values

These optimized methods are fast and efficient, but pandas lets us use other, non-optimized methods. First, any series method is available.

df

	key1	key2	data1	data2
0	a	one	0.4967	-0.2341
1	a	two	-0.1383	1.5792
2	b	one	0.6477	0.7674
3	b	two	1.5230	-0.4695
4	a	one	-0.2342	0.5426

df.groupby('key1')['data1'].quantile(0.9)

key1
a   0.3697
b   1.4355
Name: data1, dtype: float64

Second, we can write our own functions and pass them to the .agg() method. These functions should accept an array and returns a single value.

def max_minus_min(arr):
    return arr.max() - arr.min()

df.sort_values(by=['key1', 'data1'])

	key1	key2	data1	data2
4	a	one	-0.2342	0.5426
1	a	two	-0.1383	1.5792
0	a	one	0.4967	-0.2341
2	b	one	0.6477	0.7674
3	b	two	1.5230	-0.4695

df.groupby('key1')['data1'].agg(max_minus_min)

key1
a   0.7309
b   0.8753
Name: data1, dtype: float64

Some other methods work, too, even if they are do not aggregate an array to a single value.

df.groupby('key1')['data1'].describe()

	count	mean	std	min	25%	50%	75%	max
key1
a	3.0000	0.0414	0.3972	-0.2342	-0.1862	-0.1383	0.1792	0.4967
b	2.0000	1.0854	0.6190	0.6477	0.8665	1.0854	1.3042	1.5230

3.1 Column-Wise and Multiple Function Application

The .agg() methods provides two more handy features:

We can pass multiple functions to operate on all of the columns
We can pass specific functions to operate on specific columns

Here is an example with multiple functions:

df.groupby('key1')['data1'].agg(['mean', 'median', 'min', 'max'])

	mean	median	min	max
key1
a	0.0414	-0.1383	-0.2342	0.4967
b	1.0854	1.0854	0.6477	1.5230

df.groupby('key1')[['data1', 'data2']].agg(['mean', 'median', 'min', 'max'])

	data1				data2
	mean	median	min	max	mean	median	min	max
key1
a	0.0414	-0.1383	-0.2342	0.4967	0.6292	0.5426	-0.2341	1.5792
b	1.0854	1.0854	0.6477	1.5230	0.1490	0.1490	-0.4695	0.7674

What if I wanted to calculate the mean of data1 and the median of data2 by key1?

df.groupby('key1').agg({'data1': 'mean', 'data2': 'median'})

	data1	data2
key1
a	0.0414	0.5426
b	1.0854	0.1490

What if I wanted to calculate the mean and standard deviation of data1 and the median of data2 by key1?

df.groupby('key1').agg({'data1': ['mean', 'std'], 'data2': 'median'})

	data1		data2
	mean	std	median
key1
a	0.0414	0.3972	0.5426
b	1.0854	0.6190	0.1490

4 Apply: General split-apply-combine

The .agg() method aggrates an array to a single value. We can use the .apply() method for more general calculations.

We can combine the .groupby() and .apply() methods to:

Split a dataframe by grouping variables
Call the applied function on each chunk of the original dataframe
Recombine the output of the applied function

def top(x, col, n=1):
    return x.sort_values(col).head(n)

df

	key1	key2	data1	data2
0	a	one	0.4967	-0.2341
1	a	two	-0.1383	1.5792
2	b	one	0.6477	0.7674
3	b	two	1.5230	-0.4695
4	a	one	-0.2342	0.5426

df.groupby('key1').apply(top, col='data1')

		key1	key2	data1	data2
key1
a	4	a	one	-0.2342	0.5426
b	2	b	one	0.6477	0.7674

df.groupby('key1').apply(top, col='data1', n=2)

		key1	key2	data1	data2
key1
a	4	a	one	-0.2342	0.5426
a	1	a	two	-0.1383	1.5792
b	2	b	one	0.6477	0.7674
b	3	b	two	1.5230	-0.4695

5 Pivot Tables and Cross-Tabulation

Above we manually made pivot tables with the groupby(), .agg(), .apply() and .unstack() methods. pandas provides a literal interpreation of Excel-style pivot tables with the .pivot_table() method and the pandas.pivot_table() function. These also provide row and column totals via “margins”. It is worthwhile to read-through the .pivot_table() docstring several times.

ind = (
    yf.download(tickers='^GSPC ^DJI ^IXIC ^FTSE ^N225 ^HSI')
    .rename_axis(columns=['Variable', 'Index'])
    .stack()
)

ind.head()

[*********************100%%**********************]  6 of 6 completed

	Variable	Adj Close	Close	High	Low	Open	Volume
Date	Index
1927-12-30	^GSPC	17.6600	17.6600	17.6600	17.6600	17.6600	0.0000
1928-01-03	^GSPC	17.7600	17.7600	17.7600	17.7600	17.7600	0.0000
1928-01-04	^GSPC	17.7200	17.7200	17.7200	17.7200	17.7200	0.0000
1928-01-05	^GSPC	17.5500	17.5500	17.5500	17.5500	17.5500	0.0000
1928-01-06	^GSPC	17.6600	17.6600	17.6600	17.6600	17.6600	0.0000

The default aggregation function for .pivot_table() is mean.

ind.loc['2015':].pivot_table(index='Index')

Variable	Adj Close	Close	High	Low	Open	Volume
Index
^DJI	26474.0396	26474.0396	26614.0050	26317.6183	26469.7020	290895360.8696
^FTSE	7045.8903	7045.8903	7087.2046	7003.6069	7045.3863	804146245.6710
^GSPC	3152.9679	3152.9679	3169.6184	3133.9186	3152.3787	4016121530.4348
^HSI	24334.1870	24334.1870	24498.6004	24164.3352	24350.3348	2026656088.8050
^IXIC	9133.9826	9133.9826	9194.6083	9064.2100	9132.1387	3273306814.4285
^N225	23519.0408	23519.0408	23642.2996	23385.7755	23518.4156	95240957.9230

ind.loc['2015':].pivot_table(index='Index', aggfunc='median')

Variable	Adj Close	Close	High	Low	Open	Volume
Index
^DJI	26065.6494	26065.6494	26199.2002	25921.8994	26083.4893	297655000.0000
^FTSE	7190.5000	7190.5000	7226.2000	7148.0000	7190.0000	757404150.0000
^GSPC	2902.8950	2902.8950	2912.7450	2891.8750	2906.6550	3821900000.0000
^HSI	24615.1309	24615.1309	24771.5996	24476.1992	24645.1992	1868937800.0000
^IXIC	8000.2300	8000.2300	8048.5801	7953.6699	8011.6802	2447750000.0000
^N225	22470.3242	22470.3242	22555.8008	22338.5400	22460.9902	82750000.0000

We can use values to select specific variables, pd.Grouper() to sample different date windows, and aggfunc to select specific aggregation functions.

(
    ind
    .loc['2015':]
    .reset_index()
    .pivot_table(
        values='Close',
        index=pd.Grouper(key='Date', freq='A'),
        columns='Index',
        aggfunc=['min', 'max']
    )
)

	min						max
Index	^DJI	^FTSE	^GSPC	^HSI	^IXIC	^N225	^DJI	^FTSE	^GSPC	^HSI	^IXIC	^N225
Date
2015-12-31	15666.4404	5874.1001	1867.6100	20556.5996	4506.4902	16795.9609	18312.3906	7104.0000	2130.8201	28442.7500	5218.8599	20868.0293
2016-12-31	15660.1797	5537.0000	1829.0800	18319.5801	4266.8398	14952.0195	19974.6191	7142.7998	2271.7200	24099.6992	5487.4399	19494.5293
2017-12-31	19732.4004	7099.2002	2257.8301	22134.4707	5429.0801	18335.6309	24837.5098	7687.7998	2690.1599	30003.4902	6994.7598	22939.1797
2018-12-31	21792.1992	6584.7002	2351.1001	24585.5293	6192.9199	19155.7402	26828.3906	7877.5000	2930.7500	33154.1211	8109.6899	24270.6191
2019-12-31	22686.2207	6692.7002	2447.8899	25064.3594	6463.5000	19561.9609	28645.2598	7686.6001	3240.0200	30157.4902	9022.3896	24066.1191
2020-12-31	18591.9297	4993.8999	2237.3999	21696.1309	6860.6699	16552.8301	30606.4805	7674.6001	3756.0701	29056.4199	12899.4199	27568.1504
2021-12-31	29982.6191	6407.5000	3700.6499	22744.8594	12609.1602	27013.2500	36488.6289	7420.7002	4793.0601	31084.9395	16057.4404	30670.0996
2022-12-31	28725.5098	6826.2002	3577.0300	14687.0195	10213.2900	24717.5293	36799.6484	7672.3999	4796.5601	24965.5508	15832.7998	29332.1602
2023-12-31	31819.1406	7256.8999	3808.1001	16201.4902	10305.2402	25716.8594	37710.1016	8014.2998	4783.3501	22688.9004	15099.1797	33753.3281
2024-12-31	37266.6719	7446.2998	4688.6802	14961.1797	14510.2998	33288.2891	39069.1094	7728.5000	5087.0298	16788.5508	16041.6201	39098.6797