McKinney Chapter 8 - Data Wrangling: Join, Combine, and Reshape
FINA 6333 for Spring 2024
Author
Richard Herron
1 Introduction
Chapter 8 of Wes McKinney’s Python for Data Analysis introduces a few important pandas concepts:
Joining or merging is combining 2+ data frames on 1+ indexes or columns into 1 data frame
Reshaping is rearranging data frames so it has fewer columns and more rows (wide to long) or more columns and fewer rows (long to wide); we can also reshape a series to a data frame and vice versa
Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.
import matplotlib.pyplot as pltimport numpy as npimport pandas as pdimport pandas_datareader as pdrimport yfinance as yf
We need to learn about hierarchical indexing before we learn about combining and reshaping data. A hierarchical index gives two or more index levels to an axis. For example, we could index rows by ticker and date. Or we could index columns by variable and ticker. Hierarchical indexing helps us work with high-dimensional data in a low-dimensional form.
a 1 0.4967
2 -0.1383
3 0.6477
b 1 1.5230
3 -0.2342
c 1 -0.2341
2 1.5792
d 2 0.7674
3 -0.4695
dtype: float64
We can partially index this series to concisely subset data.
data['b']
1 1.5230
3 -0.2342
dtype: float64
data['b':'c']
b 1 1.5230
3 -0.2342
c 1 -0.2341
2 1.5792
dtype: float64
data.loc[['b', 'd']]
b 1 1.5230
3 -0.2342
d 2 0.7674
3 -0.4695
dtype: float64
We can subset on the index inner level, too. Here the first : slices all values in the outer index.
data.loc[:, 2]
a -0.1383
c 1.5792
d 0.7674
dtype: float64
Here data has a stacked format. For each outer index level (the letters), we have multiple observations based on the inner index level (the numbers). We can un-stack data to convert the inner index level to columns.
data.unstack()
1
2
3
a
0.4967
-0.1383
0.6477
b
1.5230
NaN
-0.2342
c
-0.2341
1.5792
NaN
d
NaN
0.7674
-0.4695
data.unstack().stack()
a 1 0.4967
2 -0.1383
3 0.6477
b 1 1.5230
3 -0.2342
c 1 -0.2341
2 1.5792
d 2 0.7674
3 -0.4695
dtype: float64
We can create a data frame with hieracrhical indexes or multi-indexes on rows and columns.
Recall that df[val] selects the val column. Here frame has a multi-index for the columns, so frame['Ohio'] selects all columns with Ohio as the outer index level.
frame['Ohio']
color
Green
Red
key1
key2
a
1
0
1
2
3
4
b
1
6
7
2
9
10
We can pass a tuple if we only want one column.
frame[[('Ohio', 'Green')]]
state
Ohio
color
Green
key1
key2
a
1
0
2
3
b
1
6
2
9
We have to do a more work to slice the inner level of the column index.
frame.loc[:, (slice(None), 'Green')]
state
Ohio
Colorado
color
Green
Green
key1
key2
a
1
0
2
2
3
5
b
1
6
8
2
9
11
We can use pd.IndexSlice[:, 'Green'] an alternative to (slice(None), 'Green').
frame.loc[:, pd.IndexSlice[:, 'Green']]
state
Ohio
Colorado
color
Green
Green
key1
key2
a
1
0
2
2
3
5
b
1
6
8
2
9
11
2.1 Reordering and Sorting Levels
We can swap index levels with the .swaplevel() method. The default arguments are i=-2 and j=-1, which swap the two innermost index levels.
frame
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame.swaplevel()
state
Ohio
Colorado
color
Green
Red
Green
key2
key1
1
a
0
1
2
2
a
3
4
5
1
b
6
7
8
2
b
9
10
11
We can use index names, too.
frame.swaplevel('key1', 'key2')
state
Ohio
Colorado
color
Green
Red
Green
key2
key1
1
a
0
1
2
2
a
3
4
5
1
b
6
7
8
2
b
9
10
11
We can also sort on an index (or list of indexes). After we swap levels, we may want to sort our data.
frame
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame.sort_index(level=1)
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
b
1
6
7
8
a
2
3
4
5
b
2
9
10
11
Again, we can give index names, too.
frame.sort_index(level='key2')
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
b
1
6
7
8
a
2
3
4
5
b
2
9
10
11
We can sort by two or more index levels by passing a list of index levels or names.
frame.sort_index(level=[0, 1])
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
We can chain these methods, too.
frame.swaplevel(0, 1).sort_index(level=0)
state
Ohio
Colorado
color
Green
Red
Green
key2
key1
1
a
0
1
2
b
6
7
8
2
a
3
4
5
b
9
10
11
2.2 Indexing with a DataFrame’s columns
We can convert a column into an index and an index into a column with the .set_index() and .reset_index() methods.
The .set_index() method converts columns to indexes, and removes the columns from the data frame by default.
frame2 = frame.set_index(['c', 'd'])frame2
a
b
c
d
one
0
0
7
1
1
6
2
2
5
two
0
3
4
1
4
3
2
5
2
3
6
1
The .reset_index() method removes the indexes, adds them as columns, and sets in integer index.
frame2.reset_index()
c
d
a
b
0
one
0
0
7
1
one
1
1
6
2
one
2
2
5
3
two
0
3
4
4
two
1
4
3
5
two
2
5
2
6
two
3
6
1
3 Combining and Merging Datasets
pandas provides several methods and functions to combine and merge data. We can typically create the same output with any of these methods or functions, but one may be more efficient than the others. If I want to combine data frames with similar indexes, I try the .join() method first. The .join() also lets use can combine more than two data frames at once. Otherwise, I try the .merge() method, which has a function pd.merge(), too. The pd.merge() function is more general than the .join() method, so we will start with pd.merge().
Merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these algorithms on your data.
We will start with the pd.merge() syntax, but pandas also has .merge() and .join() methods. Learning these other syntaxes is easy once we understand the pd.merge() syntax.
The default how is how='inner', so pd.merge() inner joins left and right data frames by default, keeping only rows that appear in both. We can specify how='outer', so pd.merge() outer joins left and right data frames, keeping all rows that appear in either.
pd.merge(df1, df2, how='outer')
key
data1
data2
0
b
0.0000
1.0000
1
b
1.0000
1.0000
2
b
6.0000
1.0000
3
a
2.0000
0.0000
4
a
4.0000
0.0000
5
a
5.0000
0.0000
6
c
3.0000
NaN
7
d
NaN
2.0000
A left merge keeps only rows that appear in the left data frame.
pd.merge(df1, df2, how='left')
key
data1
data2
0
b
0
1.0000
1
b
1
1.0000
2
a
2
0.0000
3
c
3
NaN
4
a
4
0.0000
5
a
5
0.0000
6
b
6
1.0000
A rights merge keeps only rows that appear in the right data frame.
pd.merge(df1, df2, how='right')
key
data1
data2
0
a
2.0000
0
1
a
4.0000
0
2
a
5.0000
0
3
b
0.0000
1
4
b
1.0000
1
5
b
6.0000
1
6
d
NaN
2
By default, pd.merge() merges on all columns that appear in both data frames.
on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.
Here key is the only common column between df1 and df2. We should specify on to avoid unexpected results.
pd.merge(df1, df2, on='key')
key
data1
data2
0
b
0
1
1
b
1
1
2
b
6
1
3
a
2
0
4
a
4
0
5
a
5
0
We must specify left_on and right_on if our left and right data frames do not have a common column.
# pd.merge(df3, df4) # this code fails/errors because there are not common columns# MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Here pd.merge() dropped row c from df3 and row d from df4. Rows c and d dropped because pd.merge()inner joins be default. An inner join keeps the intersection of the left and right data frame keys. Further, rows a and b from df4 appear three times to match df3. If we want to keep rows c and d, we can outer join df3 and df4 with how='outer'.
pd.merge(df1, df2, how='outer')
key
data1
data2
0
b
0.0000
1.0000
1
b
1.0000
1.0000
2
b
6.0000
1.0000
3
a
2.0000
0.0000
4
a
4.0000
0.0000
5
a
5.0000
0.0000
6
c
3.0000
NaN
7
d
NaN
2.0000
Many-to-many merges have well-defined, though not necessarily intuitive, behavior.
Many-to-many joins form the Cartesian product of the rows. Since there were three b rows in the left DataFrame and two in the right one, there are six b rows in the result. The join method only affects the distinct key values appearing in the result.
Be careful with many-to-many joins! In finance, we do not expect many-to-many joins because we expect at least one of the data frames to have unique observations. pandas will not warn us if we accidentally perform a many-to-many join instead of a one-to-one or many-to-one join.
We can merge on more than one key. For example, we may merge two data sets on ticker-date pairs or industry-date pairs.
When column names overlap between the left and right data frames, pd.merge() appends _x and _y to the left and right versions of the overlapping column names.
pd.merge(left, right, on='key1')
key1
key2_x
lval
key2_y
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
I typically specify suffixes to avoid later confusion.
I read the pd.merge() docstring whenever I am in doubt. Table 8-2 lists the most commonly used arguments for pd.merge().
left: DataFrame to be merged on the left side.
right: DataFrame to be merged on the right side.
how: One of ‘inner’, ‘outer’, ‘left’, or ‘right’; defaults to ‘inner’.
on: Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given will use the intersection of the column names in left and right as the join keys.
left_on: Columns in left DataFrame to use as join keys.
right_on: Analogous to left_on for left DataFrame.
left_index: Use row index in left as its join key (or keys, if a MultiIndex).
right_index: Analogous to left_index.
sort: Sort merged data lexicographically by join keys; True by default (disable to get better performance in some cases on large datasets).
suffixes: Tuple of string values to append to column names in case of overlap; defaults to (’_x’, ’_y’) (e.g., if ‘data’ in both DataFrame objects, would appear as ‘data_x’ and ‘data_y’ in result).
copy: If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.
indicator: Adds a special column _merge that indicates the source of each row; values will be ‘left_only’, ‘right_only’, or ‘both’ based on the origin of the joined data in each row.
3.2 Merging on Index
If we want to use pd.merge() to join on row indexes, we can use the left_index and right_index arguments.
DataFrame has a convenient join instance for merging by index. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.
If we have matching indexes on left and right, we can use .join().
left2
Ohio
Nevada
a
1.0000
2.0000
c
3.0000
4.0000
e
5.0000
6.0000
right2
Missouri
Alabama
b
7.0000
8.0000
c
9.0000
10.0000
d
11.0000
12.0000
e
13.0000
14.0000
left2.join(right2, how='outer')
Ohio
Nevada
Missouri
Alabama
a
1.0000
2.0000
NaN
NaN
b
NaN
NaN
7.0000
8.0000
c
3.0000
4.0000
9.0000
10.0000
d
NaN
NaN
11.0000
12.0000
e
5.0000
6.0000
13.0000
14.0000
The .join() method left joins by default. The .join() method uses indexes, so it requires few arguments and accepts a list of data frames.
Above, we briefly explore reshaping data with .stack() and .unstack(). Here we explore reshaping data more deeply.
4.1 Reshaping with Hierarchical Indexing
Hierarchical indexes (multi-indexes) help reshape data.
There are two primary actions: - stack: This “rotates” or pivots from the columns in the data to the rows - unstack: This pivots from the rows into the columns
If we un-stack a data frame, the un-stacked level becomes the innermost level in the resulting index.
df.unstack('state')
side
left
right
state
Ohio
Colorado
Ohio
Colorado
number
one
0
3
5
8
two
1
4
6
9
three
2
5
7
10
We can chain .stack() and .unstack() to rearrange our data.
df.unstack('state').stack('side')
state
Ohio
Colorado
number
side
one
left
0
3
right
5
8
two
left
1
4
right
6
9
three
left
2
5
right
7
10
McKinney provides two more subsections on reshaping data with the .pivot() and .melt() methods. Unlike, the stacking methods, the pivoting methods can aggregate data and do not require an index. We will skip these additional aggregation methods for now.