McKinney Chapter 8 - Data Wrangling: Join, Combine, and Reshape

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 8 of Wes McKinney’s Python for Data Analysis introduces a few important pandas concepts:

  1. Joining or merging is combining 2+ data frames on 1+ indexes or columns into 1 data frame
  2. Reshaping is rearranging data frames so it has fewer columns and more rows (wide to long) or more columns and fewer rows (long to wide); we can also reshape a series to a data frame and vice versa

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yf
%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

2 Hierarchical Indexing

We need to learn about hierarchical indexing before we learn about combining and reshaping data. A hierarchical index gives two or more index levels to an axis. For example, we could index rows by ticker and date. Or we could index columns by variable and ticker. Hierarchical indexing helps us work with high-dimensional data in a low-dimensional form.

np.random.seed(42)
data = pd.Series(
    data=np.random.randn(9),
    index=[
        ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
        [1, 2, 3, 1, 3, 1, 2, 2, 3]
    ]
)

data
a  1    0.4967
   2   -0.1383
   3    0.6477
b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
d  2    0.7674
   3   -0.4695
dtype: float64

We can partially index this series to concisely subset data.

data['b']
1    1.5230
3   -0.2342
dtype: float64
data['b':'c']
b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
dtype: float64
data.loc[['b', 'd']]
b  1    1.5230
   3   -0.2342
d  2    0.7674
   3   -0.4695
dtype: float64

We can subset on the index inner level, too. Here the first : slices all values in the outer index.

data.loc[:, 2]
a   -0.1383
c    1.5792
d    0.7674
dtype: float64

Here data has a stacked format. For each outer index level (the letters), we have multiple observations based on the inner index level (the numbers). We can un-stack data to convert the inner index level to columns.

data.unstack()
1 2 3
a 0.4967 -0.1383 0.6477
b 1.5230 NaN -0.2342
c -0.2341 1.5792 NaN
d NaN 0.7674 -0.4695
data.unstack().stack()
a  1    0.4967
   2   -0.1383
   3    0.6477
b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
d  2    0.7674
   3   -0.4695
dtype: float64

We can create a data frame with hieracrhical indexes or multi-indexes on rows and columns.

frame = pd.DataFrame(
    data=np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]
)

frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

We can name these multi-indexes but names are not required.

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

Recall that df[val] selects the val column. Here frame has a multi-index for the columns, so frame['Ohio'] selects all columns with Ohio as the outer index level.

frame['Ohio']
color Green Red
key1 key2
a 1 0 1
2 3 4
b 1 6 7
2 9 10

We can pass a tuple if we only want one column.

frame[[('Ohio', 'Green')]]
state Ohio
color Green
key1 key2
a 1 0
2 3
b 1 6
2 9

We have to do a more work to slice the inner level of the column index.

frame.loc[:, (slice(None), 'Green')]
state Ohio Colorado
color Green Green
key1 key2
a 1 0 2
2 3 5
b 1 6 8
2 9 11

We can use pd.IndexSlice[:, 'Green'] an alternative to (slice(None), 'Green').

frame.loc[:, pd.IndexSlice[:, 'Green']]
state Ohio Colorado
color Green Green
key1 key2
a 1 0 2
2 3 5
b 1 6 8
2 9 11

2.1 Reordering and Sorting Levels

We can swap index levels with the .swaplevel() method. The default arguments are i=-2 and j=-1, which swap the two innermost index levels.

frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.swaplevel()
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

We can use index names, too.

frame.swaplevel('key1', 'key2')
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11

We can also sort on an index (or list of indexes). After we swap levels, we may want to sort our data.

frame
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.sort_index(level=1)
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11

Again, we can give index names, too.

frame.sort_index(level='key2')
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11

We can sort by two or more index levels by passing a list of index levels or names.

frame.sort_index(level=[0, 1])
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

We can chain these methods, too.

frame.swaplevel(0, 1).sort_index(level=0)
state Ohio Colorado
color Green Red Green
key2 key1
1 a 0 1 2
b 6 7 8
2 a 3 4 5
b 9 10 11

2.2 Indexing with a DataFrame’s columns

We can convert a column into an index and an index into a column with the .set_index() and .reset_index() methods.

frame = pd.DataFrame({
    'a': range(7), 
    'b': range(7, 0, -1),
    'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
    'd': [0, 1, 2, 0, 1, 2, 3]
})

frame
a b c d
0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3

The .set_index() method converts columns to indexes, and removes the columns from the data frame by default.

frame2 = frame.set_index(['c', 'd'])
frame2
a b
c d
one 0 0 7
1 1 6
2 2 5
two 0 3 4
1 4 3
2 5 2
3 6 1

The .reset_index() method removes the indexes, adds them as columns, and sets in integer index.

frame2.reset_index()
c d a b
0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1

3 Combining and Merging Datasets

pandas provides several methods and functions to combine and merge data. We can typically create the same output with any of these methods or functions, but one may be more efficient than the others. If I want to combine data frames with similar indexes, I try the .join() method first. The .join() also lets use can combine more than two data frames at once. Otherwise, I try the .merge() method, which has a function pd.merge(), too. The pd.merge() function is more general than the .join() method, so we will start with pd.merge().

The pandas website provides helpful visualizations.

3.1 Database-Style DataFrame Joins

Merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these algorithms on your data.

We will start with the pd.merge() syntax, but pandas also has .merge() and .join() methods. Learning these other syntaxes is easy once we understand the pd.merge() syntax.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})
df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df2
key data2
0 a 0
1 b 1
2 d 2
pd.merge(df1, df2)
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

The default how is how='inner', so pd.merge() inner joins left and right data frames by default, keeping only rows that appear in both. We can specify how='outer', so pd.merge() outer joins left and right data frames, keeping all rows that appear in either.

pd.merge(df1, df2, how='outer')
key data1 data2
0 b 0.0000 1.0000
1 b 1.0000 1.0000
2 b 6.0000 1.0000
3 a 2.0000 0.0000
4 a 4.0000 0.0000
5 a 5.0000 0.0000
6 c 3.0000 NaN
7 d NaN 2.0000

A left merge keeps only rows that appear in the left data frame.

pd.merge(df1, df2, how='left')
key data1 data2
0 b 0 1.0000
1 b 1 1.0000
2 a 2 0.0000
3 c 3 NaN
4 a 4 0.0000
5 a 5 0.0000
6 b 6 1.0000

A rights merge keeps only rows that appear in the right data frame.

pd.merge(df1, df2, how='right')
key data1 data2
0 a 2.0000 0
1 a 4.0000 0
2 a 5.0000 0
3 b 0.0000 1
4 b 1.0000 1
5 b 6.0000 1
6 d NaN 2

By default, pd.merge() merges on all columns that appear in both data frames.

on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Here key is the only common column between df1 and df2. We should specify on to avoid unexpected results.

pd.merge(df1, df2, on='key')
key data1 data2
0 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0

We must specify left_on and right_on if our left and right data frames do not have a common column.

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})
df3
lkey data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
df4
rkey data2
0 a 0
1 b 1
2 d 2
# pd.merge(df3, df4) # this code fails/errors because there are not common columns
# MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
pd.merge(df3, df4, left_on='lkey', right_on='rkey')
lkey data1 rkey data2
0 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0

Here pd.merge() dropped row c from df3 and row d from df4. Rows c and d dropped because pd.merge() inner joins be default. An inner join keeps the intersection of the left and right data frame keys. Further, rows a and b from df4 appear three times to match df3. If we want to keep rows c and d, we can outer join df3 and df4 with how='outer'.

pd.merge(df1, df2, how='outer')
key data1 data2
0 b 0.0000 1.0000
1 b 1.0000 1.0000
2 b 6.0000 1.0000
3 a 2.0000 0.0000
4 a 4.0000 0.0000
5 a 5.0000 0.0000
6 c 3.0000 NaN
7 d NaN 2.0000

Many-to-many merges have well-defined, though not necessarily intuitive, behavior.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})
df1
key data1
0 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
df2
key data2
0 a 0
1 b 1
2 a 2
3 b 3
4 d 4
pd.merge(df1, df2, on='key')
key data1 data2
0 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2

Many-to-many joins form the Cartesian product of the rows. Since there were three b rows in the left DataFrame and two in the right one, there are six b rows in the result. The join method only affects the distinct key values appearing in the result.

Be careful with many-to-many joins! In finance, we do not expect many-to-many joins because we expect at least one of the data frames to have unique observations. pandas will not warn us if we accidentally perform a many-to-many join instead of a one-to-one or many-to-one join.

We can merge on more than one key. For example, we may merge two data sets on ticker-date pairs or industry-date pairs.

left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})
left
key1 key2 lval
0 foo one 1
1 foo two 2
2 bar one 3
right
key1 key2 rval
0 foo one 4
1 foo one 5
2 bar one 6
3 bar two 7
pd.merge(left, right, on=['key1', 'key2'], how='outer')
key1 key2 lval rval
0 foo one 1.0000 4.0000
1 foo one 1.0000 5.0000
2 foo two 2.0000 NaN
3 bar one 3.0000 6.0000
4 bar two NaN 7.0000

When column names overlap between the left and right data frames, pd.merge() appends _x and _y to the left and right versions of the overlapping column names.

pd.merge(left, right, on='key1')
key1 key2_x lval key2_y rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

I typically specify suffixes to avoid later confusion.

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))
key1 key2_left lval key2_right rval
0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7

I read the pd.merge() docstring whenever I am in doubt. Table 8-2 lists the most commonly used arguments for pd.merge().

  • left: DataFrame to be merged on the left side.
  • right: DataFrame to be merged on the right side.
  • how: One of ‘inner’, ‘outer’, ‘left’, or ‘right’; defaults to ‘inner’.
  • on: Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given will use the intersection of the column names in left and right as the join keys.
  • left_on: Columns in left DataFrame to use as join keys.
  • right_on: Analogous to left_on for left DataFrame.
  • left_index: Use row index in left as its join key (or keys, if a MultiIndex).
  • right_index: Analogous to left_index.
  • sort: Sort merged data lexicographically by join keys; True by default (disable to get better performance in some cases on large datasets).
  • suffixes: Tuple of string values to append to column names in case of overlap; defaults to (’_x’, ’_y’) (e.g., if ‘data’ in both DataFrame objects, would appear as ‘data_x’ and ‘data_y’ in result).
  • copy: If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.
  • indicator: Adds a special column _merge that indicates the source of each row; values will be ‘left_only’, ‘right_only’, or ‘both’ based on the origin of the joined data in each row.

3.2 Merging on Index

If we want to use pd.merge() to join on row indexes, we can use the left_index and right_index arguments.

left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])
left1
key value
0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
right1
group_val
a 3.5000
b 7.0000
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')
key value group_val
0 a 0 3.5000
2 a 2 3.5000
3 a 3 3.5000
1 b 1 7.0000
4 b 4 7.0000
5 c 5 NaN

The index arguments work for hierarchical indexes (multi indexes), too.

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')
key1 key2 data event1 event2
0 Ohio 2000 0.0000 4.0000 5.0000
0 Ohio 2000 0.0000 6.0000 7.0000
1 Ohio 2001 1.0000 8.0000 9.0000
2 Ohio 2002 2.0000 10.0000 11.0000
3 Nevada 2001 3.0000 0.0000 1.0000
4 Nevada 2002 4.0000 NaN NaN
4 Nevada 2000 NaN 2.0000 3.0000
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])

If we use both left and right indexes, pd.merge() will keep the index.

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)
Ohio Nevada Missouri Alabama
a 1.0000 2.0000 NaN NaN
b NaN NaN 7.0000 8.0000
c 3.0000 4.0000 9.0000 10.0000
d NaN NaN 11.0000 12.0000
e 5.0000 6.0000 13.0000 14.0000

DataFrame has a convenient join instance for merging by index. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

If we have matching indexes on left and right, we can use .join().

left2
Ohio Nevada
a 1.0000 2.0000
c 3.0000 4.0000
e 5.0000 6.0000
right2
Missouri Alabama
b 7.0000 8.0000
c 9.0000 10.0000
d 11.0000 12.0000
e 13.0000 14.0000
left2.join(right2, how='outer')
Ohio Nevada Missouri Alabama
a 1.0000 2.0000 NaN NaN
b NaN NaN 7.0000 8.0000
c 3.0000 4.0000 9.0000 10.0000
d NaN NaN 11.0000 12.0000
e 5.0000 6.0000 13.0000 14.0000

The .join() method left joins by default. The .join() method uses indexes, so it requires few arguments and accepts a list of data frames.

another = pd.DataFrame(
    data=[[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
    index=['a', 'c', 'e', 'f'],
    columns=['New York', 'Oregon']
)

another
New York Oregon
a 7.0000 8.0000
c 9.0000 10.0000
e 11.0000 12.0000
f 16.0000 17.0000
left2.join([right2, another])
Ohio Nevada Missouri Alabama New York Oregon
a 1.0000 2.0000 NaN NaN 7.0000 8.0000
c 3.0000 4.0000 9.0000 10.0000 9.0000 10.0000
e 5.0000 6.0000 13.0000 14.0000 11.0000 12.0000
left2.join([right2, another], how='outer')
Ohio Nevada Missouri Alabama New York Oregon
a 1.0000 2.0000 NaN NaN 7.0000 8.0000
c 3.0000 4.0000 9.0000 10.0000 9.0000 10.0000
e 5.0000 6.0000 13.0000 14.0000 11.0000 12.0000
b NaN NaN 7.0000 8.0000 NaN NaN
d NaN NaN 11.0000 12.0000 NaN NaN
f NaN NaN NaN NaN 16.0000 17.0000

3.3 Concatenating Along an Axis

The pd.concat() function provides a flexible way to combine data frames and series along either axis. I typically use pd.concat() to combine:

  1. A list of data frames with similar layouts
  2. A list of series because series do not have .join() or .merge() methods

The first is handy if we have to read and combine a directory of .csv files.

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
pd.concat([s1, s2, s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
pd.concat([s1, s2, s3], axis=1)
0 1 2
a 0.0000 NaN NaN
b 1.0000 NaN NaN
c NaN 2.0000 NaN
d NaN 3.0000 NaN
e NaN 4.0000 NaN
f NaN NaN 5.0000
g NaN NaN 6.0000
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])

result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
result.unstack()
a b c d e f g
one 0.0000 1.0000 NaN NaN NaN NaN NaN
two NaN NaN 2.0000 3.0000 4.0000 NaN NaN
three NaN NaN NaN NaN NaN 5.0000 6.0000
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])
one two three
a 0.0000 NaN NaN
b 1.0000 NaN NaN
c NaN 2.0000 NaN
d NaN 3.0000 NaN
e NaN 4.0000 NaN
f NaN NaN 5.0000
g NaN NaN 6.0000
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
level1 level2
one two three four
a 0 1 5.0000 6.0000
b 2 3 NaN NaN
c 4 5 7.0000 8.0000
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower'])
upper level1 level2
lower one two three four
a 0 1 5.0000 6.0000
b 2 3 NaN NaN
c 4 5 7.0000 8.0000

4 Reshaping and Pivoting

Above, we briefly explore reshaping data with .stack() and .unstack(). Here we explore reshaping data more deeply.

4.1 Reshaping with Hierarchical Indexing

Hierarchical indexes (multi-indexes) help reshape data.

There are two primary actions: - stack: This “rotates” or pivots from the columns in the data to the rows - unstack: This pivots from the rows into the columns

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                    name='number'))

data
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
result = data.stack()
result
state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32
result.unstack()
number one two three
state
Ohio 0 1 2
Colorado 3 4 5
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
data2.unstack()
a b c d e
one 0.0000 1.0000 2.0000 3.0000 NaN
two NaN NaN 4.0000 5.0000 6.0000

Un-stacking may introduce missing values because data frames are rectangular. By default, stacking drops these missing values.

data2.unstack().stack()
one  a   0.0000
     b   1.0000
     c   2.0000
     d   3.0000
two  c   4.0000
     d   5.0000
     e   6.0000
dtype: float64

However, we can keep missing values with dropna=False.

data2.unstack().stack(dropna=False)
one  a   0.0000
     b   1.0000
     c   2.0000
     d   3.0000
     e      NaN
two  a      NaN
     b      NaN
     c   4.0000
     d   5.0000
     e   6.0000
dtype: float64
df = pd.DataFrame({
    'left': result, 
    'right': result + 5
    },
    columns=pd.Index(['left', 'right'], name='side')
)

df
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colorado one 3 8
two 4 9
three 5 10

If we un-stack a data frame, the un-stacked level becomes the innermost level in the resulting index.

df.unstack('state')
side left right
state Ohio Colorado Ohio Colorado
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10

We can chain .stack() and .unstack() to rearrange our data.

df.unstack('state').stack('side')
state Ohio Colorado
number side
one left 0 3
right 5 8
two left 1 4
right 6 9
three left 2 5
right 7 10

McKinney provides two more subsections on reshaping data with the .pivot() and .melt() methods. Unlike, the stacking methods, the pivoting methods can aggregate data and do not require an index. We will skip these additional aggregation methods for now.