McKinney Chapter 8 - Data Wrangling: Join, Combine, and Reshape

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 8 of Wes McKinney’s Python for Data Analysis introduces a few important pandas concepts:

Joining or merging is combining 2+ data frames on 1+ indexes or columns into 1 data frame
Reshaping is rearranging data frames so it has fewer columns and more rows (wide to long) or more columns and fewer rows (long to wide); we can also reshape a series to a data frame and vice versa

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import yfinance as yf

%precision 4
pd.options.display.float_format = '{:.4f}'.format
%config InlineBackend.figure_format = 'retina'

2 Hierarchical Indexing

We need to learn about hierarchical indexing before we learn about combining and reshaping data. A hierarchical index gives two or more index levels to an axis. For example, we could index rows by ticker and date. Or we could index columns by variable and ticker. Hierarchical indexing helps us work with high-dimensional data in a low-dimensional form.

np.random.seed(42)
data = pd.Series(
    data=np.random.randn(9),
    index=[
        ['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
        [1, 2, 3, 1, 3, 1, 2, 2, 3]
    ]
)

data

a  1    0.4967
   2   -0.1383
   3    0.6477
b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
d  2    0.7674
   3   -0.4695
dtype: float64

We can partially index this series to concisely subset data.

data['b']

1    1.5230
3   -0.2342
dtype: float64

data['b':'c']

b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
dtype: float64

data.loc[['b', 'd']]

b  1    1.5230
   3   -0.2342
d  2    0.7674
   3   -0.4695
dtype: float64

We can subset on the index inner level, too. Here the first : slices all values in the outer index.

data.loc[:, 2]

a   -0.1383
c    1.5792
d    0.7674
dtype: float64

Here data has a stacked format. For each outer index level (the letters), we have multiple observations based on the inner index level (the numbers). We can un-stack data to convert the inner index level to columns.

data.unstack()

	1	2	3
a	0.4967	-0.1383	0.6477
b	1.5230	NaN	-0.2342
c	-0.2341	1.5792	NaN
d	NaN	0.7674	-0.4695

data.unstack().stack()

a  1    0.4967
   2   -0.1383
   3    0.6477
b  1    1.5230
   3   -0.2342
c  1   -0.2341
   2    1.5792
d  2    0.7674
   3   -0.4695
dtype: float64

We can create a data frame with hieracrhical indexes or multi-indexes on rows and columns.

frame = pd.DataFrame(
    data=np.arange(12).reshape((4, 3)),
    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
    columns=[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]
)

frame

		Ohio		Colorado
		Green	Red	Green
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

We can name these multi-indexes but names are not required.

frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

Recall that df[val] selects the val column. Here frame has a multi-index for the columns, so frame['Ohio'] selects all columns with Ohio as the outer index level.

frame['Ohio']

	color	Green	Red
key1	key2
a	1	0	1
a	2	3	4
b	1	6	7
b	2	9	10

We can pass a tuple if we only want one column.

frame[[('Ohio', 'Green')]]

	state	Ohio
	color	Green
key1	key2
a	1	0
a	2	3
b	1	6
b	2	9

We have to do a more work to slice the inner level of the column index.

frame.loc[:, (slice(None), 'Green')]

	state	Ohio	Colorado
	color	Green	Green
key1	key2
a	1	0	2
a	2	3	5
b	1	6	8
b	2	9	11

We can use pd.IndexSlice[:, 'Green'] an alternative to (slice(None), 'Green').

frame.loc[:, pd.IndexSlice[:, 'Green']]

	state	Ohio	Colorado
	color	Green	Green
key1	key2
a	1	0	2
a	2	3	5
b	1	6	8
b	2	9	11

2.1 Reordering and Sorting Levels

We can swap index levels with the .swaplevel() method. The default arguments are i=-2 and j=-1, which swap the two innermost index levels.

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.swaplevel()

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

We can use index names, too.

frame.swaplevel('key1', 'key2')

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
2	a	3	4	5
1	b	6	7	8
2	b	9	10	11

We can also sort on an index (or list of indexes). After we swap levels, we may want to sort our data.

frame

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

frame.sort_index(level=1)

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

Again, we can give index names, too.

frame.sort_index(level='key2')

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
b	1	6	7	8
a	2	3	4	5
b	2	9	10	11

We can sort by two or more index levels by passing a list of index levels or names.

frame.sort_index(level=[0, 1])

	state	Ohio		Colorado
	color	Green	Red	Green
key1	key2
a	1	0	1	2
a	2	3	4	5
b	1	6	7	8
b	2	9	10	11

We can chain these methods, too.

frame.swaplevel(0, 1).sort_index(level=0)

	state	Ohio		Colorado
	color	Green	Red	Green
key2	key1
1	a	0	1	2
1	b	6	7	8
2	a	3	4	5
2	b	9	10	11

2.2 Indexing with a DataFrame’s columns

We can convert a column into an index and an index into a column with the .set_index() and .reset_index() methods.

frame = pd.DataFrame({
    'a': range(7), 
    'b': range(7, 0, -1),
    'c': ['one', 'one', 'one', 'two', 'two','two', 'two'],
    'd': [0, 1, 2, 0, 1, 2, 3]
})

frame

	a	b	c	d
0	0	7	one	0
1	1	6	one	1
2	2	5	one	2
3	3	4	two	0
4	4	3	two	1
5	5	2	two	2
6	6	1	two	3

The .set_index() method converts columns to indexes, and removes the columns from the data frame by default.

frame2 = frame.set_index(['c', 'd'])
frame2

		a	b
c	d
one	0	0	7
	1	1	6
	2	2	5
two	0	3	4
	1	4	3
	2	5	2
	3	6	1

The .reset_index() method removes the indexes, adds them as columns, and sets in integer index.

frame2.reset_index()

	c	d	a	b
0	one	0	0	7
1	one	1	1	6
2	one	2	2	5
3	two	0	3	4
4	two	1	4	3
5	two	2	5	2
6	two	3	6	1

3 Combining and Merging Datasets

pandas provides several methods and functions to combine and merge data. We can typically create the same output with any of these methods or functions, but one may be more efficient than the others. If I want to combine data frames with similar indexes, I try the .join() method first. The .join() also lets use can combine more than two data frames at once. Otherwise, I try the .merge() method, which has a function pd.merge(), too. The pd.merge() function is more general than the .join() method, so we will start with pd.merge().

The pandas website provides helpful visualizations.

3.1 Database-Style DataFrame Joins

Merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these algorithms on your data.

We will start with the pd.merge() syntax, but pandas also has .merge() and .join() methods. Learning these other syntaxes is easy once we understand the pd.merge() syntax.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'], 'data2': range(3)})

df1

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

df2

	key	data2
0	a	0
1	b	1
2	d	2

pd.merge(df1, df2)

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

The default how is how='inner', so pd.merge() inner joins left and right data frames by default, keeping only rows that appear in both. We can specify how='outer', so pd.merge() outer joins left and right data frames, keeping all rows that appear in either.

pd.merge(df1, df2, how='outer')

	key	data1	data2
0	b	0.0000	1.0000
1	b	1.0000	1.0000
2	b	6.0000	1.0000
3	a	2.0000	0.0000
4	a	4.0000	0.0000
5	a	5.0000	0.0000
6	c	3.0000	NaN
7	d	NaN	2.0000

A left merge keeps only rows that appear in the left data frame.

pd.merge(df1, df2, how='left')

	key	data1	data2
0	b	0	1.0000
1	b	1	1.0000
2	a	2	0.0000
3	c	3	NaN
4	a	4	0.0000
5	a	5	0.0000
6	b	6	1.0000

A rights merge keeps only rows that appear in the right data frame.

pd.merge(df1, df2, how='right')

	key	data1	data2
0	a	2.0000	0
1	a	4.0000	0
2	a	5.0000	0
3	b	0.0000	1
4	b	1.0000	1
5	b	6.0000	1
6	d	NaN	2

By default, pd.merge() merges on all columns that appear in both data frames.

on : label or list Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

Here key is the only common column between df1 and df2. We should specify on to avoid unexpected results.

pd.merge(df1, df2, on='key')

	key	data1	data2
0	b	0	1
1	b	1	1
2	b	6	1
3	a	2	0
4	a	4	0
5	a	5	0

We must specify left_on and right_on if our left and right data frames do not have a common column.

df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1': range(7)})
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'], 'data2': range(3)})

df3

	lkey	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	a	5
6	b	6

df4

	rkey	data2
0	a	0
1	b	1
2	d	2

# pd.merge(df3, df4) # this code fails/errors because there are not common columns
# MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

	lkey	data1	rkey	data2
0	b	0	b	1
1	b	1	b	1
2	b	6	b	1
3	a	2	a	0
4	a	4	a	0
5	a	5	a	0

Here pd.merge() dropped row c from df3 and row d from df4. Rows c and d dropped because pd.merge() inner joins be default. An inner join keeps the intersection of the left and right data frame keys. Further, rows a and b from df4 appear three times to match df3. If we want to keep rows c and d, we can outer join df3 and df4 with how='outer'.

pd.merge(df1, df2, how='outer')

	key	data1	data2
0	b	0.0000	1.0000
1	b	1.0000	1.0000
2	b	6.0000	1.0000
3	a	2.0000	0.0000
4	a	4.0000	0.0000
5	a	5.0000	0.0000
6	c	3.0000	NaN
7	d	NaN	2.0000

Many-to-many merges have well-defined, though not necessarily intuitive, behavior.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'], 'data2': range(5)})

df1

	key	data1
0	b	0
1	b	1
2	a	2
3	c	3
4	a	4
5	b	5

df2

	key	data2
0	a	0
1	b	1
2	a	2
3	b	3
4	d	4

pd.merge(df1, df2, on='key')

	key	data1	data2
0	b	0	1
1	b	0	3
2	b	1	1
3	b	1	3
4	b	5	1
5	b	5	3
6	a	2	0
7	a	2	2
8	a	4	0
9	a	4	2

Many-to-many joins form the Cartesian product of the rows. Since there were three b rows in the left DataFrame and two in the right one, there are six b rows in the result. The join method only affects the distinct key values appearing in the result.

Be careful with many-to-many joins! In finance, we do not expect many-to-many joins because we expect at least one of the data frames to have unique observations. pandas will not warn us if we accidentally perform a many-to-many join instead of a one-to-one or many-to-one join.

We can merge on more than one key. For example, we may merge two data sets on ticker-date pairs or industry-date pairs.

left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

left

	key1	key2	lval
0	foo	one	1
1	foo	two	2
2	bar	one	3

right

	key1	key2	rval
0	foo	one	4
1	foo	one	5
2	bar	one	6
3	bar	two	7

pd.merge(left, right, on=['key1', 'key2'], how='outer')

	key1	key2	lval	rval
0	foo	one	1.0000	4.0000
1	foo	one	1.0000	5.0000
2	foo	two	2.0000	NaN
3	bar	one	3.0000	6.0000
4	bar	two	NaN	7.0000

When column names overlap between the left and right data frames, pd.merge() appends _x and _y to the left and right versions of the overlapping column names.

pd.merge(left, right, on='key1')

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

I typically specify suffixes to avoid later confusion.

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

I read the pd.merge() docstring whenever I am in doubt. Table 8-2 lists the most commonly used arguments for pd.merge().

left: DataFrame to be merged on the left side.

right: DataFrame to be merged on the right side.

how: One of ‘inner’, ‘outer’, ‘left’, or ‘right’; defaults to ‘inner’.

on: Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given will use the intersection of the column names in left and right as the join keys.

left_on: Columns in left DataFrame to use as join keys.

right_on: Analogous to left_on for left DataFrame.

left_index: Use row index in left as its join key (or keys, if a MultiIndex).

right_index: Analogous to left_index.

sort: Sort merged data lexicographically by join keys; True by default (disable to get better performance in some cases on large datasets).

suffixes: Tuple of string values to append to column names in case of overlap; defaults to (’_x’, ’_y’) (e.g., if ‘data’ in both DataFrame objects, would appear as ‘data_x’ and ‘data_y’ in result).

copy: If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.

indicator: Adds a special column _merge that indicates the source of each row; values will be ‘left_only’, ‘right_only’, or ‘both’ based on the origin of the joined data in each row.

3.2 Merging on Index

If we want to use pd.merge() to join on row indexes, we can use the left_index and right_index arguments.

left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

left1

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

right1

	group_val
a	3.5000
b	7.0000

pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

	key	value	group_val
0	a	0	3.5000
2	a	2	3.5000
3	a	3	3.5000
1	b	1	7.0000
4	b	4	7.0000
5	c	5	NaN

The index arguments work for hierarchical indexes (multi indexes), too.

lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])

pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')

	key1	key2	data	event1	event2
0	Ohio	2000	0.0000	4.0000	5.0000
0	Ohio	2000	0.0000	6.0000	7.0000
1	Ohio	2001	1.0000	8.0000	9.0000
2	Ohio	2002	2.0000	10.0000	11.0000
3	Nevada	2001	3.0000	0.0000	1.0000
4	Nevada	2002	4.0000	NaN	NaN
4	Nevada	2000	NaN	2.0000	3.0000

left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])
right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])

If we use both left and right indexes, pd.merge() will keep the index.

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

	Ohio	Nevada	Missouri	Alabama
a	1.0000	2.0000	NaN	NaN
b	NaN	NaN	7.0000	8.0000
c	3.0000	4.0000	9.0000	10.0000
d	NaN	NaN	11.0000	12.0000
e	5.0000	6.0000	13.0000	14.0000

DataFrame has a convenient join instance for merging by index. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

If we have matching indexes on left and right, we can use .join().

left2

	Ohio	Nevada
a	1.0000	2.0000
c	3.0000	4.0000
e	5.0000	6.0000

right2

	Missouri	Alabama
b	7.0000	8.0000
c	9.0000	10.0000
d	11.0000	12.0000
e	13.0000	14.0000

left2.join(right2, how='outer')

	Ohio	Nevada	Missouri	Alabama
a	1.0000	2.0000	NaN	NaN
b	NaN	NaN	7.0000	8.0000
c	3.0000	4.0000	9.0000	10.0000
d	NaN	NaN	11.0000	12.0000
e	5.0000	6.0000	13.0000	14.0000

The .join() method left joins by default. The .join() method uses indexes, so it requires few arguments and accepts a list of data frames.

another = pd.DataFrame(
    data=[[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
    index=['a', 'c', 'e', 'f'],
    columns=['New York', 'Oregon']
)

another

	New York	Oregon
a	7.0000	8.0000
c	9.0000	10.0000
e	11.0000	12.0000
f	16.0000	17.0000

left2.join([right2, another])

	Ohio	Nevada	Missouri	Alabama	New York	Oregon
a	1.0000	2.0000	NaN	NaN	7.0000	8.0000
c	3.0000	4.0000	9.0000	10.0000	9.0000	10.0000
e	5.0000	6.0000	13.0000	14.0000	11.0000	12.0000

left2.join([right2, another], how='outer')

	Ohio	Nevada	Missouri	Alabama	New York	Oregon
a	1.0000	2.0000	NaN	NaN	7.0000	8.0000
c	3.0000	4.0000	9.0000	10.0000	9.0000	10.0000
e	5.0000	6.0000	13.0000	14.0000	11.0000	12.0000
b	NaN	NaN	7.0000	8.0000	NaN	NaN
d	NaN	NaN	11.0000	12.0000	NaN	NaN
f	NaN	NaN	NaN	NaN	16.0000	17.0000

3.3 Concatenating Along an Axis

The pd.concat() function provides a flexible way to combine data frames and series along either axis. I typically use pd.concat() to combine:

A list of data frames with similar layouts
A list of series because series do not have .join() or .merge() methods

The first is handy if we have to read and combine a directory of .csv files.

s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

pd.concat([s1, s2, s3], axis=1)

	0	1	2
a	0.0000	NaN	NaN
b	1.0000	NaN	NaN
c	NaN	2.0000	NaN
d	NaN	3.0000	NaN
e	NaN	4.0000	NaN
f	NaN	NaN	5.0000
g	NaN	NaN	6.0000

result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])

result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

result.unstack()

	a	b	c	d	e	f	g
one	0.0000	1.0000	NaN	NaN	NaN	NaN	NaN
two	NaN	NaN	2.0000	3.0000	4.0000	NaN	NaN
three	NaN	NaN	NaN	NaN	NaN	5.0000	6.0000

pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

	one	two	three
a	0.0000	NaN	NaN
b	1.0000	NaN	NaN
c	NaN	2.0000	NaN
d	NaN	3.0000	NaN
e	NaN	4.0000	NaN
f	NaN	NaN	5.0000
g	NaN	NaN	6.0000

df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'], columns=['three', 'four'])

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

	level1		level2
	one	two	three	four
a	0	1	5.0000	6.0000
b	2	3	NaN	NaN
c	4	5	7.0000	8.0000

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], names=['upper', 'lower'])

upper	level1		level2
lower	one	two	three	four
a	0	1	5.0000	6.0000
b	2	3	NaN	NaN
c	4	5	7.0000	8.0000

4 Reshaping and Pivoting

Above, we briefly explore reshaping data with .stack() and .unstack(). Here we explore reshaping data more deeply.

4.1 Reshaping with Hierarchical Indexing

Hierarchical indexes (multi-indexes) help reshape data.

There are two primary actions: - stack: This “rotates” or pivots from the columns in the data to the rows - unstack: This pivots from the rows into the columns

data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                    name='number'))

data

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

result.unstack()

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])
data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

data2.unstack()

	a	b	c	d	e
one	0.0000	1.0000	2.0000	3.0000	NaN
two	NaN	NaN	4.0000	5.0000	6.0000

Un-stacking may introduce missing values because data frames are rectangular. By default, stacking drops these missing values.

data2.unstack().stack()

one  a   0.0000
     b   1.0000
     c   2.0000
     d   3.0000
two  c   4.0000
     d   5.0000
     e   6.0000
dtype: float64

However, we can keep missing values with dropna=False.

data2.unstack().stack(dropna=False)

one  a   0.0000
     b   1.0000
     c   2.0000
     d   3.0000
     e      NaN
two  a      NaN
     b      NaN
     c   4.0000
     d   5.0000
     e   6.0000
dtype: float64

df = pd.DataFrame({
    'left': result, 
    'right': result + 5
    },
    columns=pd.Index(['left', 'right'], name='side')
)

df

	side	left	right
state	number
Ohio	one	0	5
	two	1	6
	three	2	7
Colorado	one	3	8
	two	4	9
	three	5	10

If we un-stack a data frame, the un-stacked level becomes the innermost level in the resulting index.

df.unstack('state')

side	left		right
state	Ohio	Colorado	Ohio	Colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

We can chain .stack() and .unstack() to rearrange our data.

df.unstack('state').stack('side')

	state	Ohio	Colorado
number	side
one	left	0	3
one	right	5	8
two	left	1	4
two	right	6	9
three	left	2	5
three	right	7	10

McKinney provides two more subsections on reshaping data with the .pivot() and .melt() methods. Unlike, the stacking methods, the pivoting methods can aggregate data and do not require an index. We will skip these additional aggregation methods for now.