McKinney Chapter 4 - NumPy Basics: Arrays and Vectorized Computation

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 4 of Wes McKinney’s Python for Data Analysis discusses the NumPy package (an abbreviation of numerical Python), which is the foundation for numerical computing in Python, including pandas.

We will focus on:

Creating arrays
Slicing arrays
Applying functions and methods to arrays
Using conditional logic with arrays (i.e., np.where() and np.select())

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

The typical abbreviation for NumPy is np.

import numpy as np

The “magic” function %precision 4 tells JupyterLab to print NumPy arrays to 4 decimals. This magic function only changes the printed precision and does not change the stored precision of the underlying values.

%precision 4

'%.4f'

McKinney thoroughly discuesses the history on NumPy, as well as its technical advantages. But here is a simple illustration of the speed and syntax advantages of NumPy of Python’s built-in data structures. First, we create a list and array with values from 0 to 999,999.

my_list = list(range(1_000_000))

my_arr = np.arange(1_000_000)

my_list[:5]

[0, 1, 2, 3, 4]

my_arr[:5]

array([0, 1, 2, 3, 4])

If we want to double each value in my_list we have to use a for loop or a list comprehension.

len(my_list * 2) # concatenates two copies of my_list

# [2 * x for x in my_list] # list comprehension to double each value

However, we can multiply my_arr by two because math “just works” with NumPy.

my_arr * 2

array([      0,       2,       4, ..., 1999994, 1999996, 1999998])

We can use the “magic” function %timeit to time these two calculations.

%timeit [x * 2 for x in my_list]

61.2 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit my_arr * 2

1.18 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The NumPy version is a hundred times faster than the list version. The NumPy version is also faster to type, read, and troubleshoot, which are typically more important. Our time is more valuable than the computer time!

2 The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

We generate random data to explore NumPy arrays. Whenever we generate random data, we should set the random number seed with np.random.seed(42), which makes our random numbers repeatable. If we use the same random number seed, our random numbers will be the same.

np.random.seed(42)
data = np.random.randn(2, 3)
data

array([[ 0.4967, -0.1383,  0.6477],
       [ 1.523 , -0.2342, -0.2341]])

Multiplying data by 10 multiplies each element in data by 10, and adding data to itself adds each element to itself (i.e., element-wise addition). To achieve this common-sense behavior, NumPy arrays must contain homogeneous data types (e.g., all floats or all integers).

data * 10

array([[ 4.9671, -1.3826,  6.4769],
       [15.2303, -2.3415, -2.3414]])

data_2 = data + data
data_2

array([[ 0.9934, -0.2765,  1.2954],
       [ 3.0461, -0.4683, -0.4683]])

NumPy arrays have attributes. Recall that Jupyter Notebooks provides tab completion.

data.ndim

data.shape

(2, 3)

data.dtype

dtype('float64')

We access or slice elements in a NumPy array using [], the same as we slice lists and tuples.

data[0]

array([ 0.4967, -0.1383,  0.6477])

As with list and tuples, we can chain []s.

data[0][0]

0.4967

However, with NumPy arrays. we can replace \(n\) chained []s with one pair of []s containing \(n\) values separated by commas. For example, [i][j] becomes [i, j], [i][j][k] becomes [i, j, k]. This abbreviated notation is similar to what you see in your math and econometrics courses.

data[0, 0] # zero row, zero column

0.4967

data[0][0] == data[0, 0]

True

2.1 Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data

data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

arr1.dtype

dtype('float64')

Here np.array() casts the values in data1 to floats becuase NumPy arrays must have homogenous data types. We could coerce these values to integers but would lose information.

np.array(data1, dtype=np.int64)

array([6, 7, 8, 0, 1], dtype=int64)

We can coerce or cast a list-of-lists to a two-dimensional NumPy array.

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

arr2.ndim

arr2.shape

(2, 4)

arr2.dtype

dtype('int32')

There are several other ways to create NumPy arrays.

np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

The np.arange() function is similar to Python’s built-in range() but creates an array directly.

list(range(15))

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

np.array(range(15))

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Table 4-1 from McKinney lists some NumPy array creation functions.

array: Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default
asarray: Convert input to ndarray, but do not copy if the input is already an ndarray
arange: Like the built-in range but returns an ndarray instead of a list
ones, ones_like: Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the - same shape and dtype
zeros, zeros_like: Like ones and ones_like but producing arrays of 0s instead
empty, empty_like: Create new arrays by allocating new memory, but do not populate with any values like ones and zeros
full, full_like: Produce an array of the given shape and dtype with all values set to the indicated “fill value”
eye, identity: Create a square N-by-N identity matrix (1s on the diagonal and 0s elsewhere)

2.2 Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise

arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

arr.shape

(2, 3)

NumPy array addition is elementwise.

arr + arr

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

NumPy array multiplication is elementwise.

arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

NumPy array division is elementwise.

1 / arr

array([[1.    , 0.5   , 0.3333],
       [0.25  , 0.2   , 0.1667]])

NumPy powers are elementwise, too.

arr ** 2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

We can also raise a single value to an array!

2 ** arr

array([[ 2.,  4.,  8.],
       [16., 32., 64.]])

2.3 Basic Indexing and Slicing

One-dimensional array index and slice the same as lists.

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

arr[5]

arr[5:8]

array([5, 6, 7])

equiv_list = list(range(10))
equiv_list

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

equiv_list[5:8]

[5, 6, 7]

We have to jump through some hoops if we want to replace elements 5, 6, and 7 in equiv_list with the value 12.

# TypeError: can only assign an iterable
# equiv_list[5:8] = 12

equiv_list[5:8] = [12] * 3
equiv_list

[0, 1, 2, 3, 4, 12, 12, 12, 8, 9]

However, this operation is easy with the NumPy array arr!

arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

“Broadcasting” is the name for this behavior.

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted henceforth) to the entire selection. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

x = arr_slice
x

array([12, 12, 12])

x is arr_slice

True

y = x.copy()

y is arr_slice

False

arr_slice[1] = 12345
arr_slice

array([   12, 12345,    12])

arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The : slices every element in arr_slice.

arr_slice[:] = 64
arr_slice

array([64, 64, 64])

arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array-for example, arr[5:8].copy().

arr_slice_2 = arr[5:8].copy()
arr_slice_2

array([64, 64, 64])

arr_slice_2[:] = 2_001
arr_slice_2

array([2001, 2001, 2001])

arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

2.4 Indexing with slices

We can slice across two or more dimensions, including the [i, j] notation.

arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

A colon (:) by itself selects the entire dimension and is necessary to slice higher dimensions.

arr2d[:, :1]

array([[1],
       [4],
       [7]])

arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

Slicing multi-dimension arrays is tricky! Always check your output!

2.5 Boolean Indexing

We can use Booleans (Trues and Falses) to slice arrays, too. Boolean indexing in Python is like combining index() and match() in Excel.

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.random.seed(42)
data = np.random.randn(7, 4)

names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

data

array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])

Here names provides seven names for the seven rows in data.

names == 'Bob'

array([ True, False, False,  True, False, False, False])

data[names == 'Bob']

array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [ 0.242 , -1.9133, -1.7249, -0.5623]])

We can combine Boolean slicing with : slicing.

data[names == 'Bob', 2:]

array([[ 0.6477,  1.523 ],
       [-1.7249, -0.5623]])

We can use ~ to invert a Boolean.

cond = names == 'Bob'
data[~cond]

array([[-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])

For NumPy arrays, we must use & and | instead of and and or.

cond = (names == 'Bob') | (names == 'Will')
data[cond]

array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123]])

We can also create a Boolean for each element.

data

array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])

data < 0

array([[False,  True, False, False],
       [ True,  True, False, False],
       [ True, False,  True,  True],
       [False,  True,  True,  True],
       [ True, False,  True,  True],
       [False,  True, False,  True],
       [ True, False,  True, False]])

data[data < 0] = 0
data

array([[0.4967, 0.    , 0.6477, 1.523 ],
       [0.    , 0.    , 1.5792, 0.7674],
       [0.    , 0.5426, 0.    , 0.    ],
       [0.242 , 0.    , 0.    , 0.    ],
       [0.    , 0.3142, 0.    , 0.    ],
       [1.4656, 0.    , 0.0675, 0.    ],
       [0.    , 0.1109, 0.    , 0.3757]])

3 Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.sqrt(4)

2.0000

np.sqrt(arr)

array([0.    , 1.    , 1.4142, 1.7321, 2.    , 2.2361, 2.4495, 2.6458,
       2.8284, 3.    ])

Like above, we can raise a single value to a NumPy array of powers.

2**arr

array([  1,   2,   4,   8,  16,  32,  64, 128, 256, 512], dtype=int32)

np.exp(x) is \(e^x\).

np.exp(arr)

array([1.0000e+00, 2.7183e+00, 7.3891e+00, 2.0086e+01, 5.4598e+01,
       1.4841e+02, 4.0343e+02, 1.0966e+03, 2.9810e+03, 8.1031e+03])

The functions above accept one argument. These “unary” functions operate on one array and return a new array with the same shape. There are also “binary” functions that operate on two arrays and return one array.

np.random.seed(42)
x = np.random.randn(8)
y = np.random.randn(8)

array([ 0.4967, -0.1383,  0.6477,  1.523 , -0.2342, -0.2341,  1.5792,
        0.7674])

array([-0.4695,  0.5426, -0.4634, -0.4657,  0.242 , -1.9133, -1.7249,
       -0.5623])

np.maximum(x, y)

array([ 0.4967,  0.5426,  0.6477,  1.523 ,  0.242 , -0.2341,  1.5792,
        0.7674])

Be careful! Function names are not the whole story. Check your output and read the docstring! For example, np.max() returns the maximum of an array, instead of the elementwise maximum of two arrays for np.maximum().

np.max(x)

1.5792

Table 4-4 from McKinney lists some fast, element-wise unary functions:

abs, fabs: Compute the absolute value element-wise for integer, oating-point, or complex values
sqrt: Compute the square root of each element (equivalent to arr ** 0.5)
square: Compute the square of each element (equivalent to arr ** 2)
exp: Compute the exponent \(e^x\) of each element
log, log10, log2, log1p: Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
sign: Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)
ceil: Compute the ceiling of each element (i.e., the smallest integer greater than or equal to thatnumber)
floor: Compute the oor of each element (i.e., the largest integer less than or equal to each element)
rint: Round elements to the nearest integer, preserving the dtype
modf: Return fractional and integral parts of array as a separate array
isnan: Return boolean array indicating whether each value is NaN (Not a Number)
isfinite, isinf: Return boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, respectively
cos, cosh, sin, sinh, tan, tanh: Regular and hyperbolic trigonometric functions
arccos, arccosh, arcsin, arcsinh, arctan, arctanh: Inverse trigonometric functions
logical_not: Compute truth value of not x element-wise (equivalent to ~arr).

Table 4-5 from McKinney lists some fast, element-wise binary functions:

add: Add corresponding elements in arrays
subtract: Subtract elements in second array from first array
multiply: Multiply array elements
divide, floor_divide: Divide or floor divide (truncating the remainder)
power: Raise elements in first array to powers indicated in second array
maximum, fmax: Element-wise maximum; fmax ignores NaN
minimum, fmin: Element-wise minimum; fmin ignores NaN
mod: Element-wise modulus (remainder of division)
copysign: Copy sign of values in second argument to values in first argument
greater, greater_equal, less, less_equal, equal, not_equal: Perform element-wise comparison, yielding boolean array (equivalent to infix operators >, >=, <, <=, ==, !=)
logical_and, logical_or, logical_xor: Compute element-wise truth value of logical operation (equivalent to infix operators & |, ^)

4 Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. Later, in Appendix A, I explain broadcasting, a powerful method for vectorizing computations.

4.1 Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y.

NumPy’s where() is an if-else statement that operates like Excel’s if().

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])

np.where(cond, xarr, yarr)

array([1.1, 2.2, 1.3, 1.4, 2.5])

We could use a list comprehension, instead, but the list comprehension is takes longer to type, read, and troubleshoot.

np.array([(x if c else y) for x, y, c in zip(xarr, yarr, cond)])

array([1.1, 2.2, 1.3, 1.4, 2.5])

We could also use np.select() here, but it is overkill to test one condition. np.select() lets us test more more than one condition and provides a default value if no condition is met.

np.select(
    condlist=[cond==True, cond==False],
    choicelist=[xarr, yarr]
)

array([1.1, 2.2, 1.3, 1.4, 2.5])

4.2 Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.

We will use these aggregations extensively in pandas.

np.random.seed(42)
arr = np.random.randn(5, 4)
arr

array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123]])

arr.mean()

-0.1713

arr.sum()

-3.4260

The aggregation methods above aggregated the whole array. We can use the axis argument to aggregate columns (axis=0) and rows (axis=1).

arr.mean(axis=1)

array([ 0.6323,  0.4696, -0.214 , -0.9896, -0.7547])

arr[0].mean()

0.6323

arr.mean(axis=0)

array([-0.1956, -0.2858, -0.1739, -0.03  ])

arr[:, 0].mean()

-0.1956

The .cumsum() method returns the sum of all previous elements.

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7]) # same output as np.arange(8)
arr.cumsum()

array([ 0,  1,  3,  6, 10, 15, 21, 28])

We can use the .cumsum() method along the axis of a multi-dimension array, too.

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

arr.cumsum(axis=0)

array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])

arr.cumprod(axis=1)

array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]])

Table 4-6 from McKinney lists some basic statistical methods:

sum: Sum of all the elements in the array or along an axis; zero-length arrays have sum 0
mean: Arithmetic mean; zero-length arrays have NaN mean
std, var: Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator \(n\))
min, max: Minimum and maximum
argmin, argmax: Indices of minimum and maximum elements, respectively
cumsum: Cumulative sum of elements starting from 0
cumprod: Cumulative product of elements starting from 1

4.3 Methods for Boolean Arrays

np.random.seed(42)
arr = np.random.randn(100)
arr

array([ 0.4967, -0.1383,  0.6477,  1.523 , -0.2342, -0.2341,  1.5792,
        0.7674, -0.4695,  0.5426, -0.4634, -0.4657,  0.242 , -1.9133,
       -1.7249, -0.5623, -1.0128,  0.3142, -0.908 , -1.4123,  1.4656,
       -0.2258,  0.0675, -1.4247, -0.5444,  0.1109, -1.151 ,  0.3757,
       -0.6006, -0.2917, -0.6017,  1.8523, -0.0135, -1.0577,  0.8225,
       -1.2208,  0.2089, -1.9597, -1.3282,  0.1969,  0.7385,  0.1714,
       -0.1156, -0.3011, -1.4785, -0.7198, -0.4606,  1.0571,  0.3436,
       -1.763 ,  0.3241, -0.3851, -0.6769,  0.6117,  1.031 ,  0.9313,
       -0.8392, -0.3092,  0.3313,  0.9755, -0.4792, -0.1857, -1.1063,
       -1.1962,  0.8125,  1.3562, -0.072 ,  1.0035,  0.3616, -0.6451,
        0.3614,  1.538 , -0.0358,  1.5646, -2.6197,  0.8219,  0.087 ,
       -0.299 ,  0.0918, -1.9876, -0.2197,  0.3571,  1.4779, -0.5183,
       -0.8085, -0.5018,  0.9154,  0.3288, -0.5298,  0.5133,  0.0971,
        0.9686, -0.7021, -0.3277, -0.3921, -1.4635,  0.2961,  0.2611,
        0.0051, -0.2346])

arr > 0

array([ True, False,  True,  True, False, False,  True,  True, False,
        True, False, False,  True, False, False, False, False,  True,
       False, False,  True, False,  True, False, False,  True, False,
        True, False, False, False,  True, False, False,  True, False,
        True, False, False,  True,  True,  True, False, False, False,
       False, False,  True,  True, False,  True, False, False,  True,
        True,  True, False, False,  True,  True, False, False, False,
       False,  True,  True, False,  True,  True, False,  True,  True,
       False,  True, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False,  True,  True, False,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False])

(arr > 0).sum() # Number of positive values

(arr > 0).mean() # percentage of positive values

0.4600

bools = np.array([False, False, True, False])
bools

array([False, False,  True, False])

bools.any()

True

bools.all()

False