McKinney Chapter 4 - NumPy Basics: Arrays and Vectorized Computation

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

Chapter 4 of Wes McKinney’s Python for Data Analysis discusses the NumPy package (an abbreviation of numerical Python), which is the foundation for numerical computing in Python, including pandas.

We will focus on:

  1. Creating arrays
  2. Slicing arrays
  3. Applying functions and methods to arrays
  4. Using conditional logic with arrays (i.e., np.where() and np.select())

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

The typical abbreviation for NumPy is np.

import numpy as np

The “magic” function %precision 4 tells JupyterLab to print NumPy arrays to 4 decimals. This magic function only changes the printed precision and does not change the stored precision of the underlying values.

%precision 4
'%.4f'

McKinney thoroughly discuesses the history on NumPy, as well as its technical advantages. But here is a simple illustration of the speed and syntax advantages of NumPy of Python’s built-in data structures. First, we create a list and array with values from 0 to 999,999.

my_list = list(range(1_000_000))
my_arr = np.arange(1_000_000)
my_list[:5]
[0, 1, 2, 3, 4]
my_arr[:5]
array([0, 1, 2, 3, 4])

If we want to double each value in my_list we have to use a for loop or a list comprehension.

len(my_list * 2) # concatenates two copies of my_list
2000000
# [2 * x for x in my_list] # list comprehension to double each value

However, we can multiply my_arr by two because math “just works” with NumPy.

my_arr * 2
array([      0,       2,       4, ..., 1999994, 1999996, 1999998])

We can use the “magic” function %timeit to time these two calculations.

%timeit [x * 2 for x in my_list]
61.2 ms ± 2.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_arr * 2
1.18 ms ± 47.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The NumPy version is a hundred times faster than the list version. The NumPy version is also faster to type, read, and troubleshoot, which are typically more important. Our time is more valuable than the computer time!

2 The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

We generate random data to explore NumPy arrays. Whenever we generate random data, we should set the random number seed with np.random.seed(42), which makes our random numbers repeatable. If we use the same random number seed, our random numbers will be the same.

np.random.seed(42)
data = np.random.randn(2, 3)
data
array([[ 0.4967, -0.1383,  0.6477],
       [ 1.523 , -0.2342, -0.2341]])

Multiplying data by 10 multiplies each element in data by 10, and adding data to itself adds each element to itself (i.e., element-wise addition). To achieve this common-sense behavior, NumPy arrays must contain homogeneous data types (e.g., all floats or all integers).

data * 10
array([[ 4.9671, -1.3826,  6.4769],
       [15.2303, -2.3415, -2.3414]])
data_2 = data + data
data_2
array([[ 0.9934, -0.2765,  1.2954],
       [ 3.0461, -0.4683, -0.4683]])

NumPy arrays have attributes. Recall that Jupyter Notebooks provides tab completion.

data.ndim
2
data.shape
(2, 3)
data.dtype
dtype('float64')

We access or slice elements in a NumPy array using [], the same as we slice lists and tuples.

data[0]
array([ 0.4967, -0.1383,  0.6477])

As with list and tuples, we can chain []s.

data[0][0]
0.4967

However, with NumPy arrays. we can replace \(n\) chained []s with one pair of []s containing \(n\) values separated by commas. For example, [i][j] becomes [i, j], [i][j][k] becomes [i, j, k]. This abbreviated notation is similar to what you see in your math and econometrics courses.

data[0, 0] # zero row, zero column
0.4967
data[0][0] == data[0, 0]
True

2.1 Creating ndarrays

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data

data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1
array([6. , 7.5, 8. , 0. , 1. ])
arr1.dtype
dtype('float64')

Here np.array() casts the values in data1 to floats becuase NumPy arrays must have homogenous data types. We could coerce these values to integers but would lose information.

np.array(data1, dtype=np.int64)
array([6, 7, 8, 0, 1], dtype=int64)

We can coerce or cast a list-of-lists to a two-dimensional NumPy array.

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
arr2.ndim
2
arr2.shape
(2, 4)
arr2.dtype
dtype('int32')

There are several other ways to create NumPy arrays.

np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.zeros((3, 6))
array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

The np.arange() function is similar to Python’s built-in range() but creates an array directly.

list(range(15))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
np.array(range(15))
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])
np.arange(15)
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Table 4-1 from McKinney lists some NumPy array creation functions.

  • array: Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a dtype or explicitly specifying a dtype; copies the input data by default
  • asarray: Convert input to ndarray, but do not copy if the input is already an ndarray
  • arange: Like the built-in range but returns an ndarray instead of a list
  • ones, ones_like: Produce an array of all 1s with the given shape and dtype; ones_like takes another array and produces a ones array of the - same shape and dtype
  • zeros, zeros_like: Like ones and ones_like but producing arrays of 0s instead
  • empty, empty_like: Create new arrays by allocating new memory, but do not populate with any values like ones and zeros
  • full, full_like: Produce an array of the given shape and dtype with all values set to the indicated “fill value”
  • eye, identity: Create a square N-by-N identity matrix (1s on the diagonal and 0s elsewhere)

2.2 Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise

arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr
array([[1., 2., 3.],
       [4., 5., 6.]])
arr.shape
(2, 3)

NumPy array addition is elementwise.

arr + arr
array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

NumPy array multiplication is elementwise.

arr * arr
array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

NumPy array division is elementwise.

1 / arr
array([[1.    , 0.5   , 0.3333],
       [0.25  , 0.2   , 0.1667]])

NumPy powers are elementwise, too.

arr ** 2
array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

We can also raise a single value to an array!

2 ** arr
array([[ 2.,  4.,  8.],
       [16., 32., 64.]])

2.3 Basic Indexing and Slicing

One-dimensional array index and slice the same as lists.

arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
arr[5]
5
arr[5:8]
array([5, 6, 7])
equiv_list = list(range(10))
equiv_list
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
equiv_list[5:8]
[5, 6, 7]

We have to jump through some hoops if we want to replace elements 5, 6, and 7 in equiv_list with the value 12.

# TypeError: can only assign an iterable
# equiv_list[5:8] = 12
equiv_list[5:8] = [12] * 3
equiv_list
[0, 1, 2, 3, 4, 12, 12, 12, 8, 9]

However, this operation is easy with the NumPy array arr!

arr[5:8] = 12
arr
array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

“Broadcasting” is the name for this behavior.

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted henceforth) to the entire selection. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

arr_slice = arr[5:8]
arr_slice
array([12, 12, 12])
x = arr_slice
x
array([12, 12, 12])
x is arr_slice
True
y = x.copy()
y is arr_slice
False
arr_slice[1] = 12345
arr_slice
array([   12, 12345,    12])
arr
array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The : slices every element in arr_slice.

arr_slice[:] = 64
arr_slice
array([64, 64, 64])
arr
array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array-for example, arr[5:8].copy().

arr_slice_2 = arr[5:8].copy()
arr_slice_2
array([64, 64, 64])
arr_slice_2[:] = 2_001
arr_slice_2
array([2001, 2001, 2001])
arr
array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

2.4 Indexing with slices

We can slice across two or more dimensions, including the [i, j] notation.

arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])
arr2d[:2]
array([[1, 2, 3],
       [4, 5, 6]])
arr2d[:2, 1:]
array([[2, 3],
       [5, 6]])

A colon (:) by itself selects the entire dimension and is necessary to slice higher dimensions.

arr2d[:, :1]
array([[1],
       [4],
       [7]])
arr2d[:2, 1:] = 0
arr2d
array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

Slicing multi-dimension arrays is tricky! Always check your output!

2.5 Boolean Indexing

We can use Booleans (Trues and Falses) to slice arrays, too. Boolean indexing in Python is like combining index() and match() in Excel.

names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
np.random.seed(42)
data = np.random.randn(7, 4)
names
array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')
data
array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])

Here names provides seven names for the seven rows in data.

names == 'Bob'
array([ True, False, False,  True, False, False, False])
data[names == 'Bob']
array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [ 0.242 , -1.9133, -1.7249, -0.5623]])

We can combine Boolean slicing with : slicing.

data[names == 'Bob', 2:]
array([[ 0.6477,  1.523 ],
       [-1.7249, -0.5623]])

We can use ~ to invert a Boolean.

cond = names == 'Bob'
data[~cond]
array([[-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])

For NumPy arrays, we must use & and | instead of and and or.

cond = (names == 'Bob') | (names == 'Will')
data[cond]
array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123]])

We can also create a Boolean for each element.

data
array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123],
       [ 1.4656, -0.2258,  0.0675, -1.4247],
       [-0.5444,  0.1109, -1.151 ,  0.3757]])
data < 0
array([[False,  True, False, False],
       [ True,  True, False, False],
       [ True, False,  True,  True],
       [False,  True,  True,  True],
       [ True, False,  True,  True],
       [False,  True, False,  True],
       [ True, False,  True, False]])
data[data < 0] = 0
data
array([[0.4967, 0.    , 0.6477, 1.523 ],
       [0.    , 0.    , 1.5792, 0.7674],
       [0.    , 0.5426, 0.    , 0.    ],
       [0.242 , 0.    , 0.    , 0.    ],
       [0.    , 0.3142, 0.    , 0.    ],
       [1.4656, 0.    , 0.0675, 0.    ],
       [0.    , 0.1109, 0.    , 0.3757]])

3 Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

arr = np.arange(10)
arr
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.sqrt(4)
2.0000
np.sqrt(arr)
array([0.    , 1.    , 1.4142, 1.7321, 2.    , 2.2361, 2.4495, 2.6458,
       2.8284, 3.    ])

Like above, we can raise a single value to a NumPy array of powers.

2**arr
array([  1,   2,   4,   8,  16,  32,  64, 128, 256, 512], dtype=int32)

np.exp(x) is \(e^x\).

np.exp(arr)
array([1.0000e+00, 2.7183e+00, 7.3891e+00, 2.0086e+01, 5.4598e+01,
       1.4841e+02, 4.0343e+02, 1.0966e+03, 2.9810e+03, 8.1031e+03])

The functions above accept one argument. These “unary” functions operate on one array and return a new array with the same shape. There are also “binary” functions that operate on two arrays and return one array.

np.random.seed(42)
x = np.random.randn(8)
y = np.random.randn(8)
x
array([ 0.4967, -0.1383,  0.6477,  1.523 , -0.2342, -0.2341,  1.5792,
        0.7674])
y
array([-0.4695,  0.5426, -0.4634, -0.4657,  0.242 , -1.9133, -1.7249,
       -0.5623])
np.maximum(x, y)
array([ 0.4967,  0.5426,  0.6477,  1.523 ,  0.242 , -0.2341,  1.5792,
        0.7674])

Be careful! Function names are not the whole story. Check your output and read the docstring! For example, np.max() returns the maximum of an array, instead of the elementwise maximum of two arrays for np.maximum().

np.max(x)
1.5792

Table 4-4 from McKinney lists some fast, element-wise unary functions:

  • abs, fabs: Compute the absolute value element-wise for integer, „oating-point, or complex values
  • sqrt: Compute the square root of each element (equivalent to arr ** 0.5)
  • square: Compute the square of each element (equivalent to arr ** 2)
  • exp: Compute the exponent \(e^x\) of each element
  • log, log10, log2, log1p: Natural logarithm (base e), log base 10, log base 2, and log(1 + x), respectively
  • sign: Compute the sign of each element: 1 (positive), 0 (zero), or –1 (negative)
  • ceil: Compute the ceiling of each element (i.e., the smallest integer greater than or equal to thatnumber)
  • floor: Compute the „oor of each element (i.e., the largest integer less than or equal to each element)
  • rint: Round elements to the nearest integer, preserving the dtype
  • modf: Return fractional and integral parts of array as a separate array
  • isnan: Return boolean array indicating whether each value is NaN (Not a Number)
  • isfinite, isinf: Return boolean array indicating whether each element is finite (non-inf, non-NaN) or infinite, respectively
  • cos, cosh, sin, sinh, tan, tanh: Regular and hyperbolic trigonometric functions
  • arccos, arccosh, arcsin, arcsinh, arctan, arctanh: Inverse trigonometric functions
  • logical_not: Compute truth value of not x element-wise (equivalent to ~arr).

Table 4-5 from McKinney lists some fast, element-wise binary functions:

  • add: Add corresponding elements in arrays
  • subtract: Subtract elements in second array from first array
  • multiply: Multiply array elements
  • divide, floor_divide: Divide or floor divide (truncating the remainder)
  • power: Raise elements in first array to powers indicated in second array
  • maximum, fmax: Element-wise maximum; fmax ignores NaN
  • minimum, fmin: Element-wise minimum; fmin ignores NaN
  • mod: Element-wise modulus (remainder of division)
  • copysign: Copy sign of values in second argument to values in first argument
  • greater, greater_equal, less, less_equal, equal, not_equal: Perform element-wise comparison, yielding boolean array (equivalent to infix operators >, >=, <, <=, ==, !=)
  • logical_and, logical_or, logical_xor: Compute element-wise truth value of logical operation (equivalent to infix operators & |, ^)

4 Array-Oriented Programming with Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. Later, in Appendix A, I explain broadcasting, a powerful method for vectorizing computations.

4.1 Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y.

NumPy’s where() is an if-else statement that operates like Excel’s if().

xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])
cond = np.array([True, False, True, True, False])
np.where(cond, xarr, yarr)
array([1.1, 2.2, 1.3, 1.4, 2.5])

We could use a list comprehension, instead, but the list comprehension is takes longer to type, read, and troubleshoot.

np.array([(x if c else y) for x, y, c in zip(xarr, yarr, cond)])
array([1.1, 2.2, 1.3, 1.4, 2.5])

We could also use np.select() here, but it is overkill to test one condition. np.select() lets us test more more than one condition and provides a default value if no condition is met.

np.select(
    condlist=[cond==True, cond==False],
    choicelist=[xarr, yarr]
)
array([1.1, 2.2, 1.3, 1.4, 2.5])

4.2 Mathematical and Statistical Methods

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like sum, mean, and std (standard deviation) either by calling the array instance method or using the top-level NumPy function.

We will use these aggregations extensively in pandas.

np.random.seed(42)
arr = np.random.randn(5, 4)
arr
array([[ 0.4967, -0.1383,  0.6477,  1.523 ],
       [-0.2342, -0.2341,  1.5792,  0.7674],
       [-0.4695,  0.5426, -0.4634, -0.4657],
       [ 0.242 , -1.9133, -1.7249, -0.5623],
       [-1.0128,  0.3142, -0.908 , -1.4123]])
arr.mean()
-0.1713
arr.sum()
-3.4260

The aggregation methods above aggregated the whole array. We can use the axis argument to aggregate columns (axis=0) and rows (axis=1).

arr.mean(axis=1)
array([ 0.6323,  0.4696, -0.214 , -0.9896, -0.7547])
arr[0].mean()
0.6323
arr.mean(axis=0)
array([-0.1956, -0.2858, -0.1739, -0.03  ])
arr[:, 0].mean()
-0.1956

The .cumsum() method returns the sum of all previous elements.

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7]) # same output as np.arange(8)
arr.cumsum()
array([ 0,  1,  3,  6, 10, 15, 21, 28])

We can use the .cumsum() method along the axis of a multi-dimension array, too.

arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
arr.cumsum(axis=0)
array([[ 0,  1,  2],
       [ 3,  5,  7],
       [ 9, 12, 15]])
arr.cumprod(axis=1)
array([[  0,   0,   0],
       [  3,  12,  60],
       [  6,  42, 336]])

Table 4-6 from McKinney lists some basic statistical methods:

  • sum: Sum of all the elements in the array or along an axis; zero-length arrays have sum 0
  • mean: Arithmetic mean; zero-length arrays have NaN mean
  • std, var: Standard deviation and variance, respectively, with optional degrees of freedom adjustment (default denominator \(n\))
  • min, max: Minimum and maximum
  • argmin, argmax: Indices of minimum and maximum elements, respectively
  • cumsum: Cumulative sum of elements starting from 0
  • cumprod: Cumulative product of elements starting from 1

4.3 Methods for Boolean Arrays

np.random.seed(42)
arr = np.random.randn(100)
arr
array([ 0.4967, -0.1383,  0.6477,  1.523 , -0.2342, -0.2341,  1.5792,
        0.7674, -0.4695,  0.5426, -0.4634, -0.4657,  0.242 , -1.9133,
       -1.7249, -0.5623, -1.0128,  0.3142, -0.908 , -1.4123,  1.4656,
       -0.2258,  0.0675, -1.4247, -0.5444,  0.1109, -1.151 ,  0.3757,
       -0.6006, -0.2917, -0.6017,  1.8523, -0.0135, -1.0577,  0.8225,
       -1.2208,  0.2089, -1.9597, -1.3282,  0.1969,  0.7385,  0.1714,
       -0.1156, -0.3011, -1.4785, -0.7198, -0.4606,  1.0571,  0.3436,
       -1.763 ,  0.3241, -0.3851, -0.6769,  0.6117,  1.031 ,  0.9313,
       -0.8392, -0.3092,  0.3313,  0.9755, -0.4792, -0.1857, -1.1063,
       -1.1962,  0.8125,  1.3562, -0.072 ,  1.0035,  0.3616, -0.6451,
        0.3614,  1.538 , -0.0358,  1.5646, -2.6197,  0.8219,  0.087 ,
       -0.299 ,  0.0918, -1.9876, -0.2197,  0.3571,  1.4779, -0.5183,
       -0.8085, -0.5018,  0.9154,  0.3288, -0.5298,  0.5133,  0.0971,
        0.9686, -0.7021, -0.3277, -0.3921, -1.4635,  0.2961,  0.2611,
        0.0051, -0.2346])
arr > 0
array([ True, False,  True,  True, False, False,  True,  True, False,
        True, False, False,  True, False, False, False, False,  True,
       False, False,  True, False,  True, False, False,  True, False,
        True, False, False, False,  True, False, False,  True, False,
        True, False, False,  True,  True,  True, False, False, False,
       False, False,  True,  True, False,  True, False, False,  True,
        True,  True, False, False,  True,  True, False, False, False,
       False,  True,  True, False,  True,  True, False,  True,  True,
       False,  True, False,  True,  True, False,  True, False, False,
        True,  True, False, False, False,  True,  True, False,  True,
        True,  True, False, False, False, False,  True,  True,  True,
       False])
(arr > 0).sum() # Number of positive values
46
(arr > 0).mean() # percentage of positive values
0.4600
bools = np.array([False, False, True, False])
bools
array([False, False,  True, False])
bools.any()
True
bools.all()
False