McKinney Chapter 3 - Built-In Data Structures, Functions, and Files

FINA 6333 for Spring 2024

Author

Richard Herron

1 Introduction

We must understand Python’s core functionality to fully use NumPy and pandas. Chapter 3 of Wes McKinney’s Python for Data Analysis discusses Python’s core functionality. We will focus on the following:

  1. Data structures
    1. tuples
    2. lists
    3. dicts (also known as dictionaries)
    4. we will ignore sets
  2. List comprehensions
  3. Functions
    1. Returning multiple values
    2. Using anonymous functions

Note: Indented block quotes are from McKinney unless otherwise indicated. The section numbers here differ from McKinney because we will only discuss some topics.

2 Data Structures and Sequences

Python’s data structures are simple but powerful. Mastering their use is a critical part of becoming a proficient Python programmer.

2.1 Tuple

A tuple is a fixed-length, immutable sequence of Python objects.

We cannot change a tuple after we create it because tuples are immutable. A tuple is ordered, so we can subset or slice it with a numerical index. We will surround tuples with parentheses but the parentheses are not always required.

tup = (4, 5, 6)

Python is zero-indexed, so zero accesses the first element in tup!

tup[0]
4
tup[1]
5
nested_tup = ((4, 5, 6), (7, 8))

Python is zero-indexed!

nested_tup[0]
(4, 5, 6)
nested_tup[0][0]
4
tup = tuple('string')
tup
('s', 't', 'r', 'i', 'n', 'g')
tup[0]
's'
tup = tuple(['foo', [1, 2], True])
tup
('foo', [1, 2], True)
# tup[2] = False # gives an error, because tuples are immutable (unchangeable)

If an object inside a tuple is mutable, such as a list, you can modify it in-place.

tup
('foo', [1, 2], True)
tup[1].append(3)
tup
('foo', [1, 2, 3], True)

You can concatenate tuples using the + operator to produce longer tuples:

Tuples are immutable, but we can combine two tuples into a new tuple.

(1, 2) + (1, 2)
(1, 2, 1, 2)
(4, None, 'foo') + (6, 0) + ('bar',)
(4, None, 'foo', 6, 0, 'bar')

Multiplying a tuple by an integer, as with lists, has the effect of concatenating together that many copies of the tuple:

This multiplication behavior is the logical extension of the addition behavior above. The output of tup + tup should be the same as the output of 2 * tup.

('foo', 'bar') * 2
('foo', 'bar', 'foo', 'bar')
('foo', 'bar') + ('foo', 'bar')
('foo', 'bar', 'foo', 'bar')

2.1.1 Unpacking tuples

If you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equals sign.

tup = (4, 5, 6)
a, b, c = tup
(d, e, f) = (7, 8, 9) # the parentheses are optional but helpful!

We can unpack nested tuples!

tup = 4, 5, (6, 7)
a, b, (c, d) = tup

2.1.2 Tuple methods

Since the size and contents of a tuple cannot be modified, it is very light on instance methods. A particularly useful one (also available on lists) is count, which counts the number of occurrences of a value.

a = (1, 2, 2, 2, 3, 4, 2)
a.count(2)
4

2.2 List

In contrast with tuples, lists are variable-length and their contents can be modified in-place. You can define them using square brackets [ ] or using the list type function.

a_list = [2, 3, 7, None]
tup = ('foo', 'bar', 'baz')
b_list = list(tup)

Pyhon is zero-indexed!

a_list[0]
2

2.2.1 Adding and removing elements

Elements can be appended to the end of the list with the append method.

The .append() method appends an element to the list in place without reassigning the list.

b_list.append('dwarf')

Using insert you can insert an element at a specific location in the list. The insertion index must be between 0 and the length of the list, inclusive.

b_list.insert(1, 'red')
b_list.index('red')
1
b_list[b_list.index('red')] = 'blue'

The inverse operation to insert is pop, which removes and returns an element at a particular index.

b_list.pop(2)
'bar'
b_list
['foo', 'blue', 'baz', 'dwarf']

Note that .pop(2) removes the 2 element. If we do not want to remove the 2 element, we should use [2] to access an element without removing it.

Elements can be removed by value with remove, which locates the first such value and removes it from the list.

b_list.append('foo')
b_list.remove('foo')
'dwarf' in b_list
True
'dwarf' not in b_list
False

2.2.2 Concatenating and combining lists

Similar to tuples, adding two lists together with + concatenates them.

[4, None, 'foo'] + [7, 8, (2, 3)]
[4, None, 'foo', 7, 8, (2, 3)]

The .append() method adds its argument as the last element in a list.

xx = [4, None, 'foo']
xx.append([7, 8, (2, 3)])

If you have a list already defined, you can append multiple elements to it using the extend method.

x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])

Check your output! It will take you time to understand all these methods!

2.2.3 Sorting

You can sort a list in-place (without creating a new object) by calling its sort function.

a = [7, 2, 5, 1, 3]
a.sort()

sort has a few options that will occasionally come in handy. One is the ability to pass a secondary sort key—that is, a function that produces a value to use to sort the objects. For example, we could sort a collection of strings by their lengths.

Before you write your own solution to a problem, read the docstring (help file) of the built-in function. The built-in function may already solve your problem faster with fewer bugs.

b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort()

Python is case sensitive, so “He” sorts before “foxes”!

b.sort(key=len)

2.2.4 Slicing

Slicing is very important!

You can select sections of most sequence types by using slice notation, which in its basic form consists of start:stop passed to the indexing operator [ ].

Recall that Python is zero-indexed, so the first element has an index of 0. The necessary consequence of zero-indexing is that start:stop is inclusive on the left edge (start) and exclusive on the right edge (stop).

seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq
[7, 2, 3, 7, 5, 6, 0, 1]
seq[5]
6
seq[:5]
[7, 2, 3, 7, 5]
seq[1:5]
[2, 3, 7, 5]
seq[3:5]
[7, 5]

Either the start or stop can be omitted, in which case they default to the start of the sequence and the end of the sequence, respectively.

seq[:5]
[7, 2, 3, 7, 5]
seq[3:]
[7, 5, 6, 0, 1]

Negative indices slice the sequence relative to the end.

seq[-1:]
[1]
seq[-4:]
[5, 6, 0, 1]
seq[-4:-1]
[5, 6, 0]
seq[-6:-2]
[3, 7, 5, 6]

A step can also be used after a second colon to, say, take every other element.

seq[:]
[7, 2, 3, 7, 5, 6, 0, 1]
seq[::2]
[7, 3, 5, 0]
seq[1::2]
[2, 7, 6, 1]

I remember the trick above as :2 is “count by 2”.

A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple.

seq[::-1]
[1, 0, 6, 5, 7, 3, 2, 7]

We will use slicing (subsetting) all semester, so it is worth a few minutes to understand the examples above.

2.3 dict

dict is likely the most important built-in Python data structure. A more common name for it is hash map or associative array. It is a flexibly sized collection of key-value pairs, where key and value are Python objects. One approach for creating one is to use curly braces {} and colons to separate keys and values.

Elements in dictionaries have names, while elements in tuples and lists have numerical indices. Dictionaries are handy for passing named arguments and returning named results.

empty_dict = {}
empty_dict
{}

A dictionary is a set of key-value pairs.

d1 = {'a': 'some value', 'b': [1, 2, 3, 4]}
d1['a']
'some value'
d1[7] = 'an integer'
d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

We access dictionary values by key names instead of key positions.

d1['b']
[1, 2, 3, 4]
'b' in d1
True

You can delete values either using the del keyword or the pop method (which simultaneously returns the value and deletes the key).

d1[5] = 'some value'
d1['dummy'] = 'another value'
d1
{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 5: 'some value',
 'dummy': 'another value'}
del d1[5]
d1
{'a': 'some value',
 'b': [1, 2, 3, 4],
 7: 'an integer',
 'dummy': 'another value'}
ret = d1.pop('dummy')
ret
'another value'
d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

The keys and values method give you iterators of the dict’s keys and values, respectively. While the key-value pairs are not in any particular order, these functions output the keys and values in the same order.

d1.keys()
dict_keys(['a', 'b', 7])
d1.values()
dict_values(['some value', [1, 2, 3, 4], 'an integer'])

You can merge one dict into another using the update method.

d1
{'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}
d1.update({'b': 'foo', 'c': 12})
d1
{'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

3 List, Set, and Dict Comprehensions

We will focus on list comprehensions.

List comprehensions are one of the most-loved Python language features. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter in one concise expression. They take the basic form:

[expr for val in collection if condition]

This is equivalent to the following for loop:

result = []
for val in collection:
    if condition:
        result.append(expr)

The filter condition can be omitted, leaving only the expression.

List comprehensions are very Pythonic.

strings = ['a', 'as', 'bat', 'car', 'dove', 'python']

We could use a for loop to capitalize the strings in strings and keep only strings with lengths greater than two.

caps = []
for x in strings:
    if len(x) > 2:
        caps.append(x.upper())

caps
['BAT', 'CAR', 'DOVE', 'PYTHON']

A list comprehension is a more Pythonic solution and replaces four lines of code with one. The general format for a list comprehension is [operation on x for x in list if condition]

[x.upper() for x in strings if len(x) > 2]
['BAT', 'CAR', 'DOVE', 'PYTHON']

Here is another example. Write a for-loop and the equivalent list comprehension that squares the integers from 1 to 10.

squares = []
for i in range(1, 11):
    squares.append(i ** 2)
    
squares
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
[i**2 for i in range(1, 11)]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]

4 Functions

Functions are the primary and most important method of code organization and reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make your code more readable by giving a name to a group of Python statements.

Functions are declared with the def keyword and returned from with the return keyword:

def my_function(x, y, z=1.5):
    if z > 1:
         return z * (x + y)
     else:
         return z / (x + y)

There is no issue with having multiple return statements. If Python reaches the end of a function without encountering a return statement, None is returned automatically.

Each function can have positional arguments and keyword arguments. Keyword arguments are most commonly used to specify default values or optional arguments. In the preceding function, x and y are positional arguments while z is a keyword argument. This means that the function can be called in any of these ways:

 my_function(5, 6, z=0.7)
 my_function(3.14, 7, 3.5)
 my_function(10, 20)

The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any). You can specify keyword arguments in any order; this frees you from having to remember which order the function arguments were specified in and only what their names are.

Here is the basic syntax for a function:

def mult_by_two(x):
    return 2*x

4.1 Returning Multiple Values

We can write Python functions that return multiple objects. In reality, the function f() below returns one object, a tuple, that we can unpack to multiple objects.

def f():
    a = 5
    b = 6
    c = 7
    return (a, b, c)
f()
(5, 6, 7)

If we want to return multiple objects with names or labels, we can return a dictionary.

def f():
    a = 5
    b = 6
    c = 7
    return {'a' : a, 'b' : b, 'c' : c}
f()
{'a': 5, 'b': 6, 'c': 7}
f()['a']
5

4.2 Anonymous (Lambda) Functions

Python has support for so-called anonymous or lambda functions, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda keyword, which has no meaning other than “we are declaring an anonymous function.”

I usually refer to these as lambda functions in the rest of the book. They are especially convenient in data analysis because, as you’ll see, there are many cases where data transformation functions will take functions as arguments. It’s often less typing (and clearer) to pass a lambda function as opposed to writing a full-out function declaration or even assigning the lambda function to a local variable.

Lambda functions are very Pythonic and let us to write simple functions on the fly. For example, we could use a lambda function to sort strings by the number of unique letters.

strings = ['foo', 'card', 'bar', 'aaaa', 'abab']
strings.sort()
strings
['aaaa', 'abab', 'bar', 'card', 'foo']
strings.sort(key=len)
strings
['bar', 'foo', 'aaaa', 'abab', 'card']
strings.sort(key=lambda x: x[-1])
strings
['aaaa', 'abab', 'card', 'foo', 'bar']

How can I sort by the second letter in each string?

strings.sort(key=lambda x: x[1])
strings
['aaaa', 'card', 'bar', 'abab', 'foo']