#!/usr/bin/env python # coding: utf-8 # # Fancy Indexing # The previous chapters discussed how to access and modify portions of arrays using simple indices (e.g., `arr[0]`), slices (e.g., `arr[:5]`), and Boolean masks (e.g., `arr[arr > 0]`). # In this chapter, we'll look at another style of array indexing, known as *fancy* or *vectorized* indexing, in which we pass arrays of indices in place of single scalars. # This allows us to very quickly access and modify complicated subsets of an array's values. # ## Exploring Fancy Indexing # # Fancy indexing is conceptually simple: it means passing an array of indices to access multiple array elements at once. # For example, consider the following array: # In[1]: import numpy as np rng = np.random.default_rng(seed=1701) x = rng.integers(100, size=10) print(x) # Suppose we want to access three different elements. We could do it like this: # In[2]: [x[3], x[7], x[2]] # Alternatively, we can pass a single list or array of indices to obtain the same result: # In[3]: ind = [3, 7, 4] x[ind] # When using arrays of indices, the shape of the result reflects the shape of the *index arrays* rather than the shape of the *array being indexed*: # In[4]: ind = np.array([[3, 7], [4, 5]]) x[ind] # Fancy indexing also works in multiple dimensions. Consider the following array: # In[5]: X = np.arange(12).reshape((3, 4)) X # Like with standard indexing, the first index refers to the row, and the second to the column: # In[6]: row = np.array([0, 1, 2]) col = np.array([2, 1, 3]) X[row, col] # Notice that the first value in the result is `X[0, 2]`, the second is `X[1, 1]`, and the third is `X[2, 3]`. # The pairing of indices in fancy indexing follows all the broadcasting rules that were mentioned in [Computation on Arrays: Broadcasting](02.05-Computation-on-arrays-broadcasting.ipynb). # So, for example, if we combine a column vector and a row vector within the indices, we get a two-dimensional result: # In[7]: X[row[:, np.newaxis], col] # Here, each row value is matched with each column vector, exactly as we saw in broadcasting of arithmetic operations. # For example: # In[8]: row[:, np.newaxis] * col # It is always important to remember with fancy indexing that the return value reflects the *broadcasted shape of the indices*, rather than the shape of the array being indexed. # ## Combined Indexing # # For even more powerful operations, fancy indexing can be combined with the other indexing schemes we've seen. For example, given the array `X`: # In[9]: print(X) # We can combine fancy and simple indices: # In[10]: X[2, [2, 0, 1]] # We can also combine fancy indexing with slicing: # In[11]: X[1:, [2, 0, 1]] # And we can combine fancy indexing with masking: # In[12]: mask = np.array([True, False, True, False]) X[row[:, np.newaxis], mask] # All of these indexing options combined lead to a very flexible set of operations for efficiently accessing and modifying array values. # ## Example: Selecting Random Points # # One common use of fancy indexing is the selection of subsets of rows from a matrix. # For example, we might have an $N$ by $D$ matrix representing $N$ points in $D$ dimensions, such as the following points drawn from a two-dimensional normal distribution: # In[13]: mean = [0, 0] cov = [[1, 2], [2, 5]] X = rng.multivariate_normal(mean, cov, 100) X.shape # Using the plotting tools we will discuss in [Introduction to Matplotlib](04.00-Introduction-To-Matplotlib.ipynb), we can visualize these points as a scatter plot (see the following figure): # In[14]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt plt.style.use('seaborn-whitegrid') plt.scatter(X[:, 0], X[:, 1]); # Let's use fancy indexing to select 20 random points. We'll do this by first choosing 20 random indices with no repeats, and using these indices to select a portion of the original array: # In[15]: indices = np.random.choice(X.shape[0], 20, replace=False) indices # In[16]: selection = X[indices] # fancy indexing here selection.shape # Now to see which points were selected, let's overplot large circles at the locations of the selected points (see the following figure): # In[17]: plt.scatter(X[:, 0], X[:, 1], alpha=0.3) plt.scatter(selection[:, 0], selection[:, 1], facecolor='none', edgecolor='black', s=200); # This sort of strategy is often used to quickly partition datasets, as is often needed in train/test splitting for validation of statistical models (see [Hyperparameters and Model Validation](05.03-Hyperparameters-and-Model-Validation.ipynb)), and in sampling approaches to answering statistical questions. # ## Modifying Values with Fancy Indexing # # Just as fancy indexing can be used to access parts of an array, it can also be used to modify parts of an array. # For example, imagine we have an array of indices and we'd like to set the corresponding items in an array to some value: # In[18]: x = np.arange(10) i = np.array([2, 1, 8, 4]) x[i] = 99 print(x) # We can use any assignment-type operator for this. For example: # In[19]: x[i] -= 10 print(x) # Notice, though, that repeated indices with these operations can cause some potentially unexpected results. Consider the following: # In[20]: x = np.zeros(10) x[[0, 0]] = [4, 6] print(x) # Where did the 4 go? This operation first assigns `x[0] = 4`, followed by `x[0] = 6`. # The result, of course, is that `x[0]` contains the value 6. # # Fair enough, but consider this operation: # In[21]: i = [2, 3, 3, 4, 4, 4] x[i] += 1 x # You might expect that `x[3]` would contain the value 2 and `x[4]` would contain the value 3, as this is how many times each index is repeated. Why is this not the case? # Conceptually, this is because `x[i] += 1` is meant as a shorthand of `x[i] = x[i] + 1`. `x[i] + 1` is evaluated, and then the result is assigned to the indices in `x`. # With this in mind, it is not the augmentation that happens multiple times, but the assignment, which leads to the rather nonintuitive results. # # So what if you want the other behavior where the operation is repeated? For this, you can use the `at` method of ufuncs and do the following: # In[22]: x = np.zeros(10) np.add.at(x, i, 1) print(x) # The `at` method does an in-place application of the given operator at the specified indices (here, `i`) with the specified value (here, 1). # Another method that is similar in spirit is the `reduceat` method of ufuncs, which you can read about in the [NumPy documentation](https://numpy.org/doc/stable/reference/ufuncs.html). # ## Example: Binning Data # # You could use these ideas to efficiently do custom binned computations on data. # For example, imagine we have 100 values and would like to quickly find where they fall within an array of bins. # We could compute this using `ufunc.at` like this: # In[23]: rng = np.random.default_rng(seed=1701) x = rng.normal(size=100) # compute a histogram by hand bins = np.linspace(-5, 5, 20) counts = np.zeros_like(bins) # find the appropriate bin for each x i = np.searchsorted(bins, x) # add 1 to each of these bins np.add.at(counts, i, 1) # The counts now reflect the number of points within each bin—in other words, a histogram (see the following figure): # In[24]: # plot the results plt.plot(bins, counts, drawstyle='steps'); # Of course, it would be inconvenient to have to do this each time you want to plot a histogram. # This is why Matplotlib provides the `plt.hist` routine, which does the same in a single line: # # ```python # plt.hist(x, bins, histtype='step'); # ``` # # This function will create a nearly identical plot to the one just shown. # To compute the binning, Matplotlib uses the `np.histogram` function, which does a very similar computation to what we did before. Let's compare the two here: # In[25]: print(f"NumPy histogram ({len(x)} points):") get_ipython().run_line_magic('timeit', 'counts, edges = np.histogram(x, bins)') print(f"Custom histogram ({len(x)} points):") get_ipython().run_line_magic('timeit', 'np.add.at(counts, np.searchsorted(bins, x), 1)') # Our own one-line algorithm is twice as fast as the optimized algorithm in NumPy! How can this be? If you dig into the `np.histogram` source code (you can do this in IPython by typing `np.histogram??`), you'll see that it's quite a bit more involved than the simple search-and-count that we've done; this is because NumPy's algorithm is more flexible, and particularly is designed for better performance when the number of data points becomes large: # In[26]: x = rng.normal(size=1000000) print(f"NumPy histogram ({len(x)} points):") get_ipython().run_line_magic('timeit', 'counts, edges = np.histogram(x, bins)') print(f"Custom histogram ({len(x)} points):") get_ipython().run_line_magic('timeit', 'np.add.at(counts, np.searchsorted(bins, x), 1)') # What this comparison shows is that algorithmic efficiency is almost never a simple question. An algorithm efficient for large datasets will not always be the best choice for small datasets, and vice versa (see [Big-O Notation](02.08-Sorting.ipynb#Big-O-Notation)). # But the advantage of coding this algorithm yourself is that with an understanding of these basic methods, the sky is the limit: you're no longer constrained to built-in routines, but can create your own approaches to exploring the data. # Key to efficiently using Python in data-intensive applications is not only knowing about general convenience routines like `np.histogram` and when they're appropriate, but also knowing how to make use of lower-level functionality when you need more pointed behavior.