Beyond the Bootcamp: Memory Part 2

Welcome back to the second edition of the Beyond the Bootcamp memory module!

Last week, we talked about how what caches were along with what spatial and temporal locality were. Let’s take a look at an example where not using spatial locality really hurts performance!

But first, a quick refresher on 2D Arrays

Pictorially, we can represent this as so:

Summing elements in a grid

But does it really make a difference?

Let’s take a look:

Uhh what gives? Why is it faster to compute the sum traversing over the rows of the grid compared to the columns?

How is this related to caching?

Below is the same code traversing through each row just with intermediary variable names to aid in the explanation:

With regards to spatial locality, each time nums[i] is referenced, the other elements are brought along for the ride into the cache because the computer is making an educated guess that if you’ve referenced nums[i], you’ll probably reference nums[i + 1]. Let’s take a look at an example:

When we go through our first iteration of the outer loop to perform the variable assignment of row = nums[0], we first see that it’s not present in the cache. We then go ahead and fetch the first row from main memory to put it in the cache.

Upon modifying the ‘sum’ variable, we’ll then request for row[0]. It’s already present in the cache! No need to go to main memory now. On the next iteration, we’ll request for row[1]. We see that’s in our cache as well, and by doing so, we’ve saved ourselves an expensive trip to main memory.

By traversing row by row, the computer is efficiently able to leverage both temporal and spatial locality. The computer constantly refers to the sum variable and keeps it in a register to leverage temporal locality. Simultaneously, the computer makes use of spatial locality by readily pulling in elements that it thinks the program will access next.

Let’s now take a look at the column based strategy to see what went wrong…

Traversing over columns

On the first iteration, we set the for loop bounds and evaluate nums[0].length:

We first request for nums[0] and initially don’t find it in the cache. As a result, we’ll go to main memory to fetch nums[0] along with all of its elements because of spatial locality.

The next line of code we execute is sum += nums[0][0]:

Nice! nums[0][0] is in our cache. We then go ahead and add nums[0][0] to our current sum.

On the next iteration, we execute the line:

sum += nums[1][0]

Which results in the following execution:

Since we’re referencing a different row and not adjacent elements, when requesting for nums[1][0], we don’t find it in the cache. As a result, we have to go do an expensive lookup in main memory to figure out what nums[1][0] resolves to. We also have to evict our entire cache and replace elements with the adjacent elements in the row nums[1].

On the next iteration, we then request for nums[2][0]:

Once again, this results in a cache miss which means we have to go to main memory, evict the current cache, replace elements, and return nums[2][0]. This is a costly operation every iteration which is why we saw such a large difference traversing the grid via rows compared to via columns! Formally, this phenomenon is referred to as cache thrashing. We should always traverse the grid row wise to leverage spatial locality when we can/

Wrapping up

CS @ UW • SDE @ Amazon • http://www.justinharjanto.com