Programming L-caches

Sep 26, 2025

·9 min read

Long time no post...

So, this time we are going to talk a bit low-level, to get close to computers(cause my girlfriend won't let me, and sometimes a guy just needs to obsess over something else, you know?).

The Buildup: So, one of my friends, bless his heart, hit me up asking for laptop advice. He narrowed it down to two choices, and after some intense deliberation(read: a quick look at the specs and whatever looked cooler), I lean towards choice 2. My reasoning? "Better processor, probably". Classic me.

But here's where the things got...maybe...interesting. I was scrolling through the Flipkart specs, minding "his" business, when I saw this one little line: "L3 cache size".

It said "24MB". And my brain went, "Hmm, that sounds...big?"

Naturally, given the type of person I am, my curiosity piqued. It was 24MB in size, so I thought why not check mine, I fired up my terminal and

lscpu

To be a bit more specific (and look cooler, obviously):

lscpu | grep "cache"

And then the cold, hard truth. My laptop's L3 cache? 🤢

L3 cache:                                4 MiB (1 instance)

Four. Whole. Megabytes.

I mean, 4MB compared to 24MB? No wonder my life is so... slow?

Now I know how slow my laptop is, So today, we're going to break down some legit low-level sh*t I picked up, without needing a computer science degree to get it.

Today you will learn:

RAM is the slowest one of them.
Your CPU can sometimes be smarter than a neural network
Why using arrays is a better choice of linear data structures
The efficient way of writing loops
I write the best blogs 🤥

So, What Even Is This "Cache" Thing? 🧐

Imagine, you are super-speedy chef (that's your CPU), and you're cooking(obviously).

Your Hands (Registers): This is like the knife you're holding, or the spice you're actively sprinkling, insanely fast, right there, zero lag.
Your Kitchen Table (L1 cache): This is where you put the ingredients you're just about to use - diced onions, flour. Super quick to grab.
The Pantry Right Behind You (L2 cache): This has stuff you know you'll need soon, maybe the butter or the main thing. A quick swirl, and you've got it.
The Fridge (L3 cache): This is where you keep all the bulk ingredients. A bit further, takes a few more steps, but still in the kitchen.
The Supermarket (RAM/Main Memory): This is where you go when you realize you're out of stock, and you have to shut down your cooking operation, drive there, buy it, drive back, and then resume. This is the sloooooowest, most painful part of the process.

Turns out RAM is the slowest one of them all when it comes to CPU grabbing data. Every time your CPU has to run to the "supermarket", your computer basically goes 🥱 and slows down. The caches (L1, L2, L3) are there to prevent those painful trips by keeping the most-likely-needed stuff as close as possible. My 4MB L3 cache means my CPU is making way more trips to the market than my friend's would. Sed Life.

Your CPU and Cache Friendly Code

Here's the mind-blowing part: your CPU can sometimes be smarter than a neural network when it comes to predicting what data you'll need next. Seriously. It's not AI, but it's actually good at guessing.

It relies on something called "Locality of Reference."

Temporal Locality (Time): If you just used a specific ingredient(data), you are probably going to use it again soon. The CPU keeps it in the cache because, statistics.
Spatial Locality (Space): If you just grabbed one specific potato, chances are you'll need the other potatoes right next to it in the bag. So, when the CPU fetches data from RAM, it doesn't just get that one piece. Oh no. It grabs a whole chunk of surrounding data (a "cache line", usually 64 bytes) and stuffs it into the cache. Why? Because it thinks you'll need the adjacent stuff next.

This "smart guessing" is what we can totally exploit as programmers.

TIP

Read https://en.algorithmica.org/hpc/cpu-cache/cache-lines/ to learn more about "Cache Lines", how they work, what padding is?. It's an interesting E-book for performance Engineering.

Arrays are better

This spatial locality thing? It makes a huge difference in how we store data.

Think about it:

Arrays(std::vector in C++): All your items are lined up neatly(in memory), one after the other, like books on a shelf. When the CPU goes to grab book[0], it loads book[0] through book[10] (or whatever fits a cache line) into its quick-access zone. So, when you ask for book[1], BOOM! Instant access, already there. This is why using arrays is a better choice for linear data structures if you care about speed.
Linked Lists(std::list in C++): This is like a treasure hunt. Each item tells you where the next item is hidden somewhere else entirely in memory. So, to find item[1], the CPU has to read item[0], get its pointer, then go hunting for item[1]. There's no guarantee item[1] is anywhere near item[0]. This is called pointer chasing, and it's like sending your CPU to the "supermarket" over and over again for each single item. A major cache miss fiesta!.

Level Up Your Loops

This is probably one of the coolest, most practical things I learned. The efficient way of writing loops can seriously boost performance, all thanks to how caches work.

Imagine you have a big spreadsheet(a 2D array, or matrix) matrix[ROWS][COLS]. In C++(and many other languages), these are stored in "row-major" order. This means all of matrix[0][...] is stored contiguously, then all of matrix[1][...], and so on.

By Cmglee - Own work, CC BY-SA 4.0, Link

The Slow Way (Column-Major Traversal):

// This is bad for cache!
for (int col = 0; col < COLS; ++col) {
    for (int row = 0; row < ROWS; ++row) {
        sum += matrix[row][col]; // We're jumping columns first
    }
}

Here, in the inner loop, you're accessing matrix[0][col], then matrix[1][col], then matrix[2][col]. You're basically skipping across memory by an entire row's length in each step. This means every single access is likely a cache miss, forcing your CPU to run to the "supermarket" constantly. Ouch.

The Fast Way (Row-Major Traversal):

// This is super cache-friendly!
for (int row = 0; row < ROWS; ++row) {
    for (int col = 0; col < COLS; ++col) {
        sum += matrix[row][col]; // We're moving along the row
    }
}

Now, in the inner loop, you're accessing matrix[row][0], then matrix[row][1], then matrix[row][2]. This is perfect! As the CPU loads matrix[row][0], it grabs that whole cache line (remember the "bag of potatoes"?). So, when you ask for matrix[row][1], it's already there, waiting. Cache hit, baby! This is how you make your CPU happy and your code zoom.

Don't trust me?

Try running this

#include <iostream>
#include <vector>
#include <chrono> // For timing the code

int main() {
    // We define a large matrix size to make the time difference noticeable.
    // WARNING: This will use a lot of RAM (10000 * 10000 * 4 bytes ≈ 400 MB)
    const int ROWS = 10000;
    const int COLS = 10000;

    // Create a 2D vector (matrix) and fill it with some data.
    std::vector<std::vector<int>> matrix(ROWS, std::vector<int>(COLS));
    for (int i = 0; i < ROWS; ++i) {
        for (int j = 0; j < COLS; ++j) {
            matrix[i][j] = i + j; // Just some dummy data
        }
    }

    // We use 'volatile' here to ensure the compiler doesn't get clever
    // and optimize away our loops because it sees the 'sum' isn't used.
    volatile long long sum = 0;

    std::cout << "Starting tests with a " << ROWS << "x" << COLS << " matrix...\n\n";

    // --- Test 1: The Bad Way (Column-Major Traversal) ---
    auto start_bad = std::chrono::high_resolution_clock::now();

    for (int col = 0; col < COLS; ++col) {
        for (int row = 0; row < ROWS; ++row) {
            sum += matrix[row][col];
        }
    }

    auto end_bad = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::milli> elapsed_bad = end_bad - start_bad;

    std::cout << "Bad (Column-Major) Traversal Time: " << elapsed_bad.count() << " ms\n";

    // --- Test 2: The Good Way (Row-Major Traversal) ---
    sum = 0; // Reset sum for a fair test
    auto start_good = std::chrono::high_resolution_clock::now();

    for (int row = 0; row < ROWS; ++row) {
        for (int col = 0; col < COLS; ++col) {
            sum += matrix[row][col];
        }
    }

    auto end_good = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double, std::milli> elapsed_good = end_good - start_good;

    std::cout << "Good (Row-Major) Traversal Time:  " << elapsed_good.count() << " ms\n";

    return 0;
}

# Compile the code with optimizations
g++ -O3 -o cache_test cache_test.cpp

# Run the executable
./cache_test

This is a more advanced topic. A cache line is the smallest unit of data shared between CPU cores.

False sharing happens when:

Two threads on different cores are working.
They are modifying different variables.
But those two variables happen to be located on the same cache line.

When Core 1 writes to its variable, the entire cache line is marked "dirty." This forces Core 2's cache to invalidate its copy of the line and re-fetch it from main memory, even though its own variable wasn't changed. This causes constant, unnecessary cache synchronization and kills performance.

Solution: Pad your data structures with extra bytes to ensure that variables modified by different threads are on different cache lines.

TIP

Read: https://en.algorithmica.org/hpc/cpu-cache/sharing/

Finally

So here we are. Who knew a casual laptop recommendation could lead to such an interesting dive into CPU architecture? It's pretty wild to think about how these low-level hardware details impact the code we write every day. So, don't go mindlessly solving every Leetcode question, think about each line you type-maybe your order of lines makes a huge difference.

And while my laptop might have a tiny L3 cache, at least I know how to write code that makes the most of it now. My friend just wanted a solid laptop, but I ended up with a whole new perspective on optimizing my code. Totally worth it.

Stay curious, stay coding.

References

Some places I wandered to learn these things:

Temporal and Spatial locality: https://medium.com/@adamzerner/spatial-and-temporal-locality-for-dummies-b080f2799dd
Temporal Locality: https://arxiv.org/pdf/2102.09413
Algorithmica Performance Engineering, Cache Lines: https://en.algorithmica.org/hpc/cpu-cache/cache-lines/
And Wikipedia as always.