Unexpected Cache Performance

LWN, an excellent weekly publication about Linux and the OSS world in general started running a series of articles about the subtleties of memory from the view of a programmer.

What every programmer should know about memory, Part 1

Memory part 2: CPU caches

The analysis of cache performance in multithreaded code surprised me.  In a multicore CPU with shared L2 cache, threads writing to the same data will actually perform terribly if the working set fits withing the L1 d-cache.  Figure 3.27: Core 2 Bandwidth with 2 Threads shows this very nicely.  Performance improves again in the L2 where cache line locking isn’t an issue anymore.  Figure 3.29: AMD Fam 10h Bandwidth with 2 Threads shows that this isn’t limited to the Core 2 either.  L2 isn’t shared so the L3 is the sweet spot for multithreaded performance.

Advertisements
Unexpected Cache Performance

3 thoughts on “Unexpected Cache Performance

  1. Dan says:

    When does that scenario happen in the real world? The graph shows that two threads fighting to write data to the same set of addresses will have performance problems, but I don’t know when you’d want to do that.

    The same figure shows that in the common case – two threads reading the same data, the effective bandwidth is monotonic in working set size.

    Since he’s comparing AMD vs. Intel parts, it would be interesting to see a MOESI vs. MESI comparison, and maybe an analysis of particular bus protocol features, e.g., fast TRDY.

  2. breadthfirst says:

    I imagine this could happen if two threads are randomly accessing(writing) the same data structure. This in itself is not dangerous, but if we assume the the data structure is small enough that there is non-negligable probability of holding the same dirty cache line in L1, I think you might see some of this performance hit in the real world. The locking involved in could also make a few cache lines very hot.

    I always assumed that the performance hit for Locking/Synchronization would mostly go away with a shared L2. This makes the situation look somewhat worse.

  3. P O'Grady says:

    I found this situation when I had an object which was allocated on the stack of one thread, then passed to (and used by) another thread. I know what you’re thinking–that new() should be used in these cases, but in real time programming, the only time-bounded way to allocate an object is to allocate it on a stack. In my particular case, the object was a network receiver buffer. Another thread was managing socket connections and I/O (a.k.a. calling select); so my worker thread would create a receiver buffer, hand it over to the network thread, then go on processing other stuff. The address proximity was enough to cause thrashing; and because of this, it turned out that my system runs much slower on a dual pentium.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s