LWN, an excellent weekly publication about Linux and the OSS world in general started running a series of articles about the subtleties of memory from the view of a programmer.
The analysis of cache performance in multithreaded code surprised me. In a multicore CPU with shared L2 cache, threads writing to the same data will actually perform terribly if the working set fits withing the L1 d-cache. Figure 3.27: Core 2 Bandwidth with 2 Threads shows this very nicely. Performance improves again in the L2 where cache line locking isn’t an issue anymore. Figure 3.29: AMD Fam 10h Bandwidth with 2 Threads shows that this isn’t limited to the Core 2 either. L2 isn’t shared so the L3 is the sweet spot for multithreaded performance.