Alois Kraus

blog

  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks

News



Archives

Post Categories

Image Galleries

Programming

I see a lot of different code and issues. One interesting bug was where someone did remove a few lines of code but the regression test suite did consistently report a 100ms slowdown. Luckily the regression test suite was using ETW by default so I could compare the good baseline with the bad one and I could also take a look at the code change. The profiling diff did not make much sense since there was a slowdown but for no apparent reason in the CultureInfo.CurrentCulture.DisplayName property did become ca. 100ms slower.

How can that be? To make things even more mysterious when they changed some other unrelated code the numbers did return back to normal. After looking into it more deeply I found that the basic application logic did not slow down. But instead some unrelated methods did just become much more CPU hungry. The methods were internal CLR methods named COMInterlocked::CompareExchange64 and COMInterlocked::CompareExchange64. The interesting thing is that it happened only under 32 bit and under 64 bit the error did go away. If you are totally confused by now you are in good company. But there is hope. I had a similar problem encountered already over a year ago. I knew therefore that it has something to do with the interlocked intrinsics for 64 bit operands in 32 bit code. The most prominent on 32 bit is

lock cmpxchg8b qword ptr [some register pointing to a memory location] 

which is heavily used by the clr interlocked methods. To reproduce the problem cleanly I have written a little C program where I played a bit around to see what the real issue is. It turns out it is ……

                                                                                Memory Alignment

A picture will tell more than 1000 words:

image

The CPU cache is organized in cache lines which are usually 64 wide. You can find out the cache line size of your CPU with the nice Sysinternals tools coreinfo. On my Haswell home machine it prints something like this

Logical Processor to Cache Map:
**------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**------  Unified Cache       0, Level 2,  256 KB, Assoc   8, LineSize  64
********  Unified Cache       1, Level 3,    8 MB, Assoc  16, LineSize  64

The most important number for the following is the LineSize of 64 which tells us how big the smallest memory unit is which is managed by the CPU cache controller. Now back to our slow lock cmpxchg8b instruction. The effect of the lock prefix is that one core gets exclusive access to a memory location. This is usually implemented on the CPU by locking one cache line which is quite fast. But what happens if the variable spans two cache lines? In that case the CPU seems to lock all cache lines which is much more expensive. The effect is that it is at least 10-20 times slower than before. It seems that our .NET application in x86 did allocate a 64 bit variable on a 4 byte (int32) boundary at an address that crossed two cache lines (see picture above). If by bad luck we are using variable 7 for a 64 bit interlocked operation we will cause an expensive global cache lock.

Since under 64 bit the class layout is usually 8 byte aligned we are practically never experiencing variables which are spanning two cache lines which makes all cache line related errors go away and our application was working as expected under 64 bit. The issue is still there but the class layout makes it much harder to get into this situation. But under 32 bit we can frequently find data structures with 4 byte alignment which can cause sudden slowdowns if the memory location we are hitting is sitting on a cache line boundary. Now it is easy to write a repro for the issue:

using System;
using System.Diagnostics;
using System.Globalization;

namespace InterlockedFun
{
    class Program
    {
        static void Main(string[] args)
        {
            int len = 1;
            if (args.Length == 1)
            {
                len = int.Parse(args[0]);
            }
            var b = new byte[len];

            var sw = Stopwatch.StartNew();

            var name = CultureInfo.CurrentCulture.DisplayName;

            sw.Stop();
            Console.WriteLine("DisplayName property did take {0:F0}ms", sw.Elapsed.TotalMilliseconds);
        }
    }
}

That is all. You only need to allocate on the managed heap enough data so the other data structures will at some point hit a cache line boundary. To force this you can try different byte counts with a simple for loop on the command line:

for /L %I in (1,1,64) do InterlockedFun.exe %i

At some point the measured times will change quite a lot:

InterlockedFun.exe 29
DisplayName property did take 17ms

InterlockedFun.exe 30
DisplayName property did take 17ms

InterlockedFun.exe 31
DisplayName property did take 17ms

InterlockedFun.exe 32
DisplayName property did take 17ms

InterlockedFun.exe 33
DisplayName property did take 128ms

InterlockedFun.exe 34
DisplayName property did take 93ms

InterlockedFun.exe 35
DisplayName property did take 77ms

You can play with the little sample for yourself to find the worst performing version on your machine. If you now look at WPA with a differential view you will find that CompareExchange64 is responsible for the measured difference:

image

Since that was a such a nice problem here is the actual C Code I did use to verify that the issue only pops up only at cache line boundaries:

#include "stdafx.h"
#include <windows.h>
#include <chrono>

const int Iterations = 1000 * 1000;  // yeah heavy locking
size_t  Alignment = 4; // simulate DWORD alignment

int main(int argc, char **argv)
{
    if (argc == 2)
    {
        Alignment = atoi(argv[1]);
        _tprintf(L"Alignment: %I64d", Alignment);
    }

    auto pAligned = (LONG64 *)_aligned_malloc(10 * sizeof(LONG64), 64);

    auto v1 = (LONG64 *) (((byte *)pAligned) + Alignment)+7; // Now shift our 64 byte cache line aligned variable by 4 bytes and then go 7 
                                                                // int64 to the right to land on the border of two cache lines

    auto start = std::chrono::high_resolution_clock::now();

    for (int k = 0; k < Iterations; k++)  // simulate many interlocked operations on a variable which crosses two cache lines
    {
        _InterlockedCompareExchange64(v1, 100, 100);
    }
    auto stop = std::chrono::high_resolution_clock::now();
    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count();
    _tprintf(L"\nInterlocked %d iterations did take %I64dms. Average duration interlocked operation: %f us", Iterations,  ms, (ms*1000.0f)/Iterations);

    _aligned_free(pAligned);
    return 0;
}

This will print with bad 4 byte alignment

Interlocked 1000000 iterations did take 1104ms. Average duration interlocked operation: 1.104000 us

but with 8 byte alignment

Interlocked 1000000 iterations did take 26ms. Average duration interlocked operation: 0.026000 us

That is a whooping factor of 42 faster. No wonder that the intel manual recommends to align the variables on page 258 of the 64 ia 32 architectures software developer system programing manual:

… The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed
for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses
be aligned on their natural boundaries for better system performance:
• Any boundary for an 8-bit access (locked or otherwise).
• 16-bit boundary for locked word accesses.
• 32-bit boundary for locked doubleword accesses.
64-bit boundary for locked quadword accesses.  …

The word better should be written in big red letters. Unfortunately it seems that 32 bit code has a much high probability to cause random performance issues in real world applications than 64 bit code due to the memory layout of some data structures. This is not an issue which makes only your application slower. If you execute the C version concurrently

start  cmpxchg.exe && cmpxchg.exe

Then you will get not 1s but 1,5s of runtime because of the processor bus locking. In reality it is not as bad as this test suggests because if the other application uses correctly aligned variables they will operate at normal speed. But if two applications exhibit the same error they will slow each other down.

If you use an allocator which does not care about natural variable alignment rules such as the GC allocator you can run into issues which can be pretty hard to find. 64 bit code can also be plagued by such issues because we have also 128 bit interlocked intrinsics. With the AVX2 SIMD extensions memory alignment is becoming mainstream again. If people tell you that today memory alignment and CPU caches play no role in todays high level programing languages you can prove them wrong with a simple 8 line C# application. To come to an end and to answer the question of the headline: No it is not a CPU bug but an important detail how the CPU performance is affected if you use interlocked intrinsics on variables which span more than one cache line. Performance is an implementation detail. To find out how bad it gets you need to measure for yourself in your scenario.

posted on Thursday, November 12, 2015 10:51 AM