Alois Kraus


  Home  |   Contact  |   Syndication    |   Login
  133 Posts | 8 Stories | 368 Comments | 162 Trackbacks



Post Categories


With the new Windows Performance Toolkit you can now identify bottlenecks in your system which did not show up in any profiler. Lets e.g. query for some system properties via WMI.

    static void Main(string[] args)
        var sw = Stopwatch.StartNew();
        string[] queries = new string[] { "Win32_Processor", "Win32_Printer", "Win32_ComputerSystem", "Win32_VideoController", "Win32_PhysicalMemory", "Win32_LogicalDisk" };

        foreach(var wmiColl in queries.Select(x=> new ManagementObjectSearcher(new SelectQuery(x)))
                                      .Select(query => query.Get()))
            foreach(var prop in wmiColl.Cast<ManagementBaseObject>()
                //Console.WriteLine("{0} - {1}", prop.Name, prop.Value);
        Console.WriteLine("Time: {0:F1}s", sw.Elapsed.TotalSeconds);

This takes around 1,6s on my machine. When I do look at my process (ConsoleApplication1.exe) the CPU consumption is rather low.



So where is the time spent? It could be some disc IO happening during the queries. The check for disc IO we need the Disc Usage “Utilization by Process, IO Type” graph. As you can see for the specified time there was no disc IO at all happening. A flat line does not mean that there were no CreateFile/ReadFile calls at all. But if you call ReadFile on a file which contents are already cached by the OS reading from a file is only a matter of a CopyMemory call. This graph will show you only real disc accesses.


In the CPU graph you can see that the ConsoleApplication1.exe was not the process with most CPU consumption. Most CPU cycles were spent in WmiPrvSE, spoolsv and svchost. So lets graph them as well. You can enable or disable items in the graph by left clicking on the color box or via the context menu. In the context menu you have Enable … or Disable … menus to enable or disable all or only the selected processes/threads, …


We see that although our process did only consume 2,8% of CPU for all WMI queries we did max out one core with a WMI worker process (50% is one core for a dual core machine. If you have an 8 core machine one core would be 12,5%). In total we did consume 45% CPU time which means one core was fully utilized to process the WMI queries! At least this is the working hypothesis that our WMI queries did cause the load in the other processes. How can we verify this? First we can have a look at the call stacks in our process.


Here we do see nicely the full managed call stacks. At least with .NET 4.5. If you have .NET 4.0 you will only see only the first managed stack frame and then nothing. It is better than nothing but it should be improved for .NET 4.0 as well.


To get full managed stacks it does work with .NET 4.0 as well for x86 processes. It does only fail when you want to profile 64 bit processes under Windows Vista or Windows 7. There is a bug in the stalk walking algorithm which does stop when it does encounter the first dynamically generated stack frame which address is not inside a loaded module. To workaround this issue you can NGen your assemblies to get full call stacks as well. This is a bit cumbersome to work with but it does work very nicely. The only issue is that truly dynamic code like WCF proxies and stuff like this will not work out as expected.

The other fix to this issue is to use Windows 8.

The ManagementObjectCollection does some COM calls to communicate with the WMI service where we do wait for a response from a COM object. When our thread gives up its time slice by calling some wait method the OS does schedule the next thread in the ready queue. This topic is covered in greater detail here which is part of the WPT docu. Here is a copy of the relevant part:


Each thread exists in a particular execution state at any given time. Windows uses three states that are relevant to performance; these are: Running, Ready, and Waiting.

Threads that are currently being executed are in the Running state. Threads that can execute but are currently not running are in the Ready state. Threads that cannot run because they are waiting for a particular event are in the Waiting state.

A state to state transition is shown in Figure 1 Thread State Transitions:

Figure 1 Thread State Transition

Figure 1 Thread State Transitions

Figure 1 Thread State Transitions is explained as follows:

  1. A thread in the Running state initiates a transition to the Waiting state by calling a wait function such as WaitForSingleObject or Sleep(> 0).
  2. A running thread or kernel operation readies a thread in the Waiting state (for example, SetEvent or timer expiration). If a processor is idle or if the readied thread has a higher priority than a currently running thread, the readied thread can switch directly to the Running state. Otherwise, it is put into the Ready state.
  3. A thread in the Ready state is scheduled for processing by the dispatcher when a running thread waits, yields (Sleep(0)), or reaches the end of its quantum.
  4. A thread in the Running state is switched out and placed into the Ready state by the dispatcher when it is preempted by a higher priority thread, yields (Sleep(0)), or when its quantum ends.

A thread that exists in the Waiting state does not necessarily indicate a performance problem. Most threads spend significant time in the Waiting state, which allows processors to enter idle states and save energy. Thread state becomes an important factor in performance only when a user is waiting for a thread to complete an operation.


This means that we can see where our thread was rescheduled by the OS to execute the next non waiting thread. To see the next executing thread you need to right click on the column header of the list and add ReadyThreadStack. This allows you to follow the trail of your thread from one process to the next process. It is not a 100% correlation but the chances are good that the next ready thread is the one that the OS has scheduled as next thread to execute now. Since the the WPT toolkit does record the call stacks after every context switch we can get a deep understanding how the threads do interact with each other. When we do follow now the ReadThreadStack until it does give up control by itself we do see that we did end up in svhost (1028). It would be interesting to see which system services were running inside it to confirm that our process has something to do with it.


Unfortunately there is no column command line or running services available in this view. But there is a view called Processes “Lifetime By Process” which does give us the command line of all processes and the hosted services if any. This way we do see that our call did end up in the service which does host the WMI (Winmgmt) service which sounds reasonable.


When we do look into the call stacks of our WMI service host process we do find in the ReadyThreadStack many times our console application which is not surprising since it does need to send the results back to our process. There are only a few calls to WmiPrvSE.exe (1516) which do execute the WMI queries in some WMI worker process where the actual WMI provider objects are living. When we do inspect the WMI worker process more closely we will find that it does some work with svchost(700) which is responsible for DCOM calls. No we have a pretty good understanding what is actually going on.

  1. ConsoleApplication1 does issue some WMI query calls.
  2. These calls are processed by the WMI service (svchost 1028) which does hand these calls to a WMI worker process
  3. WmiPrvSE.exe (1516) does execute the actual queries.
  4. The execution of the “Win32_Printer” query does cause some heavy load in the print spooling service (spoolsv 1648).
  5. Some WMI call (the fetching of display device infos) requires the activation of some remote COM object which causes some load in the DCOM system service (svchost 700).

From this we can conclude that our process did cause with a 2,8% CPU load in our process a total load of nearly 50% (that means one core was full utilized). The load in our process did cause a 15 times higher load in other processes in the system!

I do really love this tool since it allows you to analyze cross process correlations which no other profiler can do I have encountered so far. It seems that the initial target audience of this tool were kernel developers but over the years XPerf has now reached the state of a full system wide profiler for managed and unmanaged applications. The managed part does work best with .NET 4.5 but you can get useful insights with .NET 4.0 already. Now we have found hat the WMI calls in our process were rather cheap but the whole system was under quite high load due to the WMI queries. Be careful to not overuse WMI since it might hurt overall system performance for no good reasons.

There is good content in the WPT toolkit online to guide you to more CPU scenarios. I found the thread starvation analysis especially interesting. If you see in your thread high Ready (us) times then you have a problem because this means that your thread was sitting too long in the Thread Ready queue to be scheduled for execution. Now you can not only see that your thread was treated unfair by the OS but you can also find out which other thread did get in your way (could be another process, higher thread priority …).

I am still shocked to see how many professional developers do not use profiling tools at all. I am not sure how they tackle tricky performance or memory issues. But I do think the answer is: They do not do it. If performance problems are tackled by “normal” devs only on demand by management because customers did complain about crappy performance they will only be able to iron out the big performance bottlenecks. The medium and small ones will be left as an exercise for the next dev. There may still be big issues in the product but they will not be able to analyze it down to the root cause. Part of the issue was that there was no good tool available to analyze managed and unmanaged code at the same time.

The tool problem is gone now.

There is no longer an excuse to say: I do not know how to dig deeper.

But there is another excuse left: I do not know how to read kernel stacks…

Now you can dig from your process and thread into the kernel and back to another process in another thread with a few clicks. Yes this tool is complex but this is our software we do write today as well. And there are times where all these nice abstractions of a managed execution environment, automated garbage collection, dynamic code generation, kernel mode transitions, OS thread scheduling need to become white boxes to see what is actually going on. Some problems are only solvable when you have such detailed information at hand.

posted on Friday, December 28, 2012 12:38 PM