I first started blogging about latches and some of the deeper parts of SQL Server internals last year (see Advanced performance troubleshooting: waits, latches, spinlocks) and now I'd like to pick up that thread (no scheduling pun intended :-)) and blog some more about some of the common latches that could be a performance bottleneck.

To that end, I've got some code below (plus example output) that will show the most common latch waits that have occurred on your system.

WITH Latches AS
    (SELECT
        latch_class,
        wait_time_ms / 1000.0 AS WaitS,
        waiting_requests_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_latch_stats
    WHERE latch_class NOT IN (
        'BUFFER')
    AND wait_time_ms > 0

    )
SELECT
    W1.latch_class AS LatchClass,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
    ON W2.RowNum <= W1.RowNum
WHERE W1.WaitCount > 0
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
--------------------------------- ------- ---------- ----------- ----------
LOG_MANAGER                       221.43  4659       45.81       0.0475
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  199.56  7017       41.28       0.0284
FGCB_ADD_REMOVE                   35.17   1047       7.27        0.0336
DBCC_OBJECT_METADATA              26.85   256490     5.55        0.0001

I'd like you to run the code and send me the output (either as a comment or in email). I'll collate all your output and do some blogging for your enjoyment.

Thanks!

This is a performance tuning post that's been on my to-do list for quite a while. Wait stats analysis is a great way of looking at the symptoms of performance problems (see my Wait Stats category for more posts on this) but using the sys.dm_os_wait_stats DMV shows everything that's happening on a server. If you want to see what waits are occuring because of a single query or operation on a production system (e.g. the effect of MAXDOP hints on number and length of CXPACKET waits for a query) then using the DMV isn't usually feasible - you'd need to clear the wait stats and then have nothing else running on the system except the query/operation you want to investigate. The way to do it is with extended events.

Whenever a thread state changes to SUSPENDED because of a wait, the wait_info event from the sqlos event package fires to mark the beginning of the wait. When the thread gets back on the CPU again, the wait_info event fires again, noting the total wait time and the signal wait time (spent on the RUNNABLE queue). Just like with the DMV, we need to calculate the resource wait time (spent being SUSPENDED) ourselves.

Here's a simple example for you - imagine we want to investigate the waits that occur during an index rebuild. First thing is to create the event session:

-- Drop the session if it exists.
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = 'MonitorWaits')
    DROP EVENT SESSION MonitorWaits ON SERVER
GO

CREATE EVENT SESSION MonitorWaits ON SERVER
ADD EVENT sqlos.wait_info
    (WHERE sqlserver.session_id = 53 /* session_id of connection to monitor */)
ADD TARGET package0.asynchronous_file_target
    (SET FILENAME = N'C:\SQLskills\EE_WaitStats.xel',
    METADATAFILE = N'C:\SQLskills\EE_WaitStats.xem')
WITH (max_dispatch_latency = 1 seconds);
GO

You'll need to plug in the session ID of the SQL Server connection where you'll run the operation under investigation. I created a dummy production database with a table and some data so I can monitor the waits from rebuilding its clustered index.

You can also filter out any wait types you're not interested in by adding another predicate to the session that filters on the sqlos.wait_type. You'll need to specify numbers, and you can list the mapping between numbers and human-understandable wait types using this code:

SELECT xmv.map_key, xmv.map_value
FROM sys.dm_xe_map_values xmv
JOIN sys.dm_xe_packages xp
    ON xmv.object_package_guid = xp.guid
WHERE xmv.name = 'wait_types'
    AND xp.name = 'sqlos';
GO

Now I turn on the session and run my rebuild.

-- Start the session
ALTER EVENT SESSION MonitorWaits ON SERVER STATE = START;
GO

-- Go do the query

-- Stop the event session
ALTER EVENT SESSION MonitorWaits ON SERVER STATE = STOP;
GO

And then I can see how many events fired:

SELECT COUNT (*)
FROM sys.fn_xe_file_target_read_file
    ('C:\SQLskills\EE_WaitStats*.xel',
    'C:\SQLskills\EE_WaitStats*.xem', null, null);
GO

13324

And then pull them into a temporary table and aggregate them (various ways of doing this, I prefer this one):

-- Create intermediate temp table for raw event data
CREATE TABLE #RawEventData (
    Rowid  INT IDENTITY PRIMARY KEY,
    event_data XML);
 
GO

-- Read the file data into intermediate temp table
INSERT INTO #RawEventData (event_data)
SELECT
    CAST (event_data AS XML) AS event_data
FROM sys.fn_xe_file_target_read_file (
    'C:\SQLskills\EE_WaitStats*.xel',
    'C:\SQLskills\EE_WaitStats*.xem', null, null);
GO

SELECT
    waits.[Wait Type],
    COUNT (*) AS [Wait Count],
    SUM (waits.[Duration]) AS [Total Wait Time (ms)],
    SUM (waits.[Duration]) - SUM (waits.[Signal Duration]) AS [Total Resource Wait Time (ms)],
    SUM (waits.[Signal Duration]) AS [Total Signal Wait Time (ms)]
FROM
    (SELECT
        event_data.value ('(
/event/@timestamp)[1]', 'DATETIME') AS [Time],
        event_data.value ('(/event/data[@name=''wait_type'']/text)[1]', 'VARCHAR(100)') AS [Wait Type],
        event_data.value ('(/event/data[@name=''opcode'']/text)[1]', 'VARCHAR(100)') AS [Op],
        event_data.value ('(/event/data[@name=''duration'']/value)[1]', 'BIGINT') AS [Duration],
        event_data.value ('(/event/data[@name=''signal_duration'']/value)[1]', 'BIGINT') AS [Signal Duration]
     FROM #RawEventData
    ) AS waits
WHERE waits.[op] = 'End'
GROUP BY waits.[Wait Type]
ORDER BY [Total Wait Time (ms)] DESC;
GO

-- Cleanup
DROP TABLE #RawEventData;
GO

Wait Type             Wait Count  Total Wait Time (ms) Total Resource Wait Time (ms) Total Signal Wait Time (ms)
--------------------- ----------- -------------------- ----------------------------- ---------------------------
LATCH_EX              4919        15783                15693                         90
LATCH_SH              18          4399                 4398                          1
CXPACKET              16          3039                 3038                          1
PAGEIOLATCH_SH        813         468                  457                           11
PAGELATCH_SH          246         113                  109                           4
WRITELOG              99          71                   66                            5
PAGELATCH_UP          203         65                   63                            2
LOGBUFFER             123         58                   57                            1
WRITE_COMPLETION      82          55                   55                            0
PAGEIOLATCH_UP        35          27                   27                            0
IO_COMPLETION         80          25                   25                            0
SQLTRACE_LOCK         1           6                    6                             0
PAGEIOLATCH_EX        4           3                    3                             0
CMEMTHREAD            17          1                    0                             1
SOS_SCHEDULER_YIELD   6           0                    0                             0

You can easily change this to give average times and add filters to remove uninteresting waits.

If you're interested in getting more involved with extended events, check out Jonathan's excellent series from last December.

Happy waiting!

Over the last few months I've been blogging occasionally about some pretty deep performance tuning topics, namely latches and spinlocks (see my blog categories Wait Stats, Latches, and Spinlocks).

Ewan Fairweather and Mike Ruthruff of the SQLCAT team have written some really excellent whitepapers on interpreting and dealing with latch and spinlock issues, which I helped tech review, and they've now been published.

I strongly recommend reading these whitepapers if you want to improve your perf tuning skills beyond looking at wait statistics.

You can get them at:

Enjoy!

A month ago I kicked off a survey about MAXDOP setting - see here for the survey. I received results for 700 servers around the world! Here they are:

 

The X-axis format is X-Y-Z, where X = number of cores, Y = number of NUMA nodes, Z = MAXDOP setting. I didn't differentiate between hyper-threading or not, or soft vs. hard NUMA.

The results are striking - 75% of all systems out there use a server MAXDOP of zero. Now, this doesn't show whether individual queries are using MAXDOP to override that, or resource governor in 2008+ to override that either and I don't have info on the workload for all these servers - so it's not a result we can draw any concrete conclusions from. However, I do find it interesting that such a high proportion are running fine with MAXDOP 0 - my expectation was that there would be a higher proportion of servers with a non-zero MAXDOP setting.

There are quite a few 'black and white' configuration rules out there - for instance:

  1. Set MAXDOP to 1 if you're seeing CXPACKET waits as the prevalent wait type.
  2. Set MAXDOP to 1 for OLTP systems, and don't do anything else.
  3. Old Microsoft guidance to set MAXDOP to half the number of physical processors.
  4. Set MAXDOP to the number of cores in the NUMA node.

These are all incorrect as *rules*. There is no one-size-fits-all rule for MAXDOP - there are only general guidelines. For instance:

  • For OLTP systems, it can often be beneficial to set MAXDOP to 1 and then use the MAXDOP query hint to override the server-wide setting for queries that can benefit from parallelism.
  • For mixed-workload systems, you need to be careful how you set MAXDOP so you don't inadvertently penalize one of the workloads. Judicious use of the MAXDOP query hint can help here. A more powerful solution for mixed workloads is to use resource governor and have a workload group for each portion of the workload, with a different MAXDOP for each workload group.
  • For systems with high CXPACKET waits, investigate why this is the case before reducing MAXDOP. It's easy to come up with a demo where there are lots of CXPACKET waits, and while reducing MAXDOP (for the server or the query) reduces the CXPACKET waits, it also makes the query take a lot longer. CXPACKET waits can be because the statistics are incorrect and the query execution system divides up the work by the out-of-date statistics
  • Consider using the cost threshold for parallelism setting - see Jonathan's recent post here.

Using the resource governor as I described above can be a very easy way to mess around with the MAXDOP setting - especially for applications with legacy code that you cannot change, and you don't want to set a server-wide MAXDOP setting. This even works if the legacy code uses MAXDOP query hints, because the resource governor workload group MAXDOP setting *cannot* be overridden.

The key point when making any change for performance tuning is to test the change before putting it into production and work out which setting works best for your workload on your system - rather than blindly following 'this is the best way' rules for settings that people publish.

In other words, what should your MAXDOP be set to? It depends! :-)

One thing to be aware of: if you change the MAXDOP setting for the server, it will flush the plan cache when you do. It shouldn't, but it does. Be careful when doing this on a production server...

Thanks to all those who replied to the original survey!

It's been a long time since the last blog post on SSD benchmarking - I've been busy! I'm starting up my benchmarking activities again and hope to post more frequently. You can see the whole progression of benchmarking posts here.

You can see my benchmarking hardware setup here, with the addition of the Fusion-io ioDrive Duo 640GB drives that Fusion-io were nice enough to lend me. My test systems now have 16GB each and all tests were performed with the buffer pool ramped up, so memory allocation didn't figure into the performance numbers.

In this recent set of tests I wanted to explore three questions:

  1. What kind of performance gain do I get upgrading from Fusion-io's v1.2 driver to the v2.2 driver?
  2. What is the sweet spot for the number of files on an SSD?
  3. Does a 4Kb block size give any gains in performance for my test?

To keep it simple I'm using one half of the 640GB drive (it's two 320GB drives under the covers). To do this my test harness does the following:

  • Formats the SSDs in one of three ways:
    • Fusion-io basic format (the 320GB drive has 300GB capacity)
    • Fusion-io improved write performance format (the 320GB drive has only 210GB capacity, 70% of normal)
    • Fusion-io maximum write performance format (the 320GB drive has only 151GB capacity, 50% of normal)
  • The SSD format is performed using Fusion-io's ioManager tool
  • Creates 1, 2, 4, 8, 16, 32, 64 or 128 data files, with the file sizes calculated to fill the SSDs
  • My table structure is:

CREATE TABLE MyBigTable (
    c1 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID (),
    c2 DATETIME DEFAULT GETDATE (),
    c3 CHAR (111) DEFAULT 'a',
    c4 INT DEFAULT 1,
    c5 INT DEFAULT 2,
    c6 BIGINT DEFAULT 42); 
GO

CREATE CLUSTERED INDEX MyBigTable_cl ON MyBigTable (c1);
GO

  • I have 64 connections each inserting 2.5 million records into the table (with the loop code running server-side) for a total of 160 million records inserted, in batches of 1000 records per transaction. This works out to be about 37 GB of allocated space from the database.

The clustered index on a random GUID is the easiest way to generate random reads and writes, and is a very common design pattern out in the field (even though it performs poorly) - for my purposes it's perfect.

I tested each of the eight data file layouts on the following configurations (all using 1MB partition offsets, 64k NTFS allocation unit size, RAID was not involved):

  • Log and data on a single 320GB SSD with the old v1.2.7 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver (each of the 3 ways of formatting)
  • Log and data on a single 320GB SSD with the new v2.2.3 Fusion-io driver and a 4Kb block size (each of the 3 ways of formatting)
That's a total of 9 configurations, with 8 data file layouts in each configuration - making 72 separate configurations. I ran each test 5 times and then took an average of the results - so altogether I ran 360 tests, for a cumulative test time of just over 1.43 million seconds (16.5 days) during April.

The test harness takes care of all of this except reformatting the drives, and also captures the wait stats for each test, making note of the most prevalent waits that make up the top 95% of all waits during the test. This uses the script from this blog post.

On to the results... the wait stats are *really* interesting!

Note: the y-axes in the graphs below do not start at zero. All the graphs have the same axes so there is nothing to misunderstand. They are not misleading - and I make no apologies for my choice of axes - I want to show the difference between the various formats more clearly.

v1.2.7 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the old driver was 3580 seconds for test completion, with 4 data files and the Improved Write format - a tiny amount less than the time for 8 data files.

v2.2.3 Fusion-io driver and 512-byte sector size

 

The best performance I could get with the new driver was 2993 seconds for test completion, with 8 data files and the Max Write format - a tiny amount less than the time for other formats for 8 data files, and very close to the times for 4 data files.

On average across all the tests the new v2.2 Fusion-io driver gives a 20.5% performance boost over the old v1.2 driver, and for the regular format, the new v2.2 Fusion-io driver gives a 24% performance boost over the old v1.2 driver. It also (but I didn't measure) reduces the amount of system memory required to use the SSDs. Good stuff!

v2.2.3 Fusion-io driver and 4-Kb sector size

 

The performance using a 4-Kb sector size is roughly the same for my test compared to traditional 512-byte sector size. The most performance gain I got was 3% over using a 512-byte sector size, but on average across all tests the performance was very slightly (0.5%) lower.

Wait Stats

The wait stats were very interesting.

The wait stats are presented in the following format:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  154611.39  128056.51   26554.88  45295507   71.83       0.0034     0.0028    0.0006
PAGELATCH_UP    37948.31   36988.52    959.79    2314370    17.63       0.0164     0.016     0.0004
PAGELATCH_SH    16976      13823.71    3152.3    3751811    7.89        0.0045     0.0037    0.0008

The columns are:

  • WaitType - kind of obvious
  • Wait_S - cumulative wait time in seconds, from a thread being RUNNING, going through SUSPENDED, back to RUNNABLE and then RUNNING again
  • Resource_S - cumulative wait time in seconds while a thread was SUSPENDED (called the resource wait time)
  • Signal_S - cumulative wait time in seconds while a thread was RUNNABLE (i.e. after being signalled that the resource wait has ended and waiting on the runnable queue to get the CPU again - called the signal wait time)
  • WaitCount - number of waits of this type during the test
  • Percentage - percentage of all waits during the test that had this type
  • AvgWait_S - average cumulative wait time in seconds
  • AvgRes_S - average resource wait time in seconds
  • AvgSig_S - average signal wait time in seconds

For a single file, the wait stats for all the various formatting options look like:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX  226362.54  193378.56   32983.98  45742117   78.66       0.0049     0.0042    0.0007
PAGELATCH_UP    36701.66   35760.67    940.99    2144533    12.75       0.0171     0.0167    0.0004
PAGELATCH_SH    16775.05   13644.99    3130.06   3542549    5.83        0.0047     0.0039    0.0009

With more files, the percentage of PAGEIOLATCH_EX waits increases, and by the time we get to 8 files, SOS_SCHEDULER_YIELD has started to appear. At 8 files, the wait stats for all the various formatting options look like:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       244703.77  210859.63   33844.14  45169863   89.47       0.0054     0.0047    0.0007
SOS_SCHEDULER_YIELD  12500.15   0.99        12499.16  823658     4.57        0.0152     0         0.0152
PAGELATCH_SH         5777.54    3749.18     2028.36   476618     2.11        0.0121     0.0079    0.0043

By 16 files, the PAGELATCH waits have disappeared from the top 95%. As the number of files increases to 128, the PAGEIOLATCH_EX waits increase to just over 91% and the wait stats look like this for regular format:

WaitType             Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX       304106.32  273671      30435.32  43478489   91.25       0.007      0.0063    0.0007
SOS_SCHEDULER_YIELD  16733      1.35        16731.65  889147     5.02        0.0188     0         0.0188

What does this mean? It's obvious from the wait stats that as I increase the number of data files on the drive, the average resource wait time for each PAGEIOLATCH_EX wait increases from 4.2ms for 1 file up to 6.3ms for 128 files - 50% worse, with the signal wait time static at 0.7ms.

But look at the wait stats for 128 files using the Maximum Write format:

WaitType      Wait_S Resource_S  Signal_s  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
-------------------  ---------  ----------  --------  ---------  ----------  ---------  --------  --------
PAGEIOLATCH_EX      196322.97 166602.88   29720.09  45435521  88.66      0.0043 0.0037   0.0007
SOS_SCHEDULER_YIELD  16182.48 1.1     16181.38  828595  7.31      0.0195 0   0.0195

The average resource wait time for the PAGEIOLATCH_EX waits has dropped from 6.3ms to 3.7ms! But isn't PAGEIOLATCH_EX a wait type that's for a page *read*? Well, yes, but what I think is happening is that the buffer pool is having to force pages out to disk to make space for the pages being read in to be inserted into (which I believe is included in the PAGEIOLATCH_EX wait time) - and when the SSD is formatted with the improved write algorithm, this is faster and so the PAGEIOLATCH_EX resource wait time decreases.

But why the gradual decrease in PAGELATCH waits and increase in SOS_SCHEDULER_YIELD waits as the number of files increases?

I went back and ran a single file test and used the sys.dm_os_waiting_tasks DMV (see this blog post) to see what the various threads are waiting for. Here's some example output:

session_id  wait_duration_ms  wait_type     resource_description
----------  ----------------  ------------  --------------------
79          11                PAGELATCH_UP  5:1:1520544
80          14                PAGELATCH_UP  5:1:1520544
93          16                PAGELATCH_UP  5:1:1520544
94          0                 PAGELATCH_UP  5:1:1520544
101         15                PAGELATCH_UP  5:1:1520544
111         25                PAGELATCH_UP  5:1:1520544
110         17                PAGELATCH_UP  5:1:1520544
75          6                 PAGELATCH_UP  5:1:1520544
78          17                PAGELATCH_UP  5:1:1520544
88          8                 PAGELATCH_UP  5:1:1520544
107         12                PAGELATCH_UP  5:1:1520544
109         25                PAGELATCH_UP  5:1:1520544
113         20                PAGELATCH_UP  5:1:1520544
.
.

Dividing 1520544 by 8088 gives exactly 188. Running the same query a few seconds later gives most of the threads waiting on resource 5:1:1544808, another exact multiple of 8088. These resources are PFS pages! What we're seeing is PFS page contention, just like you can get in tempdb with lots of concurrent threads creating and dropping temp tables. In this case, I have 64 concurrent threads doing inserts that are causing page splits, which requires page allocations. As the number of files increases, the amount of PFS page contention decreases. It disappears after 8 files because I've only got 8 cores, so there can only be 8 threads running at once (one per SQLOS scheduler, with the others SUSPENDED on the waiter list or waiting in the RUNNABLE queue).

From 16 to 128 files, the wait stats hardly change and the performance (in the Improved Write and Max Write formats) only slightly degrades (5%) with each doubling of the number of files. Without deeper investigation, I'm putting this down to increased amounts of metadata to deal with - maybe with more to do when searching the allocation cache for the allocation unit of the clustered index. If I have time I'll dig in and investigate exactly why.

The SOS_SCHEDULER_YIELD waits are just because the threads are able to do more before having to wait, and so they're hitting voluntary yield points in the code - the workload is becoming more CPU bound.

Summary

I've clearly shown that the new Fusion-io driver gives a nice boost compared to the older one - very cool.

I've also shown that the number of files on the SSD does have an affect on performance too - with the sweet spot appearing to be the number of processor cores. I'd love to see someone do similar tests on a 16-way, 32-way or higher (or lend me one to play with :-)

[Edit: I discussed the results with my friend Thomas Kejser on the SQLCAT team and he sees the same behavior on a 64-way running the same benchmark (in fact we screen-shared on a 64-way system with 2TB and 4 640GB Fusion-io cards this weekend). He posted some more investigations on his blog - see here.]

And finally, I showed that for my workload, using a 4-Kb sector size did not improve performance.

I'd call that a successful test - *really* interesting!

Continuing my series on advanced performance troubleshooting - see these two posts for the scripts I'll be using and an introduction to the series:

In this blog post I'd like to show you an example of SOS_SCHEDULER_YIELD waits occurring and how it can seem like a spinlock is the cause.

I originally published this blog post and then had a discussion with good friend Bob Ward from Product Support who questioned my conclusions given what he's seen (thanks Bob!). After digging in further, I found that my original post was incorrect, so this is the corrected version.

The SOS_SCHEDULER_YIELD wait means that an executing thread voluntarily gave up the CPU to allow other threads to execute. The SQL Server code is sprinkled with "voluntary yields" in places where high CPU usage may occur.

One such place where a thread will sleep but not explicitly yield is when backing off after a spinlock collision waiting to see if it can get access to the spinlock. A spinlock is a very lightweight synchronization mechanism deep inside SQL Server that protects access to a data structure (not database data itself). See the end of the second blog post above for more of an explanation on spinlocks.

When a thread voluntarily yields, it does not go onto the waiting tasks list - as it's not waiting for anything - but instead goes to the bottom of the RUNNABLE queue. SOS_SCHEDULER_YIELD waits by themselves are not cause for concern unless they are the majority of waits on the system, and performance is suffering.

To set up the test, I'll create a simple database and table:

USE master;
GO
DROP DATABASE YieldTest;
GO

CREATE DATABASE YieldTest;
GO

USE YieldTest;
GO

CREATE TABLE SampleTable (c1 INT IDENTITY);
GO
CREATE NONCLUSTERED INDEX SampleTable_NC ON SampleTable (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO SampleTable DEFAULT VALUES;
GO 100

Then I'll clear out wait stats and latch stats:

DBCC SQLPERF ('sys.dm_os_latch_stats', CLEAR);
GO

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);
GO

And then fire up 50 clients running the following code (I just have a CMD script that fires up 50 CMD windows, each running the T-SQL code):

USE YieldTest;
GO

SET NOCOUNT ON;
GO

DECLARE @a int
WHILE (1=1)
BEGIN 
    SELECT @a = COUNT (*) FROM YieldTest..SampleTable WHERE c1 = 1;
END;
GO

And the CPUs in my laptop are jammed solid (snapped from my desktop and rotated 90 degrees to save space):

 

Wow!

Looking at perfmon:

 

The CPU usage is not Windows (% User Time), and of all the counters I usually monitor (I've cut them all out here for clarity) I can see a sustained very high Lock Requests/sec for Object and Page locks (almost 950 thousand requests per second for both types! Gotta love my laptop :-)

So let's dig in. First off, looking at wait stats (using the script in the wait stats post referenced above):

WaitType            Wait_S  Resource_S Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S
------------------- ------- ---------- -------- --------- ---------- --------- -------- --------
SOS_SCHEDULER_YIELD 4574.77 0.20       4574.57  206473    99.43      0.0222    0.0000   0.0222

SOS_SCHEDULER_YIELD at almost 100% of the waits on the system means that I've got CPU pressure - as I saw from the CPU graphs above. The fact that nothing else is showing up makes me suspect this is a spinlock issue.

Checking in the sys.dm_os_waiting_tasks output (see script in the second blog post referenced above), I see nothing waiting, and if I refresh a few times I see the occasional ASYNC_NETWORK_IO and/or PREEMPTIVE_OS_WAITFORSINGLEOBJECT wait type - which I'd expect from the CMD windows running the T-SQL code.

Checking the latch stats (again, see script in the second blog post referenced above) shows no latch waits apart from a few BUFFER latch waits.

So now let's look at spinlocks. First off I'm going to take a snapshot of the spinlocks on the system:

-- Baseline
DROP TABLE ##TempSpinlockStats1;
GO
SELECT * INTO ##TempSpinlockStats1 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

(On 2005 you'll need to use DBCC SQLPERF ('spinlockstats') and use INSERT/EXEC to get the results into a table - see the post above for example code.)

Then wait 10 seconds or so for the workload to continue... and then grab another snapshot of the spinlocks:

-- Capture updated stats
DROP TABLE ##TempSpinlockStats2;
GO
SELECT * INTO ##TempSpinlockStats2 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

Now running the code I came up with to show the difference between the two snapshots:

-- Diff them
SELECT
    '***' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] AS [DiffCollisions],
    ts2.[spins] AS [DiffSpins],
    ts2.[spins_per_collision] AS [SpinsPerCollision],
    ts2.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] AS [DiffBackoffs]
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NULL
UNION
SELECT
    '' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] - ts1.[collisions] AS [DiffCollisions],
    ts2.[spins] - ts1.[spins] AS [DiffSpins],
    CASE (ts2.[spins] - ts1.[spins]) WHEN 0 THEN 0
        ELSE (ts2.[spins] - ts1.[spins]) / -- > 0 spins = > 0 collisions
        (ts2.[collisions] - ts1.[collisions]) END AS [SpinsPerCollision],
    ts2.[sleep_time] - ts1.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] - ts1.[backoffs] AS [DiffBackoffs]
    --, ts2.*
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NOT NULL
    AND ts2.[collisions] - ts1.[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

New  Spinlock                        DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
---- ------------------------------- -------------- ---------- ----------------- ------------- ------------
     LOCK_HASH                       6191134        4005774890 647               686           1601383
     OPT_IDX_STATS                   1164849        126549245  108               57            7555
     SOS_OBJECT_STORE                73             305        4                 0             0
     SOS_WAITABLE_ADDRESS_HASHBUCKET 115            44495      386               0             3
     XDESMGR                         1              0          0                 0             0

This is telling me that there is massive contention for the LOCK_HASH spinlock, further backed up by the SOS_WAITABLE_ADDRESS_HASHBUCKET spinlock. The LOCK_HASH spinlock protects access to the hash buckets used by the lock manager to efficiently keep track of the lock resources for locks held inside SQL Server and to allow efficient searching for lock hash collisions (i.e. does someone hold the lock we want in an incompatible mode?). In this case the contention is so bad that instead of just spinning, the threads are actually backing off and letting other threads execute to allow progress to be made.

And that makes perfect sense because of what my workload is doing - 50 concurrent connections all trying to read the same row on the same page in the same nonclustered index.

But is that the cause of the SOS_SCHEDULER_YIELDs? To prove it one way or the other, I created an XEvent session that would capture callstacks when a wait occcured:

CREATE EVENT SESSION MonitorWaits ON SERVER
ADD EVENT sqlos.wait_info
 (ACTION (package0.callstack))
      ADD TARGET package0.asynchronous_bucketizer (
            SET filtering_event_name = 'sqlos.wait_info',
            source_type = 1, source = 'package0.callstack')        
      WITH (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE,
      max_dispatch_latency=5 seconds)
GO

I also made sure to have the correct symbol files in the \binn directory (see How to download a sqlservr.pdb symbol file). After running the workload and examining the callstacks, I found the majority of the waits were coming from voluntary yields deep in the Access Methods code. An example call stack is:

IndexPageManager::GetPageWithKey+ef [ @ 0+0x0
GetRowForKeyValue+146 [ @ 0+0x0
IndexRowScanner::EstablishInitialKeyOrderPosition+10a [ @ 0+0x0
IndexDataSetSession::GetNextRowValuesInternal+7d7 [ @ 0+0x0
RowsetNewSS::FetchNextRow+12a [ @ 0+0x0
CQScanRangeNew::GetRow+6a1 [ @ 0+0x0
CQScanCountStarNew::GetRowHelper+44 [ @ 0+0x0
CQScanStreamAggregateNew::Open+70 [ @ 0+0x0
CQueryScan::Uncache+32f [ @ 0+0x0
CXStmtQuery::SetupQueryScanAndExpression+2a2 [ @ 0+0x0
CXStmtQuery::ErsqExecuteQuery+2f8 [ @ 0+0x0
CMsqlExecContext::ExecuteStmts&lt;1,1&gt;+cca [ @ 0+0x0
CMsqlExecContext::FExecute+58b [ @ 0+0x0
CSQLSource::Execute+319

[Edit: check out how to do this in the spinlock whitepaper

This is clearly (to me) nothing to do with the LOCK_HASH spinlock, so that's a red herring. In this case, I'm just CPU bound. When a thread goes to sleep when backing off from a spinlock, it directly calls Windows Sleep() - so it does not show up as a SQL Server wait type at all, even though the sleep call is made from the SQLOS layer. Which is a bummer.

How to get around that and reduce CPU usage? This is really contrived workload, but this can occur for real. Even if I try using WITH (NOLOCK), the NOLOCK seek will take a table SCH_S (schema-stability) lock to make sure the table structure doesn't change while the seek is occurring, so that only gets rid of the page locks, not the object locks and doesn't help CPU usage. With this (arguably weird) workload, there are a few things I could do (just off the top of my head):

  • Scale-out the connections to a few copies of the database, updated using replication
  • Do some mid-tier or client-side caching, with some notification mechanism of data changes (e.g. a DDL trigger firing a Service Broker message)

So this was an example of where wait stats lead to having to look at spinlock stats, but that the spinlock, on even *deeper* investigation, wasn't the issue at all. Cool stuff.

Next time we'll look at another cause of SOS_SCHEDULER_YIELD waits.

Hope you enjoyed this - let me know your thoughts!

Cheers

PS In the latest SQLskills Insider Quick Tips newsletter from last week, I did a video demo of looking at transaction log IOs and waits - you can get it here.

It's all very well having whizz-bang 3rd-party performance monitoring and troubleshooting tools, but sometimes you have to get deeper into what's going on with SQL Server than any of these tools can go. Or you have to call Customer Support or Premier Support so *they* can dive in deeper.

Typically you or they are going to make use of four DMVs that give increasingly advanced information about what's going on for use in performance troubleshooting:

A few weeks ago I kicked off a survey to find out whether you've heard of or used these DMVs - see here for the survey.

In this post I'm going to present the survey results and explain a bit about these DMVs, focusing the most attention on latches and spinlocks. This started out as a small post but grew into a 10-page, 2500 word article :-)

Here are the results (in each of the Other values, a few people asked what DMVs are - see Dynamic Management Views and Functions in BOL):

 

Other values:

  • 12 x Yes, more than once or twice but not routinely.
  • 2 x Only because of your wait statistics post.

The survey results are not surprising, especially amongst readers of my blog.

Wait statistics are the bread-and-butter of performance tuning. SQL Server is keeping track of what resources threads need to wait for, and how long they need to wait. By analyzing which resources (and combinations of resource) are being waited for the most, you can get an idea of where to start digging in further. An example might be that if most of the waits are PAGEIOLATCH_SH waits, and this wasn't the case in your wait stats baseline, you might look at the I/O subsystem performance using the sys.dm_io_virtual_file_stats DMV (which I blogged about here).

Last December I wrote a long blog post introducing wait statistics, showing how to use the sys.dm_os_wait_stats DMV, giving links to resources, and explaining the most common ones that people see in the field based on data from more than 1800 SQL Servers - see Wait statistics, or please tell me where it hurts.

 

Other values:

  • 6 x Yes, More than once or twice but not routinely.

I'm surprised that these results don't tie in more closely with the results for sys.dm_os_wait_stats, but they're reasonably close.

The sys.dm_os_waiting_tasks DMV shows you what is currently being waited on by everything running on the system.

I created a scenario with 200 clients creating and dropping small temp tables to creat tempdb latch contention. Using the DMV, I can see what's being waited on (I've removed the columns describing blocking from the output in this case to make it fit on screen)

SELECT * FROM sys.dm_os_waiting_tasks;
GO

waiting_task_address session_id exec_context_id wait_duration_ms     wait_type          resource_address   resource_description
-------------------- ---------- --------------- -------------------- ------------------ ------------------ --------------------
0x000000000044E508   2          0               4091305              XE_DISPATCHER_WAIT NULL               NULL
0x000000000044E988   9          0               4121252              FSAGENT            NULL               NULL
0x000000000044EBC8   20         0               4121251              BROKER_TRANSMITTER NULL               NULL
0x000000000044F4C8   63         0               53                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044EE08   64         0               8                    PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000000044F288   87         0               0                    PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000000044F048   91         0               53                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044F948   92         0               61                   PAGELATCH_EX       0x0000000088FFEED8 2:1:1139
0x000000000044F708   101        0               10                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000000044FDC8   103        0               37                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744E088   118        0               11                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E2C8   121        0               66                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744E508   122        0               33                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E748   155        0               32                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744E988   158        0               27                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744EBC8   163        0               34                   PAGELATCH_EX       0x0000000080FEEC58 2:1:120
0x000000008744EE08   168        0               66                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
0x000000008744F048   179        0               26                   PAGELATCH_UP       0x0000000080FE8AD8 2:1:1
.
.

As you can see, the classic tempdb latch contention is showing - page ID (2:1:1) - the first PFS page in tempdb. (See here for more on tempdb contention, and here for more on PFS pages.)

Joe Sack (blog|twitter) created a script that pulls in data from a bunch of other DMVs to make the sys.dm_os_waiting_tasks output more useful, which I've modified into the following:

SELECT
    owt.session_id,
    owt.wait_duration_ms,
    owt.wait_type,
    owt.blocking_session_id,
    owt.resource_description,
    es.program_name,
    est.text,
    est.dbid,
    eqp.query_plan,
    es.cpu_time,
    es.memory_usage
FROM sys.dm_os_waiting_tasks owt
INNER JOIN sys.dm_exec_sessions es
    ON owt.session_id = es.session_id
INNER JOIN sys.dm_exec_requests er
    ON es.session_id = er.session_id
OUTER APPLY sys.dm_exec_sql_text (er.sql_handle) est
OUTER APPLY sys.dm_exec_query_plan (er.plan_handle) eqp
WHERE es.is_user_process = 1;
GO

There's too much information in the output to usefully show in this post, but I can see the actual T-SQL statements being run (in this case a lot of DROP TABLE and SELECT * INTO of global temp tables) and the XML query plans. Clicking on one of them in 2008 SSMS gives me the actual plan - very cool:

 

This means I can see from the sys.dm_os_wait_stats DMV what the prevalent resource waits are, then use the sys.dm_os_waiting_tasks DMV to see which queries are waiting for those resources - and then dive in deeper to see why.

 

As I suspected most readers have heard of latches, but 75% of respondents haven't used the DMV or have only used it once or twice.

A latch is a lightweight syncronization mechanism that protects access to read and change in-memory structures - for instance, 8KB page buffers in the buffer pool (latch type = BUFFER), or the data structure that represents a database's data and log files (latch type = FGCB_ADD_REMOVE). A latch is only held for the duration of the operation, unlike a lock which may be held until a transaction commits. One example of locks and latches - imagine a table where an update query has caused lock escalation so that a table X lock is held on the table. As the query continues updating more records in the table, it won't acquire any more locks, but any data and index pages that are updated in memory must be EX (exclusive) latched before the update can occur. The latch acts as the synchronization mechanism to prevent two threads updating the page at the same time, or a thread reading the page while another is in the middle of updating it. Another example is if you run a select query using NOLOCK - although the query will not acquire SH (share) locks at any level in the table, the threads must acquire SH latches on pages before they can be read - to synchronize with possible concurrent updaters.

If a thread requires a latch it will be moved from RUNNING to SUSPENDED and put on the waiter list to await notification that the latch has been acquired in the requested mode.

Latch waits correspond to LATCH_XX waits in the output from the sys.dm_os_wait_stats DMV, so digging into to which latches are accounting for most waits can show where a bottleneck is on the system.

You can reset latch wait statistics just like regular wait statistics using:

DBCC SQLPERF('sys.dm_os_latch_stats', CLEAR);
GO

An example set of output from the DMV is:

SELECT * FROM sys.dm_os_latch_stats
WHERE [waiting_requests_count] > 0
ORDER BY [wait_time_ms] DESC;
GO

latch_class                       waiting_requests_count wait_time_ms         max_wait_time_ms
--------------------------------- ---------------------- -------------------- --------------------
BUFFER                            113181121              466697044            1233
ACCESS_METHODS_HOBT_COUNT         66676                  331193               577
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  15018                  68865                125
LOG_MANAGER                       130                    5610                 234
FGCB_ADD_REMOVE                   299                    5073                 32
TRACE_CONTROLLER                  1                      0                    0
VERSIONING_TRANSACTION_LIST       1                      0                    0
ACCESS_METHODS_HOBT_FACTORY       64                     0                    0

You can also aggregate them in the same way as I described in my big wait stats blog post, using code below:

WITH Latches AS
    (SELECT
        latch_class,
        wait_time_ms / 1000.0 AS WaitS,
        waiting_requests_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_latch_stats
    WHERE latch_class NOT IN ('BUFFER')
)
SELECT
    W1.latch_class AS LatchClass,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(14, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S
FROM Latches AS W1
INNER JOIN Latches AS W2
    ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.latch_class, W1.WaitS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

Here's an example after clearing the latch stats and running the tempdb contention test (I described above) for 30 seconds:

LatchClass                        Wait_S  WaitCount  Percentage  AvgWait_S
--------------------------------- ------- ---------- ----------- ----------
ACCESS_METHODS_HOBT_COUNT         7.92    1471       75.41       0.0054
ACCESS_METHODS_HOBT_VIRTUAL_ROOT  1.38    393        13.15       0.0035
LOG_MANAGER                       1.20    12         11.44       0.1001

Most of the latch classes are undocumented, but I'll be shedding light on them as I blog more about latch stats.

 

Other values:

  • 1 x Just learned of this at SQLskills training!
  • 1 x Learned about it, after you tweeted about it on 3/23 to a co-worker.

This is really cool that more than 40% of respondents have never heard of this DMV or spinlocks - education time!

A spinlock is another lightweight synchronization mechanism used to control access to certain data structures in the engine - used when the time that the spinlock will be held is very short. They are different from latches because a thread waiting for a latch will yield the scheduler and go onto the waiter list whereas a thread waiting to acquire a spinlock will burn some CPU "spinning" to see if it can get the CPU before giving up and backing off (yielding the scheduler) before trying again. This may allow another thread to execute that is holding the spinlock and eventually release it, allowing the system to proceed (yes, a thread can yield the scheduler and move to the waiter list while holding a spinlock!) because another thread can then acquire the spinlock.

It is perfectly normal for spinlock collisions and spins to occur on a busy system, but sometimes a bottleneck can occur on systems with larger numbers of CPUs where collisions are more likely - this can drain CPU resources while many threads are spinning trying to acquire the spinlock.

Running the DMV shows you the list of all spinlocks on the system (all of which are undocumented - but I'll be working on that going forward) - here is some partial output:

SELECT * FROM sys.dm_os_spinlock_stats
ORDER BY [spins] DESC;
GO

name               collisions           spins                spins_per_collision sleep_time           backoffs
------------------ -------------------- -------------------- ------------------- -------------------- -----------
LOCK_HASH          3629624              4402099957           1212.825            561                  817819
SOS_CACHESTORE     11992297             3352117666           279.5226            6093                 71948
OPT_IDX_MISS_KEY   63329610             2036811058           32.16207            15830                180845
SOS_TLIST          9769744              574740437            58.82861            320                  3619
SOS_SCHEDULER      2137875              107392996            50.23352            557                  7753
MUTEX              676406               83493780             123.4374            340                  3300
LOGCACHE_ACCESS    2210697              83204315             37.63714            0                    252366
SOS_RW             264489               70122059             265.1228            14                   799
XDESMGR            346005               61031449             176.3889            216                  3788
SOS_SUSPEND_QUEUE  3397384              53752545             15.82174            120                  2384
OPT_IDX_STATS      129814               19800952             152.5332            27                   356
BACKUP_CTX         29730                16784471             564.5635            192                  1645
LOCK_RESOURCE_ID   17558                4363116              248.4973            20                   375
SOS_TASK           206597               1898063              9.187273            16                   171
XVB_LIST           266112               882691               3.316991            1                    63
.
.

[

On 2005 you'll need to use DBCC SQLPERF ('spinlockstats') and use INSERT/EXEC to get the results into a table. Eric Humphrey (blog|twitter) put the code together:

-- Baseline
IF OBJECT_ID('tempdb..#TempSpinlockStats1') IS NOT NULL
DROP TABLE #TempSpinlockStats1;
GO
CREATE TABLE #TempSpinlockStats1(
[name] NVARCHAR(30) NOT NULL,
[collisions] BIGINT NOT NULL,
[spins] BIGINT NOT NULL,
[spins_per_collision] FLOAT NOT NULL,
[sleep_time] BIGINT NOT NULL,
[backoffs] BIGINT NOT NULL
)
INSERT INTO #TempSpinlockStats1
EXEC ('DBCC SQLPERF(''spinlockstats'')')
GO

]

The LOCK_HASH spinlock, for instance, is used by the lock manager to look at one of the hash buckets holding lock resource hashes to tell whether lock can be granted or not.

The sleep_time is an aggregate of how much time is spent sleeping between spin cycles when a backoff occurs.

I've put together some code that will allow you to see what spinlock activity occurs between two times. The code captures the output from the DMV into two temp tables, with whatever time period you want in between (to allow you to run a command), and then shows the difference between the two sets of data. I'll show an example of running DBCC CHECKDB.

-- Baseline
DROP TABLE ##TempSpinlockStats1;
GO
SELECT * INTO ##TempSpinlockStats1 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Do some stuff
DBCC CHECKDB (SalesDB) WITH NO_INFOMSGS;
GO

-- Capture updated stats
DROP TABLE ##TempSpinlockStats2;
GO
SELECT * INTO ##TempSpinlockStats2 FROM sys.dm_os_spinlock_stats
WHERE [collisions] > 0
ORDER BY [name];
GO

-- Diff them
SELECT
    '***' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] AS [DiffCollisions],
    ts2.[spins] AS [DiffSpins],
    ts2.[spins_per_collision] AS [SpinsPerCollision],
    ts2.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] AS [DiffBackoffs]
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NULL
UNION
SELECT
    '' AS [New],
    ts2.[name] AS [Spinlock],
    ts2.[collisions] - ts1.[collisions] AS [DiffCollisions],
    ts2.[spins] - ts1.[spins] AS [DiffSpins],
    CASE (ts2.[spins] - ts1.[spins]) WHEN 0 THEN 0
        ELSE (ts2.[spins] - ts1.[spins]) / -- > 0 spins = > 0 collisions
        (ts2.[collisions] - ts1.[collisions]) END AS [SpinsPerCollision],
    ts2.[sleep_time] - ts1.[sleep_time] AS [DiffSleepTime],
    ts2.[backoffs] - ts1.[backoffs] AS [DiffBackoffs]
    --, ts2.*
FROM ##TempSpinlockStats2 ts2
LEFT OUTER JOIN ##TempSpinlockStats1 ts1
    ON ts2.[name] = ts1.[name]
WHERE ts1.[name] IS NOT NULL
    AND ts2.[collisions] - ts1.[collisions] > 0
ORDER BY [New] DESC, [Spinlock] ASC;
GO

And the output is as follows:

New  Spinlock           DiffCollisions       DiffSpins            SpinsPerCollision DiffSleepTime        DiffBackoffs
---- ------------------ -------------------- -------------------- ----------------- -------------------- ------------
***  DBCC_CHECK         4                    24                   6                 0                    0
***  DIAG_MANAGER       1                    0                    0                 0                    0
***  FCB_REPLICA_SYNC   10                   16147                1614.7            0                    0
***  LSID               9                    0                    0                 0                    0
***  QUERYEXEC          5                    0                    0                 0                    0
***  X_PACKET_LIST      2                    0                    0                 0                    0
***  X_PORT             8                    227                  28.375            0                    0
***  XACT_WORKSPACE     11                   0                    0                 0                    0
***  XID_ARRAY          7                    0                    0                 0                    0
     BUF_FREE_LIST      2                    0                    0                 0                    0
     HOBT_HASH          2                    1                    0                 0                    0
     LOCK_HASH          3                    1818                 606               0                    0
     SOS_RW             2                    500                  250               0                    0
     SOS_SCHEDULER      5                    6                    1                 0                    0
     SOS_SUSPEND_QUEUE  11                   39                   3                 0                    0
     SOS_TASK           1                    0                    0                 0                    0
     SOS_TLIST          1                    0                    0                 0                    0

You can see here which spinlocks were acquired to run the DBCC CHECKDB commands - those marked with *** did not appear in the 'before' set of spinlock stats. More on all of these in future posts.

You can also investigate spinlocks using extended events - again, more on that in future.

Summary

It's possible to dive really deeply into what's happening inside SQL Server using these four DMVs. Spinlocks in particular - what each means, what each controls and what contention on each them means (plus what you can do about it) - involve a lot of knowledge of what's going on inside the engine, and I'm planning to spread some of that knowledge going forward - there's an enormous amount of information to be published about latches and spinlocks.

Hope you'll join me to learn about these - let me know if you find this stuff interesting and useful.

Thanks!

In this week's survey I've got four mini-surveys for you, all to do with in-depth performance analysis.

I'd like to know whether you've ever used each of four DMVs that look progressively more deeply into the workings of the database engine.

I'll report on the results in a week or two and start blogging about using these - let me know if that would be interesting to you.

Thanks!

(Be sure to join our community to get our monthly newsletter with exclusive content, advance notice of classes with discount codes, and other SQL Server goodies!)  

How many times have you walked up to a SQL Server that has a performance problem and wondered where to start looking? 

One of the most under-utilized performance troubleshooting methodologies in the SQL Server world is one called "waits and queues" (also known simply as "wait stats"). The basic premise is that SQL Server is permanently tracking why execution threads have to wait. You can ask SQL Server for this information and then use the results to narrow down where to start digging to unearth the cause of performance issues. The "waits" are what SQL Server tracks. The "queues" are the resources that the threads are waiting for. There are a myriad of waits in the system and they all indicate different resources being waited for. For example, a PAGEIOLATCH_EX wait means a thread is waiting for a data page to be read into the buffer pool from disk. A LCK_M_X wait means a thread is waiting to be granted an exclusive lock on something.

The great thing about all of this is the SQL Server *knows* where the performance issues are, and you just need to ask it.... and then interpret what it tells you, which can be a little tricky.

Now - where people sometimes get hung up is trying to track down every last wait and figure out what's causing it. Waits *always* occur. It's the way SQL Server's scheduling system works.

A thread is using the CPU (called RUNNING) until it needs to wait for a resource. It then moves to an unordered list of threads that are SUSPENDED. In the meantime, the next thread on the FIFO (first-in-first-out) queue of threads waiting for the CPU (called being RUNNABLE) is given the CPU and becomes RUNNING. If a thread on the SUSPENDED list is notified that it's resource is available, it becomes RUNNABLE and is put on the bottom of the RUNNABLE queue. Threads continue this clockwise movement from RUNNING to SUSPENDED to RUNNABLE to RUNNING again until the task is completed. You can see processes in these states using the sys.dm_exec_requests DMV. 

SQL Server keeps track of the time that elapses between leaving the RUNNING state and becoming RUNNING again (called the "wait time") and the time spent on the RUNNABLE queue (called the "signal wait time" - i.e. how long does the thread need to wait for the CPU after being signaled that its resource is available). We need to work out the time spent waiting on the SUSPENDED list (called the "resource wait time") by subtracting the signal wait time from the overall wait time.

Tom Davidson of Microsoft wrote a fantastic and very accessible whitepaper on "waits and queues" which I encourage you to read: Performance Tuning Using Waits and Queues. My good friend Joe Sack (blog|twitter) who runs the MCM program also published an excellent slide deck on the subject that you can download here. And of course Books Online has a section on the sys.dm_os_wait_stats DMV that gives info on some of the newer wait types. PSS is putting together a repository of info on all the wait types but not much progress has been made. And there's a free video+demo recording as part of the MCM online training we recorded for Microsoft - see here.

You can ask SQL Server for the cumulative wait statistics using the sys.dm_os_wait_stats DMV, and many people prefer to wrap the DMV call in some aggregation code. I use code based on a query that I got from fellow-MVP Glenn Berry (blog|twitter) and then modified quite a bit. See below for the version updated to take account of the results discussed below:

WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP',
        'BROKER_RECEIVE_WAITFOR', 'ONDEMAND_TASK_QUEUE', 'DBMIRROR_EVENTS_QUEUE',
        'DBMIRRORING_CMD', 'BROKER_TRANSMITTER', 'SQLTRACE_WAIT_ENTRIES',
        'SLEEP_BPOOL_FLUSH', 'SQLTRACE_LOCK')
    )
SELECT
    W1.wait_type AS WaitType,
    CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
    CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
    CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
    W1.WaitCount AS WaitCount,
    CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage,
    CAST ((W1.WaitS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgWait_S,
    CAST ((W1.ResourceS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgRes_S,
    CAST ((W1.SignalS / W1.WaitCount) AS DECIMAL (14, 4)) AS AvgSig_S
FROM Waits AS W1
    INNER JOIN Waits AS W2 ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

This will show the waits grouped together as a percentage of all waits on the system, in decreasing order. The waits to be concerned about (potentially) are those at the top of the list as this represents the majority of where SQL Server is spending it's time waiting. You can see that a bunch of waits are being filtered out of consideration - as I said above, waits happen all the time and these are the benign ones we can usually ignore.

You can also reset the aggregated statistics using this code:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

And of course you can very easily come up with a way to persist the results every few hours or every day and do some time-series analysis to figure out trends or automatically spot problems as they start to happen. You can also use Performance Dashboard to see these graphically in 2005 and Data Collector in 2008. On SQL Server 2000 you can use DBCC SQLPERF ('waitstats').

Once you get the results, you then start figuring out how to interpret them and where to go looking. The whitepaper I referenced above has a ton of good info on most of the wait types (except those added in 2008). There are various ways you can dig in deeper to this information that I'll go into in later posts.

I'm going to start blogging about wait stats analysis, either as standalone posts or as part of other things - and I've already done so in the last post (at time of writing this) in my benchmarking series.

For now, I want to report on the results of the wait stats survey I posted a couple of months back. I asked people to run the code above and let me know the results. I received results for a whopping 1823 SQL Servers out there - thank you!

Here's a graphical view of the results:

 

I'm not surprised at all by the top four results as I see these over and over on client systems.

For the remainder of this post, I'm going to list all the top wait types reported by survey respondents, in descending order, and give a few words about what they might mean if they are the most prevalent wait on your system. The list format shows the number of systems with that wait type as the most prevalent, and then the wait type.

  • 505: CXPACKET
    • This is commonly where a query is parallelized and the parallel threads are not given equal amounts of work to do, or one thread blocks. One thread may have a lot more to do than the others, and so the whole query is blocked while the long-running thread completes. If this is combined with a high number of PAGEIOLATCH_XX waits, it could be large parallel table scans going on because of incorrect non-clustered indexes, or out-of-date statistics causing a bad query plan. If neither of these are the issue, you might want to try setting MAXDOP to 4, 2, or 1 for the offending queries (or possibly the whole instance). Make sure that if you have a NUMA system that you try setting MAXDOP to the number of cores in a single NUMA node first to see if that helps the problem. You also need to consider the MAXDOP effect on a mixed-load system.
  • 304: PAGEIOLATCH_XX
    • This is where SQL Server is waiting for a data page to be read from disk into memory. It commonly indicates a bottleneck at the IO subsystem level, but could also indicate buffer pool pressure (i.e. not enough memory for the workload).
  • 275: ASYNC_NETWORK_IO
    • This is commonly where SQL Server is waiting for a client to finish consuming data. It could be that the client has asked for a very large amount of data or just that it's consuming it reeeeeally slowly because of poor programming.
  • 112: WRITELOG
    • This is the log management system waiting for a log flush to disk. It commonly indicates a problem with the IO subsystem where the log is, but on very high-volume systems it could also be caused by waiting for the LOGCACHE_ACCESS spinlock (which you can't do anything about). To be sure it's the IO subsystem, use the DMV sys.dm_io_virtual_file_stats to examine the IO latency for the log file.
  • 109: BROKER_RECEIVE_WAITFOR
    • This is just Service Broker waiting around for new messages to receive. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 086: MSQL_XP
    • This is SQL Server waiting for an extended stored-proc to finish. This could indicate a problem in your XP code. 
  • 074: OLEDB
    • As its name suggests, this is a wait for something communicating using OLEDB - e.g. a linked server. It could be that the linked server has a performance issue.
  • 054: BACKUPIO
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 041: LCK_M_XX
    • This is simply the thread waiting for a lock to be granted and indicates blocking problems. These could be caused by unwanted lock escalation or bad programming, but could also be from IOs taking a long time causing locks to be held for longer than usual. Look at the resource associated with the lock using the DMV sys.dm_os_waiting_tasks.
  • 032: ONDEMAND_TASK_QUEUE
    • This is normal and is part of the background task system (e.g. deferred drop, ghost cleanup).  I would add this to the list of waits to filter out and re-run the wait stats query.
  • 031: BACKUPBUFFER
    • This shows up when you're backing up directly to tape, which is slooooow. I'd be tempted to filter this out.
  • 027: IO_COMPLETION
    • This is SQL Server waiting for IOs to complete and is a sure indication of IO subsystem problems.
  • 024: SOS_SCHEDULER_YIELD
    • If this is a very high percentage of all waits (had to say, but Joe suggests 80%) then this is likely indicative of CPU pressure.
  • 022: DBMIRROR_EVENTS_QUEUE
  • 022: DBMIRRORING_CMD
    •  These two are database mirroring just sitting around waiting for something to do. I would add these to the list of waits to filter out and re-run the wait stats query.
  • 018: PAGELATCH_XX
    • This is contention for access to in-memory copies of pages. The most well-known cases of these are the PFS, SGAM, and GAM contention that can occur in tempdb under certain workloads. To find out what page the contention is on, you'll need to use the DMV sys.dm_os_waiting_tasks to figure out what page the latch is for. For tempdb issues, my friend Robert Davis (blog|twitter) has a good post showing how to do this. Another common cause I've seen is an index hot-spot with concurrent inserts into an index with an identity value key.
  • 016: LATCH_XX
    • This is contention for some non-page structure inside SQL Server - so not related to IO or data at all. These can be hard to figure out and you're going to be using the DMV sys.dm_os_latch_stats. More on this in future posts.
  • 013: PREEMPTIVE_OS_PIPEOPS
    • This is SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 013: THREADPOOL
    • This says that there aren't enough worker threads on the system to satisfy demand. You might consider raising the max worker threads setting.
  • 009: BROKER_TRANSMITTER
    • This is just Service Broker waiting around for new messages to send. I would add this to the list of waits to filter out and re-run the wait stats query. 
  • 006: SQLTRACE_WAIT_ENTRIES
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 005: DBMIRROR_DBM_MUTEX
    •  This one is undocumented and is contention for the send buffer that database mirroring shares between all the mirroring sessions on a server. It could indicate that you've got too many mirroring sessions.
  • 005: RESOURCE_SEMAPHORE
    • This is queries waiting for execution memory (the memory used to process the query operators - like a sort). This could be memory pressure or a very high concurrent workload. 
  • 003: PREEMPTIVE_OS_AUTHENTICATIONOPS
  • 003: PREEMPTIVE_OS_GENERICOPS
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means.
  • 003: SLEEP_BPOOL_FLUSH
    • This is normal to see and indicates that checkpoint is throttling itself to avoid overloading the IO subsystem. I would add this to the list of waits to filter out and re-run the wait stats query.
  • 002: MSQL_DQ
    • This is SQL Server waiting for a distributed query to finish. This could indicate a problem with the distributed query, or it could just be normal.
  • 002: RESOURCE_SEMAPHORE_QUERY_COMPILE
    • When there are too many concurrent query compilations going on, SQL Server will throttle them. I don't remember the threshold, but this can indicate excessive recompilation, or maybe single-use plans.
  • 001: DAC_INIT
    • I've never seen this one before and BOL says it's because the dedicated admin connection is initializing. I can't see how this is the most common wait on someone's system... 
  • 001: MSSEARCH
    • This is normal to see for full-text operations.  If this is the highest wait, it could mean your system is spending most of its time doing full-text queries. You might want to consider adding this to the filter list. 
  • 001: PREEMPTIVE_OS_FILEOPS
  • 001: PREEMPTIVE_OS_LIBRARYOPS
  • 001: PREEMPTIVE_OS_LOOKUPACCOUNTSID
  • 001: PREEMPTIVE_OS_QUERYREGISTRY
    • These are SQL Server switching to pre-emptive scheduling mode to call out to Windows for something. These were added for 2008 and haven't been documented yet (anywhere) so I don't know exactly what it means. 
  • 001: SQLTRACE_LOCK
    • Part of SQL Trace. I would add this to the list of waits to filter out and re-run the wait stats query.

I hope you found this interesting! Let me know if there's anything in particular you're interested in seeing or just that you're following along and enjoying the ride!

I've recently been creating some content about wait stats analysis and I think it would be really interesting to see what kind of waits people are seeing out there in the wild. Hopefully it'll also introduce a bunch of people to the waits-and-queues performance troubleshooting methodology and how it can be really useful to them.

[Edit: comments are closed on this post as I have all the info I need - thanks!]

Here's what I'd like you to do:

  • Run the T-SQL query from the end of this post. It's completely benign and only reports on statistics that SQL Server is already gathering. It will not cause any perf issues on your production systems.
  • Look at the output - it'll be something like (from random anonymous client system):

 

  • Look at the top WaitType and fill in the simple survey below. If your top wait isn't listed, click on Other and cut-and-paste the top WaitType into the box provided.

The free survey system only allows a single vote per IP address - if you have any other results, send them in email (mailto:paul@SQLskills.com?Subject=Wait stats) or attach a comment below. 

In a week or two I'll report on the results. It would be great to get a few hundred responses. 

Thanks!

T-SQL code to run (works on 2005 onwards) - based on code by Glenn Berry:

WITH Waits AS
    (SELECT
        wait_type,
        wait_time_ms / 1000.0 AS WaitS,
        (wait_time_ms - signal_wait_time_ms) / 1000.0 AS ResourceS,
        signal_wait_time_ms / 1000.0 AS SignalS,
        waiting_tasks_count AS WaitCount,
        100.0 * wait_time_ms / SUM (wait_time_ms) OVER() AS Percentage,
        ROW_NUMBER() OVER(ORDER BY wait_time_ms DESC) AS RowNum
    FROM sys.dm_os_wait_stats
    WHERE wait_type NOT IN (
        'CLR_SEMAPHORE', 'LAZYWRITER_SLEEP', 'RESOURCE_QUEUE', 'SLEEP_TASK',
        'SLEEP_SYSTEMTASK', 'SQLTRACE_BUFFER_FLUSH', 'WAITFOR', 'LOGMGR_QUEUE',
        'CHECKPOINT_QUEUE', 'REQUEST_FOR_DEADLOCK_SEARCH', 'XE_TIMER_EVENT', 'BROKER_TO_FLUSH',
        'BROKER_TASK_STOP', 'CLR_MANUAL_EVENT', 'CLR_AUTO_EVENT', 'DISPATCHER_QUEUE_SEMAPHORE',
        'FT_IFTS_SCHEDULER_IDLE_WAIT', 'XE_DISPATCHER_WAIT', 'XE_DISPATCHER_JOIN', 'BROKER_EVENTHANDLER',
        'TRACEWRITE', 'FT_IFTSHC_MUTEX', 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP')
     )
SELECT
     W1.wait_type AS WaitType,
     CAST (W1.WaitS AS DECIMAL(14, 2)) AS Wait_S,
     CAST (W1.ResourceS AS DECIMAL(14, 2)) AS Resource_S,
     CAST (W1.SignalS AS DECIMAL(14, 2)) AS Signal_S,
     W1.WaitCount AS WaitCount,
     CAST (W1.Percentage AS DECIMAL(4, 2)) AS Percentage
FROM Waits AS W1
INNER JOIN Waits AS W2
     ON W2.RowNum <= W1.RowNum
GROUP BY W1.RowNum, W1.wait_type, W1.WaitS, W1.ResourceS, W1.SignalS, W1.WaitCount, W1.Percentage
HAVING SUM (W2.Percentage) - W1.Percentage < 95; -- percentage threshold
GO

Theme design by Nukeation based on Jelle Druyts