There's a lot of controversy about the Buffer Manager performance object counter Page Life Expectancy - mostly around people continuing to quote 300 as a threshold for starting to worry about there being a problem (which is just utter nonsense these days). That's far too *low* to be the point at which to start worrying if your PLE dips and stays there. Jonathan came up with a better number to use - based on the size of your buffer pool - see the bottom of his post here.

But that's not why I'm writing today: I want to explain why most of the time now Page Life Expectancy is really not giving you useful information.

Most new systems today use NUMA, and so the buffer pool is split up and managed per NUMA node, with each NUMA node getting it's own lazy writer thread, managing it's own buffer free list, and dealing with node-local memory allocations. Think of each of these as a mini buffer pool.

The Buffer Manager:Page Life Expectancy counter is calculated by adding the PLE of each mini buffer pool and then getting the average.

What does this mean? It means that the overall PLE is not giving you a true sense of what is happening on your machine as one NUMA node could be under memory pressure but the *overall* PLE would only dip slightly. One of my friends who's a Premier Field Engineer and MCM just had this situation today, which prompted this blog post. The conundrum was how can there be 100+ lazy writes/sec occuring when overall PLE is relatively static - and this was the issue.

For instance, for a machine with 8 NUMA nodes, with PLE of each being 4000, the overall PLE is 4000. If one of them drops to 1000, the overall PLE only drops to 3625, which likely wouldn't trigger your alerting as it hasn't even dropped 10%.

On NUMA machines, you need to be looking at the Buffer Node:Page Life Expectancy counters for all NUMA nodes otherwise you're not getting an accurate view of buffer pool memory pressure and so could be missing performance issues. And adjust Jonathan's threshold value according to the number of NUMA nodes you have.

Unfortunately there's no easy way to monitor per-NUMA node lazy writes - maybe in future.

[Edit: thanks to an email thread with Trayce Jordan from Microsoft and Jonathan, you can see the lazywriter activity for each NUMA node by looking for the lazywriter threads in sys.dm_exec_requests.]

Hope this helps!

Back in April I kicked off a survey where I asked you all to send me some information about your buffer pools - how much memory is being used for data file pages and how much of that memory is storing empty space. I got back data from 1394 servers around the world - thanks!

The reason I'm interested in this, and you should be too, is that memory is one of the most important resources that SQL Server uses. If you don't have anough memory, your workload will suffer because:

  • You'll be driving more read I/Os because more of the workload can't fit in the buffer pool.
  • You'll be driving more write I/Os because the lazywriter will have to be tossing dirty pages from the buffer pool.
  • You may encounter RESOURCE_SEMAPHORE waits because queries can't get the query execution memory grants that they need.
  • You may cause excessive plan recompilations if the plan cache is too constrained.

And a bunch of other things.

One of the memory problems that Kimberly discussed in depth last year (and teaches in depth in our Performance Tuning classes) is single-use plan cache bloat - where a large proportion of the plan cache is filled with single-use plans that don't ever get used again. You can read about it in the three blog posts in her Plan Cache category, along with how to identify plan cache bloat and what you can do about it.

This post is about the memory the buffer pool is using to store data file pages, and whether good use is being made from it.

The sys.dm_os_buffer_descriptors DMV gives the information stored by the buffer pool for each data file page in memory (called a BUF structure in the code). One of the things that this structure keeps track of is the free_space_in_bytes for each page. This metric is updated in real-time as changes are made to the page in memory (you can easily prove this for yourself) and so is a reliable view of the data density of the used portion of the buffer pool.

Data density? Think of this as how packed full or data, index, or LOB rows a data file page is. The more free space on the page, the lower the data density.

Low data density pages are caused by:

  • Very wide data rows (e.g. a table with a 5000-byte fixed-size row will only ever fit one row per page, wasting roughly 3000 bytes per page).
  • Page splits, from random inserts into full pages or updates to rows on full pages. These kind of page splits result in logical fragmentation that affects range scan performance, low data density in data/index pages, and increased transaction log overhead (see How expensive are page splits in terms of transaction log?).
  • Row deletions where the space freed up by the deleted row will not be reused because of the insert pattern into the table/index.

Low data density pages can be detrimental to SQL Server performance, because the lower the density of records on the pages in a table:

  • The higher the amount of disk space necessary to store the data (and back it up).
  • The more I/Os are needed to read the data into memory.
  • The higher the amount of buffer pool memory needed to store the extra pages in the buffer pool.

From the survey results I took all the SQL Servers that were using at least one GB of buffer pool memory for data file page storage (900 servers) and plotted that amount of memory against the percentage of that memory that was storing free space in the data file pages.

Wow! That's a lot of servers with a lot of buffer pool memory storing nothing useful.

So what can you do about it? There are a number of solutions to low page density including:

  • Change the table schema (e.g. vertical partitioning, using smaller data types).
  • Change the index key columns (usually only applicable to clustered indexes - e.g. changing the leading cluster key from a random value like a non-sequential GUID to a sequential GUID or identity column).
  • Use index FILLFACTOR to reduce page splits, and...
  • Periodically rebuild problem indexes.
  • Consider enabling data compression on some tables and indexes.

From the graph above, bear in mind that some of the 'wasted' space on these servers could be from proper index management where data and index pages have a low FILLFACTOR set to alleviate page splits. But I suspect that only accounts for a small portion of what we're seeing in this data.

The purpose of my survey and this post is not to explain how to make all the changes to reduce the amount of free space being stored in memory, but to educate you that this is a problem. Very often PAGEIOLATCH waits are prevalent on systems because more I/O than necessary is being driven to the I/O subsystem because of things like bad plans causing table scans or low data density. If you can figure out that it's not an I/O subsystem problem, then you as the DBA can do something about it.

Below is a script to analyze the buffer pool and break down by database the amount of space being taken up in the buffer pool and how much of that space is empty space. For systems with a 100s of GB of memory in use, this query may take a while to run:

SELECT
   (CASE WHEN ([database_id] = 32767)
       THEN 'Resource Database'
       ELSE DB_NAME ([database_id]) END) AS [DatabaseName],
   COUNT (*) * 8 / 1024 AS [MBUsed],
   SUM (CAST ([free_space_in_bytes] AS BIGINT)) / (1024 * 1024) AS [MBEmpty]
FROM sys.dm_os_buffer_descriptors
GROUP BY [database_id];
GO

And here's some sample output from a client system (made anonymous, of course):

DatabaseName        MBUsed   MBEmpty
------------------- -------- ---------
Resource Database   51       11
ProdDB              71287    9779
master              2        1
msdb                481      72
ProdDB2             106      17
model               0        0
tempdb              2226     140

Below is a script that will break things down by table and index across all databases that are using space in the buffer pool. I'm filtering out system objects plus indexes where the amount of space used in the buffer pool is less than 100MB. You can use this to identify tables and indexes that need some work on them to allow your buffer pool memory to be used more effectively by SQL Server and increase your workload performance.

[Edit: There was a bug in the script - fixed now 6/10/2011] 

EXEC sp_MSforeachdb
    N'IF EXISTS (SELECT 1 FROM (SELECT DISTINCT DB_NAME ([database_id]) AS [name]
    FROM sys.dm_os_buffer_descriptors) AS names WHERE [name] = ''?'')
BEGIN
USE [?]
SELECT
    ''?'' AS [Database],
    OBJECT_NAME (p.[object_id]) AS [Object],
    p.[index_id],
    i.[name] AS [Index],
    i.[type_desc] AS [Type],
    --au.[type_desc] AS [AUType],
    --DPCount AS [DirtyPageCount],
    --CPCount AS [CleanPageCount],
    --DPCount * 8 / 1024 AS [DirtyPageMB],
    --CPCount * 8 / 1024 AS [CleanPageMB],
    (DPCount + CPCount) * 8 / 1024 AS [TotalMB],
    --DPFreeSpace / 1024 / 1024 AS [DirtyPageFreeSpace],
    --CPFreeSpace / 1024 / 1024 AS [CleanPageFreeSpace],
    ([DPFreeSpace] + [CPFreeSpace]) / 1024 / 1024 AS [FreeSpaceMB],
    CAST (ROUND (100.0 * (([DPFreeSpace] + [CPFreeSpace]) / 1024) / (([DPCount] + [CPCount]) * 8), 1) AS DECIMAL (4, 1)) AS [FreeSpacePC]
FROM
    (SELECT
        allocation_unit_id,
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 1 ELSE 0 END) AS [DPCount],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 0 ELSE 1 END) AS [CPCount],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN CAST ([free_space_in_bytes] AS BIGINT) ELSE 0 END) AS [DPFreeSpace],
        SUM (CASE WHEN ([is_modified] = 1)
            THEN 0 ELSE CAST ([free_space_in_bytes] AS BIGINT) END) AS [CPFreeSpace]
    FROM sys.dm_os_buffer_descriptors
    WHERE [database_id] = DB_ID (''?'')
    GROUP BY [allocation_unit_id]) AS buffers
INNER JOIN sys.allocation_units AS au
    ON au.[allocation_unit_id] = buffers.[allocation_unit_id]
INNER JOIN sys.partitions AS p
    ON au.[container_id] = p.[partition_id]
INNER JOIN sys.indexes AS i
    ON i.[index_id] = p.[index_id] AND p.[object_id] = i.[object_id]
WHERE p.[object_id] > 100 AND ([DPCount] + [CPCount]) > 12800 -- Taking up more than 100MB
ORDER BY [FreeSpacePC] DESC;
END';

And here's some sample output from the same client system with the more comprehensive script:

Database Object index_id Index        Type         TotalMB FreeSpaceMB FreeSpacePC
-------- ------ -------- ------------ ------------ ------- ----------- -----------
ProdDB   TableG 1        TableG_IX_1  CLUSTERED    531     130         24.5
ProdDB   TableI 1        TableI_IX_1  CLUSTERED    217     48          22.2
ProdDB   TableG 2        TableG_IX_2  NONCLUSTERED 127     27          21.8
ProdDB   TableC 1        TableC_IX_1  CLUSTERED    224     47          21.4
ProdDB   TableD 3        TableD_IX_3  NONCLUSTERED 1932    393         20.4
ProdDB   TableH 1        TableH_IX_1  CLUSTERED    162     33          20.4
ProdDB   TableF 5        TableF_IX_5  NONCLUSTERED 3128    616         19.7
ProdDB   TableG 9        TableG_IX_9  NONCLUSTERED 149     28          19.1
ProdDB   TableO 10       TableO_IX_10 NONCLUSTERED 1003    190         19
ProdDB   TableF 6        TableF_IX_6  NONCLUSTERED 3677    692         18.8
.
.

This is cool because it's a lot less intrusive way to figure out which tables and indexes have data density issues than running sys.dm_db_index_physical_stats (see this post for an in depth view of this DMV). You can mess around with the script to make it create a table to hold all the data for all databases and slice+dice however you want. I have a much more comprehensive script that I use on client systems but this one will provide you with lots of relevant data.

Have fun! Let me know if this is useful!

This is a quick post inspired by a question I was sent in email (thanks Marcos!) which very neatly lets me show a DMV I've been meaning to blog about for a while. And the weather here in Redmond really sucks right now so I can't go outside - blogging will serve as my work-avoidance strategy this afternoon :-).

The (paraphrased) question is:

A checkpoint is a process that writes all dirty pages to disk, and is per-database. So, if the data cache can hold a page from any database, how does checkpoint know which pages to check for a dirty status? Does it scan through buffer pool looking for pages for database X and process only those? Or is data cache somehow partitioned by database? I'd like to know a bit better how it works under the covers.

The answer is that pages are stored in buffers in the buffer pool (aka buffer cache or data cache), and the buffers are indeed hashed so they can easily be found by database. You can see what pages are currently in the buffer pool, and their status using the sys.dm_os_buffer_descriptors DMV in 2005:

SELECT * FROM sys.dm_os_buffer_descriptors;
GO

database_id file_id  page_id  page_level  allocation_unit_id   page_type      row_count   free_space_in_bytes is_modified
----------- -------- -------- ----------- -------------------- -------------- ----------- ------------------- -----------
1           1        9        0           6488064              BOOT_PAGE      1           7362                0
1           1        6        0           6488064              DIFF_MAP_PAGE  2           6                   0
1           1        7        0           6488064              ML_MAP_PAGE    2           6                   0
1           1        104      0           262144               DATA_PAGE      100         4196                0
1           1        105      0           851968               DATA_PAGE      65          5041                0
1           1        106      0           262144               DATA_PAGE      197         413                 0
1           1        107      0           262144               DATA_PAGE      207         23                  0
1           1        108      1           262144               INDEX_PAGE     7           7949                0
.
.

I cut off the output rather than list all 3258 pages in the buffer pool on my laptop. The DMV gives you back some info from the pages themselves as well as you can see(remember all this is in memory so it's quick to find).

I played around with the DMV a little bit and came up with a neat script that will tell you may many clean and dirty pages there are in the buffer pool per-database.

SELECT
   
(CASE WHEN ([is_modified] = 1) THEN 'Dirty' ELSE 'Clean' END) AS 'Page State',
   
(CASE WHEN ([database_id] = 32767) THEN 'Resource Database' ELSE DB_NAME (database_id) END) AS 'Database Name',
   
COUNT (*) AS 'Page Count'
FROM sys.dm_os_buffer_descriptors
   
GROUP BY [database_id], [is_modified]
   
ORDER BY [database_id], [is_modified];
GO

Page State Database Name             Page Count
---------- ------------------------- -----------
Clean      master                    302
Dirty      master                    1
Clean      tempdb                    88
Dirty      tempdb                    52
Clean      model                     56
Clean      msdb                      622
Dirty      msdb                      5
Clean      adventureworks            110
Clean      DemoRestoreOrRepair       64
Clean      DBMaint2008               88
Clean      DemoFatalCorruption1      64
Clean      DemoFatalCorruption2      64
Clean      broken                    64
Clean      DemoFatalCorruption3      64
Clean      DemoCorruptMetadata       111
Clean      DemoDataPurity            88
Clean      SalesDB                   123
Clean      DemoNCIndex               88
Clean      shrinktest                88
Clean      DemoRestoreOrRepairCopy   64
Clean      DemoSuspect               64
Clean      FileHeaderTest            96
Clean      MultiFileDB               96
Clean      HA2008                    88
Clean      SalesDB_Snapshot          21
Clean      BootPageTest              88
Clean      Resource Database         599

Later this week I'll try to blog a script that can tell you how much of a particular table is in memory ([Edit 12/8/11: that script is here]). Enjoy!

Theme design by Nukeation based on Jelle Druyts