There's a lot of confusion around duplicate indexes and I think a lot of it has to do with how sp_helpindex (or even SSMS) display what's in an index. What's actually in the index - and how it's structured - are not always what they seem. This was the original motivation behind my rewrite of sp_helpindex but even since then, I've still seen a lot of confusion. In today's blog post I'm going to first explain EXACTLY which indexes are the same and which aren't - as well as the faults in the tools.

So, starting with index structures... (it all starts with internals :) )

The clustered index IS the data. The KEY to the clustered index (what I often refer to as the clustering key) defines the way that the data is ordered (not necessarily truly, physically ordered on disk - but ordered LOGICALLY). And, no, I'm not going to rehash every bit of internals here... just a few reminders.

A nonclustered index is duplicated data (similar to an index in the back of a book). This duplicated data can be used to help reference the actual data (exactly like an index in the back of a book) OR can be used to actually respond to requests (for example, if you're solely looking for a count of people with a name that begins with 'S' then an index that has LastName in the index could be used to count them - without actually "looking up" the actual data). So, there are some very powerful uses to indexes. But, alas, this isn't a post on using indexes or indexing strategies - this is ALL about internals (and understanding the structure of an index). So, I'm going to cut to the chase!

A nonclustered index always has:

  • Key (this is what defines the order of the index)
  • A leaf-level entry (this is the actual data stored in the index + the lookup value* + any included columns) - however, ALL of these columns are only stored once (and they've ALWAYS only been stored once here so, even if you reference a column that's part of the lookup value, SQL Server will NOT duplicate it again).

*So, what is this lookup value?

The lookup value is what SQL Server uses to reference the actual data row. If a table has a clustered index then the lookup value is the clustering key (and EVERY column defined by it). If a table does not have a clustered index (it's then a HEAP) then SQL Server uses something called a RID. A RID is an 8-byte structure consisting of 2:4:2 bytes which breakdown into 2 for the FileID, 4 for the PageID and 2 for the slot number. And while RID's (and their history) are very interesting - it's not entirely relevant here (how they specifically work and/or their structures) but if/when they're in an index, I'll list is as RID.

Let's now put all of this together with an example (or two) starting with a relatively easy one.

USE JunkDB
go

CREATE TABLE Test
(
    TestID  int identity,
    [Name]  char(16)
)
go

CREATE UNIQUE CLUSTERED INDEX TestCL ON Test (TestID)
go

CREATE INDEX TestName ON Test([Name])
go

sp_helpindex Test
go

The output shows:

index_name      index_description                               index_keys
TestCL          clustered, unique located on PRIMARY            TestID
TestName        nonclustered located on PRIMARY            
     Name

And, while that looks right, it's horribly misleading. The index that's on TestName also includes TestID in the index as well. And, not just in the leaf level but in the tree (for ordering purposes). So, that should really show Name, TestID. But, it's a bit more confusing if you throw in this:

CREATE UNIQUE INDEX TestNameUnique ON Test([Name])
go

sp_helpindex Test
go

The output shows:

index_name      index_description                               index_keys
TestCL          clustered, unique located on PRIMARY            TestID
TestName        nonclustered located on PRIMARY                 Name
TestNameUnique  nonclustered, unique located on PRIMARY         Name

And, at this point, it doesn't look like there's any difference at all between the second and third indexes (well, except that the third index requires that the values be unique - shown in the description). But, as for the "index_keys" they look the same. However, they're actually not the same (in all parts of the tree). So, this is why I tend to differentiate between the "leaf" and the non-leaf levels of an index (when I describe them). And, it only gets more complicated when you throw in included columns (2005+).

So, how to do you tell the difference? Unfortunately, there are NO tools within SQL Server (or even any 3rd party tools that I know of) that display this through the UI, etc. But, you could start by using my replacement to sp_helpindex. I've updated and rewritten it a few times but the latest one will always be in the sp_helpindex reqrites category here: http://www.sqlskills.com/BLOGS/KIMBERLY/category/sp_helpindex-rewrites.aspx. I personally think that's the easiest (and is the reason why I originally wrote it!). Using it you can see how the output is more detailed.

The output shows (specifically showing ONLY the last 2 columns):

columns_in_tree              columns_in_leaf
[TestID]                     All columns "included" - the leaf level IS the data row.
[Name], [TestID]             [Name], [TestID]
[Name]                       [Name], [TestID]

Now, we're getting somewhere. We can definitely see that these two indexes ARE different. A nonunique nonclustered needs to have the lookup value pushed up into the tree (for navigation). A unique nonclustered index does not.

Next, let's go with a [just slightly more challenging] example:

USE JunkDB
go

CREATE TABLE Member
(
    MemberNo    int     identity,
    FirstName   varchar(30)     NOT NULL,
    LastName    varchar(30)     NOT NULL,
    RegionNo    int
)
go

CREATE UNIQUE CLUSTERED INDEX MemberCL ON Member(MemberNo)
go
   
CREATE INDEX MemberIndex1
ON Member(FirstName, RegionNo, MemberNo)
INCLUDE (LastName)
go

CREATE INDEX MemberIndex2
ON Member(FirstName, RegionNo)
INCLUDE (LastName)
go

CREATE INDEX MemberIndex3
ON Member(FirstName, RegionNo)
INCLUDE (MemberNo, LastName)
go

CREATE UNIQUE INDEX MemberIndex4
ON Member(Firstname, RegionNo)
INCLUDE (MemberNo, lastname)
go

First, let's review with sp_helpindex:

sp_helpindex Member
go

The output shows:

index_name      index_description                           index_keys
MemberCL        clustered, unique located on PRIMARY        MemberNo
MemberIndex1    nonclustered located on PRIMARY             FirstName, RegionNo, MemberNo
MemberIndex2    nonclustered located on PRIMARY             FirstName, RegionNo
MemberIndex3    nonclustered located on PRIMARY             FirstName, RegionNo
MemberIndex4    nonclustered, unique located on PRIMARY     FirstName, RegionNo

Looking solely at sp_helpindex it LOOKS like the first nonclustered index is different from the others and that the 2nd, 3rd and 4th nonclustered indexes are the same. However, that's actually NOT true. Next, let's use my version of sp_helpindex.

sp_SQLskills_SQL2008_helpindex Member
go

The output (again, just showing the last two columns of output):

columns_in_tree                           columns_in_leaf
[MemberNo]                                All columns "included" - the leaf level IS the data row.
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo], [MemberNo]       [FirstName], [RegionNo], [MemberNo], [LastName]
[FirstName], [RegionNo]                   [FirstName], [RegionNo], [MemberNo], [LastName]

From this, you can see that all 4 indexes have the same leaf level but index 4 has a slightly different tree structure. Ultimately, it's indexes 1, 2 and 3 that are the same and index 4 is actually [slightly] different. How they're different (outside of the fact that the 4th index guarantees uniqueness) is a bit beyond the scope of this post. But, yes, there are some [relatively minor in this case] differences. And, since I'm ONLY looking for indexes that are the same then only 1, 2 and 3 fit that requirement.

And, things get MUCH more complicated if you have multicolumn clustering key and/or more complicated INCLUDEs.

Having said that - how do you find duplicate indexes?

Well... I had started this by putting together a simple way for you to check for dupes with my version of sp_helpindex but then I figured out an issue with included columns. I'm showing the structures as they're defined (and stored). But, in terms of usage - the order of the columns in the INCLUDE does not matter. As a result, two indexes with different orders for their included columns will show as two different indexes (technically they are and they are stored differently). However, there are absolutely NO differences (in terms of usage) so I'll need to write code to adjust for that (to truly find duplicates).

For now, here's some quick code to get you closer. I had written some of this when we discussed this in our London Immersion Event. However, I've tweaked it even further here after thinking about some interesting exceptions. This code will find absolute duplicates (where the order of the structure is completely identical). To use it you'll need to enter in the schemaname and tablename (in bold below).

IF (SELECT count(*) FROM #FindDupes) IS NOT NULL
    DROP TABLE #FindDupes
go

CREATE TABLE #FindDupes
(
 index_id           int,
 is_disabled        bit,
 index_name         sysname,
 index_description  varchar(210),
 index_keys         nvarchar(2126),
 included_columns   nvarchar(max),
 filter_definition  nvarchar(max),
 columns_in_tree    nvarchar(2126),
 columns_in_leaf    nvarchar(max)
)
go

DECLARE @SchemaName sysname,
        @TableName  sysname,
        @ExecStr    nvarchar(max)

SELECT @SchemaName = N'schemaname',     -- MODIFY
       @TableName = N'tablename'        -- MODIFY
       
SELECT @ExecStr = 'EXEC sp_SQLskills_SQL2008_helpindex '''
                    + QUOTENAME(@SchemaName)
                    + N'.'
                    + QUOTENAME(@TableName)
                    + N''''

INSERT #FindDupes
EXEC (@ExecStr)

SELECT t1.index_id, COUNT(*) AS 'Duplicate Indexes w/Lower Index_ID',
        N'DROP INDEX '
            + QUOTENAME(@SchemaName, N']')
            + N'.'
            + QUOTENAME(@TableName, N']')
            + N'.'
            + t1.index_name AS 'Drop Index Statement'
FROM #FindDupes AS t1
    JOIN #FindDupes AS t2
        ON t1.columns_in_tree = t2.columns_in_tree
            AND t1.columns_in_leaf = t2.columns_in_leaf
            AND ISNULL(t1.filter_definition, 1) = ISNULL(t2.filter_definition, 1)
           
AND PATINDEX('%unique%', t1.index_description) = PATINDEX('%unique%', t2.index_description)
           
AND t1.index_id > t2.index_id
GROUP BY t1.index_id, N'DROP INDEX ' + QUOTENAME(@SchemaName, N']')
                    + N'.'
                    + QUOTENAME(@TableName, N']')
                    + N'.' + t1.index_name
go

I'm planning to bulid a sproc around this next as well as get rid of cases where the included columns are just in different orders next. It will be another sp_ for master and I do hope to wrap it up shortly! That code will REALLY find ALL unnecessary duplicates and help you to drop them! Stay tuned!!

Cheers and thanks for reading!!
kt

OK, I've talked about the clustering key many times. Here, I thought I'd bring together one final series of details (and links) to help you really understand why I'm so adamant about DESIGNING a clustering key and not just letting SQL Server pick it for you (for example when it defaults to making the primary key clustered). Just because SQL Server defaults to making the primary key clustered doesn't actually mean it's a GOOD clustering key!

The key things that I've always recommended about a GOOD clustering key is that it is: unique, narrow, static and ever-increasing. For more details on the reasons behind this, check out these posts:

And, today, there are two additional items that I want to add to this list: your clustering key should be non-nullable and fixed-width!

First, why non-nullable?

In a regular data row you will ALWAYS have a null block. This is 2 bytes for a column count (in that row) and 1 bit per column (to store the actual null values). If you want to get more details on the internals of a data row, see Paul's post: Inside the Storage Engine: Anatomy of a record. However, the btree of the clustered index and the nonclustered index leaf/non-leaf levels do NOT have to have this "null block" of information if the columns in the index do NOT allow nulls. While this may be only 3 or 4 bytes (depending on the number of columns), it's still 3 or 4 bytes that would have to be added to EVERY nonclustered index's leaf level (for every row). And, it just doesn't have to be there. Why waste space when you don't have to!

Second, and along the same lines, is the need to use fixed-width columns!

Just like in data rows, index rows only have a variable block (offsets + end of row) when there are variable-width columns. If a clustered index is chosen that has one or more variable-width columns then you are adding at least 4 bytes to every row that might not have otherwise needed to be there. In fact, often people forget that uniquifiers are stored in the variable block portion of the row and as a result, a uniquifier really needs 8 bytes (4 bytes for the int + 2 bytes in the variable block for the offset + 2 bytes for the end of row marker).

To show you this, I've created four tables each with the same 3 columns DATA TYPES (table 4 has col1 as an int that's nullable and you can't do that with an identity so I used INSERT/SELECT to copy the data over):

col1 int identity not null,
col2 datetime2(7) not null default sysdatetime(),
col3 datetime2(7) null default sysdatetime()

And, each of the tables has these four nonclustered indexes:

  • Non-unique nonclustered on col2
  • Unique nonclustered on col2
  • Non-unique nonclustered on col3
  • Unique nonclustered on col3

The difference is solely within the definition of the clustering key as well as weather or not the values are unique (or nullable)!

CLTable1 is clustered and non-nullable but NOT defined as unique (the values are unique)

CLTable2 is clustered and non-nullable but there are duplicate values in the clustering key

CLTable3 is defined as a unique clustered index

CLTable4 is clustered, nullable and there are duplicate values in the clustering key

The end result is the sizes of all of the keys and the wasted space from different keys!

For CLTable1, CLTable2 and CLTable3 - all of the minimums are the same. For CLTable1 and CLTable3 - the maximums are the same as the minimums. In CLTable3 it's because the records MUST be unique. In CLTable1 it's because the records just happen to be unique right now (meaning there's no overhead unless there are actual dupes).

        record_size_in_bytes
index_id           min          max          avg
1 27 27 27
1 11 11 11
2 13 13 13
2 19 19 19
3 13 13 13
3 15 15 15
4 16 16 16
4 22 22 22
5 16 16 16
5 18 18 18

For CLTable2 the maximums are each 8 bytes larger (highlighted in yellow) EXCEPT where in the btrees of UNIQUE nonclustered indexes where SQL Server does NOT need to duplicate the clustering key (pale pink):

        record_size_in_bytes
index_id         min         max          avg
1 27 35 27.73
1 11 19 11.76
2 13 21 13.73
2 19 27 20.09
3 13 21 13.73
3 15 15 15.00
4 16 24 16.73
4 22 30 22.92
5 16 24 16.73
5 18 18 18.00

NOTE: If you're not familar with index internals and when/why SQL Server duplicates the clustering key in nonclustered indexes, check out my Index Internals chapter from the SQL Server 2008 Internals title (here's the Amazon link: http://www.amazon.com/Microsoft-SQL-Server-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&qid=1291001371&sr=8-1) and my companion content from Chapter 6 here: Companion content for Chapter 6 (Index Internals) of SQL Server 2008 Internals. And, if you really want to see what's in your indexes, check out my updated versions of sp_helpindex here: http://www.sqlskills.com/BLOGS/KIMBERLY/category/sp_helpindex-rewrites.aspx.

Finally, CLTable4 has the most interesting results...

       record_size_in_bytes
index_id         min         max          avg
1 27 35 27.73
1 14 22 14.78
2 16 24 16.73
2 22 30 22.92
3 16 24 16.73
3 15 15 15
4 16 24 16.73
4 22 30 22.92
5 16 24 16.73
5 18 18 18

Nonclustered indexes 2 and 3 (and the clustered index's btree) are all 3 bytes larger... why? Because the clustering key allows nulls and each of these structures has only 2-3 columns. As a result, the space needed for the null block is 3 bytes (2 bytes for the column count [NCol] and 1 byte for the actual null bitmap [less than 9 columns]).

The btree for index id 3 only needs the nonclustered key column (col2) which does not allow NULLs and the index is unique so the tree is the same as before.

Nonclustered indexes 4 and 5 already had a nullable column and therefore already had a null block.

My point - it all adds up and if it's not truly necessary, then avoid it!

Thanks for reading!
kt

NOTE: I've released other versions of sp_helpindex since this post. Check out the category: sp_helpindex rewrites for all of the options!

OK, I first blogged about re-writing sp_helpindex here as sp_helpindex2 (April 2008). Shortly thereafter (Aug 2008), a reader found a bug and when I went digging, so did I (a couple of others). I did a few tweaks and the version that's stuck for the past couple of years is in this blog post: Updates (fixes) to sp_helpindex2.

However, as much as I've used that - I've always been frustrated by its output. And, so, over the months+, I wrote sp_helpindex3 (this solely added the index_id as the first column and then ordered the output; I've always hated the weird way in which sp_helpindex/2 ordered the output). Then, I started writing sp_helpindex4 which listed the clustering key columns with each index so that I could more easily identify columns on large tables without a lot of scrolling. Ultimately, I didn't like this because the clustering key (or some of the columns in a multi-column clustering key - might already be in the nonclustered index and so I still had to figure things out more manually). And, so I finally sat down and created what's jokingly being called sp_helpindex8.

Geeky note: Er, you might wonder with how many geeky SQL people I've joked about sp_helpindex8 but in fact, in my last few events (both onsite with a couple of customers as well as when teaching the Indexing Strategies and Performance Tuning portion of the Microsoft Certified Masters - SQL Server course (two weeks ago)), I've actually joked - and promised - that I'd have this blogged before this particular 3 week MCM rotation ends. And, well, that's this week. So, I've actually joked with quite a few folks on this and I'm close to the end of time on blogging it. So, here I am - sitting in a hotel room in Houston, after having presented at the Houston Area User Group (7-9pm) (yes, with Paul) and after having been onsite all day with a customer - blogging this darn thing. I just have to get it out there. ;-)

Having said that - I'm pretty happy with this version but it was pretty painful to write. And, before I get into what it does - and why I wanted this version - let me explain what I didn't do and why. There are a lot of other things that could be added. And, quite a few folks have said - why don't you do this or why don't you do that... Specific suggestions have been made such as adding index usage stats (from the current state shown by the DMV) or add index operational stats (to show if an index is suffering from latch IO waits, for example) or add index physical stats (to show if an index is fragmented), etc. And while many of these suggestions would make for lots of interesting uses of this output, I've stayed away from these. Why? Some are expensive to run - such as physical stats - even a limited scan can be painful on a really large table. Or, the results might not be able to be truly trusted  - such as with index_operational_stats. Why? Because the information from that DMV might only cover a short period of time. The information help by them is cleared whenever SQL Server is restarted (or if a database is taken offline or restored, etc.). So, when were those usage or operational stats last cleared? Moreover, is this single "look" at the DMV information really the best on which to base our decisions (even just in general).

In the end, I think that physical stats, operational stats and even usage stats - should be used wisely and carefully (definitely used but not by this). In fact, we often recommend that the best way to use these is by studying them over time and then analyzing their trends not just the point in time values. And, there are LOTS of posts that can help you with understanding this information; here's just a couple to get you started:

So, having said that - I've kept my sp_helpindex8 more pure ;-). What I want to see is EXACTLY what's stored in that index. In fact, this has always been my main goal. The first version (sp_helpindex2) added INCLUDED columns (2005+) as well as filter definitions (2008) to the output. So, what was missing? The added columns that SQL Server might have to add if a nonclustered is nonunique and/or when the clustering key is nonunique. The reason this is so important to me is that there are many things coming together and recommending indexes (DTA, the missing index DMVs, application tools, etc.) and while many of these tools are solely there to "help" us get better performance and "help" us create more effective indexing strategies - these tools sometimes recommend the same index over and over again (see Paul's recent blog post: Missing index DMVs bug that could cost your sanity... about the bug in the missing index DMVs) OR the tools might recommend similar indexes to what we already have.

Additionally, the output from sp_helpindex/2 only shows the columns that are EXPLICITLY defined in the key and in the leaf (using INCLUDE) but they don't show what SQL adds for you. Most folks know that SQL Server adds the clustering key to the nonclustered indexes and when you have a simple single-column clustering key it's not difficult to know that this IS "there" in your index even though it doesn't show up. But, even with that, I think there are still a few things that are lesser known. For example, if your nonclustered index is nonunique - did you know that the clustering key is not only added to the leaf level of the index (like INCLUDE does) but that the clustering key is added into the tree (as part of the key and ordering?). And, if have one index that lists this clustering key explicitly and another that doesn't - are you actually going to be able to easily tell that these are in fact the SAME. Ultimately, I want you to be able to better recognize what's REALLY in that index. What you might end up seeing is that two indexes are more similar than you first thought and as a result - maybe you can drop one of them. Or, you might be able to combine them to reduce your overall indexes (I call this index consolidation). And, all of this becomes significantly more challenging when you have more complex clustering keys (and, yes, they do exist!). The other thing I did is delimit the columns and index names properly using brackets. This is also nice because you might have identifiers that make the output of sp_helpindex/2 hard to read. This should help. However, I do have to admit that you probably have other problems if your index/column names have commas, spaces or brackets in them. But, this should still help/work.

And, that's why I wrote "sp_helpindex8" it's a MUCH better version of sp_helpindex but it requires a few other components. And, it was a lot more challenging to write. Basically, this information is more internal to SQL Server and not directly exposed anywhere. So, we had to do a bit of digging. Paul started the process (thanks!) but, I ended up spending about 5 hours getting it exactly the way I wanted it. And, I also ended up breaking it up into two parts: one sp that gets column definitions and the other that builds the final output. And, I've actually named it with a longer name (sorry!). But, I rarely type in the name anyway. What I typically do is go to SSMS, Tools | Options, then under Environment | Keyboard, I set the keyboard shortcut for Ctrl+F1 to my sp_SQLskills_SQL2008_helpindex.

Now, when I want to see the true defintions of indexes on a table - I highlight the tablename and hit Ctrl+F1. But, if you want to rename this to sp_helpindex8 - I won't be offended. ;-)

So, how do you create this?

  1. You need to create sp_SQLskills_ExposeColsInIndexLevels (sp_SQLskills_ExposeColsInIndexLevels.sql (6.73 kb)). This is what gives us the tree/leaf definitions. And, this works for both SQL Server 2005 and SQL Server 2008.
  2. You need to create the version specific sp_helpindex8. These are named: sp_SQLskills_SQL2005_helpindex (sp_SQLskills_SQL2005_helpindex.sql (10.50 kb)) and  sp_SQLskills_SQL2008_helpindex (sp_SQLskills_SQL2008_helpindex.sql (10.73 kb)). Again, I know they're long names but using the keyboard shortcuts can help. And, if you do rename these make sure you do a replace ALL. There are multiple places where there are comments and/or statements that need to execute to get everything to work (for example the sp_MS_marksystemobject execution).

NOTE: The very first version of the 2005 script had a bug in it (thanks Calvin!). Funnily one I already knew about and thought I had caught (it was fixed in the 2008 version). Anyway, as of 3:15pm CT on May 19, it's fixed in the scripts here. There could be others but so far - it's looking good!

Enjoy!
kt

Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast - we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery...everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)

For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)...

Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 - one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline - you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day - then daily. Let me give you the full story...

Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"... How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is... it depends - here, it depends on the version of SQL Server that you're using:

  • SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
  • SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).

Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000... If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect... and, the new column-based algorithm poses another HUGE problem for filtered indexes.

For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows... then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.

But, having said that - I also have a simple solution. Don't wait... For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.

SELECT object_name(object_id) AS [Table Name]
       , name AS [Index Name]
       , stats_date(object_id, stats_id) AS [Last Updated]
FROM sys.
stats
WHERE has_filter = 1

So, there are lots of options and definitely a few things of which to be aware... filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!

Cheers... it's time for a pint of Guinness!

kt

PS - For more information on statistics, check out the SQL Server whitepapers:

OK, for years I've been saying that SQL Server doesn't care about the order in which you define the columns of your table because internally SQL Server will re-arrange your columns to store all of the fixed width columns first and the variable columns last. In both the fixed-width portion of the row as well as the variable-width portion of the row, the columns are defined in the order in which they are declared. So, what does matter?

It's all in the cost of the variable array's offset values. If the large majority of your NULLable records are at the end, then SQL Server doesn't need to completely populate the variable block array (which saves 2 bytes per column). If you have a table where 36 columns are NULLable and generally they are NULL, then defining those columns at the end of the row can save you space.

The following script will show you how the maximum length of the row changes based on whether or not a later column in the variable block is NOT NULL - even when most/all of the prior columns are!

CREATE TABLE RowSizeVariableBlock
(
ID
int NOT NULL identity
,
c01 char(10) NOT NULL default 'test'
,
c02 datetime2(7) NOT NULL default sysdatetime
(),
c03 char(80) NOT NULL default 'junk'
,
c04 varchar(100)
NULL,
c05 varchar(100)
NULL,
c06 varchar(100)
NULL,
c07 varchar(100)
NULL,
c08 varchar(100)
NULL,
c09 varchar(100)
NULL,
c10 varchar(100)
NULL,
c11 varchar(100)
NULL,
c12 varchar(100)
NULL,
c13 varchar(100)
NULL,
c14 varchar(100)
NULL,
c15 varchar(100)
NULL,
c16 varchar(100)
NULL,
c17 varchar(100)
NULL,
c18 varchar(100)
NULL,
c19 varchar(100)
NULL,
c20 varchar(100)
NULL,
c21 varchar(100)
NULL,
c22 varchar(100)
NULL,
c23 varchar(100)
NULL,
c24 varchar(100)
NULL,
c25 varchar(100)
NULL,
c26 varchar(100)
NULL,
c27 varchar(100)
NULL,
c28 varchar(100)
NULL,
c29 varchar(100)
NULL,
c30 varchar(100)
NULL,
c31 varchar(100)
NULL,
c32 varchar(100)
NULL,
c33 varchar(100)
NULL,
c34 varchar(100)
NULL,
c35 varchar(100)
NULL,
c36 varchar(100)
NULL,
c37 varchar(100)
NULL,
c38 varchar(100)
NULL,
c39 varchar(100)
NULL,
c40 varchar(100)
NULL
)
go

insert RowSizeVariableBlock DEFAULT VALUES
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 114
go

insert RowSizeVariableBlock (c01, c03, c20)
values ('med row', 'up to c20', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 154
go

insert RowSizeVariableBlock (c01, c03, c30)
values ('med+ row', 'up to c30', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 174
go

insert RowSizeVariableBlock (c01, c03, c40)
values ('large row', 'up to c40', 'test'
)
go

select * from RowSizeVariableBlock
go

select * from sys.dm_db_index_physical_stats
(db_id(), object_id('RowSizeVariableBlock'), null, null, 'detailed'
)
-- review "max" record size = 194
go

While there are some other optimizations at this level, most tables cannot benefit from this as the data populations aren't as predictable nor are most tables filled with so many variable-width and NULLable columns. However, if you do have this data pattern, defining these columns at the end of your table's definition - MIGHT save a tremendous amount of space, especially when this table is very large!

Paul's blogged more on these structures as well as the NULL bitmap here: http://www.sqlskills.com/BLOGS/PAUL/post/Misconceptions-around-null-bitmap-size.aspx.

Enjoy! And, thanks for reading,
kt

Recently, the SQL Server 2008 Internals title was released (and only in the past few days have people actually received their copies)! In fact, I still haven't seen the book in person... soon though!

Anyway, Kalen, Paul, Conor, Adam and I worked to create a comprehensive resource on SQL Server 2008 internals and to supplement the written content, many of us created demo scripts and examples. I've now gone back and created a sample script based on ALL of the code in the entire chapter (and in many cases I extended the code in the samples). And, while this updated content is going to be located on the companion content site, I thought I’d also release it here with some notes.

The zip contains a solution with 3 projects, each with a few scripts:

I know the names seem a bit strange but everything is ordered EXACTLY as it is shown in the book. And, in the book, I referenced "a" script called EmployeeCaseStudy-AnalyzeStructures.sql but that script was so big that I broke it down into 7 parts (hence the naming convention of 06...01, 06...02, etc.). Regardless, each script contains a brief header and a few details about the script. To get the most from the script, do not just open the script and execute it. If you really work your way through the script, you should see all of the comments and they will help you to make instance specific changes so that everything runs without error. Just take your time and really try to step back and think about each command (and what you're expecting the output to be) to test yourself while your working through the results. Taking your time and really grapsing all of these internals is what makes it fun!

Quite a bit of this content can stand alone but it's really best as companion content to the title as there's a lot more "text" and detailed information in the book. But, the scripts are really a great way to dive deeper, learn documented/undocumented commands and really get to know what the heck is really going on internally.

Finally, I only worked on Chapter 6 so here it is: 20100628-IndexInternals-Chapter6-Resources.zip (6.13 mb). As for the other companion content, you'll need to get the links from the book! 

And, certainly, if you find a typo or anything that you think needs more clarification, let me know! I'm more than happy to post updates (see below) to this content. 

Have fun,
kt

UPDATES/ERRORLOG:
2009-Apr-13 (8am): Updated the zip after remembering in my sleep (yes, sad, but true!) that one of my comments that referenced some line numbers didn't get updated in the final version. So, no errors and if you don't get this update, it's not going to break anything. But, the script that's been tweaked is script: 05_EmployeeCaseStudy-TableDefinition.sql. Enjoy! kt
2009-Apr-13 (4:30pm): Ha... guess what arrived today. Yes - our copies of the book. Wow, it's great to see it in person. Again, enjoy!
2009-Aug-10: Added a :CONNECT option inside the IndexInternals restore script AND, cleaned up the zip as it had an extra copy of the IndexInternals database in it.
2010-Jun-28: Fixed a typo in the table definitions. However, this code is never executed so it doesn't create errors. Just a typo in the script.

Theme design by Nukeation based on Jelle Druyts