OK, I first blogged about re-writing sp_helpindex here as sp_helpindex2 (April 2008). Shortly thereafter (Aug 2008), a reader found a bug and when I went digging, so did I (a couple of others). I did a few tweaks and the version that's stuck for the past couple of years is in this blog post: Updates (fixes) to sp_helpindex2.

However, as much as I've used that - I've always been frustrated by its output. And, so, over the months+, I wrote sp_helpindex3 (this solely added the index_id as the first column and then ordered the output; I've always hated the weird way in which sp_helpindex/2 ordered the output). Then, I started writing sp_helpindex4 which listed the clustering key columns with each index so that I could more easily identify columns on large tables without a lot of scrolling. Ultimately, I didn't like this because the clustering key (or some of the columns in a multi-column clustering key - might already be in the nonclustered index and so I still had to figure things out more manually). And, so I finally sat down and created what's jokingly being called sp_helpindex8.

Geeky note: Er, you might wonder with how many geeky SQL people I've joked about sp_helpindex8 but in fact, in my last few events (both onsite with a couple of customers as well as when teaching the Indexing Strategies and Performance Tuning portion of the Microsoft Certified Masters - SQL Server course (two weeks ago)), I've actually joked - and promised - that I'd have this blogged before this particular 3 week MCM rotation ends. And, well, that's this week. So, I've actually joked with quite a few folks on this and I'm close to the end of time on blogging it. So, here I am - sitting in a hotel room in Houston, after having presented at the Houston Area User Group (7-9pm) (yes, with Paul) and after having been onsite all day with a customer - blogging this darn thing. I just have to get it out there. ;-)

Having said that - I'm pretty happy with this version but it was pretty painful to write. And, before I get into what it does - and why I wanted this version - let me explain what I didn't do and why. There are a lot of other things that could be added. And, quite a few folks have said - why don't you do this or why don't you do that... Specific suggestions have been made such as adding index usage stats (from the current state shown by the DMV) or add index operational stats (to show if an index is suffering from latch IO waits, for example) or add index physical stats (to show if an index is fragmented), etc. And while many of these suggestions would make for lots of interesting uses of this output, I've stayed away from these. Why? Some are expensive to run - such as physical stats - even a limited scan can be painful on a really large table. Or, the results might not be able to be truly trusted  - such as with index_operational_stats. Why? Because the information from that DMV might only cover a short period of time. The information help by them is cleared whenever SQL Server is restarted (or if a database is taken offline or restored, etc.). So, when were those usage or operational stats last cleared? Moreover, is this single "look" at the DMV information really the best on which to base our decisions (even just in general).

In the end, I think that physical stats, operational stats and even usage stats - should be used wisely and carefully (definitely used but not by this). In fact, we often recommend that the best way to use these is by studying them over time and then analyzing their trends not just the point in time values. And, there are LOTS of posts that can help you with understanding this information; here's just a couple to get you started:

So, having said that - I've kept my sp_helpindex8 more pure ;-). What I want to see is EXACTLY what's stored in that index. In fact, this has always been my main goal. The first version (sp_helpindex2) added INCLUDED columns (2005+) as well as filter definitions (2008) to the output. So, what was missing? The added columns that SQL Server might have to add if a nonclustered is nonunique and/or when the clustering key is nonunique. The reason this is so important to me is that there are many things coming together and recommending indexes (DTA, the missing index DMVs, application tools, etc.) and while many of these tools are solely there to "help" us get better performance and "help" us create more effective indexing strategies - these tools sometimes recommend the same index over and over again (see Paul's recent blog post: Missing index DMVs bug that could cost your sanity... about the bug in the missing index DMVs) OR the tools might recommend similar indexes to what we already have.

Additionally, the output from sp_helpindex/2 only shows the columns that are EXPLICITLY defined in the key and in the leaf (using INCLUDE) but they don't show what SQL adds for you. Most folks know that SQL Server adds the clustering key to the nonclustered indexes and when you have a simple single-column clustering key it's not difficult to know that this IS "there" in your index even though it doesn't show up. But, even with that, I think there are still a few things that are lesser known. For example, if your nonclustered index is nonunique - did you know that the clustering key is not only added to the leaf level of the index (like INCLUDE does) but that the clustering key is added into the tree (as part of the key and ordering?). And, if have one index that lists this clustering key explicitly and another that doesn't - are you actually going to be able to easily tell that these are in fact the SAME. Ultimately, I want you to be able to better recognize what's REALLY in that index. What you might end up seeing is that two indexes are more similar than you first thought and as a result - maybe you can drop one of them. Or, you might be able to combine them to reduce your overall indexes (I call this index consolidation). And, all of this becomes significantly more challenging when you have more complex clustering keys (and, yes, they do exist!). The other thing I did is delimit the columns and index names properly using brackets. This is also nice because you might have identifiers that make the output of sp_helpindex/2 hard to read. This should help. However, I do have to admit that you probably have other problems if your index/column names have commas, spaces or brackets in them. But, this should still help/work.

And, that's why I wrote "sp_helpindex8" it's a MUCH better version of sp_helpindex but it requires a few other components. And, it was a lot more challenging to write. Basically, this information is more internal to SQL Server and not directly exposed anywhere. So, we had to do a bit of digging. Paul started the process (thanks!) but, I ended up spending about 5 hours getting it exactly the way I wanted it. And, I also ended up breaking it up into two parts: one sp that gets column definitions and the other that builds the final output. And, I've actually named it with a longer name (sorry!). But, I rarely type in the name anyway. What I typically do is go to SSMS, Tools | Options, then under Environment | Keyboard, I set the keyboard shortcut for Ctrl+F1 to my sp_SQLskills_SQL2008_helpindex.

Now, when I want to see the true defintions of indexes on a table - I highlight the tablename and hit Ctrl+F1. But, if you want to rename this to sp_helpindex8 - I won't be offended. ;-)

So, how do you create this?

  1. You need to create sp_SQLskills_ExposeColsInIndexLevels (sp_SQLskills_ExposeColsInIndexLevels.sql (6.73 kb)). This is what gives us the tree/leaf definitions. And, this works for both SQL Server 2005 and SQL Server 2008.
  2. You need to create the version specific sp_helpindex8. These are named: sp_SQLskills_SQL2005_helpindex (sp_SQLskills_SQL2005_helpindex.sql (10.50 kb)) and  sp_SQLskills_SQL2008_helpindex (sp_SQLskills_SQL2008_helpindex.sql (10.73 kb)). Again, I know they're long names but using the keyboard shortcuts can help. And, if you do rename these make sure you do a replace ALL. There are multiple places where there are comments and/or statements that need to execute to get everything to work (for example the sp_MS_marksystemobject execution).

NOTE: The very first version of the 2005 script had a bug in it (thanks Calvin!). Funnily one I already knew about and thought I had caught (it was fixed in the 2008 version). Anyway, as of 3:15pm CT on May 19, it's fixed in the scripts here. There could be others but so far - it's looking good!

Enjoy!
kt

When SQL Server 2008 was in beta, Paul and I worked on a project to create content that would "jumpstart" folks in learning and working with SQL Server 2008. We delivered these "new features" courses at various locations and finally, we recorded the content. All of this content is now available FOR FREE from Microsoft through their eLearning outlet. This course is officially titled and available as: Clinic 10259: SQL Server 2008: Database Infrastructure and Scalability.

There are 12 hours of lecture and demo content... lots of good stuff.

Modules & Lessons

Availability Enhancements

  • Module overview
  • Demo: Database Mirroring automatic page repair
  • Database Mirroring enhancements continued, automatic page repair, log stream
  • Backup compression and using Transparent Data Encryption (TDE) to encrypt data
  • Hot-add CPU, DBCC CHECKDB enhancements, failover clustering enhancements
  • Peer-to-peer replication enhancements
  • Demo: Peer-to-peer replication

Policy-Based Management and Multi-Server Administration

  • Module overview
  • Central management server
  • Demo: Central management server
  • Central management server metadata, multi-server script execution
  • Demo: Multi-server script execution
  • Policy-based management overview, conditions, facets and policies
  • Policy-based management demo: Facets and conditions
  • Applying facets and conditions to policies, policy enforcement
  • Demo: Applying facets and conditions to policies
  • Policy based management architecture and execution
  • Demo: Applying policies to multiple servers
  • Policy based management advanced topics

Performance Data Collection

  • Module overview, management data warehouse
  • Demo: Creating and configuring a management data warehouse
  • Data collection overview and architecture
  • Demo: Configuring system data collection on an instance
  • Demo: Transact-SQL custom data collection
  • Data collection reports and module summary

Resource Management and Troubleshooting Enhancements

  • Module Overview
  • Demo: Resource Governor
  • Extended events, predicates, actions, targets, types, maps, sessions

Security Enhancements

  • Module overview
  • Transparent data encryption
  • Demo: Transparent data encryption
  • Backups and transparent data encryption, transparent data encryption limitations
  • Integrated authentication enhancements, SQL Server 2008 auditing improvements
  • Demo: SQL Server 2008 all actions audited
  • Security Summary

Management Implications of New Features Part 1

  • Module overview, spatial indexes, integrated full-text search
  • Problems addressed by sparse columns
  • Demo: Sparse columns and column sets
  • Sparse columns and filtered indexes, best practices for sparse columns
  • Demo: Constructing tabular data sets with sparse columns
  • Limitations of sparse columns and interaction with other features
  • Overview of filtered indexes
  • Demo: Using filtered indexes with sparse columns
  • Filtered statistics, monitoring filtered indexes and statistics
  • Demo: Using filtered indexes for improved index coverage

Management Implications of New Features Part 2

  • Change tracking overview, change tracking best practices
  • Change data capture (CDC) overview
  • Demo: Change Data Capture
  • Blob storage with Filestream overview, enabling Filestream
  • Filestream security, filestream and transactions, monitoring filestream storage
  • Demo: Filestream manageability demo

Scalability Enhancements

  • Data compression overview, data compression trade-offs
  • Demo: Estimating data compression space savings
  • Estimating data compressing space savings, choosing row compression
  • Demo: Enabling compression and comparing performance
  • Enabling and disabling compression, data compression options and syntax
  • Service broker improvements, Service Broker conversation priority
  • Query optimizer enhancements, MERGE overview
  • Forcing query optimizer plans with plan guides
  • Demo: Using plan guides
  • Star join optimizer improvements
  • Partitioning enhancements overview, partition-aligned indexed views
  • Partitioning and lock escalation enhancements
  • Demo: Improving concurrency with partition lock escalation
  • Module review

There are a few more resources to mention. I've got a few more posts coming on this right away!

Enjoy!!
kt

Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast - we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery...everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)

For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)...

Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 - one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline - you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day - then daily. Let me give you the full story...

Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"... How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is... it depends - here, it depends on the version of SQL Server that you're using:

  • SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
  • SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).

Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000... If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect... and, the new column-based algorithm poses another HUGE problem for filtered indexes.

For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows... then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.

But, having said that - I also have a simple solution. Don't wait... For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.

SELECT object_name(object_id) AS [Table Name]
       , name AS [Index Name]
       , stats_date(object_id, stats_id) AS [Last Updated]
FROM sys.
stats
WHERE has_filter = 1

So, there are lots of options and definitely a few things of which to be aware... filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!

Cheers... it's time for a pint of Guinness!

kt

PS - For more information on statistics, check out the SQL Server whitepapers:

Recently, the SQL Server 2008 Internals title was released (and only in the past few days have people actually received their copies)! In fact, I still haven't seen the book in person... soon though!

Anyway, Kalen, Paul, Conor, Adam and I worked to create a comprehensive resource on SQL Server 2008 internals and to supplement the written content, many of us created demo scripts and examples. I've now gone back and created a sample script based on ALL of the code in the entire chapter (and in many cases I extended the code in the samples). And, while this updated content is going to be located on the companion content site, I thought I’d also release it here with some notes.

The zip contains a solution with 3 projects, each with a few scripts:

I know the names seem a bit strange but everything is ordered EXACTLY as it is shown in the book. And, in the book, I referenced "a" script called EmployeeCaseStudy-AnalyzeStructures.sql but that script was so big that I broke it down into 7 parts (hence the naming convention of 06...01, 06...02, etc.). Regardless, each script contains a brief header and a few details about the script. To get the most from the script, do not just open the script and execute it. If you really work your way through the script, you should see all of the comments and they will help you to make instance specific changes so that everything runs without error. Just take your time and really try to step back and think about each command (and what you're expecting the output to be) to test yourself while your working through the results. Taking your time and really grapsing all of these internals is what makes it fun!

Quite a bit of this content can stand alone but it's really best as companion content to the title as there's a lot more "text" and detailed information in the book. But, the scripts are really a great way to dive deeper, learn documented/undocumented commands and really get to know what the heck is really going on internally.

Finally, I only worked on Chapter 6 so here it is: 20100628-IndexInternals-Chapter6-Resources.zip (6.13 mb). As for the other companion content, you'll need to get the links from the book! 

And, certainly, if you find a typo or anything that you think needs more clarification, let me know! I'm more than happy to post updates (see below) to this content. 

Have fun,
kt

UPDATES/ERRORLOG:
2009-Apr-13 (8am): Updated the zip after remembering in my sleep (yes, sad, but true!) that one of my comments that referenced some line numbers didn't get updated in the final version. So, no errors and if you don't get this update, it's not going to break anything. But, the script that's been tweaked is script: 05_EmployeeCaseStudy-TableDefinition.sql. Enjoy! kt
2009-Apr-13 (4:30pm): Ha... guess what arrived today. Yes - our copies of the book. Wow, it's great to see it in person. Again, enjoy!
2009-Aug-10: Added a :CONNECT option inside the IndexInternals restore script AND, cleaned up the zip as it had an extra copy of the IndexInternals database in it.
2010-Jun-28: Fixed a typo in the table definitions. However, this code is never executed so it doesn't create errors. Just a typo in the script.

I started the series here: http://www.sqlskills.com/BLOGS/KIMBERLY/post/Spring-cleaning-your-indexes-Part-I.aspx and I want to continue with Part II today by clarifying some great questions/comments that have come up on the series. In Part III, I'll give you a few more ways to get rid of (or consolidate) indexes. And, I think there's still a bit more that Paul and I will investigate further (wrt to operational stats) but, I want to address a few comments and a few interesting things that both Paul and I have found.

In the Part I post, I talked about using sys.dm_db_index_usage_stats to see if there are any indexes that just aren't being used at all... A few comments asked why I didn't use operational_stats instead. To address that first, there are a few key differences:

  • dm_db_index_operational_stats is persisted only as long as an object is in cache (however, it's not cleared when objects are forced out of cache with DBCC DROPCLEANBUFFERS). If you want to clear ALL DMVs for a specific database, then a relatively easy way to do this (IN TESTING) is to take the database offline and then immediately bring it online again.
    • ALTER DATABASE <dbname> SET OFFLINE
    • ALTER DATABASE <dbname> SET ONLINE
      • NOTE: If there are any suspect files, you will NOT be able to bring your database back online without FIRST taking all suspect files OFFLINE. And, if you take a FILE offline then it's even more important to know that THERE IS NO WAY TO BRING A FILE ONLINE without restoring it from backups. So, it's VERY important to understand that OFFLINE/ONLINE for a database is really easy IF AND ONLY IF there are no other problems with the DB. You really need to resolve those problems first (or at least know that you're going to need to resolve those problems later through backup/restore) before you take a database offline.
  • dm_db_index_operational_stats is (from BOL) neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used. For information about this, see sys.dm_db_index_usage_stats.

Having said that though, none of these are really any guarantee of perfect information. And, they're not meant to be. I look at these DMVs as being a quick and easy way to get some relatively descent insight into what is or is not happening in my environement. However, even though the sys.dm_db_index_operational_stats might give you insight that you have a problem it still doesn't give you good insight into exactly what that problem might be. For example, it *might* be splits that causes some of your wait times to increase (column: page_io_latch_wait_in_ms) but, it could be something else too (some other system issue).

The main point, you can use these to get insight into which tables have the biggest problems (i.e. the biggest waits) and where they might have a lot of splits (column:nonleaf_allocation_count) but, in all honesty, that's not a guarantee. In fact, the reason I said "might" is that pages that are allocated at the end of the leaf level STILL allocate a page and require an entry to be made in the next level up in the index. So, a lot of nonleaf_allocations COULD be for a perfectly unfragmented index. So, it still doesn't tell you how fragmented the objects are or what the REAL problem is (or even if it is a problem yet).

Basically, these just tell you where you have the most activity and give you a starting point for problem solving. But, none of these (usage or operational) really tell you how to solve the problem. However, sometimes even knowing where to start IS the problem in and of itself. So, I'm not against these DMVs and I really do think you can get some good insight from them. Just use them as a tool to help focus your investigations. Use better tools like sys.dm_db_index_physical_stats to really see if you have fragmentation and where it's the worst.

hth,
kt

First and foremost, happy spring! I truly hope we're on the path to summer (although who would know it here - we're in Florida for SQL Connections and the weather is a bit chilly and it's been raining off/on today - I hope this is short term (no, I don't want to look at the forecast as I don't want to jinx it :)). But, wherever you are - I hope you're on your way to nicer weather and minimal cold weather (ok, I guess I only have warm wishes for the northern hem... for you southies - I hope your fall is lovely!!).

But, for everyone - now's a good time to start thinking about cleaning out the [non-literal index] closet... and getting rid of some of those dusty indexes?

Why/when should you get rid of some of your indexes?

  1. It's possible that some of them aren't being used at all...
  2. Especially when they're not being used but even when they're "redundant" (or minimally useful) they're still costing you in many ways:
    1. Wasting space on disk
    2. Wasting space in memory (well, if they are being used then they're cluttering up your cache)
    3. Wasting space/time in your maintenance routines (so, here they're cluttering up your cache for sure!)
    4. Wasting space in your backups
  3. You might be able to reduce your overall indexes with index consolidation...

So, for this post, I'm going to target #1 - are there any indexes that just aren't being used at all...

First, how do you know if your indexes are being used?
In SQL Server 2005 and higher, there's a DMV (dynamic management view) called sys.dm_db_index_usage_stats and it's there to track index usage patterns. However, it's not persisted since the beginning of time and as a result, if you look at this and believe that it's telling you ALL of the indexes that have been used in your database - then you might be mistaken. The index usage stats DMV is cleared when SQL Server is restarted as well as when you detach/attach the database or when you backup/restore the database.

Therefore, you don't want to just run the following query and drop all of the indexes that aren't being used. A better way to "trust" this information is to periodically persist the data from the DMV in your own table and then query it after you've completed a business cycle's worth of activity - logging all of the usage stats. Then you can trust this much more. Again, here are a couple of negatives:
1) it's not persisted
2) it only keeps the database_id and the index_id (which could change over time). You're right in thinking it probably shouldn't change but, a nonclustered index's ID is not permanent so, it's better to track the index name in addition to the index_id. And, when you run your queries to determine what to delete, you can easily verify indexes against the current indexes because your comparison is within the same db (more on this below).

If you want to persist this, then you have two ways to do this:

Store the index usage patterns in a table within the specific database you're tracking:

Pro: it goes with the database when you back it up, etc. and, it's easier to reverse engineer which actual indexes you're referring to (grabbing the names and not just the IDs).

Con: it's a bit more complex of a query to run and you'll need to run it for all of your databases (ok, it's really not all that bad - but, using something like sp_msForEachDB will really help)

Store the index usage patterns from all databases in a table within master or your own "performance database":

Pro: you only need one job to handle all the index usage info AND object_name *does* support TWO parameters (object_id and database_id) so, as long as you trap the name at the time of insert then you'll be good.

Con: it doesn't go with the database (e.g. backup/restore - and if you're restoring to a test system and you want to see what the usage patterns were then you'll need to get this information as well...)

Here's a simple query that you can run that shows all the indexes used right now - and adds the databasename/objectname into the results - in a persisted table you'll also want to add the runtime:

SELECT getdate() AS RunTime
, DB_NAME(i.database_id) as DatabaseName
, OBJECT_NAME(i.object_id, i.database_id) as ObjectName
, *
FROM sys.dm_db_index_usage_stats AS i
WHERE object_id > 100

And, if you want to get a few more insghts into how to persist this on Paul's blog here: http://www.sqlskills.com/BLOGS/PAUL/post/Indexes-From-Every-Angle-How-can-you-tell-if-an-index-is-being-used.aspx.

OK, so, you have a few options to think about and I have a few more parts to post!
kt

PS - The Tipping Point is coming soon too. I'm still adding a few things to that one!!! ;-)

When you decide to rebuild or reorganzie an index, you have an option to set something called FILLFACTOR. This is probably the MOST IMPORTANT thing to understand about index maintenance and reducing fragmentation (especially in databases that are prone to it). Unfortunately, we need to dive into some internals to really understand why this is so helpful...

What is FILLFACTOR?

FILLFACTOR defines how full the leaf level pages of the index are filled when rebuilding or reorganizing an index. The leaf level of an index is the largest structure of an index and it's the most prone to fragmentation. A good analogy would be to use the phone book (ah, loosely for this one) and imagine that new people come to the area. The *entire* phone book needs to get reprinted. Obviously this is terribly expensive and therefore it's not done often (well, we all use the internet anyway...). But, imagine if EACH page of the phone book were only filled to 90%. When new people come to the area we could just add their information - on the specific page where it should reside - and the phone book stays relatively intact and we don't have to kill a bunch of trees (OK, you could argue that you would have had to kill more trees to make a phone book that has 10% whitespace...but, let's not go there right now. :). OK, I know this analogy is weak at best but think back to "a page in the phone book" and think of it representing a "page" within SQL Server (quite literally here - SQL Server actually stores data on 8KB pages and indexes are doubly-linked lists of these 8KB pages).

So, if the pages are full and there's nowhere to put a particular row (let's say someone with a lastname of 'Tripp'), how does SQL Server "insert" this new row? They can't possibly shift all of the data down and rewrite all of the pages AFTER 'Tripp' could they? I suppose they could but it would get REALLY expensive - especially if someone with the last name of 'Anderson' moves to town...

How does SQL Server make room for new rows? They do something called a page split. Now, this can be A LOT more technical than I'm going to explain here but simply put, they split the page (on which the row is to be located) into half. The first 50% of the rows STAYS on the current page and the 50% MOVES to a new page (and this new page is probably nowhere near the page being split). However, they keep this list LOGICALLY in order by maintaining a doubly-linked list. So, SQL Server knows where to go when it's reading the data - even if the pages are not physically/sequentially in order. If this happens once or twice, it's not a huge problem. But as it continues to happen, it makes scans of your data more and more expensive and the process of the split can be expensive in and of itself. The biggest problems are:

  1. The data is not physically in order (and can create lot random IOs when scanning)
  2. The pages are no longer very dense (they've gone from 100% to 50%). Splitting does has a positive side in that it's going to take awhile before the page splits again BUT, splitting is meant as a temporary fix and ideally, it shouldn't be happening to every page. But, this also negatively impacts cache - you're only using 50% of the page that's split so now you're wasting half of that page in memory (and, on disk). So, while making a table 10% wider from the start seems bad; it reduces the need for splitting and therefore your table will stay more intact over time. 

What indexes have fragmentation?

Some indexes are prone to more fragmentation than others. For example, indexes with a high-order element (the first column of the index) that is NOT in ever-increasing order. This does NOT mean that ALL indexes should be created such that their first column is ever-increasing (that's NOT a good way to create indexes!). But, it does mean that you should pay special attention to some of your larger tables and therefore larger indexes (meaning ESPECIALLY the clustered index since the clustered index is over ALL of the columns). 

In terms of index management, here are the main things to think about for each index type: 

Rebuilding the clustered index is the most expensive index to rebuild. It's also the one that you'd rather avoid if you can. It has the most limitations... A clustered index CANNOT be rebuilt as an online operation IF the table has ANY LOB columns in it at all. What this means is that the clustered index - if fragmented - must be rebuilt using an offline operation. This translates into downtime for that table - while it's being maintained. So, if you can create a good clustering key EARLY, then you might be able to minimize clustered index maintenance.

Rebuilding a nonclustered index is much easier and likely to be an ONLINE operation. Because nonclustered indexes contain a much smaller number of columns - and rarely ever include a LOB column - they can be managed/maintained through ONLINE index operations. So, even if these are prone to more fragmentation (and, in general, they are) then it's not as big of a deal.

How do I solve this problem?

So, the BEST thing to do is to visually picture the ORDER of the data in the index (just think of it like a phone book - a LIST of records on PAGES - ordered by the KEY of the index). If you can visualize that structure then you can visualize what happens on an INSERT, UPDATE and a DELETE. Once you have a good visual you should be able to define a FILLFACTOR based on these two things:

  1. How much fragmentation is happening (is it becoming massively fragmented quickly) - then set a lower fillfactor (like 70 or 80 - depending on how big the table is)
  2. How frequently are you going to be running maintenance (if you can do this daily or weekly - for some objects) then you might not need much free space. If you can run maintenance against the table frequently then you won't need as much free space.

And, of course, you can directly check the level of fragmentation your table has by using the DMV: dm_db_index_physical_stats (which I referenced in Part I) and ideally, you'll create a procedure that runs through your objects determining how much fragmentation they have and the appropriate FILLFACTOR. Sometimes the best way to do this is to create your own "management table" which lists all of your tables and then have definitions for when to rebuild, when to reorg and what fillfactor to use. There are a few folks that have done things like this. Check out Ola's site here. He's put a lot of effort into a more robust and flexible combination of options for database maintenance. It might be more than you need but it's a GREAT EXAMPLE of all of the things that you can evaluate and consider wrt to maintenance.

Solving this using Database Maintenance Plans

Finally, if you CHOOSE to use a maintenance plan you ONLY have two options:

Reorganize pages with the default amount of free space
Does NOT sound like what it actually does (IMO). It sounds like it uses some form of SQL Server default value. However, it actually uses the "OrigFillFactor" value that's set for the index. This is actually a good thing IF you've set the FILLFACTOR on the CREATE INDEX statement OR on a rebuild. Reorganizing does NOT allow you to set this ONLY a rebuild does. So, if you have set this then this choice is probably best. If you run a query against sysindexes (yes, the backward compatibility view NOT the new sys.indexes catalog view), then you can see the OrigFillFactor setting for ALL of your indexes. (And, it's probably 0 - which sounds strange... 0 actually means 100... I could explain this but just trust me on this one.)

Change free space per page percentage to _____ %
Well, this is certainly an option you can consider. If you've never set the FILLFACTOR (or, even if you have...), this would overwrite it AND change the OrigFillFactor value to the percent you use here. If you decide to take this "sledgehammer" approach then I'd suggest something quite high - like 90-95% so that you don't end up wasting a lot of space. Some indexes (ones that have a high-order element that IS ever-increasing) don't really need free space and this might be completely wasted. So, I think it's better to set this indivually using stored procedures and/or a more custom maintenance procedure.

NOTE/TIP FROM COMMENTS POSTED: While I explain this concept (above) using the index option/setting (called FILLFACTOR) the UI is wanting the FREE SPACE percentage. So, if you want to use 90-95% FULL then be sure to set this value to 5-10%. Using the syntax, you will set page fullness.

OK, I still have at least 2 more in this series and EVEN THOUGH I've been completely sucked into facebook (as Paul blogged here), I'm still trying to manage to work and save SQLskills. Paul, on the other hand, just uploaded more photos. ;-)

Cheers,
kt

OK, it seems as though there's A LOT of confusion about what steps are required for proper database maintenance. And, it seems as though some recommendations are being given as "quick fixes" without any real recommendation for root cause analysis. I'm not saying that the generalizations are horribly wrong but in many cases they're just too broad and/or unspecific to actually be useful (and, well, in all honesty, some of them are just really bad recommendations because they’re so ambiguous). And, in my random internet trolling for the day, I found 4 different references that I want to go through (which is why this is only Part I). For this post, I’ll focus just on Sharepoint.

First, what did I see that’s motivating this post?
I found the following KB article – which was referenced by numerous sites as recommended reading. And, without knowing a lot about SQL (and, that’s NOT meant as a dig at all – it’s just that most apps that sit on SQL don’t ever even recommend that you need to know SQL and I can argue certain aspects of that point as well BUT, wrt to maintenance, it can really become a problem if you don't know a few things about these tasks), it does seem like good reading: Information about the Maintenance Plan Wizard in SQL Server 2005 and about tasks that administrators can perform against SharePoint databases

Here is the part that over-simplifies picking what maintenance tasks to run vs. what not to run:

DIRECTLY TAKEN FROM THIS KB ARTICLE IS THE FOLLOWING:

We have tested these tasks and the effects that these tasks have on database schema and performance. The following table summarizes the results of these tests.

Task                                        Safe to perform this task?
Check database Yes
Reduce a database Yes
Reorganize an index Yes
Clean up the history Yes
Update statistics Yes. However, this task is unnecessary because the SharePoint Timer service performs this task automatically.
Rebuild an index No. The task does not restore existing index options before the rebuild operation. However, you can use scripts that restore index options.
Note This problem was corrected in SQL Server 2005 Service Pack 2.

We used the following criteria to determine whether a task was safe to perform:

  • Whether the task modified the database schema from its default state
  • Whether the task decreased performanceResults may vary depending on the environment.

However, if you use the Maintenance Plan Wizard to perform the tasks that are listed in the table as "safe to perform," you are likely to experience increased performance in SQL Server 2005.

The big problem is: this is just too little information about too many VERY important tasks!

Let me break this down task by task and give you a few other places to go for more information.

Check database

The check database task refers to DBCC CHECKDB. This is definitely an important part of any maintenance plan. And, it really is a safe task to run as it’s NOT corrective by default. However, there is nothing mentioned about how this command may completely flush your buffer pool as it reads all of the pages of all of the objects it’s checking. So, this might impact performance but, of all of the tasks, this is the safest to run and it’s definitely a recommended task.

If you want to learn more about DBCC CHECKDB, check out these things: 

  1. Read Paul’s blog post to his TechNet article on Database Maintenance. It’s a great overview of all of these maintenance tasks and will give you a good overview of what each one does. Here’s a link to the blog post that pulls all of the TechNet resources together: http://www.sqlskills.com/BLOGS/PAUL/post/TechNet-Magazine-Effective-Database-Maintenance-article-and-August-SQL-QA-column.aspx  
  2. Read Paul’s blog post on Myths around causing corruption – so that you can get better insight into where/why the actual corruptions are occurring. 
  3. Finally, if you’re really interested in the internals of CHECKDB and how it works – Paul’s written a ton about it in his CHECKDB from Every Angle category. FYI, 3 of his 9 years on the SQL Server Development Team were spent writing CHECKDB and repair for SQL Server 2005 (so, he definitely knows how it works J). Here’s the link to the category: http://www.sqlskills.com/BLOGS/PAUL/category/CHECKDB-From-Every-Angle.aspx

Reduce a database 

OK, I’m sure I’ll get a lot of responses to this one but IMO, a database maintenance plan SHOULD NEVER INCLUDE A SHRINK.

Let me explain… J
To be honest, I'm not even a fan of manually running database-level shrinks (DBCC SHRINKDATABASE) either. Don't get me wrong - there are ACCEPTABLE times to shrink parts of a database but, in general, I'd recommend only using DBCC SHRINKFILE for individual file-level shrinks. I wouldn't schedule shrinks nor would I EVER turn on [the database option] autoshrink. I don't think shrinks should EVER be automated - either through the database option OR through maintenance plans.

If you need to do regular shrinks - then it's likely that you have some other problem. And, without DIRECTLY addressing this problem, you *might* be making things worse.

This is a bigger discussion and I’ve found a few other references that I want to pull together. I’ll post another post about this within the next day or so – and link to it from here BUT, for right now…Know this – free space is generally GOOD. Excessive free space has happened FOR A REASON. Maybe there’s a pattern to it but often shrinking is worse than just leaving the free space for the next data explosion (a bunch of data comes in, the database grows, the data is archived, the free space remains for the next set of data that comes in).

If you shrink the database you might make things worse by fragmenting everything. Paul’s video that goes with the TechNet article on Database Maintenance shows you the [shocking if you didn’t know this] effect of shrinking a database on indexes.

Reorganize an index, Update statistics and Rebuild an index

These need to be grouped together to start because this KB article does NOT address the impact of running these together. In fact, the problem, if you run these together – is that you MIGHT make things worse. First, let me give you an overview of each:

Reoganizing an index removes fragmentation in the largest part of an index (it’s called the leaf level of the index) and removing fragmentation in this level has the greatest (and positive) effect on range query scans and cache. So, this is really the most important type of fragmentation to remove. However, this is NOT the only way to do it… 

Rebuilding an index completely and totally removes ALL forms of fragmentation in all levels of an index; however, this is the most expensive (yet most effective) way to do it. As a result of rebuilding an index, SQL Server also updates the statistics for the indexes that were rebuilt. Therefore you do not need to update statistics OR reorganize an index if it gets rebuilt.

Updating statistics is important for query processing and optimization. The query processor uses statistics on your data to help determine how many rows will be processed by your query/statement. If SQL Server can accurately estimate the rows, then it can choose a more effective plan. However, if it doesn’t have good statistics, then it may not do as good of a job at accurately estimating rows and therefore it might not come up with as optimal of a plan. So, this is an integral part to good database health. However, some of this might be done via the database option: auto update statistics which is ON by default in SQL Server (and, YES, you should leave this on). Check out this post on: Auto update statistics and auto create statistics - should you leave them on and/or turn them on??

However, if you use a maintenance plan then I really see two problems: 

  1. You’ll end up doing maintenance on things that may not need it. The default behavior for these tasks is just to run them on the selected objects. And, since many people will choose all objects (possibly even of all databases) then you’ll probably select objects that won’t really need this as frequently as you run this maintenance plan. 
  2. You might end up running a combination of things that either – wastes cycles/CPU and a MASSIVE amount of log space (which can translate into all sorts of concerns for DR technologies like database mirroring which will need to send all log rows to the secondary server). For example, if you run ALL three of these things then they’ll have to be run in a certain order (you can change this in a maintenance plan). However, the default order is: Reorganize Index(es), Rebuid Index(es), Update Statistics. This means that the work that’s done by reorganizing is effectively wasted as the rebuild would have taken care of it AND the work that’s done for updating statistics could mean that they update statistics TWICE (during the rebuild AND after) and the end result is WORSE because the update statistics command might use a sampling mechanism to generate statistics (which can lead to LESS EFFECTIVE statistics information). However, this is ONLY if you change the wizard’s default. The default is for the updating statistics command to do a “full scan”. So, even if the statistics will end up being the same – it’s still problematic because it means that for all indexes you’ve just rebuilt – you’ve now updated their statistics TWICE.

SUMMARY

A database maintenance plan is CRITICAL for best performance (especially for databases that are prone to some of the problems corrected by these maintenance tasks (yes, you can read SHAREPOINT into that statement). Sharepoint uses GUIDs as PRIMARY KEYs (read this post to hear more about the side effects of this choice) and as a result, as clustering keys. This means that many Sharepoint tables are prone to [potentially a MASSIVE amount of] fragmentation.

You absolutely need to have a maintenance plan. But, what should it be?

My absolute preference is to NOT use the Database Maintenance Plan Wizard UNLESS you really know what you’re doing. It just doesn’t give enough prescriptive advice. And, if you just select the defaults, you will end up with an inoptimal maintenance plan.

A better approach would be to create your own maintenance plan. If you write the code yourself (or leverage one of the custom ones that are already out on the web) then you can strategically target ONLY the objects that have the warning signs and/or are out of date and you can set when to rebuild vs. when to reorganize (generally people rebuild if a table has more than 30% fragentation and they reorganize when it's less than 30%). Fragmentation is something that can be detected programmatically using the DMV: dm_db_index_physical_stats (in SQL 2005/2008) or by using DBCC SHOWCONTIG (in SQL 7.0/2000). Here are a few places to go to see the more flexible and programmatic way of rebuilding/reorganizing indexes:

Smart Indexing Part II - Conditional Rebuilding a blog post (with conditional index rebuild code) from SQLMCA Bob Duffy (a good friend who is located in Dublin, Ireland and whose wife (Carmel) just had a baby last week – congrats again Bob!! You guys are seriously outnumbered now!!!) here: http://blogs.msdn.com/boduff/archive/2007/06/08/smart-indexing-part-ii-conditional-rebuilding.aspx

Custom Index Defrag / Rebuild Procedures - a blog post with some posted code as well. http://www.sqlstuff.info/post/2008/03/Custom-Index-Defrag--Rebuild-Procedures.aspx

Rebuild and Reorganize Indexes in SQL 2005 – an article (with conditional index rebuild code) from SQL Server Central here: http://www.sqlservercentral.com/scripts/31857/  (NOTE: You will need to become a subscriber to get to this article.)

Rebuild Only the Indexes that Need Help - an article by Andrew Kelly (SQL MVP) on SQL Server Magazine here: http://www.sqlmag.com/articles/index.cfm?articleid=99019&pg=1 (NOTE: You will need to become a suscriber to get to the full text of the article.)

Or, build your own! Check out the BOL topic for the sys.dm_db_index_physical_stats for SQL 2005 here: http://msdn.microsoft.com/en-us/library/ms188917(SQL.90).aspx, Example D has sample code to help you get started! For SQL 2008 it’s here: http://msdn.microsoft.com/en-us/library/ms188917.aspx. It’s still Example D for the sample code to leverage. J

The most important thing I can tell you is that a SMALL amount of time getting familiar with what’s really happening in SQL as well as WHY it’s happening to you WILL BE A LOT MORE PRODUCTIVE then just slamming in a maintenance plan that solves some problems but probably creates others.

Hope this helps!
kt

OK, so this is interesting. I've got a few answers to my last survey (Tipping Point Query #1) and well, there's a good mix of answers (and, yes, some are correct! ;)). Be sure to go back and review that last post so that you can evaluate it and these two tipping point questions completely. So, now I want to see if people really know the basis of "the tipping point".

Try these two:

Tipping Point Query #2

Table1 (t1) has 1 million rows at 100 rows per page. The table has 10,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

Tipping Point Query #3

Table2 (t2) has 1 million rows at 2 rows per page. The table has 500,000 pages. A nonclustered index exists (on name) but it does not cover the query. At what percentage (of the table) is this nonclustered index no longer selective enough to use:

OK, so I'd really love to see quite a few responses to these *3* "tipping point" questions. I PROMISE to do a nice long (and detailed) post for what is the actual tipping point AND the answers to all three of these questions. I'll explain the math as well as how you can generalize "what is selective enough" so that you can better create your nonclustered indexes!!!

Thanks for reading - and responding to these brain teasers!!

Cheers,
kt

PS - It's snowing here (ah...again)... maybe I'll spend the day creating brain teasers??! Do you guys like this kind of a post? (well, I suppose you won't really know until I post the answer part of it... but, just in general??). I think it's pretty cool. But, don't worry, I won't (nor will Paul) make all of my posts surveys. But, I think this is a really good one. I'm anxious to see if the asnwers come in correctly for these two as well! Have at it!

Along the same lines of improving database design and getting better performance on SQL Server (which [IMO] DOES take an experienced SQL Server database developer - but, we'll talk more about "whose job this really is" in many more posts and probably even a RunAs - which Richard and I just setup to record on Thursday (Mar 12)), I started thinking about how I could convince people of why they NEED a database developer. So, I thought I'd ask this VERY important question...

What percentage of data IS selective enough to use a nonclustered index which doesn't cover the query... in other words (just in case you're not entirely sure of what I mean :)), think of indexes in the back of a book... if you need to go to the back of the book to reference a bunch of data (this is called a [bookmark] lookup in SQL Server), there's a point where the randomness of the lookups (especially if you think in terms of many rows on a page) becomes too expensive. For example, imagine that the index is customer name and the data (the book) is customer orders - and, each page (of this rather weird book ;)), has 20 orders on it. Doing a query to lookup customer number 12's orders might be really easy (if they only have only a few orders) BUT, what if the query is "show me all of the orders for people that have an 'e' in their name". First, the number of people have have an 'e' in their name is probably better than 50% (that's TOTALLY a guess) and, if there's 20 orders per page then a lookup from the index into the book would require SQL Server to touch every page roughly 10 times. If the table has 50,000 pages (therefore 1 million rows - at 20 rows per page), then to find the 500,000 rows (remember, I'm estimating half), SQL Server would have to do 500,000 bookmark lookups. For a table with only 50,000 pages that's terribly expensive.

So, here's the question - what's the tipping point? When is a nonclustered index on customer name NOT going to be used to lookup rows of sales orders? I'm going to use a survey to see what you think and then within a week, I'll give the specific SQL Server math AND a query you can run within your own DBs to see EVERY one of your table's "tipping points". It's really interesting and I think will really help you to understand why SQL Server might not be using those nonclustered indexes.............

Cheers,
kt

Expanding on the topic of "are you kidding me"... one of the MOST PREVALENT problems I see today is the dreaded "GUIDs as PKs" problem. However, just to be clear, it's not [as much of a] problem that your PRIMARY KEY is a GUID as much as it is a problem that the PRIMARY KEY is probably your clustering key. They really are two things BUT the default behavior in SQL Server is that a PRIMARY KEY uses a UNIQUE CLUSTERED INDEX to enforce entity integrity. So, I thought I'd take this post to really dive into why this is a problem and how you can hope to minimize it.

Relational Concepts - What is a PRIMARY KEY? (quick and basic reminder for what is what and why)

Starting at the very beginning... a primary key is used to enforce entity integrity. Entity integrity is the very basic concept that every row is uniquely identifiable. This is especially important in a normalized database because you usually end up with many tables and a need to reference rows across those tables (i.e. relationships). Relational theory says that every table MUST have a primary key. SQL Server does not have this requirement. However, many features - like replication - often have a requirement on a primary key so that they can guarantee which row to modify on a related database/server (like the subscriber in a replication environment). So, most people think to create one. However, not always...

What happens when a column(s) is defined as a PRIMARY KEY - in SQL Server?

The first thing that SQL Server checks is that ALL of the columns that make up the PRIMARY KEY constraint do not all NULLs. This is a requirement of a PRIMARY KEY but not a requirement of a UNIQUE KEY. They also check to make sure (if the table has data) that the existing data meets the uniqueness requirement. If there are any duplicate rows, the addition of the constraint will fail. And, to check this as well as to enforce this for [future] new rows - SQL Server builds a UNIQUE index. More specifically, if you don't specify index type when adding the constraint, SQL Server makes the index a UNIQUE CLUSTERED index. So, why is that interesting...

What is a clustered index?

In SQL Server 7.0 and higher the internal dependencies on the clustering key CHANGED. (Yes, it's important to know that things CHANGED in 7.0... why? Because there are still some folks out there that don't realize how RADICAL of a change occurred in the internals (wrt to the clustering key) in SQL Server 7.0). It's always (in all releases of SQL Server) been true that the clustered index defines the order of the data in the table itself (yes, the data of the table becomes the leaf level of the clustered index) and, it's always been a [potential] source of fragmentation. That's really not new. Although it does seem like it's more of a hot topic in recent releases but that may solely because there are more and more databases out there in general AND they've gotten bigger and bigger... and you feel the effects of fragmentation more when databases get really large.

What changed is that the clustering key gets used as the "lookup" value from the nonclustered indexes. Prior to SQL Server 7.0, SQL Server used a volatile RID structure. This was problematic because as records moved, ALL of the nonclustered indexes would need to get updated. Imagine a page that "splits" where half of the records are relocated to a new page. If that page has 20 rows then 10 rows have new RIDs - that means that 10 rows in EACH (and ALL) of your nonclustered indexes would need to get updated. The more nonclustered indexes you had, the worse it got (this is also where the idea that nonclustered indexes are TERRIBLY expensive comes from). In 7.0, the negative affects of record relocation were addressed in BOTH clustered tables and heaps. In heaps they chose to use forwarding pointers. The idea is that the row's FIXED RID is defined at insert and even if the data for the row has to relocate because the row no longer fits on the original page - the rows RID does not change. Instead, SQL Server just uses a forwarding pointer to make one extra hop (never more) to get to the data. In a clustered table, SQL Server uses the clustering key to lookup the data. As a result, this puts some strain on the clustering key that was never there before. It should be narrow (otherwise it can make the nonclustered indexes UNNECESSARILY wide). The clustering key should be UNIQUE (otherwise the nonclustered indexes wouldn't know "which" row to lookup - and, if the clustering key is not defined as unique then SQL Server will internally add a 4-byte uniquifier to each duplicate key value... this wastes time and space - both in the base table AND the nonclustered indexes). And, the clustering key should be STATIC (otherwise it will be costly to update because the clustering key is duplicated in ALL nonclustered indexes).

In summary, the clustering key really has all of these purposes:

  1. It defines the lookup value used by the nonclustered indexes (should be unique, narrow and static)
  2. It defines the table's order (physically at creation and logically maintained through a linked list after that) - so we need to be careful of fragmentation
  3. It can be used to answer a query (either as a table scan - or, if the query wants a subset of data (a range query) and the clustering key supports that range, then yes, the clustering key can be used to reduce the cost of the scan (it can seek with a partial scan)

However, the first two are the two that I think about the most when I choose a clustering key. The third is just one that I *might* be able to leverage if my clustering key also happens to be good for that. So, some examples of GOOD clustering keys are:

  • An identity column
  • A composite key of date and identity - in that order (date, identity) 
  • A pseudo sequential GUID (using the NEWSEQUENTIALID() function in SQL Server OR a "homegrown" function that builds sequential GUIDs - like Gert's "built originally to use in SQL 2000" xp_GUID here: http://sqldev.net/xp/xpguid.htm

But, a GUID that is not sequential - like one that has it's values generated in the client (using .NET) OR generated by the newid() function (in SQL Server) can be a horribly bad choice - primarily because of the fragmentation that it creates in the base table but also because of its size. It's unnecessarily wide (it's 4 times wider than an int-based identity - which can give you 2 billion (really, 4 billion) unique rows). And, if you need more than 2 billion you can always go with a bigint (8-byte int) and get 263-1 rows. And, if you don't really think that 12 bytes wider (or 8 bytes wider) is a big deal - estimate how much this costs on a bigger table and one with a few indexes...

  • Base Table with 1,000,000 rows (3.8MB vs. 15.26MB)
  • 6 nonclustered indexes (22.89MB vs. 91.55MB)

So, we're looking at 25MB vs 106MB - and, just to be clear, this is JUST for 1 million rows and this is really JUST overhead. If you create an even wider clustering key (something horrible like LastName, FirstName, MiddlieInitial - which let's say is 64bytes then you're looking at 427.25MB *just* in overhead..... And, then think about how bad that gets with 10 million rows and 6 nonclustered indexes - yes, you'd be wasting over 4GB with a key like that.

And, fragmentation costs you even more in wasted space and time because of splitting. Paul's covered A LOT about fragmentation on his blog so I'll skip that discussion for now BUT if your clustering key is prone to fragmentation then you NEED a solid maintenance plan - and this has it's own costs (and potential for downtime).

So............... choosing a GOOD clustering key EARLY is very important!

Otherwise, the problems can start piling up!

kt

**** UPDATE March 17, 2009 ****
Be sure to read the comments.
This doesn't appear to be as bad as it seemed...still glad I did the rant though; I learned more about what it's actually doing! :) 
And, it's still good to have good database design - NO MATTER WHAT!
****

OK... today's been a bit of a weird day. I've been doing research for some upcoming events and in general, I'm learning a few things. However, as one does with the web - you can get side-tracked. And, in watching the video (recommended by my good friend Timmie) highlighed in this blog post about DBAs and Developers: http://www.benhblog.com/2009/03/linq-dba-vs-developer.html, I followed a few other links as well and found this direct quote:

"I was surprised to learn that EF decided that since there was not a primary key it would just use all the non-nullable columns as a concatenated primary key.  This might not be what you want."

This quote is also from Ben's blog (and I agree) and specifically this post: http://www.benhblog.com/2008/11/entity-framework-and-tables-with-no.html. And, David Yack (a fellow RD) makes a GREAT comment that you can deselect this but I honestly cannot believe that this is the default???!

My response is....... Seriously, are you FREAKING kidding me?

OK, I don't want to go on a total rant here but this is so bad that I almost cannot believe it. And, well, it's somewhat par for the course with the day that I'm having (with other things that I've learned and been "working on"). I'll come back to more topics that fall under the category of "are you kidding me" in a second. Ah ha... there's a new category for me!!!

Anyway, why is that so bad?

  1. There's no guarantee that all non-null columns are even going to be unique (but, this is the LEAST of the problems)
  2. Making a WIDE composite PRIMARY KEY means that the underlying structure which is a UNIQUE CLUSTERED INDEX won't be as efficient. SQL Server creates this index to enforce the relational rule for a PRIMARY KEY-that it is UNIQUE. So, SQL Server will make this WIDE COMPOSITE SET OF COLUMNS the table's one and only clustered index (ugh - I hope at least it changes the index type to nonclustered.... someone please let me know if this is the case)?? However, if it's not.... then, the clustered index will now force SQL Server to order this table by this combination of columns (which probably don't have any form of pattern) so, this clustered index is probably going to be very prone to fragmentation (which in turn can negatively affect performance). And, even worse, ALL of the non-clustered indexes will ALSO be rebuilt to include ALL of the columns of the clustered index which are not already in the non-clustered index. In fact, this step alone - if a clustered index IS created - will cause ALL of the nonclustered indexes to be rebuilt in addition to the table being rebuilt.

So, please tell me this isn't the case? Sorry if this is a rant but I honestly cannot believe this one.

This is another reason where people just weren't even thinking about what was going on in the backend. Another scenario where a little bit of background knowledge of the "data store" would have REALLY helped.

Oh my?
kt

Something I learned while the SQL Server 2008 Internals book was in tech edit (thanks to our *awesome* tech editor Ben Nevarez - who, unfortunately, does not have a blog or anything...yet! (well, I'm hopeful)), was that you can use a FOREIGN KEY constraint to reference a UNIQUE index - one without a PRIMARY KEY or UNIQUE key constraint. At first glance this might seem like something relatively insignificant but in terms of reducing indexes and/or consolidating indexes it offers something that constraints do not. When you create a UNIQUE index you can use INCLUDE to reference (and include) non-key columns in the leaf level of an index. This offers more choices for covering and if you want to cover a query using INCLUDE but also have a UNIQUE column(s) as the key - you can do that with a regular index but not with a constraint based index. So, that got me thinking even more - can I use a UNIQUE index with INCLUDE and even a filters - from a FOREIGN KEY. My guess was that it probably wouldn't work because it would be too costly to have to verify it on every referencing row BUT, I did have hopes that a filter of IS NOT NULL would work. However, it does not. ;-( 

So, you CAN reference a UNIQUE index with INCLUDEd columns but not filters. Even that's really cool!

And, when you start your spring cleaning - try and cleanup and/or consolidate some of those redundant indexes!!

Cheers,
kt 

I know where I go but where the heck does the time go (it seems to move much faster than I)? First off, I hope everyone had a truly lovely holiday season. I know they've long since past for everyone but maybe this reminder will bring back a nice memory from the holidays as it does for me. Paul and I escaped the world of SQL and focused on finding frogfish. Unfortunately, it just wasn't the right time of year to find them though. Oh well - we did see some very coolstuff including a batfish (the odd-shaped-bottom-dweller kind) and, we also read a ton of books (a completely strage combination of books - list below) and we relaxed.  It was really quite lovely.

Once the holidays were over, we were back into the fire... Paul worked from home (mostly writing) and did a lot of work from there - including blogging (as always) and I was back in the classroom. We were only home for 10 days and then we were off again. Right now, we're both on the road - in Hyderabad, India this week (and last) and Bangkok, Thailand next week - teaching some classes on performance tuning, maintenance and high availability. I thought I'd do a quick post to remind you that I do exist (yes, I know - I really do *want* to blog more but it's not easy for me... first of all, before I publish a blog post I probably read it 50 times and that's after I even get the courage to write one... blogging is just different for me than it is for Paul... for that, I'm sorry. I'm *really* going to try... no, really. I will try to get a really good post out there as often as possible but for right now, I'm not going to give a number or a goal - just know that it drives me nuts that I don't blog all that often.)

In the interim, interviews/podcasts - for some reason - are a lot easier for me. Paul and I did a few interviews with TechNet in December and these are being made available for download as they're edited for publishing. The first one is available now and it's titled: New Features in Partitioning and Indexing for SQL Server 2008. Here are the download links:

OK, that's it for me for today. But, I do have a couple of other posts brewing. I will work to get these out as soon as possible as I have a small break from teaching, writing, and conferences (yes, I've been writing too!). Oh, speaking of which, we've *all* finished the SQL Server 2008 Internals work (Kalen doing the large majority) and it's getting really close! You can pre-order it on Amazon here: http://www.amazon.com/Microsoft%C2%AE-SQL-Server%C2%AE-2008-Internals/dp/0735626243/ref=sr_1_1?ie=UTF8&s=books&qid=1233824791&sr=1-1.

Thanks for reading - I'll be back soon!
Kimberly

As for books, in no particular order:  

  • Robin Maxwell's Mademoiselle Boleyn
  • Josh Grogan's Marley & Me
  • Sheridan Hay's The Secret of Lost Things
  • Tucker Max's I hope They Serve Beer in Hell
  • J.K. Rowling's Harry Potter books 2, 3, 4, 5 (these were re-reads - always a safe bet to bring with me and a fun re-read as Kiera (7) goes through the series now too) and where is the 6th movie? Ugh. ;)
  • Tom Philbin's The Killer Book of Serial Killers: Incredible Stories, Facts and Trivia from the World of Serial Killers
  • Stephenie Meyer's Twilight (last, but not least!)

OK, I told you it was a very wide-ranging book list; a few were purchased at the airport on the way. I've since read the entire Twilight series by Stephenie Meyer: New Moon, Eclipse, and Breaking Dawn (and I have to admit that I loved the series (especially Breaking Dawn))... I really hope she completes Midnight Sun; I absolutely loved Twilight from Edward's perspective (maybe even more than from Bella's). And, if she continues writing in this world - I really hope she writes as Renesmee. The Twilight movie was not spectacular - defnitely read the book before going to the movie - but, I loved the casting of Bella and Edward. And, I'm hopeful that the next movie will be *a lot* better. Next on my reading list, I'm planning to read many Robin Maxwell books starting with The Secret Diary of Anne Boleyn. In the end, the holidays were really relaxing.

Oh, and here's a great shot from India... At least he has a helmet??

Well... I think I had had too much tea that morning ;-). But, as always, chatting with Richard and Greg was great. Here's the specific show link: http://www.runasradio.com/default.aspx?showNum=76.

Oh, and just for the record, I didn't come up with that title. But, I do hope that all your [high-priority and important] queries are indexed!

Enjoy!
kt

I first posted an update to sp_helpindex here. My version of sp_helpindex was solely to expand what sp_helpindex showed and adds 1 or 2 things based on version: for SQL2005+ it adds included columns and for SQL2008 it also adds the filter predicate. So, there were two versions of sp_helpindex2 depending on which verison you're using. A lot of folks like the changes to this sp but, alas, it had a bug (or two :) and in fact, I found a few others when I went back over this as well. So, thanks to Josh (who commented here) and to a private email (thanks Vasco!), I have an updated version of sp_helpindex2:

For SQL Server 2005, here's your new sp_helpindex2 script: sp_helpindex2_2005.zip (2.89 KB)

And, here's a simple test script for 2005:

DROP TABLE tbl1
GO

CREATE TABLE tbl1( c1 int, c2 int, c3 int, c4 int)
GO
CREATE INDEX ix_1 ON tbl1(c1) INCLUDE (c2)
CREATE INDEX ix_2 ON tbl1(c1)
CREATE INDEX ix_3 ON tbl1(c1) INCLUDE (c2, c3)
CREATE INDEX ix_4 ON tbl1(c1, c3) INCLUDE (c2)
CREATE INDEX ix_5 ON tbl1(c3) INCLUDE (c1, c2, c4)
CREATE INDEX ix_6 ON tbl1(c1, c2) INCLUDE (c3, c4)
go

sp_helpindex2 tbl1
go

index_name

index_description

index_keys

included_columns

ix_1

nonclustered located on fg1

c1

c2

ix_2

nonclustered located on fg1

c1

NULL

ix_3

nonclustered located on fg1

c1

c2, c3

ix_4

nonclustered located on fg1

c1, c3

c2

ix_5

nonclustered located on fg1

c3

c1, c2, c4

ix_6

nonclustered located on fg1

c1, c2

c3, c4

For SQL Server 2008, here's your new sp_helpindex2 script: sp_helpindex2_2008.zip (2.84 KB)

And, here's a simple test script for 2008:

DROP TABLE tbl1
GO

CREATE TABLE tbl1( c1 int, c2 int, c3 int, c4 int)
CREATE INDEX ix_1 ON tbl1(c1) INCLUDE (c2)
CREATE INDEX ix_2 ON tbl1(c1)
CREATE INDEX ix_3 ON tbl1(c1) INCLUDE (c2, c3)
CREATE INDEX ix_4 ON tbl1(c1, c3) INCLUDE (c2)
CREATE INDEX ix_5 ON tbl1(c3) INCLUDE (c1, c2, c4)
CREATE INDEX ix_6 ON tbl1(c1, c2) INCLUDE (c3, c4)

CREATE INDEX ix_1f ON tbl1(c1) INCLUDE (c2)
WHERE c3 IS NOT NULL

CREATE
INDEX ix_2f ON tbl1(c1)
WHERE c4 > 2

CREATE INDEX ix_3f ON tbl1(c1) INCLUDE (c2, c3)
WHERE c4 > 2 AND c1 < 50 AND c2 = 12

CREATE INDEX ix_4f ON tbl1(c1, c3) INCLUDE (c2)
WHERE c4 IS NOT NULL AND c1 = 12

CREATE INDEX ix_5f ON tbl1(c3) INCLUDE (c1, c2, c4)
WHERE c1 > 5

CREATE INDEX ix_6f ON tbl1(c1, c2) INCLUDE (c3, c4)
WHERE c4 < 20
go

sp_helpindex2 tbl1
go

index_name

index_description

index_keys

included_columns

filter_definition

ix_1

nonclustered located on PRIMARY

c1

c2

NULL

ix_1f

nonclustered located on PRIMARY

c1

c2

([c3] IS NOT NULL)

ix_2

nonclustered located on PRIMARY

c1

c2

NULL

ix_2f

nonclustered located on PRIMARY

c1

c2

([c4]>(2))

ix_3

nonclustered located on PRIMARY

c1

c2, c3

NULL

ix_3f

nonclustered located on PRIMARY

c1

c2, c3

([c4]>(2) AND [c1]<(50) AND [c2]=(12))

ix_4

nonclustered located on PRIMARY

c1, c3

c2

NULL

ix_4f

nonclustered located on PRIMARY

c1, c3

c2

([c4] IS NOT NULL AND [c1]=(12))

ix_5

nonclustered located on PRIMARY

c3

c1, c2, c4

NULL

ix_5f

nonclustered located on PRIMARY

c3

c1, c2, c4

([c1]>(5))

ix_6

nonclustered located on PRIMARY

c1, c2

c3, c4

NULL

ix_6f

nonclustered located on PRIMARY

c1, c2

c3, c4

([c4]<(20))

Have fun!
kt

YES!!!

OK, well, I guess I should be more specific because as in most things in SQL Server, the real answer is "it depends". And for these two options, it depends mostly on your SQL Server version. Since SQL Server 7.0, the way that auto update works, has changed (much so for the better!!). So, if you're in SQL Server 2005 or SQL Server 2008, I would say most definitely - leave these ON (or if you turned them off - turn them back on!!!)! If you still have problems with a specific index causing your grief, then turn off auto update at the index level NOT at the database level. To turn of auto update at the index level use STATISTICS_NORECOMPUTE in the index (or statistics) definition.

Now, as for why (and how!) this has changed over the versions... here we go:

SQL Server 7.0

  • Invalidation: Statistics were *invalidated* when a row modification counter (sysindexes.rowmodctr) was reached. This meant that they could not tell where the modifications were occuring and, if modifications were somewhat isolated to a specific column ALL of the statistics for the TABLE would be invalidated (so, statistics could be invalidated earlier than necessary)
  • Updating: Even worse, in SQL Server 7.0, when statistics were invalidated, they were immediately updated. This caused two problems - thrashing at the time of invalidation because all of the stats needed to be updated AND two, if the statistics were not used for awhile then extra work was involved to update them and by the time they were used, they might already be somewhat out of date already.

SQL Server 2000

  • Invalidation: Statistics were still invalidated based on a row modification counter.
  • Updating: SQL Server 2000 fixed the "updating-potentially-too-often" problem by only updating statistics when they were needed.

SQL Server 2005

  • Invalidation: The biggest changes were introduced in SQL Server 2005 where they decided to NO LONGER user the sysindexes.rowmodctr and instead use an internal (and undocumented) columns specific modification counter. Now, statistic invalidation is more isolated to only those columns which are heavily modified. This internal/undoc'ed column is sysrowsetcolumns.rcmodified and can only be seen when connecting to SQL Server using the DAC (Dedicated Admin Connection).
  • Updating: Updating didn't really change but, SQL Server 2005 added "Aynch Auto Update" for statistics so that when the QO (query optimizer) encounters an out-of-date (i.e. invalidated) statistic, they can "trigger" the update but not wait for the update (meaning that they'll optimize using the out-of-date statistic). This can be both positive (faster) and negative (might not be the best plan if the statistics have changed drastically). It is off by default and IMO, I'd leave it off in most cases but if you find that auto update events (which can be Profiled) are causing you grief, then you can turn this on at the database level.

SQL Server 2008

Nothing new except "Filtered Statistics" and these are interesting as the density vector is still relative to the table (not the predicate) but the histogram is just over the predicate (OK, I know I'll have to blog a lot more about this one!). Anyway, I'm still playing/learning a lot more about these and they make the most sense with filtered indexes (as opposed to just a filtered statistic) but, just like statistics on secondary columns you will also potentially want statistics on the secondary columns of your indexes. The next question is should they have a filter or not. I've found that sp_createstats doesn't seem to create statistics with filters and I'm going to need to do some testing here but I think statistics with filters (filters that match the non-clustered index) should help to make the stats better (and even allow better usage of filtered indexes) but, I'm really going to need a bunch of time with this - and another post :). As for auto create/auto update - no changes there!

Long story short, if you're using SQL Server 2005 or SQL Server 2008, you should leave auto create/auto update ON.

Thanks for reading!
kt

PS - A few of you have mailed me about a bug in the sp_helpindex2 script(s). OK, that's my next post!!! Possibly with an sp_helpstats2 script as well!

OK, so, I don't blog very often. I don't know what it is... I think it's that I feel like I always need to blog huge posts and the thought of writing my huge post makes me not want to blog... so, sometimes my time-between-blogging (TBB :) is long. I'm going to turn over a new leaf (no, really... I'm *really* going to try this time!!) and I'm going to try and write smaller posts and more of them. And, yes, don't worry, I'll still write the long in-depth ones but I'll at least try to give you a few tidbits of things that I encounter - more frequently. And, that's partially why I'm posting this entry...

I've been working quite a bit with SQL Server 2008 and yesterday I went to test of a few things with database mail. Normally this gets setup quite early and gets setup with a lot of other things so restarts of the server/agent happen because of saving/tweaking my test VPC. I don't usually go in and setup Database Mail and then immediately try to send mail. Yesterday, I did... and, guess what, it didn't work. I got error [264] An attempt was made to send an email when no email session has been established. So, I felt like I had seen that before... and, with a bit of web searching I landed on Gops Dwarak's blog for a known issue of SQL Server 2005: http://blogs.msdn.com/gopsdwarak/archive/2006/04/25/583434.aspx. And, yes, restarting the SQL Server Agent solved the problem. I'm surprised that this is still a bug in 2008 but it's not entirely the worst I could come upon. However, having said that, I also thought that there was a general initiative for software to stop requiring restarts of services and/or the OS. And, yes, it's *just* an Agent restart so it won't directly impact your server's availability but, I'm still surprised that it wasn't resolved. Has anyone else found a bug (or even a documented "issue") that requires a service to be restarted - and you think it shouldn't?

And, well, that's definitely not the most interesting thing I've learned/found with SQL Server 2008, there's lots more (yes, I know - I need to blog a lot of it :). The most fun I've had has probably been learning sparse columns and filtered indexes. For right now, I want to give you some quick key points about some of the changes around indexes and in particular around sparse columns/filtered indexes:

  1. Do NOT create non-clustered indexes on sparse columns without filters (a filter that says WHERE sparsecolumn IS NOT NULL) because indexes do NOT have sparse columns in their definition. So, if you don't use a filter you will end up storing all of the NULL values in the index - which will waste a tremendous amount of space.
  2. A non-clustered index on a sparse column (without a filter) will be the same size as a nonclustered index on a non-sparse column (which is essentially the same point as above but, this might help clarify it a bit).
  3. Do NOT think that the increase in total indexes (from 250 in 2005 to 1000 in 2008) is because you should have more indexes... it's NOT. It's specifically because you might have a lot more columns (these changed as well from 1024 in 2005 to 30000 in 2008) because even that increase should not be used unless you're using sparse columns... So, you STILL want to use a lot of the best practices we've recommended in past webcasts, whitepapers, etc. and you still want to care about row size (and page density) and therefore work to create narrower tables (in general and relatively speaking - depending mostly on usage patterns). However, if you have a need for *lots* of wider tables BECAUSE you have an interesting set of properties that only some rows will have (i.e. the main reason to use sparse columns) then using sparse columns to handle these columns that will largely be filled with NULLs is a good thing. It's a good thing because a row that has a NULL for a sparse column will take ZERO BYTES. Absolutely no space is used for a sparse column that is NULL. So, this allows your tables to be wide (in definition) but your rows to be narrow (in practice). And, with well defined indexes you can VERY efficiently and effectively search on these properties.

OK, I hope to get some examples posted as well. Between Paul and I and our upcoming events - where we're demo'ing/discussing a lot of these principles, we'll plan to post a demo or two on how effective these really can be. And, I know... some of you will fight back with the thought that sparse columns introduce bad database design practices... I know, you want (instead) tables that have name/value pairs - which are the way most of us did this in all versions prior to 2008. However, name/value pairs tables become fragmented messes that also have fragmented indexes and therfore overall poor performance (for both inserts and queries). Not to mention, they can be difficult to query/join with (because you need to join multiple times to retrieve multiple properties) and the code gets messy quickly. Anyway, sparse columns - while they may not seem quite right at first - can really be a *MUCH BETTER* way to design (and perform!!!) around this problem.

And, speaking of events. We have a bunch of upcoming events... I'll give you the short bulleted list here because I'm sure many of you also read Paul's blog and he's mentioned quite a few of these coming up. Also, if you're interested in learning more details for these, the full abstract/links for many of these can be found on our Upcoming Events page. However, specific links are also listed below!

  • Best Practices in Performance and Availability for SQL Server 2005/2008, 1-3 September 2008 in Hatfield, England. You can get more details/register here
  • Dublin SQL Server User Group, Index Internals and Fragmentation, 4 September 2008, Dublin, Ireland. Bob Duffy blogged about this here and you can get more details/register here.
  • Microsoft SQL Academy 2008 - Session 1, 5 September 2008, Dublin, Ireland. You can get more details/register here.
  • Indexing for Performance in SQL Server 2000/2005/2008, 8-9 September 2008, Edinburgh, Scotland. You can get more details/register here.
  • SQL Connections "Power Workshops Series" Microsoft SQL Server 2008 Overview for Database Professionals (Hands-on — Bring Your Own Laptop), 6-8 October 2008, San Francisco, CA, USA. You can get more details/register here.
  • Microsoft Tech Ed EMEA ITPro, 3-7 November 2008, Barcelona, Spain. You can get more details/register here.
  • SQL Connections Fall Conference, 9-14 November 2008, Las Vegas, USA. You can get more details/register here.
  • PASS Community Summit 2008, 17-21 November 2008, Seattle, WA, USA. You can get more details/register here.
  • And, the last thing I'll leave you with is a recipe for the best darn Chocolate Chip Oatmeal Cookies *ever*. OK, I did a web search for exactly that (ok, it all started when Paul (who really doesn't really like sweets) decided that for his birthday he wanted Choc Chip Oatmeal Cookies (we asked becuase the girls and I wanted to bake something)). The search led me to here. And, they're definitely right that these are the best cookies ever! And, they make a few really good points:

    1. Don't microwave the butter... set it out for an hour before you're going to make the cookies. It's definitely different/better when the butter is naturally soft.
    2. Definitely take the cookies out when they look like they have a few more mins to go...

But, Paul's birthday was in July... and, it was not the only time we've made these cookies... mostly because *everyone* we introduce to these cookies is addicted - yes, the girls, me, Paul, my Mother, Brian Randell (who was forced to take some "togo" after a BBQ here) and countless of our other friends since this is now our "when-we-entertain" dessert ;-). So, since I've had the opportunity to tweak the recipe myself, here's what I do differently...

  1. I add less chocolate - usually 1-1.5 cups of chocolate chips. Sometimes I mix half/half semi-sweet and milk chocolate but I always use closer to 1 cup...
  2. I add about 1 cup of coconut - right at the end, with the chocolate chips...
  3. I make a single 12x18 cookie sheet "uni-cookie" and then I cut it up like brownies. As for the cook time, I've found that this giant cookie takes right about 14 minutes (but, you'll have to experiment with your oven... I usually take the giant cookie out when it's puffed up and the edges are just starting to go lightly brown). Oh, and you don't have to spread this out perfectly to the edges - it will expand and fill in. Just get somewhat close. The nice thing about this form factor is that it's a lot faster than making 36 individual cookies and then you get to choose the size(s) when you cut up the uni-cookie.

Oh, and don't blame me when you make 10 batches of these things and get all of your friends addicted as well. They really are yummy!!

Next up... installing Data Dude CTP16 with SQL 2008 and Visual Studio... what works with what and what's the best install order!!!

Thanks for reading!
Kimberly

OK, I first posted on some of the limitations to indexes in SQL Server 2005 and 2008 in part one here. Now, I want to dive into index internals for a post (or two). And, I often get the question “who is the best audience for your blog – or, for this post” and well, that’s a bit hard to answer. At SQL Connections in Orlando, I delivered a session titled: Index Internals & Usage and while we (fyi – Paul and I co-chair the SQL Connections portion of “DevConnections”) put it in the "developer-focused track," it was more of a Dev/DBA "hybrid" session with the emphasis on database development and best practices in creating and managing indexes (rather than management/maintenace/operational tuning - which is more for DBAs). Here at TechEd this week, I'm going to focus more on the management/maintenace/operational tuning side with a session called Are your Indexing Strategies Working? I'll also do a complementary blog post for that as well...

Having said that thought, indexes are definitely in a group of topics - very much so related to performance and scalability (index internals, indexing strategies, log maintenance, general database maintenance) which really needs to cross almost all database-related disciplines (dev, admin, ops, etc…). If you work with SQL Server in almost any capacity, you need to get a feel for at least some aspect of indexing for performance.

So, for this post, I’m continuing with some internals. In the first post (in this series), I wrote about limits. Limits/boundaries are interesting to discuss but it's also important to remember that good performance takes a lot more than just staying within the bounds of what’s possible. Creating indexes solely because you can – without reason and only with upper limits in mind – can be even worse than under indexing. So, if you find that you're wanting more about indexes (I have many blog posts that are solely Q&A posts), check out my Indexing category here. Now that you know how many indexes you can create, a better question would be when is it appropriate to create indexes at all?

So, what is “finding the right balance” in indexing? In my opinion, there are three requirements/pre-requisites:

  1. knowing the data
  2. knowing how the users use the data
  3. knowing how the underlying structures and database stores/manipulates and uses indexes

Bringing all of these things together is what I try to do in my workshops, seminars and lectures – in this post, I'll start with a smaller more digestible piece - internals.

Indexes have 2 components: a leaf level and a non-leaf level (or b-tree). The non-leaf level is interesting to understand and discuss (in terms of internals) but simply put, it’s used for navigation to the leaf level (more than anything else). So, we'll start with the leaf level (as does SQL Server - the leaf level is always built first). The leaf level of an index contains something (I’ll explain more coming up) for every row of the table in indexed order (note: I am focusing on traditional indexes in every release from SQL Server 2000 up to and including SQL Server 2008 – with the exception of filtered indexes which I will write about in a later post). Once the leaf level is built, non-leaf level(s) can be built to help navigate to the leaf level but the architecture is rather straightforward. The non-leaf level stores something for every page of the level below – and levels are added (each smaller than the previous because each level only contains one the first entry from every page) until the index gets to a root of one page. While it sounds like this could result in a lot of levels (ie. a tall tree), the limitation on the size of the key (which has a maximum of 900 bytes or 16 columns) helps to keep index trees relatively small. In fact, in the example I’ll show coming up – which has a fairly large (large meaning WIDE) index and has a key definition which is at the maximum size – even the tree size of this example index (at the time the index is created) is only 8 levels high/deep…

To see this tree (and the math used to create it – which is the same thing that SQL Server would go through to create it), we’ll use an example where the leaf level of the index contains 1,000,000 “rows.” I put quotes around “rows” because I don’t want to imply that these have to be data rows – these are really just leaf level rows and I’ll explain more on what leaf level rows can be... The leaf level rows are 4,000 bytes per row (therefore only 2 rows per page) or 500,000 pages. This is not ideal but at least the pages are almost full and we’re not wasting a lot of space – if we had two 3000 byte rows we’d still only fit 2 per page and then we’d have 2,000 bytes of wasted space. Now, as for why these are just “rows” and not specifically data rows is because this leaf level could be the leaf level for a clustered index (therefore data rows) OR these leaf level rows could be rows in a non-clustered index that uses INCLUDE (which was new to SQL Server 2005) to add non-key columns to the leaf level of the index (which therefore creates wider leaf rows (wider than the 900 bytes or 16 column maximum). Again, while this doesn’t currently sound interesting, I’ll explain why this can be beneficial coming up (possibly in another post depending on how long this particular post becomes… J).  

The leaf level of this index would result in a 4 GB structure (and this is only at the time it’s created – if a lot of rows are added and the key is not ever increasing then this structure could become heavily fragmented and therefore much larger/taller). In this case, it’s relatively large (again because of “row” width) and with an index key of 900 bytes you can even see that in this case, the tree would be relatively small and only result in 8 levels – as shown below.

Root page of non-leaf level (Level 7) = 2 rows = 1 page

Intermediate non-leaf level (Level 6) = 15 rows = 2 pages (8 rows per page at 900 bytes)

Intermediate non-leaf level (Level 5) = 122 rows = 15 pages (8 rows per page at 900 bytes)

Intermediate non-leaf level (Level 4) = 977 rows = 122 pages (8 rows per page at 900 bytes)

Intermediate non-leaf level (Level 3) = 7,813 rows = 977 pages (8 rows per page at 900 bytes)

Intermediate non-leaf level (Level 2) = 62,500 rows = 7,813 pages (8 rows per page at 900 bytes)

Intermediate non-leaf level (Level 1) = 500,000 rows = 62,500 pages (8 rows per page at 900 bytes)

Leaf level (Level 0) = 1,000,000 rows = 500,000 pages (2 rows per page)

 

Having said that though, this is NOT a goal. :) In more realistic scenarios [where the key is much smaller and] even when there are more rows, there are fewer levels (3-4 is quite normal). Most importantly, the size of an index (and the number of levels) depends on two things – the width of the key (in terms of the number of bytes) and the number of pages in the leaf level of the indexes. The number of pages in the leaf level of an index depends on the number of rows and the size of the rows (again, in terms of bytes) of the rows in the leaf level.

You can see the size of your index by using one of the following commands:

In SQL Server 2000: DBCC SHOWCONTIG … WITH ALL_LEVELS

In SQL Server 2005/2008: querying the dmv: sys.dm_db_index_physical_levels

To see the syntax of these commands and their output, we’ll use some structures created in the credit sample database. Using credit, you can see exactly how these commands work and how they return the details about every level.

NOTE: you can download a zip of a SQL Server 2000 backup of this database here – and since this is a SQL Server 2000 backup, you can restore this to SQL Server 2000, SQL Server 2005 or SQL Server 2008.

USE credit
go

SELECT *
FROM sys.dm_db_index_physical_stats
    (db_id(), object_id('Charge'), 1, NULL, 'DETAILED')
go

DBCC SHOWCONTIG('charge', 1) WITH ALL_LEVELS, TABLERESULTS
go

Using the DMV or DBCC SHOWCONTIG you can get the same picture of the charge table. Using the detailed (or ALL_LEVELS) parameter, you get the entire structure (all levels) for the clustered index (index_id = 1 is always the clustered index, IF the table is clustered). The reason it returns all levels is that the 'DETAILED' mode has been specified.

The clustered index in this table has 1,600,000 rows (DMV column: record_count or SHOWCONTIG column: rows) and these are stored on 9303 pages (DMV column: page_count or SHOWCONTIG column: pages). If you read to the next level which is level 1 because the leaf level is level 0 (remember index levels always start with the leaf level 0 and then go up to the root), you can see that it's number of "rows" is equal to the number of pages in the leaf level... and this keeps going until you get to a root of 1 page. In this case, the clustered index (which is the widest structure of the table) has a very narrow clustering key (the key is on charge_no which is an int) only has a total of 3 levels even though the table has 1,600,000 rows. Ideally, you should run this on a few of your production tables (in a development/test environment) and you can start to get some insight into how big your structures are. However, a BIG factor that you might see in production is fragmentation. If a particular level (or levels for that matter) are heavily fragmented then each level might be wider and less compact (and therefore less performant). Reviewing the DMV columns avg_fragmentation_in_percent and avg_page_space_used_in_percent, you can get a feel for how full each page is. Poor page density reflects that your pages are not as full as they could be but there are many factors for why this is the case: bad row size, splits due to inserts, splits due to updates of varchar columns or even a poorly chosen fillfactor that has left too much space on the pages. However, page density is only one piece of the puzzle and if your avg_fragmentation_in_percent is very low (0-5%) then I wouldn't be over worried about your pages not being entirely full unless you have the time to possibly re-design tables (eg. vertically partition them) and then rewrite your applications to direct your statements at only the appropriate base table. But, another factor to consider is the rate at which your fragmentation occurs as well as when you can fix that fragmentation. This is a HUGE discussion that requires time... And, I want to get back to index structures for now. However, both Paul and I have blogged quite a bit about rebuilding v. defragging indexes and what those operations do/how, etc. In fact, just today, Paul has blogged a Q&A about myths and misconceptions about index rebuild operations. So, I'll get back to internals for now! :)

You can use LIMITED (which is the default mode), SAMPLED, or DETAILED. All three have excellent uses and all use IS locks (to minimize blocking). Limited gives you a quick overview of fragmentation and mostly describes how intact and in order the levels are. Limited is quite clever in that it only scans the first non-leaf level above the leaf to determine how much fragmentation there is... since the non-leaf level always tracks the first entry (and a pointer to the page) then they know EACH and EVERY page in the leaf level by ONLY reading the non-leaf level (which is [typically] a lot smaller and therefore faster). However, because they don't touch every page and determine page density then they only track how out of order the levels are and not how dense/full the pages are (which is also a form of fragmentation). So, if you want a bit more details, you can use SAMPLED. The SAMPLED mode returns the fragmentation from reading every 100th page of the index (or heap). If the table has less than 80MB used (which is 10,000 pages), every page is read instead (which is a DETAILED scan). The DETAILED mode reads every page of every level to calculate the most accurate picture of your tables fragmentation. This is the best form of analysis but also takes the most time.

If you’re interested in learning a few more of the tips/tricks with using this DMV, check out the following script: Using dm_db_index_physical_stats.zip (2.23 KB)

A favorite tip is that the database in which you want to analyze tables does NOT have to be in 9.0 compatibility mode in order to use this DMV. Don’t get me wrong, you will get errors if you try to use this DMV in a database that’s not in 9.0 compat mode; however, if you are in master (which is set appropriately and cannot be changed) and then use the first parameter to target a non-9.0 compat mode database, then this DMV works great. However, a second "gotcha" is for parameter 2... as long as you don’t use 2-part naming for the objectname (2nd) parameter, everything will work as expected. If you specify object_id('tablename') from master for a table that's in credit then object_id will return NULL. The query will still run but against all tables in credit rather than the one you thought you were targeting. If you want to use this DMV across databases, you will need to supply the database name in the first parameter and then make sure that you use 3-part naming for the second parameter.

Now that you are getting to know some of the structures (in terms of seeing physical structures and internals), where do we go from here? The best route to start “finding the right balance” for performance is to know the data and as well as get some general insight into usage patterns (this is probably the hardest component to know and sometimes you only know exactly what’s going on if you profile what’s actually happening in production – is that too late? To a certain extent yes and to another extent no…there are still many things for which you can plan and other things you can confirm or test once the application is running (i.e. Profiler). All of those things together are going to help to “find the right balance”.

Having said that, and having discussed the general internals of a b-tree (and therefore an index structure), what’s the difference between a clustered and non-clustered index? Well… stay tuned, that will be part 3 in this series. And, then (finally), we'll get to appropriate uses for INCLUDE (which was new for SQL Server 2005) and then appropriate uses for Filtered Indexes (a new feature in SQL Server 2008). Also, somewhere in there I'll post a few tips from my TechEd session so that you can start to determine if your indexing strategies are working??

Thanks for reading!
kt

In my blog post on my new sp_helpindex proc (sp_helpindex2), I mentioned that the indexes in my sample were not necessarily a recommended set of indexes - just a test set of indexes. So... in this post, I thought I'd start a series on indexes, limitations and best practices/uses... Especially, why/how to best choose when to use INCLUDE v. having columns in your key. To start, I thought I'd give some background, as well as limitations that exist in various releases from 2005 to 2008 CTP6 (Feb CTP), plus what's expected in the SQL Server 2008 RTM (ah... I did say "expected" so don't come back and yell at me if/when I'm wrong ;-))

First, let's go through a few rules and limitations and background:

SQL Server 2005

  • 250 total indexes per table: 1 clustered index and up to 249 nonclustered indexes (no, this is not a goal!)
  • The index key can be up to 16 columns OR 900 bytes - whichever comes first...
  • The leaf level is limited just as a table is limited to 1024 columns (and, all column types are acceptable in the leaf level of an index - even LOB columns)
  • Statistics are kept for every index (so, up to 250 index-related stats) and there can also be statistics on columns or sets of columns in addition to the index-related stats. In earlier releases, statistics used index ids and as a result, the number of statistics were limited to 250 total minus the statistics used by indexes... in SQL Server 2005, they changed to having statistics kept/managed separately (use sys.stats to see them). As a result of using sys.stats, you can now have 2000 statistics on a table, in addition to the 250 (total) indexes and their statistics. If you want to test this out (and check it on various versions of SQL Server), use this script to setup a test database, a test table and then use dynamic string execution to loop through (until it errors) with creating nonclustered indexes and statistics: IndexMax.sql (853.00 bytes).

SQL Server 2008 CTP6

  • So far, it seems as though most of the maximums have not yet been lifted...
  • 250 total indexes per table: 1 clustered index and up to 249 nonclustered indexes (and this number  - for CTP6 - includes filtered indexes AND spatial indexes too!)
  • The index key limit hasn't changed (it can be up to 16 columns OR 900 bytes - whichever comes first)
  • The leaf level is still limited just as a table is limited to 1024 columns (and, all column types are still acceptable in the leaf level of an index)
  • Statistics in CTP6 seem to be limited to only 2000 stats per table...

SQL Server 2008 RTM (expected/target... no guarantees on this one :)

  • 30,000 columns per table (mostly to allow sparse columns)
  • 1,000 total indexes per table: 1 clustered index and up to 999 nonclustered indexes. This is also not a goal BUT, it makes sense because of both sparse columns and filtered indexes. Both Paul and I will try to post some entries about sparse columns and filtered indexes in the coming days...
  • The index key limit won't change
  • The leaf level is will be limited just as a table is limited to 30,000 columns (and, all column types are still acceptable in the leaf level of an index)
  • Statistics are also said to be increasing and likely to 30,000... And, for having extra statistics just sitting around and possibly not being used - well, outside of a minimal amount of disk space taken by the stat blob (which does start to get interesting at 1,000s I suppose), even stats that don't get used don't really create much of a problem. So, I'm OK with this one increasing - even significantly - but I have to admit I'm somewhat nervous about the significan't increase in indexes.........

So... you can have A LOT more indexes in SQL Server 2008 but just because you can - DOES it mean that you should?!

And on that - I'll leave you hanging for my next post where I start to talk about WHY they're increasing this (hint: sparse columns and filtered indexes = more columns/more indexes)....

Have fun,
kt

IMPORTANT NOTE: Use the updated version of sp_helpindex2... see this post: Updates (fixes) to sp_helpindex2

OK - so this has been frustrating me for many months... when you create indexes with included columns (which was a new feature of SQL Server 2005), they're not shown by sp_helpindex or by DBCC SHOW_STATISTICS. I understand this not showing for statistics because included columns are not factored into the histogram (that's only the high order element which is the first column in the index) OR the density vector (which only shows the densities (or averages) for the left-based subsets of the key). So, why doesn't sp_helpindex show it? Well... I guess it just didn't get updated for SQL 2005. So, in SQL 2008, I was hoping I'd not only see included columns but also filtered indexes... well, neither is there and sp_helpindex is still the same old proc. Don't get me wrong, you can use SSMS to see all of the index properties for a single index (pane, by pane for each property) OR you can run queries to find the included columns for a given index:

SELECT
(CASE ic.key_ordinal WHENTHEN CAST(AS tinyint) ELSE ic.key_ordinal END) AS [ID],
clmns.name AS [Name],
CAST(COLUMNPROPERTY(ic.object_id, clmns.name, N'IsComputed') AS bit) AS [IsComputed],
ic.is_descending_key AS [Descending],
ic.is_included_column AS [IsIncluded]
FROM sys.tables AS tbl
   
INNER JOIN sys.indexes AS
      
ON (i.index_id >AND i.is_hypothetical = 0) AND (i.object_id = tbl.object_id)
   INNER
JOIN sys.index_columns AS ic 
      
ON (ic.column_id >AND (ic.key_ordinal >OR ic.partition_ordinal =OR ic.is_included_column != 0)) 
         
AND (ic.index_id = CAST(i.index_id AS int) AND ic.object_id = i.object_id)
   INNER
JOIN sys.columns AS clmns 
   
ON clmns.object_id = ic.object_id AND clmns.column_id = ic.column_id
WHERE (i.name = N'[MyIndex]') AND ((tbl.name = N'[MyTable]' AND SCHEMA_NAME(tbl.schema_id) = N'[MySchema]'))
ORDER BY IsIncluded, [ID] ASC
but, there isn't a nice clean way to show all of the included columns for all indexes for a particular table... until now :)A couple of weeks ago I sat down and rewrote sp_helpindex. I was actually on a plane from Hyderabad to Frankfurt or from Frankfurt to San Fran or from San Fran to Seattle (it was a long day :) and I was using (and well, forcing myself to learn how to use :) my new Vista laptop. OK, that's a HUGE story in and of itself and it definitely warrants its own post but I'll sum up the story with the fact that I had to purchase a new laptop while in Hyderabad because BOTH my primary laptop (T61p) AND my backup laptop (T60p) BOTH (yes, BOTH!!!) suffered catastrophic disk failures on their boot drives within 24 hours of each other. In the end, I really cannot believe the "coincidence" of two laptops crashing within 24 hours of each other. Yes, I thought MTBF too (at first) but the laptops were two Lenovos - one Lenovo (the T60p) was purchased in Feb 2007 and the second, a Lenovo T61p was purchase in Oct 2007. And, it was the T61p that went first. The only thing I can even begin to speculate about and/or think to attribute it to (as I was in India for 17 days from Mar 3 through Mar 20 and this all started on Mar 17) was an overactive metal detector at the hotel at which I was staying (or something related to St. Patrick but I've since ruled that out - and no, I wasn't drinking green beer either...). OK, I really need to do another post to give you all of the details about this trip BUT, I did get a new laptop... and, having just bought it only shortly before I flew back I felt like I really needed to get my money's worth so I just *had* to work on the flights home (ah, security with *3* laptops was NOT fun and I'm *VERY* glad that none of them asked me to "boot" my laptops to prove they were working... that could have been a VERY bad situation... lol).

OK - so back to the story... I was working on the flights and I was preparing to deliver some content on the Friday after I returned (yes, I taught a full day in India on Wednesday then flew back leaving India at 2:15am Thursday morning so that I could arrive back in Redmond at roughly 7pm Thursday night - about 30 hours later - and then teach Friday morning for an 8:30 start time... ah, I was *really* tired on Friday night :). Anyway, in preparing, I decided that I finally needed to re-write sp_helpindex. When I was first writing it, I was only thinking of SQL Server 2005. So, here's the 2005 version that I wrote: sp_helpindex2_2005.zip (2.71 KB).

So, I had wanted to blog that when I got back to Redmond but in preparing for the trip we're on now AND rebuilding my primary and backup laptops, well, it got tabled. So now, today, Paul and I are in Iceland (working with our great friends at Miracle Iceland) and we're teaching "the Accidental DBA" (this past Monday) and SQL Server 2008 New Features in Database Infrastructure and Scalability (Tue through Thursday)... I was giving a lecture on Filtered Indexes in SQL Server 2008 and I, once again, found myself needing a better sp_helpindex. So, when Paul got up to talk about Compression (which is no short lecture for him :), I had time to rewrite sp_helpindex... again. And, here's what I ended up with...

exec sp_helpindex2 'member'

index_name index_description index_keys included_columns filter_definition
member_corporation_link nonclustered located on PRIMARY corp_no NULL NULL
member_ident clustered, unique, primary key located on PRIMARY member_no NULL NULL
member_region_link nonclustered located on PRIMARY region_no NULL NULL
NCIndexCoveringLnFnMiIncludePhone nonclustered located on PRIMARY lastname, firstname, middleinitial phone_no NULL
NCIndexCoversAll4Cols nonclustered located on PRIMARY lastname, firstname, middleinitial, phone_no NULL NULL
NCIndexLNinKeyInclude3OtherCols nonclustered located on PRIMARY lastname firstname, middleinitial, phone_no NULL
NCIndexLNOnly nonclustered located on PRIMARY lastname NULL NULL
QuickFilterTest nonclustered located on PRIMARY lastname phone ([lastname]>'S' AND [lastname]<'T')

So, in the end, I can quickly see whether or not my index has included_columns and/or a filter_definition. Don't get me wrong, these indexes above are NOT necessarily a good combination of indexes (or recommendation of ANY kind) to have - these were just created to make sure that my code works. And, as my good friend Gunnar would say - "it's not my best code but it's not my worst code either" <G>. And, so, here it is: sp_helpindex2_2008.zip (2.75 KB).

Pretty darn useful for sure! Oh, and I used the undoc'ed sp_MS_marksystemobject so that I could still create the sp_ in master but then execute it in all other databases. It's frustrating that this behavior (with sp_ named objects) no longers works in 2005/2008 but at least the sp_MS_marksystemobject still sets the behavior so that we can create this one proc in master but use it in all other databases.

Have fun!
kt

I know that Paul and I recommended that you subscribe to Conor's blog... but have you? He's posted some great details on Partitioning (Part 1 and Part 2) as well as statistics and it always reminds me of how much I can learn from other people's perspectives!

And, just to dove-tail on some of his statistics comments... I, too, have found that as tables get significantly larger AND have non-standard distributions of more than 200 distinct values (and un-even distribution between those values as well), that the optimizer just cannot possibly do a perfect job. The only way an optimizer can be good is when it can "find a good plan fast" (which I first heard from Nigel Ellis (former Development Manager of the Query Processor team) - back when he delivered a Pacific Northwest SQL Server User Group meeting many moons ago). The most important thing to realize is that it's just not possible to waste time to find the absolutely best plan... mathematically analyzing all permutations would be prohibitive - you'd have to take a vacation between query executions (wait, that's not a bad idea... I digress :).

The point:

  1. Make sure that statistics are up-to-date (either through the database option: auto update stats OR by manually updating statistics)
  2. Consider re-evaluating statistics over large tables (and, when poor performance occurs - look at the estimated rows v. the actual rows - if the estimate/actual are off by a fact of 10, then it could be the statistics). I'd try updating stats first and then if that doesn't work, updating with a fullscan. If neither of those work, I'd also re-evaluate other possible indexes (there are some distributions between tables being joined that just can't show a correct correlation between the values when in multiple indexes... sometimes the best index is a multi-column (ie. composite index)). 
  3. Consider breaking very large tables down into smaller chunks (not just table index partitioning but possibly Partitioned Tables AND Partitioned Views) as this can give the optimizer additional details about partiticularly interesting data sets. Even in SQL Server 2008, statistics are still table-level (filtered indexes can provide some, but not complete, relief... I'll give more details in a later post) but I'd often argue that some of the best table designs are not just for a single table. Consider the statistical, locking, and indexing implications for mixed workloads against a single table (and the tremendous amount of blocking that could occur in addition to varying access patterns). And, even while 2008 will offer Partition-level lock escalation, well-designed tables may not need it! I know I've mentioned this before but different perspectives on statistics, optimizers and the fact that a good optimizer has to be efficient in-and-of itself, remind me of some of the most basic things that are also the most common problems contributing to poor query performance.

Returning to the basics and optimizing a system from the ground up always leads to better scalability!

Enjoy!
kt

 

Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)......... I figured that it was time to turn the response into a blog post. ;)

Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term - a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But - there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.

And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So.............. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:

The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys).  [kt note: this second sentence is not entirely true... SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this... non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ] 

Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason.  It is the business key.  It would satisfy uniqueness (by definition of logical KEY).  And, it is well suited for a wide variety of range scans.  However, this is when the PRIMARY KEY is defined on the Business Key of the data.But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered.  First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design.  As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness.  Second, just because a proxy has been defined does not make it a natural candidate for the clustered index.  The business key still maintains this distinction.What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits.  However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list.  Page Splits are managed by proper FILLFACTOR not increasing INSERTS.  Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.

In short, the argument runs shallow.

Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here...... When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table - not just select... because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:

  • Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But - this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
  • Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
    • Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
    • Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
    • Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
  • Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space........ hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices - Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
  • Selects.............. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups - as mentioned in the mail...and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms - ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:

SELECT c.member_no AS MemberNo,
 sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no

On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:

  • Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
  • Indexed view = .406 seconds

Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries - which can be true as well)......... But - this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows........ then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by - because this doesn't require a worktable).........

  • CL on member_no = 4.906 seconds
  • CL on charge_no = 6.173 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
  • Indexed view = .516 seconds

This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and - of course, that statement warrants yet another blog post :)]

So, depending on the tests that you do - especially if you focus only on selects and you don't have modifications (i.e. fragmentation) - then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.

Oh - and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data... which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)

Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!
kt

On Wednesday I had the pleasure of chatting with Greg Low - a fellow RD and MVP who is also excited to be focused and working in the SQL Server space. He's been doing a few podcasts on database technologies and we were finally able to hook up today. He caught me off guard with a couple of questions on my hobbies (the hobbies that I have outside of SQL Server ;-) but more than anything he got me talking about indexes (well, now that's not all that hard, eh? ;-). It was fun - thanks for the chat Greg.

Check it out:

SQL Down Under: http://www.sqldownunder.com/
The show in mp3 format: http://www.sqldownunder.com/SDU15FullShow.mp3
The show in wma format: http://www.sqldownunder.com/SDU15FullShow.wma

Enjoy - and thanks Greg!
kt

Categories:
Events | Indexes | Tips

Hey there everyone! I know I still owe you a few Q&A entries (for sessions 7, 8 and 9) but I wanted to get this blog entry out there so that you can play a bit with some of the resources. This series was targeted at developers but really helps to "bridge the gap" between development and administration by always remembering the impacts of what you implement (and techniques to help you to see if you do). More specifically, everything you do and/or design, has the potential for a negative impact to something else - there's no free lunch, eh?

So, this series focused more on Scalability but always remembered the impact to availability and/or reliability. This last session brought together the three primary areas to remember while developing a scalable and reliable architecture:

  1. Know your data
    • Design for Performance - Sessions 1, 2, 3, 6, and 8
  2. Know your users
    • Indexing for Performance - Sessions 4, 5 and 9
    • Optimizing Procedural Code - Session 7
    • Controlling Mixed Workloads and Concurrency - Session 6
  3. Users lie
    • Profile - to make sure that you're tuning what's really happening as opposed to what you think was going to happen! - Session 9

This last session had some great questions and as a result, a few new resources were used. Here are a few of the things we talked about:

Event Notifications and DDL Triggers

DMVs

Webcast links for the entire series!

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click here.

Part 4: SQL Server Indexing Best Practices
For the MSDN Download for Part 4, click here.
For the SQLskills Blog Entries for Part 4
Resource links blog entry, click here.
Q&A blog entry, click
here.

Part 5: SQL Server Index Defrag Best Practices
For the MSDN Download for Part 5, click here.
For the SQLskills Blog entry, click here.

Part 6: Mixed Workloads, Secondary Databases, Locking and Isolation
For the MSDN Download for Part 6, click here.
For the SQLskills Blog Entry for Part 6, click here.

Part 7: Understanding Plan Caching and Optimizing Procedure Performance
For the MSDN Download for Part 7, click here.

Part 8: Data Loading and Aging Strategies - Table and Index Partitioning
For the MSDN Download for Part 8, click here.

Part 9: Profiling for Better Performance
For the MSDN Download for Part 9, click here.

Part 10: Session Summary - Common Roadblocks to Scalability
For the MSDN Download for Part 10, click here.
Transcript can be found here.

So, the series comes to an end (even though I still have more work to do). I have to say that it was a lot of fun and I enjoyed everyone's questions. And /start shameless plug here/ starting in March, SQLskills will begin a 10-12 part series on TechNet. The series will include sessions from my colleague Bob Beauchemin as well as me. This will definitely be more Operations and DBA focused but for all of you developers - it may help you better understand the system, High Availability and a myriad of New Features in SQL Server 2005.

I hope to see you there - or at least your DBA... ;-)

Thanks again everyone,

Kimberly

Effectively Designing a Scalable and Reliable Database

A Primer to Proper SQL Server Development

New Features in Indexing and Index Maintenance Best Practices, Part 5 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Part 5 can be replayed by clicking here.

Q: Where can we get the demo scripts AND the sample database: Credit? The demo scripts are in this zip (20050902MSDNDemoScripts.zip (8.52 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.  To download the ZIP of the Credit Database Backup click here. Once unzipped, restore this backup to SQL Server 2000 or SQL Server 2005. The backup is a SQL Server 2000 backup and can be restored to either version! If restoring to SQL Server 2005, you might want to change the destination for the data and log file as the path will probably be different.

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click
here.

Part 4: SQL Server Indexing Best Practices
For the MSDN Download for Part 4, click here.
For the SQLskills Blog Entries for Part 4
Resource links blog entry, click here.
Q&A blog entry, click here.

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Related Resources

MSDN Webcast: Indexing for Performance – Proper Index Maintenance MSDN Whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices TechNet It’s ShOwtime Webcast: Index Defragmentation with SQL Server 2005 

Technical Questions

Q: In your script, what is "HA Requirements"? HA = High Availability. This is the requirement that your table stay online and available. Some companies are trying to achieve 99.999% uptime, this is especially challenging when even maintenance operations take a table offline.

Q: If you create extra indexes is there a easy to configure utility that you can run across an application after it has run for a few months to list keys that were never or infrequently used? Use one of the new DMVs: sys.dm_db_index_usage_stats. To see the complete list of DMV objects, use the following query:SELECT * FROM sys.system_objects WHERE [name] LIKE 'dm[_]%'

Q: What is DMV again? Dynamic Management View. These are new objects which give information about in-memory objects and state information.

Q: What are the parameters and their usage – for sys.dm_db_index_physical_stats? (DatabaseID, ObjectID, IndexID, PartitionNumber, Mode)

DatabaseID = [ NULL | 'DatabaseID' ] NULL: returns information for ALL databases, if NULL is used no other options can be supplied. This returns ALL indexes for all objects in all databases. Easy but possibly slow.
DatabaseID: smallint type. Refers to the ID for a specific database. DB_ID() or DB_ID('DatabaseID') can be used. The latter allows you to run this from ANY database as long as you have access. However, 3-part naming must be used.
ObjectID = [ DEFAULT | NULL | 'ObjectID' ]DEFAULT/NULL: return ALL base data: CL, Heap, LOB for the specified database.
ObjectID: int type. Refers to the ID for a specific object. OBJECT_ID('TableName') can be used. When using OBJECT_ID, you can use 1/2/3-part naming. Be sure to use 3-part when executing outside of database.
IndexID = [ DEFAULT | NULL | 'IndexID' ]DEFAULT/NULL: All indexes
IndexID: tinyint type. Refers to the ID of a specific index.
PartitionNumber = [ DEFAULT | NULL | # ]DEFAULT/NULL/0: return ALL partitions
#: returns only the details about specific partition. When a PartitionNumber is specified then an IndexID must also be specified.
Mode = [ DEFAULT | NULL | 'SpecificMode' ]DEFAULT/NULL/LIMITED: return FAST scan and use only an IS (Intent Shared) Table-level lock. This lock blocks ONLY eXclusive TABLE-level locks and schema changes. Excellent, relatively unobtrusive way to get fragmentation details.
LIMITED: IS Lock. Same as SQL 2000 WITH FAST, only page counts and EXTERNAL fragmentation displayed. Does not detail INTERNAL fragmentation and page density.SAMPLED:  IS Lock. For tables less than 10,000 pages (~80MB), all details are produced. For tables of more than 80MB, two samples are done (1% and 2%) at every nth page. The samples are compared and if close, 2% sampling output returned. If not close, then up to 10% will be sampled.DETAILED: S Lock. Entire table analyzed for both internal and external fragmentation. Returns one row for each level of the index from the leaf level (level 0) all the way up to the root level. This can help you determine fragmentation in the non-leaf levels but at the expense of holding a shared table level lock.
Q: How often should you run DEFRAG on your SQL server box? Should this be a part a regular schedule? Taking down SQL is their any other consideration? First, the only thing that’s not available is the table being REBUILT. Defragging an index does not take that table/index offline. So, more than anything, it depends on what you’re trying to achieve. If you want achieve better availability on SQL Server 2000 then you might choose to defrag rather than rebuild – to keep your tables available.

Q: How often do you get such perfect tables in practice? A table is always completely clean and contiguous after a rebuild. To periodically fix a table, you should use consistent and automated rebuild strategies.

Q: Do you have suggestions for developers using MSDE when customer’s demands can vary? Vary from few transactions to a large customer with many transactions. The general best practices in database and table design scale from the low end all the way up to the high-end and in the end – helps your database scale!

Q: Can you touch on rules of thumb for "pad index"? If fragmentation in the leaf level is minimized through proper index maintenance and fillfactor – then fragmentation in the non-leaf levels should be low as well. You rarely need to specify padindex unless you have widely varying distribution of data and really want to leave larger gaps because of strange densities of data.

Q: Do most of these "Index Rules" apply to Indexed-Views? Yes! All indexes can become fragmented after data modifications... Your scripts should always look for fragmentation across all scripts.

Q: Can you discuss fragmentation WRT horizontal partitioning, especially range partitioning on the primary key? SQL Server 2005 offers more granular rebuild options –but not necessarily online. In many cases, you might want to design a read-only partitioned table and keep the volatile portion of the table (especially if only one partition), in its own separate table – possibly using a partition view (or an inline table valued function) over these two tables.

Q: If I'm selecting from a table with a where FirstName = ... and LastName = ... and I have 2 indexes, one on LastName and another on FirstName. Are they both used? With an AND – maybe. The optimizer will look at the Index statistics to determine if either of them selective enough to use only one index. If neither is selective alone and a better index does not exist (a better index for AND would be one that includes BOTH of the columns in the SAME index – as a composite index), then SQL Server may choose to join the indexes (index intersection).

Q: URLs on the Resources slide can't be read. Could you type then into the Q&A, please? When the session is available for download (which is what happens when MSDN put this online), then you can access the URLs there as well. Typically, I place all of the links at the beginning of the Q&A – resources section. I’ll make sure to do this consistently!

Q: How does an uniqueidentifier used as a clustered primary key effect performance? This is best answered by session 4. In short, a non-sequential GUID can cause a lot of fragmentation.

Q: What is ExtentFragmentation as reported by DBCC SHOWCONTIG and is it less important than Logical Fragmentation? Extent Fragmenation refers to how many extents are next to each other. This is a bit more important than Logical Fragmentation as logical fragmentation shows whether or not the pages are next to each other.

Q: How much danger is there in the defrag processes? What kind of backup procedures do you suggest when you defrag? More frequent transaction log backups. A defrag generates a lot of log information. However, it does so in mini transactions. As a result, transaction log backups can occur concurrently with the defrag process and even though the defrag is not complete, the transaction can still be cleared because the defrag process runs as small transactions instead of one long running transaction. This also improves concurrency because the locks are released throughout the process.

Q: Defraging a large index can cause the log file to grow quite large. Is there a way to minimize this other than frequently log backups? Yes, you’re correct – defraging a large index WILL grow the log file quite large! As for minimizing this activity in the log – no way to do that. But – you’re correct in increasing the frequency of log backups!

Q: With very large tables, how much available disk space (both transaction logs and data drive) do you need to have to rebuild? Does it take less space to defrag than to rebuild? Well, this is really a multipart question… First, log space for rebuilds is mostly dependent on the recovery model. If you are running the FULL recovery model then creating and/or rebuilding indexes will take enough log space for the entire rebuild to complete. If you are running in the BULK_LOGGED or SIMPLE recovery models then this operation will run as a bulk operation and will be minimally logged. While this will take less time and significantly less log space, you are giving up some features when switching recovery models. I would strongly suggest reviewing the second session to see if this is appropriate.Now, as for data space – a rebuild will always require at least the table size in free space and possibly as much as double (if an online rebuild is being performed). Typically, when space estimates are being done (when capacity planning the database) I always recommend taking the largest table size and multiplying it by 2 or 3 – in order to make sure you have enough space for rebuilds. There is space needed for sorting as well – this can come from the database OR from tempdb (using the SORT_IN_TEMPDB option).Defraging doesn’t move an object so it doesn’t take additional data space BUT it does require more overall log space because it runs as mini transactions instead of just one.

Q: Should we look at different fragmentation stats if there are multiple files in the same filegroup? No, you still want to review average fragmentation. However, you may have more “fragments” in a table that spans filegroups; this does not necessarily mean that your table is fragmented.

Q: Are there any good third party tools for checking fragmentation and performing maintenance? Unfortunately no revolutionary ones (that I know of and/or can recommend)...but I still have high hopes :)

Q: How do you determine the appropriate fill factor? Unfortunately, there isn't a magic number... but, you can test your guestimate by seeing how fragmented the table becomes between your regularly scheduled defragmentation routines.

Q: Does it matter if I build the clustered index before/after rebuilding the non-clustered indexes? You should always create the clustered index before creating non-clustered index but as for rebuilding - you can rebuild your indexes independently as a rebuild of the clustered does not (generally) cause a rebuild of the non-clustered. There is one exception to this rule in SQL Server 2000 – in 2000 ONLY, SQL Server will automatically rebuild the non-clustered indexes when a non-unique clustered index is rebuild. Why? Because the uniqueifiers are rebuilt.

Q: Will doing a defrag followed later by a rebuild decrease the work of the rebuild? Not really. A defrag doesn’t move the object – only a rebuild does. However, you might minimize the cost of the sort…

Q: How does cache map to table pages, i.e., does free space in table pages have a 1:1 correspondence to wasted cache? SQL Server reads the 8K page from disk into memory. The number of bytes that are wasted on disk are also wasted in memory. This is often the motivation for vertical partitioning! You might refer back to session three for more details on row/page structures!

Q: If switching a varchar cluster to a bigint and vice-versa in 2000, what would the best order of drop/create index? Actually, this is the reason that CREATE with DROP_EXISTING was created… so that you could “change” the definition of the clustered

DROP TABLE test
go
CREATE TABLE test
(
      testid      int               not null,
      col1       varchar(100)      not null
)
go
CREATE CLUSTERED INDEX testind ON test(col1)
go
CREATE CLUSTERED INDEX testind ON test(testid) WITH DROP_EXISTING
go
sp_rename 'test.testind', 'NewIndexName', 'INDEX'
go
sp_helpindex test
go
Q: What about instances of one name only? (like Madonna, Cher, etc. ;^) Well, this is a good question and this is something that you might need to plan for in design. In these cases, you might allow NULLs in the lastname column and then make sure to search both when a lookup is performed. To be honest, I probably won’t do all that much to find these special first names – unless you wanted to do searches across both columns without knowing whether or not this is a first or last name. You might do something like this in a lookup
SELECT * FROM NamesTable
WHERE LastName = @variable
      OR (FirstName = @variable AND LastName IS NULL)
Comment: Just wanted to say I appreciate the blog you have put together.

Thanks for the thanks! It's a lot of work but I think it's great as a reference!! Even for me! To be honest, I can't always remember where to find things either! J

Thanks! So – we’re half way there – 5 more to go! And, lots more questions coming I’m sure J For the next session, we’re going to cover Isolation and options in Isolation in SQL Server 2005. If you’re interested in hearing more isolation, locking/blocking – here’s the registration link:MSDN Webcast: A Primer to Proper SQL Server Development (Part 6 of 10): Mixed Workloads, Secondary Databases, Wait States, Locking and Isolation See you on Friday!

kt

Effectively Designing a Scalable and Reliable Database A Primer to Proper SQL Server Development
Indexing Best Practices, Part 4 of 10
Presented by Kimberly L. Tripp, SQLskills.com For a list of the non-technical Resources related to this webcast, please review the following blog entry:MSDN Webcast Q&A for Resources: Indexing Best Practices, Part 4 of 10  Technical Questions

Q: When a clustered index is added to a table what happens physically to the data pages? SQL Server moves the data to the new location defined by the ON clause of the index creation statement. The ON clause can specify a different filegroup or even a partition scheme and the heap will move to the new location.

 Q: I'm not getting the follow-up emails with the replay link. How do I get on this email list for replay?

You *must* register through MSDN. If you are using a partner site then that is probably why you’re not getting the email. IF you are registering through MSDN and still not receiving the email, please contact eventsup@microsoft.com with your information.

 Q: Does the copy affect the transaction log?

Yes. A create index or rebuild index statement does require the statement to be a single transaction. If you are running with your database set to the FULL recovery model then this single transaction will create a lot of log activity. You could switch to the BULK_LOGGED or SIMPLE recovery model and get less log activity but switching to the SIMPLE recovery model should be avoided as it will break the continuity of the transaction log. I would recommend two resources to get more information about Recovery Models and their impact on performance, logging and recovery.

MSDN Webcast Q&A: A Primer to Proper SQL Server Development, Creating a Reliable and Automated Backup Strategy, Part 2 of 10

Review the sample chapter: Chapter 9: Database Environment Basics for Recovery from the MSPress title: Microsoft SQL Server 2000 High AvailabilityQ: Can one switch back and forth between simple and full recovery mode?

Ideally, you would switch ONLY between the FULL Recovery model and BULK_LOGGED Recovery model but not with SIMPLE. The two resources listed above will also help to clarify this point.

 Q: What about re-building a clustered index? Will it be that slow too? Is it recommended to do it off-hours?

Rebuilding a clustered index takes a lot of the same requirements that creating the CL index does. It can also be slow and yes – as a result, you might want to do this off hours. You can speed up the process by changing recovery models but switching recovery models can also have an impact on recovery. See the two resources above for more details to help clarify this point. And, in part 5, I’ll cover more of the fragmentation details.

 Q: When is it desirable to move a table?

This is not necessarily an easy one to answer simply. But, I often recommend separating transaction processing data from read-only data as well as possibly placing a large table on its own filegroup (especially true with partitioned objects). But, I don’t often recommend separating indexes. Here’s a link to a SQL Server Magazine article I wrote – it does give some guidance. In general I don’t separate too many objects. However, if you’re building a database that you will put into a read-only state you might want to work a bit harder to “shift” objects to filegroups in such a way that each of those objects gets a nice and clean and contiguous chunk of space within the file. Another time that it’s useful to “move” an object is when you load data into a heap and then want to build the clustered index on another filegroup – where that heap will become a partition of a partitioned table. This is good when the clustered index will move the object into a filegroup that does not have other objects in it. Again, the clustered index will be nice and clean and contiguous in the destination filegroup.

 Q: So, if I understand correctly… a good approach to creating a table would be to have a clustered index independent of the primary key. So, ideally you could use a GUID for the PK and then have a clustered index in order to minimize the fragmentation.

The most ideal combination is when the clustered index is on an identity column – and even better if this is the Primary Key. However, if you do want to keep a GUID as the primary key then yes, you can create a clustered index which is independent of the primary key.

Q: In general, do you think folks have more problems setting up indexes for OLTP, OLAP/DWH, or Web apps?

In general, I think Decision Support environments can have more indexes with less of a negative… OLTP is a bit harder to find the right balance. The hardest is combination environments…

 Q: What was the diff between 0 and 1 on the third argument of DBCC IND?

The third parameter is for Index ID. 0 = Heap, 1 = Clustered Index, 2-250 are non-clustered index ids.

 Q: Should we keep the install defaults like Collation, and others? Or do you have recommendations?

Generally speaking, yes – I would recommend keeping the defaults. However, I do recommend that some development environments – especially those who are producing software for sale. When the database can be installed/setup on other servers that may have different code pages, etc., you might want to use binary case-sensitive sort. This will make your application more consistent – even with case-sensitive servers…and you never know if one of your customers may have one! It will also help to make you a better coder.

 Q: What design considerations should I take into account when planning my primary keys, indexes and clustered indexes for databases that will be used with replication?

There are no specific requirements for Replication other than the fact that you must have a Primary Key… but the same best practices discussed today apply and should be used in replication as well.

 Q: I am using SQL Server 2000. I have a table that is used a lot on the web server. The scan density is 97%. Should I adjust the indexing and/or Primary Key?

I tend to think more about how fast it drops rather than the actual number. But – 97% is pretty good in general. What you might want to do is a rebuild IF you have off hours time to rebuild (in SQL Server 2000 rebuilding can be done offline ONLY). Otherwise, you might just want to setup a defrag procedure. However, this is NOT

 Q: Why are so many SQL commands undocumented and how do you get to know how to use them if they are undocumented?

First and foremost, I want to strongly urge you to take your time with commands that are undocumented. If the commands are really useful for day-to-day operations, they’re probably already documented. Occasionally, development teams add additional functionality during development to test their functionality. These commands – while proving useful in some cases – do not generally go through the same quality assurance procedures that fully documented commands go through and as a result, I would be very cautious before using most undocumented commands.

 

Having said all of that, you’ll find a lot of useful tidbits in articles, on webcasts, in books, etc.  Just make sure you test the command thoroughly and don’t expect to get support on it. In general, undocumented commands are NOT supported.

 Q: What is the command for scan density on SQL Server 2000?

DBCC SHOWCONTIG. If you’re interested in learning more about defragmentation on SQL Server 2000, please review this MSDN Webcast: SQL Server 2000 Performance Tuning: Index Defragmentation Best Practices

 Q: In SQL Server 2000, DBCC SHOWCONTIG, can you explain extent scan fragmentation and if we should pay attention to it?

We’re going to go over this more with our next session on Index Defrag but Extent Scan Fragmentation describes whether or not the extents that an object owns are “next” to each other. If there are gaps then this implies that there are other objects “interleaved” with this one. This is not ideal, defragmenting the object will help to improve scans and should.

 Q: The size of the database doubles when a clustered index is added?

Once the index is created the space needed to build the index is released. However, while the index is being created you need roughly 2-3 times the TABLE size to build it. If you have a table which is 1GB then you should plan for at least 2-3GB of free space in order to build or rebuild a clustered index. However, it’s only the table’s size that doubles and it’s only during the index creation NOT after the index has been created.

 Q: Aren't GUIDs too large to use for an efficient index?

No, an index can be useful even if it’s a large percentage of the size of the table – mostly, because of the order of the columns and the criteria used in the query. Hard to give a fast answer to this but in general 16 bytes isn’t so bad – much larger may even be acceptable.

 Q: Does level 1 of a clustered index point to rows, or does it describe the minimum and maximum row on a page?

Basically, it’s a pointer to the page and it has a pointer to the lowest value on the page. Each pointer in the non-leaf always points to the “first” value on the page (meaning the lowest).

 

Q: Why would you be inserting using an index other than using the primary key? Assuming the primary key is appropriate, I wouldn’t. However, there are a lot of environments where the primary key is chosen and kept – for a variety of reasons. SQL Server creates the clustered index on the Primary Key by default but it is not required’ you can create the clustered index on something else. It’s the clustered index which defines the insert order – not the Primary Key; however, they are often one and the same.

 Q: Should all indexes be clustered if unique? Would this increase performance on search results from SP?

I think the question should really be should all clustered indexes be unique. Generally, the answer to this is yes. But to answer the second part – would this increase performance on an SP – the answer here is not necessarily. Stored procedures have many reasons for being – or not being – optimal. Plan choice, optimization and recompilation issues all play a major factor in whether or not a stored procedure is optimal. However, in session 7, I will talk about plan caching and stored procedure optimizations!

 Q: If there is no natural clustering key, what is the advantage of "coming up with one". Or, said another way, what is the basic reason why I want a cluster rather than a heap?A heap does not inherently have order to it (that’s the point) but when the records are not ordered then SQL Server needs to find a location in which to insert the row. This causes a lookup in something called a PFS (page free space) as well as a lookup in the table’s Allocation Map (actually called an Index Allocation Map in SQL Server 2000 and a Heap or B-Tree Map in SQL Server 2005). To help the insert performance you generally want to have the insert location defined. However, just having a clustered index is not the best answer as there are other internal dependencies on the log. Because of these internal dependencies you typically want the clustered index to be unique, narrow and static.  If you’re interested in reading more about the reasons behind having a clustered index which is unique, narrow and static:Ever-increasing clustering key - the Clustered Index Debate..........again! MSDN Webcast Q&A: Index Defrag Best Practices - Clustering Key Choice In some cases this means adding a column to a table solely for the purposes of clustering on it. Q: Is it a good practice to just have non-clustered indexes?Well, that’s a different way of answering the prior question…and while heaps can be useful for high performance loading they are generally not as useful for day to day OLTP or combination OLTP/DSS tables. So, my recommendation is almost always – the RIGHT clustered index + additional supporting non-clustered indexes. Q: In large OLTP databases is it recommended to move reporting to a separate database which has a lot of non-clustered indexes and no clustered indexes?No… Again, it’s not really a binary thing (should you always have a clustered, or not always have a clustered). Really, there are cases for both BUT in most databases – even DSS (Decision Support System) – a clustered index helps to define a narrow and unique value that all indexes have in common. This can help other operations…  So, even in a DSS environment I generally recommend a clustered, as well as additional supporting non-clustered. Having said that there are cases where a single clustered – setup for order by and range queries can be useful but that’s generally only when the large majority of queries all want * (all columns) and you typically want the exact search and order by in almost every case. Q: I've read that you shouldn't cluster a column that is an Identity column because it hurts performance for OLTP as all writes are on same page/area of disk? Your thoughts?

There are some cases where insert performance can be compromised – in very high volume OLTP systems (typically more than 400 inserts per second but this can vary) but in general, inserting into the same “hot” spot of a table improves performance because the activity is isolated and the needed pages are already in cache. Lots more on this on in other Q&As.

 Q: If I add an identity column to my table to have a good clustered key, it will never be used in a query. Would I naturally also include the primary key, even though it would make the clustered key less narrow?

You don’t need to include the primary key is the clustered key is already unique, narrow, static and ever-increasing.

 Q: Does this mean that vertical partitioning of a table is only helpful when the table is scanned?

No, there are still many benefits in vertical partitioning – such as with locking, scanning. I’d refer you to review the Webcast that was part 3 in this series, for more details.

 

Q: Can you use DTA against SQL 2000 databases?

Yes. You can use either the DTA or the ITW against SQL Server 2000. If you're looking for details about how to use the SQL Server 2000 Index Tuning Wizard you can review this whitepaper: Index Tuning Wizard for Microsoft SQL Server 2000.

 Q: Is DTA in the Express version, too?

DTA does not come with the Express Edition but you can certainly do tuning on a higher version and then use that tuned database in Express.

 Q: Can you explain briefly Index View?

It’s a materialized view – in the sense that the data as defined by the view – is actually materialized and stored on disk. This has both pro’s and con’s and needs to be evaluated fully before you implement these. Please check out these whitepapers for more details:

 Improving Performance with SQL Server 2005 Indexed Views by Author: Eric Hanson and Contributors: Gail Erickson, Lubor Kollar, Jason WardImproving Performance with SQL Server 2000 Indexed Views by Author: Gail Erickson and Contributors: Lubor Kollar, Jason Ward contributor Q: Do you think one can become expert in SQL Server and other aspects of Design/Programming or does one need to specialize?

Wow, I think it this is a tough question; more than anything I would recommend learning all of the options available so that you can better architect the right solution. Once you think you have a good solution, then move forward to really learn that area/topic/features. I’ve been using a phrase with SQL Server 2005 a lot – Jack of all trades, master of some...

 Q: Can you repeat the new function INCLUDE on index?

In SQL Server 2000 the maximum size of an index key is 900 bytes or 16 columns – whichever comes first. This key helps to keep an index’s b-tree smaller and more scalable. In SQL Serer 2005, the leaf-level of the index can include additional non-key columns. This continues to help keep the tree structure scalable but also allows you to cover more queries.

 Q: I have installed SQL Server 2005 CTP, but can only find SQL Configuration Manager. I don't see the management tools that you are using. Any suggestions?

Actually, no. This seems a bit strange?! I might try uninstalling and re-installing, as well as reviewing the setup logs to see if there are errors within them. For SQL Server 2005, all installation log files are stored in C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG.

 Q: if your non clustered index either includes all or part of the clustered index are the clustered index columns added again to the non clustered index?

NO! SQL Server only adds the column(s) of the clustering key – if they are missing.

 Q: Did you mention the cool timebound option in DTA?

Yes, but only briefly. Vipul Shah discusses DTA in greater detail in these two webcasts.

 TechNet Webcast: Performance Diagnosis in SQL Server 2005 (Level 300) by Vipul ShahTechNet Webcast: SQL Server 2005 Database Tuning Advisor (Level 300) by Vipul Shah Q: What is the fill factor and how is it used?

I think I’m going to defer you to the next session! Now, there’s some motivation. J

 

And here are a few comments that I’ve received! I just wanted to say thanks for your comments and thank yous… J It's your questions that keep me on my toes and current. This is still fun so keep it coming.

 

Comment: Thank you again, particularly for thoroughness and enthusiasm.

Comment: Let me add my voice to the chorus of "thank you's", Kim. You're  presentations and verbal Q&As are great, and your online ones are a truly generous gift to the community. No other webcaster has come close to doing what you do. Many thanks from all of us.  

Comment: I must compliment you on your Q&A in your blog - I've just checked it out - couldn't be more comprehensive!

 

So – 4 down, 6 to go! Lots more questions coming I’m sure J

 

If you’re interested in hearing more about fragmentation, how to view it, how to clean it up and finally, how to prevent it – here’s the registration link:

MSDN Webcast: A Primer to Proper SQL Server Development (Part 5 of 10): New Features in Indexing and Index Maintenance Best Practices (Level 200) Thanks!kt

Effectively Designing a Scalable and Reliable Database A Primer to Proper SQL Server Development

Indexing Best Practices, Part 4 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? The webcast will be available for viewing within 24 hours. All “on-demand” webcasts can be found here.

Q: Where can we get the demo scripts AND the sample database: Credit? The demo scripts are in this zip (20050826MSDNDemoScripts.zip (12.17 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.  To download the ZIP of the Credit Database Backup click here. Once unzipped, restore this backup to SQL Server 2000 or SQL Server 2005. The backup is a SQL Server 2000 backup and can be restored to either version! If restoring to SQL Server 2005, you might want to change the destination for the data and log file as the path will probably be different.

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click here.

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Resources mentioned in today’s Session:

If you’re interested in hearing more of the theory behind the indexing recommednations, as well as more of the reasoning behind unique, narrow and static criteria for the clustering key –  watch the TechNet "It's Showtime" recording, recorded at Tech Ed Europe titled: Index Creation Best Practices with SQL Server 2005 which covers Index Creation Best Practices on SQL Server 2005 and really gets into a lot of the "why's". If you’re interested in hearing a lot more about how to use Profiler effectively with ITW, watch the MSDN Webcast titled: SQL Server 2000: Performance Tuning - Finding the Right Balance of Indexes which covers Index Creation Best Practices on SQL Server 2000.

As for whitepapers:

SQL Server 2005 Beta II Whitepaper: Partitioned Tables and Indexes by Kimberly L. Tripp

SQL Server 2005 Beta II Whitepaper: Snapshot Isolation by Kimberly L. Tripp

PREVIEW: The Database Administrator’s Guide to the SQL Server Database Engine .NET Common Language Runtime Environment by Kimberly L. Tripp

Improving Performance with SQL Server 2005 Indexed Views by Author: Eric Hanson and Contributors: Gail Erickson, Lubor Kollar, Jason Ward

Improving Performance with SQL Server 2000 Indexed Views by Author: Gail Erickson and Contributors: Lubor Kollar, Jason Ward contributor 

Additional Webcasts:

TechNet Webcast: Performance Diagnosis in SQL Server 2005 (Level 300) by Vipul Shah

TechNet Webcast: SQL Server 2005 Database Tuning Advisor (Level 300) by Vipul Shah Support WebCast: SQL Server 2000 Profiler: What's New and How to Effectively Use It by Sri Kasam and Ajay Manchepalli TechNet Support WebCast: I/O performance problems and resolutions in Microsoft SQL Server 2000 by David G. Brown TechNet Webcast: SQL Server 2005 Series (Part 7 of 10): Indexing Enhancements (Level 200) By Bryan Von Axelson

Summary!

Finally, check out the Indexes category on my blog – for a variety of Q&A postings, other resources and lots of great links! The rest of the technical Q&A, I should be able to post soon. Stay tuned! Thanks,kt

MSDN Webcast: A Primer to Proper SQL Server Development
Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures, Part 3 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Here’s the specific replay link.

Q: Where can we get the demo scripts? The demo scripts are in this zip (20050819MSDNScripts.zip (3.67 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.

Q: Does Kimberly L. Tripp have a blog? Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Q: Is this Session 2 of the "Effectively Designing a Scalable and Reliable Database" webcast series? The Title bar here reads "MSDN Webcast: MSDN: A Primer to Proper SQL Server Development (Part...)". A bit misleading don't you think? So, I finally figured out the mystery here. When I originally submitted my proposal for this series, I titled it like this: Effectively Designing a Scalable and Reliable DatabaseA Database Developer’s Primer to Proper SQL Server Development (Level 200-300) But somehow, through a few edits and since each module also has a more detailed title shown in this following list, then I think the title just got edited down to fit. In fact, with subtitles like these, I can barely fit the entire title, subtitle and session title on a single slide. J

Part 1: Creating a Recoverable Database

Part 2: Creating a Reliable and Automated Backup Strategy

Part 3: Designing Tables that Scale – Best Practices in Data Types and Initial Table Structures

So, the mystery ends here… It’s all the same content and the general idea is that building a scalable and reliable database takes a variety of best practices that together make it happen; there is no “DBCC MAKEITGOFASTER(‘TRUE’)” option. I'll unveal a new title slide in part 4... oh, the excitement!

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Q: What does CLR stand for? Common Language Runtime. Quick WHATIS.COM Definition here.

Q: If something uses sp_OA* now in SQL 2000, am I guaranteed it will use SQLCLR in SQL 2005? Not automatically; however, you should consider rewriting you sp_OA* is used to automate COM objects; these, by definition, are not framework objects. Moving to SQL Server 2005, you can continue to run with your previous sp_OA procedures to access COM objects but if you can write SAFE .NET Frameworks code to do the same thing then you should convert your sp_OA* code where possible. Not everything can be converted but the basic principle is that which you use sp_OA* is a good thing to target for conversion. For some great details on the differences and how to move forward – go here. Also helpful is this KB Article regarding COM and .NET in SQL Server 2000 Using extended stored procedures or SP_OA stored procedures to load CLR in SQL Server is not supported

Q: I loved how XML data is available interactively in SQL Server Management Studio (SSMS); can HTML data columns be displayed in the same way? Yes, if you store HTML in a column of type XML then the column data will appear as a link and when clicked, you can edit it in the XML Editor window in SSMS.

Q: Is the "XML" datatype really a "varchar(max)"? No. In previous releases the XML type really was just a “blob” stored in the database. There were no inherent optimizations and you were not able to define a scheme for an XML column. In SQL Server 2005, XML data is stored natively and offers the ability to put indexes on the columns as well as better access and manipulation. You *can* store XML as varchar(max) but it will be harder to access/code. There’s a very nice description of the tradeoffs in this whitepaper:XML Options in Microsoft SQL Server 2005 

Q: What is the difference between numeric and decimal? In SQL Server they are synonyms. I seem to remember hearing that there was a difference in the standards…somewhere but with a few references and looking around I can’t find any. Regardless, I’d make sure that you standardize on one or the other – just in case they do change, at least all of your data is consistent.

Q: I use decimal datatype for money values. Why it may be suboptimal to use decimal (instead of money datatype)? I wouldn’t say that it’s suboptimal – except maybe in storage length. However, decimal can be more precise.

Q: When should you use a float? A number of my clients use Float – i.e. Banks, Stocks, etc. What would be your argument against this? Just the lack of an absolute minimum precision/scale.

Q: We have a database where datatypes are set. But to use a simple percentage calculation we had to add .0 to the calculation to get the correct percentage. For example 100(5/9) is truncated to 0, but 100((5+.0)/9) yields correct result. Is there any standard procedure without massive conversions to assure a correct result? It’s a matter of data type. In the case of 100, 5 and 9 – these are all integers – without any decimal place. So – the division of 5 divided by 9 yields 0. However, 5 divided by 9.0 or 5.0/9 – both have a decimal and in that case the other data types are implicitly converted to the higher/more precise data value. A cool way to find out what the actual “base type” and/or the level of precision/scale is to ask these questions as if the data is of type sql_variant:

SELECT SQL_VARIANT_PROPERTY(5/9.0, 'BaseType')
SELECT SQL_VARIANT_PROPERTY(5/9.0, 'Precision')
SELECT SQL_VARIANT_PROPERTY(5/9.0, 'Scale')   -- numeric(8,6)
SELECT SQL_VARIANT_PROPERTY(5/9, 'BaseType')
SELECT SQL_VARIANT_PROPERTY(5/9, 'Precision')
SELECT SQL_VARIANT_PROPERTY(5/9, 'Scale')   -- int

Q: I am surprised by the nchar/nvarchar suggestions. I've always been told to NOT use these data types. Can you give the "reasoning" for this suggestion here? The primary reason that you want to use Unicode data is to avoid a cost in “translation” between the client and the server – when you’re using web-based applications that natively use Unicode. Additionally, with ever growing customer bases – you can more effectively store international data properly by supporting Unicode in the client and in the database. You are correct that these take 2 bytes per character rather than one – but that’s what gives them the larger range of characters to use/display/manipulate. Also, if you plan to use the SQLCLR against these columns, you will need to use Unicode as opposed to ASCii as the SQLCLR requires it.

Q: Does Unicode use twice as much space? Yes, Unicode columns are 2 bytes per character. If you defined a column of type nchar(100), that column will require 200 bytes of storage.

Q: Sometimes, and not always, it becomes a tradeoff in Storage Space v. Performance, doesn’t it? That’s an excellent point. You always need to look at the trade-offs associated with any decision and in fact, this is yet another. So, I always want the row size to be as optimal and as reasonable as possible. However, there are many compelling reasons – international and/or web-based applications as well as SQLCLR functionality.

Q: Is the N' keyword usable in MSSQL 2000? Yes, all Unicode character strings should be preceded with an upper case N in all releases where Unicode is supported (SQL Server 7.0 was the first release that supported Unicode).

Q: Is there any relationship between the OS's code page and nvarchar/varchar? Yes. When SQL Server is installed- it chooses the server’s default code page from the Operating System. A code page (or collation) defines the order in which characters are evaluated. The additional attributes a collation can define are: whether or not the characters are case-sensitive; the sensitivity of width and kana-type. This may not seem obvious but if these characters sets are native to you then the order to the data would probably make more sense.  Once the collation is set for the server, the database inherits the server’s collation by default, when tables are created – each character-based column inherits the database’s default collation by default. However, at ANY of these levels the collation can be changed. There are benefits to doing so – in that you can keep data within a specific table ordered for a specific character set. Wow, now that I’m getting started, this could be a session in and of itself. So, just to give you a few tips:

  • Be careful using temporary tables if your database’s collation is different than your server’s collation.
  • Changing collation on the fly (for example, if you want to do case-sensitive searching through a view – of which I’ve included a sample file) can be expensive.
  • Changing a database’s collation is supported but you might have troubles converting from a case-sensitive to a case insensitive database.

The files that can help you to get more familiar with verifying collations, changing collations, etc. are CaseSensitiveSearchingInView.sql and ChangingDatabaseCollation.sql.

Q: Another great use of adjunct tables is to isolate culture-specific data from the base (and usually indexed) data. So, this is a great point. If you change the collation for a column – and index it – then you can get better performance when accessing that column. Changing a collation on the fly (as in the script sample for the previous questions), is what’s expensive. However, you should make sure to do a lot of testing.

Q: I have a US code page OS but Latin general 850 collation set for SQL Server; I have no problems storing and retrieving data. Exactly! That’s the benefit of collation information being stored with each column. It allows the application to store many types of data – natively and then just deal with their display at the client.  For those of you who want more details on how to do this, see COLLATE in the books online.

Q: Somewhere along the line I was always told to define varchars in multiples of 8 characters. Something to do with space allocated with new rows.  Any ideas if it has any premise? Well, my first response to this was that I’ve never heard of this… So, I thought I’d check around and well, the couple of folks I’ve chatted with haven’t heard this either… IF anyone has – feel free to comment on this blog.

Q: How about images? Is it better to store them in the db or just the url? This is really an old debate. I still think there’s validity to both sides of it as well. I’d say that you should go through the pro’s and con’s and really choose what’s best for your application. If you use a lot of urls then you might be able to reference more without having to manage a copy of that data… If the url is your data then you might find that managing it in the database (from a backup/restore perspective) is easier – and you know that it’s always going to be there v. the url occasionally not working. Wow, I’d almost like to get a panel of folks together to have everyone hash out their opinions on this one. J SQL Server 2005 does make managing images easier but there are still points to both arguments!

Q: How would SQL Server support becoming a data server of video file? The only real format for this data would be binary. However, there is a 2GB limitation to a single value.

Q: Are you aware of a stored procedure to handle blob data effectively? i.e. for image storage? No. But there is an executable that comes with SQL Server 2000’s samples and I think it’s called textcopy.exe. That might be some good sample code to review.

Q: Does SQL Server support the SQL 2003 NULLS FIRST and NULLS LAST keywords of the ORDER BY clause? No support for NULLS FIRST or NULLS LAST but the order by can include ASC for ASCending or DESC for DESCending. NULL values are treated as the lowest value so NULLS FIRST would be ASC and NULLS LAST would be DESC. But – that’s the only support for specific placement of NULL values.

Q: Is there a "best practice" for the new C# nullable types? i.e. int? vs. int -- re:MSSQL2005? When programming with the SQLCLR, always use types that allows NULLs (assuming your data can contain nulls) and then make sure to always follow the appropriate behavior with NULL. For example, concatenating NULL should yield NULL.

Q: Can I change (refactor) the underlying type of a UDT (e.g., from int to bigint) in SQL Server 2005?  Unless I've missed something, I can't do it in SQL Server 2000 unless I drop the type from all uses. No, as the user-defined data type (UDDT) is only used at create time to map to the datatype; the connection is then broken (altering the UDDT definition will only impact new tables and variable definitions). User-defined Types (UDTs) are even more schema bound than UDDTs and it's a much tougher job to alter them (you need to write conversion from the old to the new). This is much more complex since they fall outside the conversion matrix and the only thing SQL knows is the binary representation. So if the data type was smart enough to add a signature to the serialized data, so that the next version can understand which data type is represents and determine if it can convert to it you might have a shot, otherwise you have to add a column, and convert columns and drop a column afterwards.

Q: Can I easily replace the SQL Server 2000 UDT with a SQL Server 2005 CLR Custom Type? I have a client who wants to -- eventually -- migrate to SQL Server 2005, but for the immediate future wants me to build their application with SQL Server 2000. I'm very interested in SQLCLR types, especially WRT to maintaining type consistency across T-SQL, CLR SP's and external data objects, so using CLR custom types. This is a typical thought for the SQLCLR – “I am going to build an object database.” Type fidelity between the client and the server exists today, but they want class fidelity, so they are going to create the customer object UDT, which is not really for what the SQLCLR was intended. You should be thinking scalar types with additional custom characteristics – not object types – as the scalar types will give you optimal and predictable performance.

Q: Can you find ANSI_NULL_DFLT_OFF setting as on or off? Does DBCC USEROPTIONS do that? Yes, DBCC USEROPTIONS is a nice, quick way of seeing the currently set LIST of session settings. However, there are better ways to check these settings programmatically. it’s not as nice to check it programmatically  The files that can help you to get more familiar with ANSI Null Issues is ANSINullIssues.sql.

Q: It would be great if you could show us a list of  classic fields "Account No", "Account Balance", "Description" etc… and tell us what type you would use inc. "Null/Not Null" etc. Well, this is a bit subjective and with a bank account there would probably be a lot of business rules that surround the account number. However, if this were a sales table and you needed an ordered then it would be a bit easier (again, business rules might change this). Typically, I’d go for a Primary Key which is simple and narrow and not necessarily a “natural” key. However, some of this reasoning comes from how tables are joined and how SQL Server internally stores data. Account Balance – I’d go with a precise numeric. Based on length of storage and the business rules behind required precision, you might choose a numeric(19,6). This will require 9 bytes of storage. For description, I would definitely go with a variable type – probably nvarchar(200) depending on what description really means. Here’s how the table would probably look on creation:

CREATE TABLE Account(
AccountID     int           IDENTITY  NOT NULL CONSTRAINT AccountPK          PRIMARY KEY,
Balance       numeric(19,6)           NOT NULL CONSTRAINT AccountBalanceDflt DEFAULT 0,
Description   nvarchar(200)           NULL)
go
INSERT Account (Description) VALUES ('The account of Kimberly Tripp.')
PRINT 'Kimberly Tripp''s account number is ' + convert(varchar(10), @@IDENTITY) + '.'
go

Q: In the Q&A you might want to tell your listeners that in SQL the string 'abc' is equal to 'abc ' when compared – unlike for C, C++ ,C# , VBNET, etc. This will help with the char v. varchar issue… It’s a good point! You should make sure that you do some testing to make sure that data behaves the way you expect. Trailing blanks may be trimmed with some data types and not others and once again, ANSI settings may impact this behavior.

Q: What are the NULL block and varchar block in the row layout? The NULL block helps to store and identify rows that are set to NULL. In the past, SQL Server stored this as a single space in a varchar column ad there was no real way to differentiate between a real space (as a value) from a NULL value (stored as a space). When the storage engine was re-architected for SQL Server 7.0 the NULL block was added to store a NULL value specifically.

Q: How would the table structures differ between the 32-bit and 64-bit in SQL Server 2005 and SQL Server 2000? They don’t! There is no difference in the on-disk format between any of the versions of SQL Server. In fact, this was even true of the MIPS/Alpha/Win32 releases when SQL Server 6.0 released on all of those platforms.

Q: do you have a script to calculate rows per pages (using the formula you showed? Ha! Seemed a bit painful I know… But that formula works.But, yes – it’s a heck of a lot easier to just do this yourself if you have a table already. In SQL Server 2000 you need to use DBCC SHOWCONTIG with TABLERESULTS and in SQL Server 2005, you use a new DMV called sys.dm_db_index_physical_stats. This takes 5 parameters and has some really nice new features to it.

Q: I might have missed this, but is the SQL Server Management Studio (SSMS) the replacement for Enterprise Manager? Yes, SSMS replaces the Enterprise Manager and the Query Analyzer. And, for a more scaled down query tool, you can use Express Manager (XM). To download XM, you can go here  If you want more details regarding the latest SQL Server CTP, you can go here.

Q: Is Management Studio backward compatible for use with SQL Server 2000? Yes! You can use SSMS to connect to SQL Server 2005 and SQL Server 2000.

Q: How do we write our queries for Insert/Update/Select on a database that we have no idea how it was setup - so we don’t experience the local issue that you talked about? Basically, the best way is for the developer to create an interface that masks the complexities of the database schema through stored procedures, functions and views.

Q: Do indexes point to the row level or the page level? There are two primary parts to an index – the leaf level and the non-leaf level. The leaf-level of a clustered index IS the data (a clustered index is an ordered table).

Q: What is the best practice or rule of thumb on using identity primary keys on tables in a database? And would that be all tables and/or some tables or not all? This is a great question. However, I might phrase it a bit differently because I *think* I know where you’re going with this… Generally, I recommend that large tables have a clustered index on an identity column. This can help to improve performance of inserts by not causing splits (inserts go to the end) and by helping non-clustered indexes look up the corresponding rows in the clustered table by using a clustering key which is unique, narrow and static. This is really quite a bit related to the session on indexing… I think you should definitely attend session!

Q: How could a LOB end up in the leaf level? This is really another great question for the indexing session – and a bit on internals as well. In SQL Server 2005, rows can span pages (so you can have a row greater than 8060 bytes) and rows can have LOB types (even in the leaf level of an index). I’ll explain how this can happen with the new INCLUDE option for indexes – in the next session!

Q: Are you talking at the Vegas launch of 2005? Yes! You can see the complete list of my upcoming events here.

Comment: By the way, thanks for the great Web cast. I don't know of many other companies that do this sort of thing -- take questions directly from their customers.

Thanks! This is part of what I love about these webcasts… it allows me to hear what you’re interested in and it allows me to keep learning. To be honest, often I have to look things up and/or hit the KB and/or ask some of the other SQL geeks I know – in order to answer some of these questions. It really helps to keep me on my toes! In fact, this week I bothered two of my favorite people, my partner Patrick and my good friend Gert.

 

Thanks again for another GREAT week of questions!

kt

See you later today – Friday, Aug 26th

MSDN Webcast: A Primer to Proper SQL Server Development (Part 4 of 10): Best Practices in Indexing (Level 200) 

Two days isn't enough time!

In fact, that's really true. There are so many interesting (well, I think they're interesting) aspects to Indexing that we could in fact do a whole 10-part Series on Indexing and it's likely that we'd still have other things to look at and talk about. So, in preparation for part 4 of my MSDN Webcast Series, I thought I'd take a few minutes to pull together a number of related resources. If you watch/read them prior to the webcast then you'll be a step ahead. To minimize overlap, part 4 is a myriad of best practices with more of the "what's" and "how's" to implement the right indexes - but not quite as much of the theory and/or the "why's" as I've done quite a bit of that before and even if you don't watch/read these resources BEFORE the session...you can use these resources to fill in the gaps and answers the "why's" AFTER the session!

So, let's talk about the session... I'm going to focus on SQL Server 2005 for all of my demos (and I'm going to try to be VERY demo heavy in this one) but quite a bit of the theory also applies to SQL Server 2000...

If you're interested in some of the "why's" and/or you want to hear the information for SQL Server 2000 and/or you just want to see what other folks have asked, here are your options:

Watch the MSDN Webcast titled: SQL Server 2000: Performance Tuning - Finding the Right Balance of Indexes which covers Index Creation Best Practices on SQL Server 2000
Watch the TechNet "It's Showtime" recording, recorded at Tech Ed Europe titled: Index Creation Best Practices with SQL Server 2005 which covers Index Creation Best Practices on SQL Server 2005 and really gets into a lot of the "why's"

Read many of the past blog entries - including A LOT of Q&As created from former webcasts by accessing my "Indexes" category on my blog entries here.

So, if you're not scared to spend quite a bit of time with indexes... I think there's lots of good stuff to review! At a minimum you should think about showing up tomorrow, I promise to give you Practical Indexing Strategies and a lot of demos to prove my point!

Effectively Designing a Scalable and Reliable Database
A Database Developer’s Primer to Proper SQL Server Development
Part 4: Best Practices in Indexing

Attend Part 4 LIVE on 26 August 2005 by registering at this link!

Many of you in the .NET Community are already familar with theServerSide.NET but many of you in the SQL Server community are not...well, here's your chance to bridge the gap over to .NET and hear a bunch of great .NET interviews! My interview is more SQL-centric but there are lots of great ones on Development in general and even one on SQL Server 2005 Reporting Services by Jason Carlson.

My interview was recorded with them back in February when I was speaking at VSLive in San Francisco and they've been saving them up, editing them and then posting an interview roughly each week (mine is the 29th posted!). The interview was great fun and Paul Ballard asked some really interesting questions about tuning, indexes, stored procedures and SQL Server 2005. Mostly it's an interview about optimization tips in both SQL Server 2000 and 2005 and just in general - things to look forward to in SQL Server 2005.

Here's a link to all of theServerSide.NET interviews: http://www.theserverside.net/talks/index.tss 
Here's a link specifically to mine: http://www.theserverside.net/talks/videos/KimberlyTripp/interview.tss?bandwidth=dsl
Here's the link to Jason Carlson's Reporting Services interview: http://www.theserverside.net/talks/videos/JasonCarlson/interview.tss?bandwidth=dsl and finally, here's a TechNet Webcast that Jason did titled: Authoring Reports in SQL Server 2000 Reporting Services (Level 200)

Enjoy!

As I'm preparing for my Tech*Ed session on Indexing Best Practices in SQL Server 2005, I'm reminded that there are a ton of best practices that really apply to both SQL Server 2000 as well as SQL Server 2005. When it comes to indexing, there are many dependencies on the storage structures. These dependencies are the basis for why I recommend a very specific type of clustering key - for all versions of SQL Server, 7.0 and higher!

I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....

Let's start with the key things that I look for in a clustering key:

  • Unique
  • Narrow
  • Static

Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Why Narrow?
A clustering key should be narrow for some of the same reasons it should be unique. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. If the clustering key is really wide, then all of the non-clustered indexes will be [unnecessarily] wide. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of size) when rebuilding these index structures. So, what does narrow mean - as few bytes as possible to help uniquely define your rows. A narrow numeric when possible.

Why Static?
A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. In fact, for a given table the clustering key will be the most duplicated data. If this data changes then they'll need to update the value in the base table as well as in EVERY non-clustered index. And, if the key changes, it will cause the record to move. When a record moves, it creates fragmentation. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of record relocation and [the likely] subsequent splits) and require more maintenance. 

OK, so it sounds like I want a narrow, unique and static value... What about a guid?
Typically, I recommend a numeric IDENTITY column as the clustering key but I always get this question. In fact, I often wait to see how long it's going to take before I get this question ;). Anyway, a guid does meet the criteria fairly well - it's certainly unique, it's usually static and it's relatively narrow. So, what's wrong with it? In SQL Server 2000, the guid function (newid()) is built using a value that does not create an ever increasing pattern (an IDENTITY column would). But wait, I didn't say that you needed to have an ever-increasing pattern.....

OK, so the final criteria I look for in a clustering key is: an ever-increasing pattern!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits - simply put - 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index - called the leaf level - is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible - like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity). But wait, what about that a guid.

Well, in SQL Server 2000 the only SQL Server function for guids is newid - that does not create an ever increasing pattern. In SQL Server 2005, you can use a new guid function called newsequentialid() to populate your uniqueidentifier column. Here's an example of how you can use it:

CREATE TABLE Test
(
TestID uniqueidentifier CONSTRAINT Test_TestID_Default DEFAULT newsequentialid(),
Inserted datetime CONSTRAINT Test_Inserted_Default DEFAULT getdate()
)
go

INSERT Test DEFAULT VALUES
go

SELECT * FROM Test
go

Is there a way to create a sequential guid in SQL Server 2000?
YES, use Gert Drapers wrote an xp to generate sequential guids! Check it out here. He just published this recently (May 2005) and it's a good change for you in SQL Server 2000 databases/applications.
Full title with link: XPGUID.DLL - Sequential GUID generation and GUID helper functions XP

OK, so I've tackled a few things here today and I've answered a few questions related to indexes before. Check out my Indexes Category of blog entries here and if you're at Tech*Ed today, I hope to see you at 5PM.

Thanks for reading,
kt

The title is a common question I've received in the past and I thought I'd take a few minutes to explain a bit about keys and indexes...

This is by no means a lot of detail regarding relational theory, etc. but there are a few things that we should quickly review to make sure that the basis for indexes being created makes sense. First, from a relational theory perspective every table must have a primary key. From SQL Server's perspective it's not a requirement but it's generally a good idea. A primary/unique key are entity identifiers. Each are a unique way of identifying a row. There are subtle differences between the two - in implementation:

Primary Key

  • In SQL Server the Primary Key is enforced through a Primary Key Constraint.
  • None of the columns that make up the primary key allow nulls.
  • The values in the Primary Key must be unique - to enforce uniqueness (as well as make it efficient), SQL Server creates a unique clustered [composite] index on the column(s) that make up the key.

Unique Key

  • In SQL Server the Primary Key is enforced through a Unique Key Constraint.
  • The columns that make up the unique key CAN allow nulls but not for more than one complete key. Meaning that allowing Nulls on a single column unique key really only allows NULL (only one NULL value). Overall, allowing Nulls in a column (like Social Security Number) doesn't really make much sense but if you need to then you can't go with a unique constraint - instead consider a unique index. At that point, I typically get the question of what's the difference between a unique key and a unique index...
  • Allowing Nulls values in the columns that make up a composite unique key makes more sense as long as the complete key is not null for more than one row.
  • The values in the unique Key must be unique and to enforce uniqueness (as well as make it efficient), SQL Server creates a unique NON-clustered [composite] index on the column(s) that make up the key.

What's the difference between a unique index and a unique constraint?

  • A unique key CAN be referenced by a foreign key constraint and a column which has only a unique index cannot be referenced.
  • Constraints are checked before indexes and this can lead to a large multi-row insert/select or update to fail before modification. However, indexes might (for a large modification) be validated at the end (instead of row by row) so a large modification that has a failure will need to rollback at the end of the modification rather than before. This is a good point - and one I hadn't really thought of until I was poking around some of the Q&A on SQLMag's website. Here's where I found it.

So, all of this leads me up to the original question (yes, you knew I get here someday :) and that's “When did SQL Server stop putting indexes on Foreign Key columns?”

First, SQL Server has NEVER put an index on a foreign key column... Indexes are used (as described above) to make the lookup (in a primary or unique key) for a duplicate value FAST. If the keys are ordered then checking to see if one already exists is trivial (i.e. fast). There is NO reason for SQL Server to put an index on a foreign key column as the column does not (and probably would never be) unique (if it is then it's likely to have a primary or unique key on it as well - as in a 1-1 relationship). So, that leads me to another key point...

Are there any benefits to indexing foreign key columns? YES

  • Better performance on maintaining the relationship on a delete of a primary/unique key. When you delete a key row, SQL Server must check to see if there are any rows which reference the row being deleted.
    • If the foreign key relationship is defined with NO ACTION (on update/delete) then a referenced row CANNOT be deleted as it would leave the referencing rows “orphaned.” To find the rows efficiently an index on the foreign key column helps!
    • If the foreign key relationship is defined with CASCADE (on update/delete) then when a referenced row is modified all of the referencing rows must be modified as well (either updated to reflect the new value or on cascade delete). To find the rows to modify efficiently, an index on the foreign key column helps!
  • Better join performance - for many of the reasons above, SQL Server can more effectively find the rows to join to when tables are joined on primary/foreign key relationships. However, this is NOT always the “best” indexing choice for joins but it is a good start. 

Finally, if you want a few titles related to relational theory check out these links:

An Introduction to Database Systems, Eighth Edition
     by C.J. Date
     E.F. Codd

Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design
     Mike Hernandez

Have fun!
kt

Categories:
Design | Indexes | Resources | Tips

People always ask me how I keep up to date with SQL Server and how I could possibly know so many intricate details about SQL Server… Well, I test/play/figure it out until it’s obvious (painfully sometimes and sometimes it’s not obvious). So, this blog entry is based on a specific question that made me want to see if I could come up with a test and positive proof in patterns and performance – when working with large tables and range deletes. I generally recommend a clustered index be created on a unique, narrow and static column – and best if it’s an identity column – yet there is an exception to this when the table has VERY specific administrative/operational/archival patterns a.k.a. the “sliding window” scenario. The question starts by asking about my point that “deletes aren’t as big of a deal” in fragmentation. While my point was solely referring to the internal v. external fragmentation issue that led us to take some more time evaluating delete performance. In talking about delete performance, we arrive at some best practices in index creation as well as horizontal partitioning! (fyi – the scripts used to create/build this scenario are fairly cool in and of themselves – not a bad idea to spend some time reviewing that!)

 

Q: You mention that deletes are not as “big of a deal” since they leave gaps. I had a situation where I had a table with around 50 million rows, about a million rows per day. After about a month and a half, I delete some old ones. I can delete a day's worth of data (1 million rows) in about 2 minutes with a delete statement, if there are no indexes. If I have a clustered index based on the date, it took about 22 minutes. Inserts were instantaneous, about two pages worth of rows every second, and even updates were quicker. Is there anything obvious that would cause this?

OK, so I need to clarify first. My comment that deletes aren't as “big of an issue” is specifically regarding fragmentation (not really performance of the actual delete)... A delete (when it’s a range delete – as opposed to a lot of singleton deletes) typically de-allocates the entire page freeing it back to the table (remember, it’s still “reserved” by this table) or if the entire extent is freed then it goes back to the DB. However, remember, this is only when the range is contiguous or grouped, per se. So, looking solely at the result and impact to the base table – we would say that range deletes aren’t that big of a deal because they free entire pages and even entire extents – effectively, the table doesn’t have gaps. OK, so having said that, you make a good point: what about performance?

 

First, I’m somewhat intrigued by the fact that you say a delete was faster with no indexes rather than with a clustered especially if the clustered was date BUT I have a feeling that there might have been nonclustered indexes as well? (In the upcoming test matrix, you may see some patterns in timing that corroborate this idea).  Without knowing whether you had non-clustered indexes or not, let me at least speculate and test (and hopefully prove) a few points? First, I would ask what the exact definition of the CL index was – was it just date? Remember, I really feel as though there are three types of tables – tables with the wrong clustered index, tables with NO clustered index (i.e. HEAP) and tables with the right clustered index. However, the “right” clustered index CAN vary based on pattern usage of the table. In a table that has range deletes based on dates – as well as an identity column – I often create a clustered index on the composite combination of these two columns. If the CL Key is on the date itself then this should not be that much slower (that the heap – in fact, I would have expected it to be faster even though there will be some overhead in maintaining the index). I really think there’s something else there – probably nonclustered indexes, maybe a blocking scenario? So, to help – I think I can prove a few points (in terms of perf) with a script/test to see if I can give you some rough numbers (I’m really, really intrigued to see if I can’t prove my point as well J). So, I decided to try an interesting combination – and really test the numbers. My test (albeit not entirely scientific) should show a pattern! First, I’m going to take the 1.6 million row charge table (in my “pumped up version of the credit database) and I’m going to copy it 10 times into slightly different table structures. Details and the creation for each table follow:

 

ChargeHeap – A table with NO Clustered index and NO non-clustered indexes.

SELECT *

INTO ChargeHeap

FROM Credit.dbo.Charge

 

ChargeCLPKForDelete – Is really NOT that bad of a choice… I typically recommend this choice for everyday tables where there are lots of inserts/updates/deletes. However, with the specific “sliding window” scenario (where data comes in at the end AND is periodically archived off the other end) then I might make a different choice. But – still a good table to create and compare against. This will certainly improve inserts and minimize fragmentation BUT it might not be ideal for range deletes and archiving. This table has a clustered index on the Primary Key; this column is an ever-increasing identity value on charge_no.

 

SELECT *

INTO ChargeCLPKForDelete

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLPKForDelete

ADD CONSTRAINT ChargeCLPKForDeletePK

          PRIMARY KEY CLUSTERED (Charge_NO)

go

 

Again, because this table will follow a “sliding window” scenario then I would change the clustering key to include the date and then the ID. I don’t typically chose date (as a single column primary key) because dates alone are rarely, if ever, unique (for more details on datetime data check out the articles I’ve written on Datetime data for SQL Magazine here). Because you already have an identity column in the table (which uniquely identfies every row), there’s no need to add a “uniqifier.”  The process of “uniqification” happens internally if the clustereing key is defined on a non-unique column (see last month’s webcast if you’re not familiar with this “uniqification” process). In the case where you already have a row identifier, it’s best to use it in the clustering key with date – especially as it has useful data in it. So – this leads us to the next TWO tables…

 

ChargeCLDateForDelete – Is a clustered index on the date ALONE.

 

SELECT *

INTO ChargeCLDateForDelete

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLDateForDelete

ON ChargeTesting.dbo.ChargeCLDateForDelete (Charge_DT)

go

 

ChargeCLDateForDeleteWithCompPK – Is a composite clustered primary key index on the date, charge_no together.

 

SELECT *

INTO ChargeCLDateForDeleteWithCompPK

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPK

ADD CONSTRAINT ChargeCLDateForDeleteWithCompPK_PK

          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)

go

 

The last base table will have a poor choice for the clustering key (for these deletes) as they will not have “ranges” with which to work.

 

ChargeCLReallyBadForDelete – This clustered index will be on something NOT used at all in this delete. The deletes will result in lots of internal fragmentation. In other words, the table will be larger and “internally” fragmented. The table will essentially look like “swiss cheese” meaning that the table will have lot of little gaps within the pages themselves.

 

SELECT *

INTO ChargeCLReallyBadForDelete

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLReallyBadForDelete

ON ChargeTesting.dbo.ChargeCLReallyBadForDelete (Charge_amt)

go

 

And what about nonclustered indexes? Finally, we will add five final tables – each of the above – with two non-clustered indexes added to see what their impact has on overall performance.

 

ChargeHeapWNCIndexes – Is a heap table with two non-clustered indexes.

 

SELECT *

INTO ChargeHeapWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC2

ON ChargeHeapWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC3

ON ChargeHeapWNCIndexes (Member_no)

go

 

ChargeCLPKForDeleteWNCIndexes – Is clustered on the Primary Key and has two additional non-clustered indexes.

 

SELECT *

INTO ChargeCLBadForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLBadForDeleteWNCIndexes

ADD CONSTRAINT ChargeCLBadForDeleteWNCIndexesPK

          PRIMARY KEY CLUSTERED (Charge_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC2

ON ChargeCLBadForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC3

ON ChargeCLBadForDeleteWNCIndexes (Member_no)

go

 

ChargeCLDateForDeleteWNCIndexes – Is clustered on the Date alone and has two additional non-clustered indexes.

 

SELECT *

INTO ChargeCLDateForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLDateForDeleteWNCIndexes

ON ChargeTesting.dbo.ChargeCLDateForDeleteWNCIndexes (Charge_DT)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC2

ON ChargeCLDateForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC3

ON ChargeCLDateForDeleteWNCIndexes (Member_no)

go

 

ChargeCLDateForDeleteWithCompPKWNCIndexes – Is the composite primary key on the date and identity. This table also has the same two non-clustered indexes.

 

SELECT *

INTO ChargeCLDateForDeleteWithCompPKWNCIndexes

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPKWNCIndexes

ADD CONSTRAINT ChargeCLDateForDeleteWithCompPKWNCIndexes_PK

          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC2

ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC3

ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Member_no)

go

 

ChargeCLReallyBadForDeleteWNCIndexes – Is the poor choice (for this delete) of a clustered index on charge_amt. Again, this table will end up with a lot of internal fragmentation after these deletes (not to mention that inserts will probably cause a lot of fragmentation as well). Over time, this table would probably be one of the worst performers – for almost all DML.

 

SELECT *

INTO dbo.ChargeCLReallyBadForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesCL

ON ChargeTesting.dbo.ChargeCLReallyBadForDeleteWNCIndexes (Charge_amt)

go

 

CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC2

ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC3

ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Member_no)

go

 

Well, now that you know the players we can setup the test. The idea of the test is to take a large chunk of early data “off” the table. Presumably to archive but I will not address the whole archiving process at this time. Once archived the data must be removed (for many – just removed – as they can always rely on backups if they need the earlier data… just make sure you keep those backups!)

 

In the tests we will delete a varying percentage of the table based on a specific date for the charge rows. I have modified and increased the size of credit in many ways before getting to these scripts so it might be hard to reproduce this BUT there are some great samples to use from these.

 

First, I increased the number of rows in credit.dbo.charge by just “doubling” the rows:

 

INSERT charge

SELECT c.[member_no] ,

            c.[provider_no] ,

            c.[category_no] ,

            c.[charge_dt] ,

            c.[charge_amt] ,

            c.[statement_no] ,

            c.[charge_code]

FROM charge AS c    -- to go to 200K rows

go

INSERT charge

SELECT c.[member_no] ,

            c.[provider_no] ,

            c.[category_no] ,

            c.[charge_dt] ,

            c.[charge_amt] ,

            c.[statement_no] ,

            c.[charge_code]

FROM charge AS c    -- to go to 400K rows

go

 

 

However, once I get the rows there I’ve realized that I have a bunch of data that doesn’t really follow an increasing ID and increasing date –which isn’t realistic at all… And, of course, I realized this on a version of my database which already had the rows increased and which I always use for all sorts of samples/examples SO – I’d rather fix the current data rather than re-write the INSERT/SELECTs. But (and as it would always happen, I found other problems – like that I had a gap in the identity values). Anyway, I decided that the best way to get dates to be ever increasing along with the identity is to FIRST cleanup the identity. I’m going to do that first by dropping and re-adding the identity col. Re-creating the identity column to reseed it and then updating the charge_dt column based on a starting time and then adding the charge_no to it. OK, so that’s not what I did BUT that would work to give you a relatively increasing value – there would be duplicates though as SQL Server rounds ms values to the nearest timetick. AND – I’m not going to get into this here at all but I did write a series of articles on datetime data for SQL Server Magazine, see here.

 

OK, so crank up your data, have ever-increasing identities to correspond to ever-increasing data and then create duplicate test tables based on the settings and patterns above. You can then create the testing patterns. The tests are going to be based on deleting a percentage of your data (you will need to determine your date ranges and estimate the correct date which is 5-10% of your table). 5-10 percent is a reasonable percentage to delete based on an “archiving” principle/idea (in fact, this might be large in many situations). Since the date values are the same for all tables this will make testing easier. Also, there a few simple testing practices here – use a separate database in which you track test run numbers, do multiple test runs and take averages (consider removing the highest and lowest values to get rid of any strange anomalies as well). SO, here’s the (TestingDeletePerformance SetupScript1 ChargeTestingResults.sql (2.49 KB)) that I used to track/store the test results. Note: you may need to change directory/path/instance names. Once the results database (ChargeTestingResults) has been created then you can create your testing database (based on your modified version of Credit.dbo.charge). The setup script to create this database (ChargeTesting), the testing procedure (DeleteTestRun) and perform a backup (to restore for reproducible test runs) is here (TestingDeletePerformance SetupScript2 ChargeTesting.sql (9.41 KB)).

 

Now, on to the testing…. You should run a series of test runs of slightly varying numbers of rows being deleted and then analyze your findings. I have a feeling that we’ll all get slightly different results on the extreme ends of a few tests BUT if you average the timings over multiple test runs I think you’ll find the same thing I’ve found!

 

So, you’ll need to set your @DateToDeleteTo and make sure your path is correct. based on your dates) but here’s the general idea… so that you input the timings into your ChargeTestingResults database:

 

DECLARE @DateToDeleteTo              datetime,

                   @NumberOfTestRuns     tinyint,

                   @CurrentTestRun           tinyint

 

-----------------------------------------

-----------------------------------------

SET @NumberOfTestRuns = 1

SET @DateToDeleteTo = 'yyyymmdd'

-----------------------------------------

-----------------------------------------

 

SET @CurrentTestRun = 0

 

WHILE @CurrentTestRun < @NumberOfTestRuns

BEGIN

          USE ChargeTesting

          EXEC DeleteTestRun @DateToDeleteTo

          USE Master

          RESTORE DATABASE ChargeTesting

                   FROM DISK = 'fix path'

          SET @CurrentTestRun = @CurrentTestRun + 1

END

go

 

OK, so now we’re ready to roll! I’ve run this 23 times (yes, thanks to my machine being idle for a few of my sleeping hours – it’s either a great time for SETI or automated tests…)

 

I’ve run the following query:

SELECT TableName, AVG(DeleteTime) AS [Average Time]

, MIN(DeleteTime) AS [Min Time]
, MAX(DeleteTime) AS [Max Time]

FROM ChargeTestingResults.dbo.DeleteTestResults

WHERE TestRunID NOT IN

                             (SELECT TestRunID

                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTR

                             WHERE DeleteTime = (SELECT min(DeleteTime)

                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC -- correlated

                                                                             WHERE DTR.TableName = DTRC.TableName GROUP BY TableName)

                             OR

                                        DeleteTime = (SELECT max(DeleteTime)

                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC2 -- correlated

                                                                             WHERE DTR.TableName = DTRC2.TableName GROUP BY TableName))

GROUP BY TableName

ORDER BY [Average Time]

 

This query removes the highest and lowest values from the test runs and then averages the final results into this data:

 

TableName

Average Time

Min Time

Max Time

ChargeCLDateForDelete

1566

1143

2903

ChargeCLDateForDeleteWithCompPK

2359

1253

8080

ChargeHeap

10392

9603

11476

ChargeCLPKForDelete

11377

9453

18476

ChargeCLReallyBadForDelete

17491

12640

35440

ChargeCLDateForDeleteWithCompPKWNCIndexes

19594

8020

29414

ChargeCLDateForDeleteWNCIndexes

22467

9243

63520

ChargeCLPKForDeleteWNCIndexes

30132

17343

59366

ChargeCLReallyBadForDeleteWNCIndexes

44208

25946

62000

ChargeHeapWNCIndexes

49407

19716

78383

 

Now, I’m sure you’ve completely forgotten what brought us to this result set but it all started when we were concerned about large scale range deletes… I really created this test set for multiple reasons:

1)     To give you ideas for tracking test runs (separate database, testing tables, dates/times and interesting data)

2)     Show you a few practices for reproducible tests (i.e. backup after setup and then restore before each test run)

3)     Show a variety of examples of dynamic string execution

4)     Show you an interesting query which removes the best case and worst case test scenario and then averages the results (fyi – that query is easier to write in SQL Server 2005 with common table expressions!).

5)     Prove my original point… (of course) J

6)     Lead you to wanting to learn more about the BEST point of all… consider partitioning.

 

So, let’s talk about #5 for a second – the original point was that I *think* the reason the deletes were slower with the clustered index is that you also had non-clustered indexes. Hard to say now but that would be my guess. The MORE interesting point is that the Date alone in the CL key is faster than the composite clustering key. Now, I’m not about to start doing insert testing or space testing but having a clustering key which is NOT unique has other problems. Given the VERY small difference in times for these deletes (yes, it’s almost 50% here but the numbers are VERY small comparatively speaking regardless – ok, we can certainly debate this BUT are you really ONLY focusing on performance for deletes or do you need to look at the big picture? When you do I would guess that INSERTs will be more important overall). But – what you really get to when you look over all of this you see the ultimate problem – NON CLUSTERED INDEXES…. Yes, if you have non-clustered indexes they will kill your delete performance. So – the point here is going to be – how can you get GREAT archiving performance when deleting ranges AS WELL AS great performance for INSERTs AS WELL AS a more manageable environment……… partitioning.

 

So – there was another reason for this whole entry… it’s partially to help try and convince a few of you (especially those of you that have tables in double-digit or triple-digit millions of rows) that large tables are harder to manage and large archival operations can be EXTREMELY costly!

Categories:
Events | Indexes | Resources | Tips

Q: How do I interpret DBCC SHOWCONTIG WITH ALL_INDEXES on index id = 255 on a table with multiple text columns. I'm looking at a copy of last night's production after running Maintenance Plan Optimizations. Scan density on table and all indexes is 99 - 100%. Logical Fragmentation is 0% for everything except text column index, which is over 99% (Extent Fragmentation is 83%). Do these numbers have any negative impact on performance? Are they worrisome?

OK, so the first thing to respond to is the “Logical Fragmentation is 0% for everything except text column indexes” is a bug. This is a recently filed bug. It’s based on using the ALL_INDEXES option with DBCC SHOWCONTIG so if you can walk the indexes individually instead then you can probably avoid this (and you could always tweak my version of sp_RebuildIndexes to do this).

 

Now, related to text indexes (and this is as per the Index Defrag Whitepaper), Logical Scan Fragmentation is not relevant on text indexes. However, Extent Scan Fragmentation is and yes, this is something that could be worrisome IF you return the text/image data to your clients in large requests (meaning LOBs measured in MBs+ or requests where you’re returning a lot of LOBs). If you think you’re doing large scans (or if the text data is very large and generally returned to the client) then you might want to compact this part of the database and even consider moving the text/image data into its own filegroup (but – remember that singleton selects won’t really benefit from this).

 

So, to start – it’s important to just mention this option. When you create a table, you can state where the data is located with the ON clause; you can also state where to place the text/image data with the TEXTIMAGE_ON clause. A sample CREATE TABLE that places data on one filegroup and text/image data on another is below:

 

CREATE TABLE dbo.TestTable

(

          col1   int              identity(100,10),

          col2   datetime     default getdate(),

          col4   char(30)      default suser_name()

)

ON [FG1]

TEXTIMAGE_ON [FG2]

 

Since the table is already created – and very fragmented – the only option you have now is to “move” that data to another filegroup. To do this you have a couple of options depending mostly on table size. The easiest way would be to use the SQL Enterprise Manager to first create a new file and filegroup and then set the Table Properties for the Text filegroup (under properties in Design table) to be this new filegroup. The bad side of using this is that the entire table (and all related objects, constraints and indexes) will also be recreated, as the ENTIRE table will be “rebuilt” using this method. The other option – which is a bit harder to implement is to BCP out and then BCP back in the data; however, you cannot change the filegroup when using ALTER TABLE so – this would only work if you wanted to load it back into the same filegroup in which the data already resides. Depending on how many gaps there are within the file you may not end up with very contiguous allocations at the end of this either.

 

So, I think it mostly depends on how much work you’re willing to do and how much of a gain you think it’s going to have.  If MOST queries only return a singleton text/image value AND the performance of these operations is adequate, then I might not worry about it so much (even though, yes it is fragmented). As per one of the dev team folks – one thing to do might be to test the timings of SELECTs and determine if they are slower than you would like and if yes, then DELETE/INSERT the text/image value to have it re-inserted. BUT – if you have time to shift things around and don’t mind some of the administrative impact then there are a couple of options to review if you are getting poor performance from text/image data. One option would be to separate it on its own filegroup (fyi – there are other management benefits in doing this especially in the area of backup and restore when this is a large portion of your database) and you should consider taking the time to do this off hours. Additionally, it will probably be easiest using the UI. You should really follow these steps:

 

  1. Perform a full backup your database
  2. Change the recovery model to simple or bulk_logged (unless it’s already simple or bulk_logged) and then restrict user access to dbo use only (just in case you need to revert back to the older version). You can choose bulk_logged if this operation is going to take a significant amount of time and IF you want to do periodic log backups while it’s running. If you just plan to do a full backup upon success completion of this operation then you can choose the simple recovery model.
  3. Use the UI to create a new file and filegroup (although this syntax is pretty easy)
  4. Use the UI to change the table’s properties to point to your new filegroup (probably should do some testing with a development version before you do this on production – especially with the same sizing, etc. so you can see how much log space and time this is likely to take, etc.)
  5. Change the recovery model back to what it was (likely full?)
  6. If you switched to simple, perform a full backup of your database. If you switched to bulk_logged then all you need is a transaction log backup.
  7. Allow users back in.

 

As for resources that might help you out more with this question and with understanding some of the management options (in terms of backup/restore for VLDB, see my series on Backup/Restore on SQL Server Magazine. A complete list of my articles is here.

 

Q: After doing index maintenance in Enterprise Manager, the LogicalScanFragmentation was zero for the indexes that I modified. After running DBCC DBREINDEX the LogicalScanFragmentation jumped up to approx 12% on all the indexes. Not sure why this happened, but there must be some difference in the way the indexes were rebuilt.

Also, somewhat related to the first part of the previous question… It turns out that Logical Scan Fragmentation of 0 (as reported by a DBCC SHOWCONTIG with ALL_INDEXES) is actually a bug. It's unlikely that it “jumped” to 12% but that it may not have been 0 to begin with...

 

But – for completeness – there could be one other factor in it increasing after a rebuild and that's IF the space into which it rebuilds is NOT contiguous. You *MAY* want to consider rebuilding/moving the object to another filegroup to make sure that it's completely contiguous but that might not be worth it.

 

Q: (This is from a recent email, fyi) I found a bug in your sp_RebuildIndexes.sql script… It doesn’t work for clustered indexes that have columns set to DESC!

THANKS! This is great. I have to admit that I haven't had clients with clustered indexes with DESC columns but it could certainly happen and good to know that I have a bug (we all have to have one someday, right? he he)!

 

So, I have updated/fixed the script here:sp_RebuildIndexes.sql (11.08 KB). For completeness, I have also updated the msdn webcast Q&A zip files: Part I here and Part II here and I have updated the version that is listed on my sample scripts page on SQLskills (here). It shouldn't impact too many people but good to have the right code! Always let me know if you find something. (This fix is from fellow RD Morten Abrahamsen who is an RD for Denmark, Norway and Sweden.)

 

Q: (Also, from a recent email) I don't quite understand what indexdefrag will do for a clustered index. It is my understanding that the leaf level of a clustered index is the data (i.e. the leaf pages are the data pages). If these pages are being physically reordered and compacted [during a rebuild], wouldn't that accomplish the same thing as a defrag index?

Rebuilding and Defraging both work on the leaf level of an index - regardless of index type... So a rebuild completely rebuilds the entire structure (so, it’s more thorough). Defrag solely rearranges the leaf level pages in "mini" transactions as it moves data through the leaf level pages.

 

So - your question is really more of a statement - and a correct one too... Yes, the leaf level of the clustered index IS the data so if you perform a rebuild then there's no reason to defragment.

 

Q: You mentioned that DBCC DBREINDEX and CREATE w/DROP_EXISTING are synonymous and from what I can tell from your documentation as well as other documentation that I have read, the DBCC DBREINDEX does rebuild the clustered index.  We currently have a script similar to the sp_RebuildIndexes.sql that you provided; however, we perform the DBCC DBREINDEX for all indexes including the clustered indexes.  We have a consultant telling us that we must use the CREATE w/DROP_EXISTING for the clustered indexes.  At first I didn't think this was true based on what I learned from your webcast. However, after looking at your sp_RebuildIndexes.sql script I do see that you are also using the CREATE w/DROP_EXISTING for the clustered index. Can you provide some clarification for me and verify if we should be using CREATE w/DROP_EXISTING for clustered indexes?

Well, yes and no… To be honest… you caught me! J

So, to answer whether or not these are synonymous – well, kind of! There's one thing you can do with CREATE w/DROP_EXISTING that you can't do with DBCC DBREINDEX. Using CREATE w/DROP_EXISTING you can CHANGE the definition of the clustering key. If you're just rebuilding then there's NO difference....

 

Now, to answer the question of why I use CREATE with DROP_EXISTING for the clustered index... it was just for the challenge! Really, I was trying to see how complex the dynamically executed string was going to be to build in order to supply the entire CREATE INDEX statement. So, didn’t mean to create the confusion! There’s really NO technical reason for it! (Good catch!)

Categories:
Events | Indexes | Resources | Tips

Q: Can I automate DBCC DBREINDEX to rebuild all tables on all databases on a server?

Actually, yes and no… With stored procedures you’ll have a hard time changing database context without having to use dynamic string execution…. And so this can add a bit of frustration. BUT – it’s your lucky day (and really, it’s mainly because I’m on a ferry going to Victoria Canada right now and the Washington State Ferry system doesn’t currently have wireless internet access… and well, I have a few spare minutes J).

 

Anyway, I’ve put together a stored procedure which allows you to enter a command that will be processed in all databases. It’s similar to the undocumented sp_MSforeachdb but it’s got a simpler format. The basic idea is to create a cursor which has the list of all of the databases in it and then while walking the cursor build a string which includes a change in database context switch followed by the command to execute.

 

                   SELECT @ExecString = N'USE ' + QUOTENAME(@DBToProcessName, ']') + N' ' + @CommandToExecute

                   SELECT @ExecString AS 'Command to be executed'

 

Please note that I have NOT done a lot of error handling nor have I done a lot of testing but this works pretty well for what I’ve used it for and in response to this question it certainly works. It’s a great start for you to use/learn from and build upon for your own procedures. Here’s the script: sp_ExecuteInDatabase.sql (3.76 KB). And here’s how you use it:

 

            exec sp_ExecuteInDatabase N'exec sp_RebuildIndexes'

 

or if you want to execute ONLY in one database:

 

            exec sp_ExecuteInDatabase N'exec sp_RebuildIndexes', 'pubs'

 

Let me know if you find anything interesting and/or make changes to this one! Have fun.

 

Q: Is there a baseline amount of improvement for INDEXDEFRAG or DBREINDEX to be worthwhile? Please express your answer looking at % change for logical fragmentation and Scan Density.

Actually, the associated whitepaper does a nice job of correlating table size with defragmentation best practices. I’ll defer to that BUT I will say that IF you have the maintenance window then there’s never a negative in doing this – even for small tables. However, (and as the whitepaper also states) the larger and more active the table the more rewarding (in terms of performance as well as disk/cache savings) a defrag/rebuild is! Here’s the link to the whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices.

 

Q: In a *typical* scenario (500 tables, 15 large ones) how often would you recommend to go between rebuilding indexes? Ex. lower the fill factor until you go X days before your scan density is below the threshold.

Yep! I think I got this one during the webcast as well BUT – this is exactly my suggestion! If you can find the stable point where fragmentation only begins to occur when you’re about to rebuild then that’s the best choice. In that case you will keep your table the most compact (i.e. at the highest fillfactor) without wasting space ; at the same time you will minimize fragmentation and keep the DML operations fast! Again, the defrag whitepaper has a few additional ideas for you. Here’s the link to the whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices.

Categories:
Events | Indexes | Resources | Tips

Q: Is there another way to determine fragmentation? I don’t want to use DBCC SHOWCONTIG as it locks objects.

Well, this is a good question and unfortunately, there really isn’t a way to generate information such as Scan Density or Average Page Density without running DBCC SHOWCONTIG. However, if you decide that you’re only interested in Scan Density then you will be interested in the WITH FAST option for DBCC SHOWCONTIG. Using the WITH FAST option the command will only calculate the out of order extents (i.e. extent switches) and not calculate average row length, maximum row length, minimum row length, average page density, etc. and in using this option SQL Server will execute this command a lot faster thereby releasing the locks faster. However, locks will still be acquired. As an odd response though – I’d have to say that you could just defrag without looking… Meaning instead of even checking for fragmentation you could just execute DBCC INDEXDEFRAG. Since DBCC INDEXDEFRAG doesn’t use long running locks (it runs as “mini transactions” as it moves through the leaf level of the index) and doesn’t do anything when there’s no fragmentation (it really only moves data around when fragmentation exists) then you could do this without checking – and not creating blocking.

 

Q: How does the fill factor influence index structure?

FILLFACTOR – which is an option which ONLY takes effect when an index is initially built or later when an index is rebuilt or defragmented – defines the level of fullness for the leaf level of an index. Fillfactor ONLY applies to the leaf level and when set appropriately for the data and the volatility patterns of the data, it should help to minimize fragmentation thereby reducing the need to free space to be left at higher levels in the tree. However, if you also want to pad the upper levels of the tree with the same level of free space as the leaf level, you can add PAD_INDEX to the index statement and it will use the fillfactor setting for all levels of the index. For example:

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80

 

Fills the leaf pages 80 percent full and the non-leaf pages are balanced such that at LEAST one entry is open on each leaf page.

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80, PAD_INDEX

 

Fills BOTH the leaf and non-leaf pages 80 percent full.

 

Typically, PAD_INDEX is not needed unless you have extremely unbalanced activity and are unable to rebalance the tree (meaning you need to achieve high availability). However, I would say that PAD_INDEX is not perfect as implemented as I would rather set that value differently from fillfactor…

 

So, not a big fan of PAD_INDEX but feel it’s very important to set an APPROPRIATE fillfactor. (Review the webcast for more details on what I mean by “appropriate” fillfactor.)

 

Q: What would be the ideal settings [of fillfactor] for OLAP/DSS Databases?

To be honest, 100% is an ideal setting for a read-only decision support database, as new data is not causing splits. However, data loads (which build these reporting databases) can cause fragmentation. Ideally, you would drop your indexes, load your data and then rebuild your indexes upon completion. If you’re trying to reduce the amount of time the process takes AND you’re not loading in a lot of data (compared to what’s already there) then you might not want to drop and recreate indexes… Setting fillfactor in these cases will be based on how much data you will be loading compared with how much data you have. Set fillfactor to whatever that percentage is (if you’re loading 100,000 rows into a table which already has 900,000 rows then a fillfactor of 90 should have just enough room – IF the data is evenly distributed). If the data is NOT evenly distributed then you might want to drop this number a tad lower. AND – you only want to set this on indexes that are going to have splits. Remember, if your clustering key ever increasing and the new data coming is “appended” to the end of the table – both logically and by clustering key definition – then you only need the fillfactor setting on the nonclustered indexes. Regardless, it is still likely that you will end up with some fragmentation. Dropping the indexes, loading the data, and then rebuilding the indexes would certainly be better – but it will take more time. IF you’re loading a lot of data relative to what you already have then there’s NO question: drop the indexes first, load the data and then rebuild. When dropping indexes ALWAYS make sure to drop NC Indexes first and then drop the clustered index. When creating indexes ALWAYS make sure to create the clustered index first and then add the nonclustered. For more details on High Performance Data Loading check out the presentations on Gert Drapers’ site: www.SQLDev.Net, under Events. (FYI - There are all sorts of great presentations there!)

 

Q: When trying to determine the fill factor to set, would you recommend setting the same fillfactor to all tables or just adjust the fillfactor on the large tables and keep the small ones at the default?

I know I got this one during the webcast… But as a recap here. I would really recommend starting with the “monster” tables and then work your way down from there. Remember, you don’t really need a fillfactor unless your object is going to have splits. In many cases your clustered structures will not have a lot of splits as they might have a clustering key which is ever-increasing. Additionally, if you have all fixed width fields or you don’t have updates to varchar columns – then you might be able to keep a large table very compact at 100% in fact. The reason I mention this is that I have a customer who did exactly what you’re asking… They decided it was easier to just set the server option for fillfactor to 90% so that they’d change the default value instead of having to set it individually for every index. Unfortunately, for their largest (yes, their monster!) table it turned out that they didn’t have updates and their clustering key was on an identity column. Once they went back and changed their rebuild job to build to 100% they reclaimed 3.2 GB of disk space!

 

So, let me summarize. Only worry about fillfactor for your largest tables. Set it somewhat strategically and make sure to check on it using DBCC SHOWCONTIG (if possible) or just defrag regularly (if you just want to make sure and can’t afford the impact on availability – review the rest of the questions in “Index Fragmentation” for more details).

 

Q: Will Kimberly show the rebuild or defrag statement she used? I had to run to the restroom and missed that brief part.

I would have answered your question but I had to use the restroom myself… Seriously though, J the commands are:

            DROP and re-CREATE

            DBCC DBREINDEX (was added in 6.5)

            CREATE with DROP_EXISTING (was added in 7.0)

            DBCC INDEXDEFRAG (was added in 2000)

Personally, when I’m going to do a rebuild – I use DBCC DBREINDEX as it’s easier to automate and easier syntax in general. If my availability requirements don’t allow the table/indexes to be rebuilt because of locking, then you only have one choice for defraging – DBCC INDEXDEFRAG. I have a feeling that I cleared this up once you returned and certainly, you can review the webcast on demand here!

 

Q: Regarding sp_RebuildIndexes, does the SP need to be created in the Master Database? Our companies Best Practices recommend against it. We use an “Admin User Database” for most shared stored procedures.

Sure, you can create the sp in any database (just make sure to modify the script) and then make sure to explicitly qualify the execution with the datasbasename.owner.procedurename and everything should work. Also, there are a couple of procedures created. Make sure to review thoroughly the script.

 

Q: What type of Defrag do the data maintenance plans use?

They use DBCC INDEXDEFRAG.

 

Q: Can you use DBCC DBREINDEX if you are using transactional replication? I thought this 'broke' replication.

Checked the KB and pinged a few people and no one is aware of any issues with regard to DBREINDEX breaking replication (nor should creating/dropping indexes). In fact, because transactional replication focuses on the transactional data changes (not the page locations of the data) this operation should not even have the potential of “breaking” replication. Don’t get me wrong, it can create blocking and this in turn can impact user access to the replicated tables but it still should not break replication. 

On a related note (and probably another common replication/maintenance question I get), often people want to know how to have replicated table’s indexes rebuilt/defraged when the source table is rebuilt/defraged. And – this is a fairly simple trick. Use stored procedures to execute the rebuild/defrag and then stored procedure replication to have the execution of the stored procedure replicated to the subscribers.

Categories:
Events | Indexes | Resources | Tips

Before I launch into the Q&A from this area there are a few other resources with Index Usage Q&A that you should also check out:

            Review the Q&A from the June 11 Webcast here.

            Review the Q&A page on SQLskills here.

 

Q: How does SQL server decide what index to use when there are multiple indexes in at table?

When SQL Server receives a request for data it goes through 5 primary steps: parsing, standardization, optimization, compilation and execution. The process of optimization (while a bit simplified here) includes multiple phases as well: query evaluation, index evaluation, join evaluation and so on. During index evaluation SQL Server determines whether or not USEFUL indexes exist to aid in processing your query. The usefulness of an index is based on the selectivity of the query and the columns requested in the query… There are many things for which SQL Server looks but the idea is that the better the information it has – the better the job of optimization. To ensure that the optimizer does a good job you should make sure to have auto update statistics and auto create statistics and you should consider using tools like Index Tuning Wizard to help you to create useful indexes.

 

While this is only touching the tip of the iceberg on these concepts last month’s webcast: Indexing Best Practices – will give you a lot more information. And – if you’re really interested… I offer a 5 day course in Performance Tuning and almost 2 full days cover almost nothing but indexes (nope, not kidding!). For details on the next SQL Immersion Event, click here.

 

Q: Why does the optimizer occasionally decide NOT to use an index(s)? In a few cases, we have had to force an index (e.g select * from a (index=index name) join b on a.id = b.id…

Also, related to how the optimizer makes decisions on index usage... and a bit hard to answer with certainty BUT there are a few reasons why the optimizer might make an incorrect decision. However, before I get there – are you sure that using an index was really the “right” decision? There are often cases where a query is not selective enough to use an index and a table scan is actually a better choice… But if you find a case where your queries performance improves when you force an index and SQL Server is unable to make that same optimization choice then I’d check these things:

  • Make sure that statistics are current and up to date. First, make sure the auto create and auto update statistics database-level options are turned ON. Also, you can use these commands to get some information about statistics: DBCC SHOW_STATISTICS or the STATS_DATE function. If the statistics are up to date then you might also want to check to see if they were based on a sampling v. a full scan. In the output of DBCC SHOW_STATISTICS you can see the number of rows as well as the number of rows sampled. If the sampling is low compared to the number of rows in the table then you might want to automate an UPDATE STATISTICS with FULLSCAN. However, this can cause blocking so you’ll want to schedule it off hours.
  • Maybe you don’t have the best index for the query… consider using Index Tuning Wizard on the query in Query Analyzer.
  • Maybe consider re-writing the query.. If it’s a join consider a subquery and visa versa.
  • If the command is in a stored procedure consider recompilation techniques. See this blog entry for more details on stored procedure recompilation issues.

Well, this certainly isn’t an exhaustive list but it should help!

Categories:
Events | Indexes | Resources | Tips

Q: What about using uniqueidentifier fields for keys in a clustered index, will this affect performance? They are random numbers so they are not monotonically increasing…

Ha, I always want to time how long it’s going to take for me to get this question. Seriously, I don’t think I’ve EVER given an indexing lecture (where I talked about the clustering key) where someone didn’t ask about GUIDs. And – once again – a great question! I do not really have a problem with GUIDs in general. I think they have a purpose in that they are relatively “random” numbers (they are not really random entirely as they get part of their value from SQL Server, part from Windows, part is based on time, part is ever increasing and part is random; however, the final values do not create any form of pattern). They truly make sense in distributed applications or those where you do not want a predictable row identifier; however, they are not always an appropriate choice for a key. (I think this should be a discussion in and of itself. Issues related to GUIDs. OK, I’ll work on this one!) But this leads me to another issue – is the primary key always clustered and/or always need to be an identity. These three things are really three separate things and they do NOT have to be on the same column. I often prefer that all three are the same column as an identity is naturally unique, naturally static, naturally narrow as well as creates an ever-increasing pattern that minimizes splits/fragmentation due to inserts (again, only when this is also the clustering key). So I guess the question is – does it really need to be the Primary key? And well, that’s the debatable but if all three criteria come together that often gives you the largest gains. If the Primary key is not an identity (and it’s a very large transaction processing table with a lot of inserts/updates/deletes) then I would consider adding a surrogate column – solely for clustering – and make the Primary Key non-clustered. The non-clustered index will certainly become fragmented (if on a GUID) so as long as proper maintenance practices are in place you can reduce the excess disk space and improve modification performance.

 

Q: What about UPDATEs that change the row's clustering columns?

This question was related to “what can cause fragmentation” where I was discussing UPDATEs to variable width columns. While updates to variable width columns can cause a row to widen (and possibly result in a page split) it’s a good point to mention that volatility in the clustering key is far worse. In fact, in last month’s webcast, I spent quite a bit of time discussing proper creation of the clustering key – to minimize fragmentation and in general, improve performance. There are three criteria for which I look in a clustering key: the key should be Unique, Narrow and Static. As final criteria – and this helps to minimize fragmentation proactively – I look for the key to have an ever-increasing pattern. Sometimes this means creating the clustered index on an identity column and if one doesn’t exist you might consider adding a column solely for this purpose! I probably wouldn’t worry about this for small tables but for an extremely large and volatile table this can often be a good idea. You might also consider (depending on access patterns), creating the clustered index on a combination of columns – such as date, identity; this creates the ever increasing pattern and also ensures uniqueness for the value as date is likely to have duplicates. There are multiple options and there are multiple considerations – all of which affect query performance, data modification performance, maintenance requirements and therefore availability. Make sure to review many of the other questions in this area as a few have a great amount of detail.

Categories:
Events | Indexes | Resources

Q: Does clustering rely on the operating system for performance?

Well, I have to admit I am not entirely sure of the question here… But – I think I understand from where it might originate (well, I’ll try). First – and unfortunately (and also somewhat seriously), I think many people lack creativity in choosing a term for something – at least in the computer industry! J Here, I think the confusion is between “clustered index” and the OS term/feature “clustering.” The term clustered index stems from having data “clustered” in a certain order or grouping. In previous releases, we tended to create the clustered index for range queries (like on LastName) and therefore people with the same LastName were “clustered” together. However, because the internals of SQL Server changed [significantly] between 6.5 and 7.0 the internals really dictated a change in where the clustered index should be created. So, even though the term didn’t change it doesn’t entirely apply. Now the OS term/feature “clustering” applies to high availability. Specifically, failover clustering refers to a feature of the Enterprise Edition of SQL Server – which requires a higher level of Windows (typically an “Advanced”, “DataCenter”, or “Enterprise” version) to support it. This really is not the primary focus for this webcast but below are a couple of references if you want to understand more about High Availability and Failover Clustering in SQL Server:

 

Q: How well are Microsoft's standard products that use SQL optimized? Are there published studies on optimizing Microsoft CRM, Sharepoint Services, Project Server, and the various Microsoft Business Products?

This one I kind of answered during the webcast…First, I will start by mentioning that I’m NOT an expert on any of these products specifically. However, you are correct in that they are based on SQL Server and use SQL Server as their data store. What that means is that they too must be optimized and maintained; however, it is impossible for anyone to predict perfectly the exact usage of each of these products and as a result they might have a good selection of indexes but they are unlikely to have maintenance operations defined. However – again, I cannot say that I am an expert and I could be wrong! What I will say however, is that regardless of what is defined and what is maintained – you can create additional indexes, possibly drop indexes (more on this in a moment) and maintain indexes with NO application changes. The reason why I say “possibly” drop indexes is because queries/procedures, etc. will FAIL if index hints are used and the index does not exist at runtime. And while I’m fairly certain that NO Microsoft product hardcodes index hints, I have seen other third party products which do and unfortunately they become VERY hard to optimize and manage. In general, it is harder to drop indexes than add indexes. Nevertheless, it is always possible to defrag/rebuild indexes – regardless of the application whether Microsoft of not. So, while I can’t speak to each of these applications directly (although my partner works on Microsoft Project and he’s pretty sharp! J), you can always improve upon an application with better/more useful indexes (meaning more specific indexes for YOUR usage patterns and requirements) and proper/consistent/automated index maintenance.

 

Q: Is there any internal command to output query plan used by optimizer (not showplan or external tools available), which may show splits with DML operations?

Well, I’ll start by saying that these are really two different parts to your query processing. First, splits are NOT part of the optimization plan as they are only known at runtime. So, a plan really wouldn’t ever show them. As a result, I don’t believe an internal command exists to show this; however, I’m a bit confused as to why it must be “internal” and not something else? Are you just trying to determine which commands and in which tables you are having a lot of splits? (And IMO, this is the most important.). If that’s the case then you can start by using PerfMon to see if you have splits (and get a rough idea of when) and then use DBCC SHOWCONTIG to see which tables are fragmented (meaning they had the splits). Perf Mon will only tell you that you have them – not in which table they are occurring (but I will say that you might be very pleased with a few new features of SQL Server 2005 – look forward to playing with beta II Profiler and PerfMon). Right now, check out PerfMon SQL Server: Access Methods, Page Splits/sec.

 

Q: How do we use the I/O affinity option in SQL Server 2000?

Actually, I’ll let the KB take care of this one. Check out KB Article Q298402 titled: INF: Understanding How to Set the SQL Server I/O Affinity Option.

 

Q: Can we use MAXDOP with Index Creation?

No, MAXDOP is supported only as a query hint – for DML (SELECT, UPDATE, etc.) statements. However, if you want to limit all operations to a reduced number of CPUs you can set the server setting Max Degree of Parallelism. See the sp_configure option in the Books Online as well as KB Article 329204 titled: INF: General Guidelines to Use to Configure the MAXDOP Option.

Categories:
Events | Indexes | Resources

Q: Can I view a recording of this webcast?

On demand MSDN webcasts (in general) can be found here. The Index Defragmentation Best Practices session is ready and available here. The prerequisite Indexing Best Practices session (from June 11) is here.

 

Q: Where can we get a print out of the slides?

I have posted a pdf of the presentation here as well as on the “Past Events” resources page on SQLskills.com here.

 

Q: Where can we get the demo scripts?

I have posted the demo scripts here (MSDNWebcast-Tripp-20040719-SCRIPTS.zip (68.37 KB)) as well as on the “Past Events” resources page on SQLskills.com here.

 

Q: Does Kimberly L. Tripp have a blog?

Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

 

Q: What's the URL for your website (for scripts/samples/articles/events?)

The main website link is: www.SQLskills.com

For a list of articles (many of which do NOT require a subscription), click here.

For a list of upcoming events, click here.

For a list of past events (including demo scripts, samples and Q&As from past events), click here.

For a list of scripts (addition sample scripts not necessarily part of demo scripts), click here.

For the upcoming SQLskills SQL Immersion Events (focused, intense performance tuning events), click here.

 

Q: Do you speak at other events for more advanced tuning topics?

Yes, in fact, SQLskills has a new event coming up in October that’s a 5-day event on performance tuning on SQL Server. We’re calling these events “Immersion Events” as they’re focused, indepth and intense! Full details are on http://www.sqlskills.com/sqlimmersion.asp.

 

Q: Are you available for onsite custom courses or consulting?

Yes! Feel free to email us at:

Mentoring@SQLskills.com

CustomCourses@SQLskills.com

or feel free to email me at Kimberly@SQLskills.com

 

Q: Can you list all of the additional resource links from the final section of the slides:

Categories:
Events | Indexes | Resources

NOTE: These have NOT yet been completed and there are still a few questions that need to be answered. I will wrap these up shortly and also break them down into separate blog entries so they're easier to wade through. I will also post the pdf, the webcast link (when it's available), the demo scripts and links to all of these blog entries - on SQLskills - when everything is complete. BUT - here's a good start with ALL of these resources as well as the majority of the Q&A. Have fun!

 

Q: Can I view a recording of this webcast?

It should be available on www.microsoft.com/usa/webcasts/ondemand shortly.

 

Q: Where can we get a print out of the slides?

I have posted a pdf of the presentation here as well as on the “Past Event” resources page on SQLskills.com here.

 

Q: Where can we get the demo scripts?

I have posted the demo scripts here (MSDNWebcast-Tripp-20040719-SCRIPTS.zip (68.37 KB)) as well as on the “Past Event” resources page on SQLskills.com here.

 

Q: Does Kimberly L. Tripp have a blog?

Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

 

Q: Do you speak at other events for more advanced tuning topics?

Yes, in fact, SQLskills has a new event coming up in October that’s a 5-day event on performance tuning on SQL Server. We’re calling these events “Immersion Events” as they’re focused, indepth and intense! Full details are on www.SQLskills.com.

 

Finally, here are all of the resource links I recommended in the resources section of the slides:

Q: What about using uniqueidentifier fields for keys in a clustered index, will this affect performance? They are random numbers so they are not monotonically increasing…

Ha, I always want to time how long it’s going to take for me to get this question. Seriously, I don’t think I’ve EVER given an indexing lecture (where I talked about the clustering key) where someone didn’t ask about GUIDs. And – once again – a great question! I do not really have a problem with GUIDs in general. I think they have a purpose in that they are relatively “random” numbers (they are not really random entirely as they get part of their value from SQL Server, part from Windows, part is based on time, part is ever increasing and part is random; however, the final value created does not have an ever-increasing or decreasing pattern to it). They truly make sense in distributed applications or those where you do not want a predictable row identifier; however, they are not always an appropriate choice for a key. But this leads me to another discussion – is the primary key always clustered and/or always need to be an identity. These three things are really three separate things and they do NOT have to be on the same column. I often prefer that all three are the same column as an identity is naturally unique, naturally static, naturally narrow as well as creates an ever-increasing pattern that minimizes splits/fragmentation due to inserts (again, only when this is also the clustering key). So I guess the question is – does it really need to be the Primary key? And well, that’s the debatable but if all three criteria come together that often gives you the largest gains. If the Primary key is not an identity (and it’s a very large transaction processing table with a lot of inserts/updates/deletes) then I would consider adding a surrogate column – solely for clustering – and make the Primary Key non-clustered. The non-clustered index will certainly become fragmented (if on a GUID) so as long as proper maintenance practices are in place you can reduce the excess disk space and improve modification performance.

 

Q: What about UPDATEs that change the row's clustering columns?

This question was related to “what can cause fragmentation” where I was discussing UPDATEs to variable width columns. While updates to variable width columns can cause a row to widen (and therefore cause a split) it’s a good point to mention that volatility in the clustering key is far worse. In fact, in last month’s ewbcast I spent quite a bit of time discussing proper placement of the clustering key – to minimize fragmentation and in general, improve performance. There are three criteria for which I look in a clustering key: the key should be Unique, Narrow and Static. As a final criteria – and this helps to proactively minimize fragmentation – I look for the key to have an ever-increasing pattern.

 

Q: Does clustering rely on the operating system for performance?

Well, I have to admit I am not entirely sure of the question here… But – I think I understand from where it might originate (well, I’ll try). First – and unfortunately (and also somewhat seriously), I think many people lack creativity in choosing a term for something – at least in the computer industry! J Here, I think the confusion is between “clustered index” and the term  OS term/feature “clustering.” The term clustered index stems from having data “clustered” in a certain order or grouping. In previous releases, we tended to create the clustered index for range queries (like on LastName) and therefore people with the same LastName were “clustered” together. However, because the internals of SQL Server changed [significantly] between 6.5 and 7.0 the internals really dictated a change in where the clustered index should be created. So, even though the term didn’t change it doesn’t entirely apply. Now the OS term/feature “clustering” applies to high availability. Specifically, failover clustering refers to a feature of the Enterprise Edition of SQL Server – which requires a higher level of Windows (typically an “Advanced”, “DataCenter”, or “Enterprise” version) to support it. This really is not the primary focus for this webcast but below are a couple of references if you want to understand more about High Availability and Failover Clustering in SQL Server:

 

Q: How well are Microsoft's standard products that use SQL optimized? Are there published studies on optimizing Microsoft CRM, Sharepoint Services, Project Server, and the various Microsoft Business Products?

This one I kind of answered during the webcast…First, I will start by mentioning that I’m NOT an expert on any of these products specifically. However, you are correct in that they are based on SQL Server and use SQL Server as their data store. What that means is that they too must be optimized and maintained; however, it is impossible for anyone to predict perfectly the exact usage of each of these products and as a result they might have a good selection of indexes but they are unlikely to have maintenance operations defined. However – again, I cannot say that I am an expert and I could be wrong! What I will say however, is that regardless of what is defined and what is maintained – you can create additional indexes, possibly drop indexes (more on this in a moment) and maintain indexes with NO application changes. The reason why I say “possibly” drop indexes is because queries/procedures, etc. will FAIL if index hints are used and the index does not exist at runtime. And while I’m fairly certain that NO Microsoft product hardcodes index hints, I have seen other third party products which do and unfortunately they become VERY hard to optimize and manage. In general, it is harder to drop indexes than add indexes. Nevertheless, it is always possible to defrag/rebuild indexes – regardless of the application whether Microsoft of not. So, while I can’t speak to each of these applications directly (although my partner works on Microsoft Project and he’s pretty sharp! J), you can always improve upon an application with better/more useful indexes (meaning more specific indexes for YOUR usage patterns and requirements) and proper/consistent/automated index maintenance.

 

Q: Is there any internal command to output query plan used by optimizer (not showplan or external tools available), which may show splits with DML operations?

Well, I’ll start by saying that these are really two different parts to your query processing. First, splits are NOT part of the optimization plan as they are only known at runtime. So, a plan really wouldn’t ever show them. As a result, I don’t believe an internal command exists to show this; however, I’m a bit confused as to why it must be “internal” and not something else? Are you just trying to determine which commands and in which tables you are having a lot of splits? (And IMO, this is the most important.). If that’s the case then you can start by using PerfMon to see if you have splits (and get a rough idea of when) and then use DBCC SHOWCONTIG to see which tables are fragmented (meaning they had the splits). Perf Mon will only tell you that you have them – not in which table they are occurring (but I will say that you might be very pleased with a few new features of SQL Server 2005 – look forward to playing with beta II Profiler and PerfMon). Right now, check out PerfMon SQL Server: Access Methods, Page Splits/sec.

 

Q: How do we use the I/O affinity option in SQL Server 2000?

Actually, I’ll let the KB take care of this one. Check out KB Article Q298402 titled: INF: Understanding How to Set the SQL Server I/O Affinity Option.

 

Q: Can we use MAXDOP with Index Creation?

No, MAXDOP as a query hint only works on DML statements. However, if you want to limit all operations to a reduced number of CPUs you can set the server setting Max Degree of Parallelism. See the sp_configure option in the Books Online as well as KB Article 329204 titled: INF: General Guidelines to Use to Configure the MAXDOP Option.

 

Q: Is there another way to determine fragmentation? I don’t want to use DBCC SHOWCONTIG as it locks objects.

Well, this is a good question and unfortunately there really isn’t a way to generate information such as Scan Density or Average Page Density without running DBCC SHOWCONTIG. However, if you decide that you’re only interested in Scan Density then you will be interested in the WITH FAST option for DBCC SHOWCONTIG. Using the WITH FAST option the command will only calculate the out of order extents (i.e. extent switches) and not calculate average row length, maximum row length, minimum row length, average page density, etc. and in using this option SQL Server will execute this command a lot faster thereby releasing the locks faster. However, locks will still be acquired. As an odd response though – I’d have to say that you could just defrag without looking… Meaning instead of even checking for fragmentation you could just execute DBCC INDEXDEFRAG. Since DBCC INDEXDEFRAG doesn’t use long running locks (it runs as “mini transactions” as it moves through the leaf level of the index) and doesn’t do anything when there’s no fragmentation (it really only moves data around when fragmentation exists) then you could do this without checking – and not creating blocking.

 

Q: How does the fill factor influence index structure?

FILLFACTOR – which is an option which ONLY takes effect when an index is initially built or later when an index is rebuilt or defragmented – defines the level of fullness for the leaf level of an index. Fillfactor ONLY applies to the leaf level and when set appropriately for the data and the volatility patterns of the data, it should help to minimize fragmentation thereby reducing the need to free space to be left at higher levels in the tree. However, if you also want to pad the upper levels of the tree with the same level of free space as the leaf level, you can add PAD_INDEX to the index statement and it will use the fillfactor setting for all levels of the index. For example:

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80

 

Fills the leaf pages 80 percent full and the non-leaf pages are balanced such that at LEAST one entry is open on each leaf page.

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80, PAD_INDEX

 

Fills BOTH the leaf and non-leaf pages 80 percent full.

 

Typically, PAD_INDEX is not needed unless you have extremely unbalanced activity and are unable to rebalance the tree (meaning you need to achieve high availability). However, I would say that PAD_INDEX is not perfect as implemented as I would rather set that value differently from fillfactor…

 

So, not a big fan of PAD_INDEX but feel it’s very important to set an APPROPRIATE fillfactor. (Review the webcast for more details on what I mean by “appropriate” fillfactor.)

 

Q: What would be the ideal settings [of fillfactor] for OLAP/DSS Databases?

To be honest, 100% is an ideal setting for a read-only decision support database, as new data is not causing splits. However, data loads (which build these reporting databases) can cause fragmentation. Ideally, you would drop your indexes, load your data and then rebuild your indexes upon completion. If you’re trying to reduce the amount of time the process takes AND you’re not loading in a lot of data (compared to what’s already there) then you might not want to drop and recreate indexes… Setting fillfactor in these cases will be based on how much data you will be loading compared with how much data you have. Set fillfactor to whatever that percentage is (if you’re loading 100,000 rows into a table which already has 900,000 rows then a fillfactor of 90 should have just enough room – IF the data is evenly distributed). If the data is NOT evenly distributed then you might want to drop this number a tad lower. AND – you only want to set this on indexes that are going to have splits. Remember, if your clustering key ever increasing and the new data coming is “appended” to the end of the table – both logically and by clustering key definition – then you only need the fillfactor setting on the nonclustered indexes. Regardless, it is still likely that you will end up with some fragmentation. Dropping the indexes, loading the data, and then rebuilding the indexes would certainly be better – but it will take more time. IF you’re loading a lot of data relative to what you already have then there’s NO question: drop the indexes first, load the data and then rebuild. When dropping indexes ALWAYS make sure to drop NC Indexes first and then drop the clustered index. When creating indexes ALWAYS make sure to create the clustered index first and then add the nonclustered. For more details on High Performance Data Loading check out the presentations on Gert Drapers’ site: www.SQLDev.Net, under Events. (FYI - There are all sorts of great presentations there!)

 

Q: When trying to determine the fill factor to set, would you recommend setting the same fill factor to all tables or just adjust the fill factor on the large tables and keep the small ones at the default?

I know I got this one

 

Q: Will Kimberly show the rebuild or defrag statement she used? I had to run to the restroom and missed that brief part.

I would have answered your question but I had to use the restroom myself… Seriously though, J the commands are:

            DROP and re-CREATE

            DBCC DBREINDEX (was added in 6.5)

            CREATE with DROP_EXISTING (was added in 7.0)

            DBCC INDEXDEFRAG (was added in 2000)

If I’m going to do a rebuild – I typically use DBCC DBREINDEX as it’s easier to automate and easier syntax in general. If my availability requirements don’t allow the table/indexes to be rebuilt because of locking, then you only have one choice for defraging – DBCC INDEXDEFRAG. I have a feeling that I cleared this up once you returned and certainly, you can also review the webcast on demand once it’s available!

Q: Regarding sp_RebuildIndexes, does the SP need to be created in the Master Database? Our companies Best Practices recommend against it. We use an “Admin User Database” for most shared stored procedures.

Sure, you can create the sp in any database (just make sure to modify the script) and then make sure to explicitly qualify the execution with the datasbasename.owner.procedurename and everything should work!

 

Q: What type of Defrag do the data maintenance plans use?

They use DBCC INDEXDEFRAG.

Categories:
Design | Events | Indexes | Resources | Tips

Abstract: In SQL Server, Performance Tuning can be achieved through many avenues: hardware, software, the database and the client. While there are many pathes that yield gains, finding the imporvements that give the greatest gain with the least impact on your current systems is a must. In SQL Server there is no other area in the product where you can get better performance gains than with indexes. However, knowing the right types of indexes as well as finding the right balance of indexes (e.g. Balancing OLTP and DSS requirements) takes strategy. Moreover, once created - indexes must be maintained. Even when properly created (and with the right strategies based on data usage patterns) indexes can lose their effectiveness over time due to fragmentation through inserts, updates and deletes. In this session you will learn the effect of data modifications on indexes and how to clean up the fragmentation. Did you know there are multiple ways? Did you know that each method provides different pros and cons in terms of maintenance and availability?

Prerequisites: Blog entry which has the prerequisite Webcast Resources: the link to the on-demand webcast from June 11, the pdf for that deck, the sample database and all scripts, the Q&A write-up and more here

PDF: So you can print and be ready for the webcast on Monday, July 19, 2004 here it is: IndexesRightBalance-Defrag.pdf (2.13 MB)

To Register: go here and DO NOT BELIEVE the abstract :)... I am NOT repeating last month's session this is a follow-on to that session which focuses on Index Defrag Best Practices!

See you Monday (and yes, I know! I still owe you a few Table Design Best Practices... it's been a long week already! :)

Categories:
Events | Indexes | Resources

Well, the person then asked if we could sitdown... I said sure...

We sat down at a table and they proceeded to rummage through their bag, pull out a bunch of paper (which turned out to be a single, very, large piece of plotter paper which was folded up) and proceeded to unfold it and smooth it out on the table...

It was their ERD - yes, all 500 tables (or whatever)...

They smoothed it all out (I was getting nervous as to the length of the still pending “question” :) and then here it came.....

How would you index this?

Literally, that was it. Here's my ERD - all 500 tables... what indexes do I need.......

How do I even begin? Database design (and indexing) are NOT something with an easy and single “right” answer. In fact, one of the reasons I love working with SQL Server is that there's almost always multiple ways to handle a problem - each having different pros and cons. So - understanding more than just the data is ciritical. For example, I could have noticed a customer table in their ERD and I could have said that an index to aid in customer lookups - by name - would be a good idea. However, if they responded with “well, we don't really use the customer name - this is our demographics analysis database,” then my recommendation would have been not only useless but a negative in terms of overhead, maintenance and therefore performance.

So - what's my point - there are hundreds of things that lead up to good design, optimal indexing, and effective procedural code... a design that gets better cache utilization and one that keeps the data pages compact, logically ordered or grouped, minimizes maintenance and has efficient lookups/modifications, etc... There isn't a single right answer in design.... and I think that's good! It means that you can design for the usage patterns you anticipate (within reason).

I promise - if there were a sqlserver /faster switch or a single silver bullet - I would tell you (well, I'd have you hire me as a consultant, surf the web and IM with friends for a week, and then set the switch on Friday :)....

Seriously, how does one design for performance?

You need to know the data and more important know how it's going to be used (which I realize might NOT be perfect - in fact, I've sometimes been surprised by usages patterns once something goes into production so you might not always be right). However, the more you know about usage patterns the better you will be able to design for them and proactively reduce the possibly negative impact of them (for example data modifications can cause fragmentation...which not only creates potentially excessive freespace but also makes the modification take longer to process - and well, there's so much more to this one...).

Having said that, over my next few blog entries I'll give you a bunch of ideas and best practices for which to look...

Categories:
Design | Indexes | Opinions | Tips

OK - I've requested that MSDN create a new “support webcast” page that has all of these links added to it (for a more complete archive/reference item) but it's not yet been completed.

For completeness - here are ALL of the resources:

So - I think that should do it. I hope you enjoy these resources. And - if you have comments, please post them to the specific blog entry for that subject area.

Have fun and I hope you'll be there in July and I hope to meet you in October!

Categories:
Events | Indexes | Resources

Index Management/Maintenance Questions:

Q: Is there a tool or method for monitoring a database for awhile to identify indexes that are rarely or never used – that should be considered for removing?

I think I mentioned this one late in the presentation but it certainly warrants an answer. Unfortunately, the answer is both yes and no. To start with the “no’s” – there’s nothing inside of SQL Server [today] that keeps track of index usage. One option is to create a trace and run it through ITW. ITW has an option on the first dialog that asks if you want to “Keep all existing indexes.” If you deselect this it will give recommendations to drop indexes that are not used by that workload. Be careful! You might drop an index that someone else needs. In SQL Server 2005, the ITW is being replaced by DTA (the Database Tuning Advisor) and it has a "drop-only" mode that allows DTA to only check for the unused indexes.

 

But – this still doesn’t really give you a list of what gets used, what doesn’t. I can certainly think of a way that will be expensive (in terms of profiler cost and your time J) to implement and given a bit of time this might be somewhat useful but not sure?! Here’s my idea, you could capture showplan information into a profiler trace, load it into a sql table and then look for your indexes by name (make sure to use the fully qualified name such as [Credit].[dbo].[member2].[member2Cl] as index names only need to be unique per table). You could even do an outer join with sysindexes to get the complete list of indexes related to the workload. It won’t be pretty but it can certainly be done and it’s still ONLY for the time of the workload. Overall, I think this is too expensive to do. Good news coming though… In SQL Server 2005 there will be a virtual system table that will keep track of index usage since the last server restart or since you reset the values.

 

Q: What is the best way to analyze both the space used and fragmentation of each table? More specifically, is there one procedure that will check all tables in a database, output the results and show me which tables/indexes are foobar?

I don’t think any particular command actually outputs the level of “foobar-ness” but… this one will get you very close. J Using DBCC SHOWCONTIG WITH TABLERESULTS returns almost everything you’re looking for. Use the “pages” column to calculate space ((convert(decimal(10,3), pages) * 8K)/1024 = MB), and then the AveragePageDensity (for internal fragmentation) and then ScanDensity, LogicalFragmentation and ExtentFragmentation (for different forms of external fragmentation). One of the best parts of having returned a tabular set is that you can put the data into a table, verify if/change it and even walk it to analyze it as well as run certain operations as a result of the information you learn. In fact, this is what my sp_RebuildIndexes script does. And check out DBCC ShowContig with TableResults.sql (another demo script) for the definition of the table. Both of these scripts can be found with the demo scripts: MSDNWebcast-Tripp-20040611-SCRIPTS.zip (73.19 KB). Additionally, you can find the scripts on www.SQLskills.com under Event resources.

 

Q: Does SHOWCONTIG keep a table lock? and I was told not to use DBCC SHOWCONTIG on a production server as it can adversely affect performance… Is that true?

First, yes, there is a shared table level lock acquired for the length of the scan. If you run DBCC SHOWCONTIG with the default options that scan will not only scan for external fragmentation but also calculate internal fragmentation. To reduce the time for which these locks are held you can use the WITH FAST option on DBCC SHOWCONTIG. This will ONLY do the scan (returning less fragmentation details) but it's good if all your looking for is Scan Density! So – this can create blocking and you should consider running it off hours OR consider running it on another server where you restore a backup of your production database. All fragmentation that exists when the database is backed up - will be restored since the restore does not physically change any data/pages, etc.

 

Q: Could you tell us how FILLFACTOR affects performance?

The short answer is that it helps IMPROVE performance by leaving free space on the pages when the index is built, rebuilt or defragged. When building or rebuilding the index you can specify (using FILLFACTOR) the percent fullness for each page. By leaving free space you can minimize splits, reduce fragmentation and therefore minimize the need to defrag as frequently. So – you should set the fillfactor based on the frequency with which you can rebuild… Combining the frequency with which you can rebuild and the rate at which your table becomes fragmented between rebuilds you can start to find the most optimal setting. The more fragmented it becomes then you should lower the fillfactor (on the next rebuild)... and if the table doesn’t become overly fragmented (and there’s still free space by the time of the next rebuild) then you might want to increase the fillfactor... so over time you can find the most optimal setting... but it does take a bit of work! Start with the “monster” tables first – the ones that give you the most problems and the ones that are the largest! With a bit of time you’ll have a more consistently balanced system.

 

Q: On a very large table (32 GB, 20 mil rows) with a clustered index and 10 non-clustered indexes – where would you start? If the length of time it takes to execute is long and you can only reindex or defrag a few indexes per day, which ones would you do and in what order?

Well, this is a great question! But – I’d like to change the focus a bit; if you focus on minimizing fragmentation then that will in turn keep things more optimal. Minimizing fragmentation in the clustered index is done by creating the recommended type of clustering key (unique, narrow, static and ever-increasing). If you meet all of these then the base table should only be getting fragmented due to updates to varchar. If you have lots of these then setting a fillfactor can help. The fillfactor defines the amount the pages should be filled to when the index is rebuilt or defraged. If your index is becoming very fragmented between builds then you should consider increasing this frequency or lowering the fillfactor. So – this is really what you should look into: verifying that you’re setting the fillfactor appropriately between rebuilds/reindexes and then adjusting that schedule appropriately!

 

But – when you still have a lot of fragmentation and still need to do frequent rebuilds – focus on the most important tables first, then the most important indexes first (i.e. start with the clustered). From there it’s hard to say but the wider indexes are probably getting used more so I’d look at those next.

 

And – one other tip – if some of you are getting a lot of fragmentation due to updates, think about: fixed length fields (which is rarely an option), default values (GREAT choice if the updates come in with no value and later get updated) and rebuilds either more frequently or with a more appropriate fillfactor.

Categories:
Events | Indexes | Resources | Tips

Index Creation Questions:

Q: In an OLTP server do clustered indexes create a negative performance impact to the constant inserts and updates?

The wrong clustered index can for sure… In most environments, having the right clustered index (an index on an ever-increasing key) will be better than a heap (a table without a clustered index) and much better than the wrong clustered index (a clustered index that creates wildly random access patterns for insert – i.e. LastName). So – the idea with the clustered index debate (as I often refer to this as J) is that the right clustered index balances an improvement in performance for inserts (by minimizing splits as well as having the needed pages already in cache – and by needing less cache overall) with keeping the table more compact (i.e. less fragmentation). However, if you have a high volume of inserts (typically more than 400 per second) then you might want to add additional hot spots (cluster on a composite key like “state, id” which creates as many hot spots as you deal with states) to better balance this increased volume. I still wouldn’t go with something random (like name) and if the table has a lot of scans/reads I certainly wouldn’t go with a heap. You do have to be careful with updates though – if there’s a high volume of updates and the updates are to varchar then there are a few things I’d think about: fixed length fields (which is rarely an option), default values (GREAT choice if the updates come in with no value and later get updated) and rebuilds with a more appropriate fillfactor.

 

Q: Is a unique key the same as a primary key?

Almost… SQL Server has two ways of enforcing entity integrity: the Primary Key (of which there can only be one per table) and Unique Key(s) (of which there can be many – 249 is the maximum number of non-clustered indexes so unique keys and other non-clustered indexes combined can be no more than 249).

 

Q: I thought you said the primary key constraint automatically gets a clustered index, how can you make it non-clustered key?

When specifying index-based constraints (Primary Key and Unique Key) you can specify index type. The default index type (when not specified) is Clustered for the Primary Key and non-clustered for a unique key. Here’s complete syntax for adding these keys to existing tables.

 

ALTER TABLE dbo.Employee
  ADD CONSTRAINT EmployeePK
    PRIMARY KEY CLUSTERED (EmployeeID)

 

ALTER TABLE dbo.Employee
  ADD CONSTRAINT EmployeeSSNUK
    UNIQUE NONCLUSTERED (SSN)

 

In either case you can CHANGE the underlined part of the syntax to change the index type.

 

Q: How is a unique key created? Explicitly or is it Implied via Indexes?

Well, I have to admit I’m not entirely sure of the question here but I think I do… When you create a unique key constraint SQL Server enforces that uniqueness by automatically adding a unique non-clustered index to the table.

 

Q: Are GUIDs as efficient as identity for the clustering key?

Not really. They do meet the criteria of being static and unique however, they’re not as narrow (they’re 16 bytes) and if the newid function is used to populate the values (which is the common approach) then the values are not ever-increasing. With the values populated by the newid function you can end up with a lot of fragmentation quickly. There are alternatives to the newid function for generating the GUID value as an incrementing value through a windows call. Gert Drapers has an extended procedure to help you do this on www.SQLDev.net. I’m not sure of the specific link but I will update this here when I do!

 

Q: When you say to avoid a “volatile” clustering key, do you mean the data value actually changing/being updated?

Yes! Since the clustering key is in EVERY nonclustered index you want to choose it wisely and make sure that it’s relatively static. If it changes that will require all of the redundant versions of the value to change as well…

 

Q: If you create an identity column in a table just for the sake of creating a clustered index on it, how does this improve performance (especially if most of the queries use other fields in the table)?

Another great question! Remember two things: the internals of SQL Server rely heavily on the clustering key (even if you don’t) and if the table stays more compact (by not becoming fragmented) then queries [especially large scan queries] improve. So – it’s a better balance. Now to add a bit more – in many cases having the right nonclustered index can yield better performance for most queries anyway!

 

Q: You said in the power point slides that generally a few wide indexes are better than several narrow; However, later you said that one narrow clustered index on an id column + non clustered indexes is the way to go. So which is it? Or did I not really understand?

You’ve got it – it’s a combination of the two. The clustered should be narrow/static/unique but non-clustered indexes tend to be more effective when they are slightly wider (i.e. they contain a couple to a few columns rather than just one) so that they can service more types of query requests.

 

Q: Do I need to change my indexes if I use the like operator? I noticed that I get Index Scans instead of seeks when I use LastName like '%johnson%' v. LastName = 'johnson'?

Well, this one is kind of mixed. But – let me give you something to think about (which might help)… Think about the phone book for a second – if you have someone’s last name it’s pretty easy to find them. However, if you only know they have “johnson” somewhere in their name it’s a lot harder, right. This could yield ‘Cajohnson’, ‘Ejohnson’, ‘Smith-Johnson’ and so forth. So – there’s really no effective way to get at that data… SQL Server will always need to scan for that type of request (now I could add that covering indexes might help but quite honestly I think there’s a better option here). SO - changing index strategy really won’t help. But – I often see applications that just automatically add the % before and after the requested data. If this is the case, can you ask the user if it’s a real value or a pattern value? Or can you even allow them to supply the exact request? When the value is specific (i.e. Johnson) then the query will be fast, when the value is pattern matching (i.e. %johnson%) then SQL Server will need to scan and the query won’t be as fast – but it will be correct! If you want to search searching for more wildcards and more complex pattern matching into descriptions, etc. then you’ll probably want to look into Full-text Indexing.

Categories:
Events | Indexes | Resources | Tips

General Questions:

Q: If you want to know the value of the key prior to your insert statement, how can you use identity?

You could insert a “place-holder” row (i.e. a row that uses only defaults and/or just basic information so that you can get the @@identity of the row) and then come back later to update the data values. However, be very cautious with this approach. If you insert a lot of NULL values and have a very narrow row inserted then when you come back to update it (and therefore widen the variable width columns, if any) you can end up creating a lot of fragmentation. To minimize this, make sure to use default values (i.e. DEFAULT constraints) to pre-allocate space and reduce fragmentation.

 

Q: Can you give me a brief definition of a doubly-linked list?

Speaking directly to the “leaf-level” of an index the doubly linked list refers to the pages that contain the index data – in indexed order. Since there’s likely to be more than one row per page and since an index implies order; the pages are “linked” to provide that order logically.

The NIST (National Institute of Standards) also has a series of “data structures” definitions and doubly-linked list is here.

 

And for a bit more about SQL internals – SQL Server 2000 (actually 7.0 and higher) uses 8K pages. Each page has a 96 byte header – [a very small] part of which is used to store this previous-page and next-page page pointers. As for how many rows SQL Server will store; that depends on the width of the row. You can divide 8096 by your average row size to get an idea of how many rows you will be storing and you can use DBCC SHOWCONTIG (‘tablename’) WITH TABLERESULTS to see information columns labeled MinimumRecordSize, MaximumRecordSize, and AverageRecordSize for more specific table-related values. There is a maximum limit of 8060 bytes for a single inserted row.

 

Q: Well, this is a live meeting question and I should have reminded you all! But - for future reference… How do I see the demo in full-screen mode?

Press Control + H when the webcast begins a demo.

 

Q: Could you please give an example of a foreign key constraint?

A foreign key constraint is used to enforce referential integrity between two columns of the same or different tables. For example, you might create two tables: Employees and Departments. Because each employee can only be in one department, you can enforce this relationship with keys. First, make the DepartmentID (of the Departments table) a Primary Key and then make the DepartmentID column of the Employee Table reference Departments. Here’s a small snippet of pseudo code.

 

There are lots of good references out there on database design and constraints and I’ve heard good things about Database Design for Mere Mortals by Mike Hernandez. And, here’s an article on MSDN titled: Implementing Referential Integrity and Cascading Actions by Itzik Ben-Gan.

 

Q: Why does SQL Server allocate up the all of the memory? Even if I stop the service in short time it’s the same?
Since SQL Server is typically run as a dedicated service in production environments, SQL Server typically allocates memory to have it available when it needs the memory. If it detects memory pressure SQL Server will release those pages. If you’d like to configure memory for a specific max server memory and/or min server memory, you can check out the books online as well as this MSDN article titled: Inside SQL Server 2000's Memory Management Facilities by Ken Henderson.

Categories:
Events | Indexes | Resources

Tools Questions:

Q: Isn't there some issue with profiler where it does not recognize DBName but only the DBID?

Well, this is a good one – and a frustrating one as well. It is true that Profiler doesn’t always generate a data column value for every type of event. So – if you setup filters you will only filter rows which have a value for that data column. If you ONLY want to see things with a very specific value it can take quite a bit of time to remove all of the uninteresting events to narrow it down to only what you need. In the SQL Server Books Online there is a topic for each category – for example: Stored Procedures Event Category has a listing of all of the data columns that are produced for each event. This is a bit tedious to go through though and there’s a nice full html version that’s available on Gert Draper’s website: www.SQLDev.net. Checkout this link specifically: http://sqldev.net/misc/SQLTraceEventMatrix.htm

 

Q: Is it true that ITW is beneficial because it can predict how the optimizer will benefit most from the indexes it recommends? Sometimes I find that the ITW recommends statistics that hinder the query performance. Do you feel comfortable using it for the most part?

Ok, so there are really two questions here – first “how does ITW work?” and second “is it always a good thing?”

 

As for how it works – yes, you’re correct… ITW hypothesizes about indexes by creating statistics about your data, looking at the queries in the workload and then based on how the optimizer would optimize – it tells you which indexes would be best serve your data!

 

As for whether or not it’s always a good thing… the first thing I’ll jokingly say is – Is there ever an ALWAYS good, good thing? J Joking aside, there is some truth to this (although rare). I do feel comfortable using ITW and because the statistics are based on real data, not updated if they’re not used (there’s a good whitepaper that explains statistics here), so in general, statistics don't usually hinder query (or server) performance. However, having an extremely large number of statistics and indexes can increase optimization time (in analyzing all of these choices) and this may in fact be something you’re seeing.

 

Q: When statistics exists on some of my client tables, SQL Server sometimes chooses to use an index that is less beneficial (even after we update stats with full scan). In some cases, we've had to force the index using optimizer hints. As a result, we disable "AUTO CREATE" and only create indexes when performance issues arise and are needed. Granted this isn't the best method but until the tables become really large we find that some queries work fine.

Ok, well there’s a lot to this one. First, I’d say that there’s a difference between just having statistics and having the right indexes. There are some cases where ONLY having statistics isn’t good enough to give SQL Server enough information to proceed and instead they make incorrect decisions. It’s generally when SQL Server only has statistics for individual columns and doesn’t have better information for a combination of columns (wow, this is hard to describe in a short Q&A). But – I’ll sum it up to say that statistics aren’t always perfect – some assumptions may need to be made and sometimes those assumptions don’t work out because the data isn’t evenly distributed… So, if the indexes that exist are not "perfect" for the query SQL Server ends up not doing as good of a job as it might if it had the right indexes... If it helps, I have a class where I get into more details on statistics (a whole module on just statistics J) that explains a lot more details... It's hard to sum up in a sentence or two BUT I would guess that a bit more tuning (i.e. more real indexes and not just statistics) might help solve the problem... Really, hard to say in a quick chat but that really shouldn't be the majority of the time.

Categories:
Events | Indexes | Resources | Tips

Related Resources:

Q: Can I view a recording of this webcast?

It should be available on www.microsoft.com/usa/webcasts/ondemand shortly.

 

Q: Where can we get the powerpoint file?

The powerpoint will be available on the MSDN webcast resource page. I will post the exact link here and on SQLskills.com as soon as the final link is available.

 

Q: Where can we get the demo scripts?

The demo scripts will also be available on the MSDN webcast resource page but since it’s a small file I will also place them here MSDNWebcast-Tripp-20040611-SCRIPTS.zip (73.19 KB). Additionally, you can find the scripts on www.SQLskills.com under Event resources.

 

Q: Where can we get BPA – the Best Practices Analyzer?

The BPA is on the front page http://www.microsoft.com/sql and the download is specifically here.

 

Q: Can you recommend a good source for information on optimizing full-text indexes and searches?

Start with the Books Online topic: Full-text Query Architecture and then make sure to check out this MSDN article by Andrew B. Cencini titled: Building Search Applications for the Web Using Microsoft SQL Server 2000 Full-Text Search.

 

Q: Do you speak in other events for more advanced tuning topics?

Yes, in fact, SQLskills has a new event coming up in October that’s a 5-day event on performance tuning on SQL Server. We’ve just announced this event in June and for those of you who become subscribers on SQLskills and sign up by July 1st you’ll receive the lowest early bird cost for the event. We’re calling these events “Immersion Events” as they’re focused, indepth and intense! Full details are on www.SQLskills.com.

 

Finally, here are all of the resource links I recommended in the resources section of the slides:

 

Q: Does Kimberly L. Tripp have a blog?

Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Categories:
Events | Indexes | Resources | Tips

Theme design by Nukeation based on Jelle Druyts