Data Dude (Visual Studio "DBPro") has generally been an expensive tool - only in the "Team Systems" Edition of Visual Studio. But, that's all going to change in VS 2010. Here's a picture of the new editions and which components of Data Dude are where:

This is a HUGE improvement (after the clarification isn't not quite as huge but still VERY useful that the tools are in different editions) and is going to allow many of you to use/leverage these key features (such as Data Generation and Static Code Analysis - those are some of my favorites :)) in smaller shops - without having to purchase the full-on Team Systems Edition of VS. If you're interested in more info - check out the following:

CLARIFICATION!!

In 2010 “Team System” goes away (as a brand), and it's been replaced with VS 2010 Ultimate and Premium (VS Pro is still there). Basically VS Premium is similar to VSTS Development Edition + Database Edition (and is priced the same). Ultimate is similar to Team Suite. 

The functionality didn't really move down the product line (or the price point), the big different is that there's more than only DB Pro tooling in there now!

Thanks to Doug "D7" Seven for setting me straight!

Enjoy!
kt

Categories:
Tips | VS DBPro

A couple of weeks ago I wrote a blog post titled Whose job is it anyway? It's an interesting debate and something I've been hearing more and more - that SQL Server is a "set it and forget it" technology - a black box where you just don't need to know how it works to do well with it. In fact, I've even had a few folks comment that they think it would be better to "roll their own" database rather than have to learn how to work in a "general purpose" database. And, while there are certainly lots of different angle to this debate - one fact remains... if you don't know anything about the database on which you're developing (whether it's SQL Server, mySQL, Oracle, whatever), I *PROMISE* you won't have a truly scalable, optimal solution. Why do you think there are so many knobs? It's because there are so many different ways to work with data. There is more than one way to query, more than one way to design. This is also why every answer to a "how should I do this" question starts with "It depends". And, while that seems like a scary response it's actually a good one. It means that you have lots of options - options that can offer many different pros/cons. And, as a result of knowing these pros/cons, you can make better decisions - decisions that will ultimately determine how well you can scale.

So..... while I don't think this debate will EVER be finished (as to WHOSE job it is to know these things), I do think a lot of folks are seeing the effects of not knowing more about their store (and, again, this is NOT limited to SQL Server in any way, shape or form).

At a minimum, hear the discussion on RunAsRadio with Richard, Greg and I and let us know what you think!

Kim Tripp on the Roles of Developers and DBAs with the Database!

Cheers,
kt

I started the series here: http://www.sqlskills.com/BLOGS/KIMBERLY/post/Spring-cleaning-your-indexes-Part-I.aspx and I want to continue with Part II today by clarifying some great questions/comments that have come up on the series. In Part III, I'll give you a few more ways to get rid of (or consolidate) indexes. And, I think there's still a bit more that Paul and I will investigate further (wrt to operational stats) but, I want to address a few comments and a few interesting things that both Paul and I have found.

In the Part I post, I talked about using sys.dm_db_index_usage_stats to see if there are any indexes that just aren't being used at all... A few comments asked why I didn't use operational_stats instead. To address that first, there are a few key differences:

  • dm_db_index_operational_stats is persisted only as long as an object is in cache (however, it's not cleared when objects are forced out of cache with DBCC DROPCLEANBUFFERS). If you want to clear ALL DMVs for a specific database, then a relatively easy way to do this (IN TESTING) is to take the database offline and then immediately bring it online again.
    • ALTER DATABASE <dbname> SET OFFLINE
    • ALTER DATABASE <dbname> SET ONLINE
      • NOTE: If there are any suspect files, you will NOT be able to bring your database back online without FIRST taking all suspect files OFFLINE. And, if you take a FILE offline then it's even more important to know that THERE IS NO WAY TO BRING A FILE ONLINE without restoring it from backups. So, it's VERY important to understand that OFFLINE/ONLINE for a database is really easy IF AND ONLY IF there are no other problems with the DB. You really need to resolve those problems first (or at least know that you're going to need to resolve those problems later through backup/restore) before you take a database offline.
  • dm_db_index_operational_stats is (from BOL) neither persistent nor transactionally consistent. This means you cannot use these counters to determine whether an index has been used or not, or when the index was last used. For information about this, see sys.dm_db_index_usage_stats.

Having said that though, none of these are really any guarantee of perfect information. And, they're not meant to be. I look at these DMVs as being a quick and easy way to get some relatively descent insight into what is or is not happening in my environement. However, even though the sys.dm_db_index_operational_stats might give you insight that you have a problem it still doesn't give you good insight into exactly what that problem might be. For example, it *might* be splits that causes some of your wait times to increase (column: page_io_latch_wait_in_ms) but, it could be something else too (some other system issue).

The main point, you can use these to get insight into which tables have the biggest problems (i.e. the biggest waits) and where they might have a lot of splits (column:nonleaf_allocation_count) but, in all honesty, that's not a guarantee. In fact, the reason I said "might" is that pages that are allocated at the end of the leaf level STILL allocate a page and require an entry to be made in the next level up in the index. So, a lot of nonleaf_allocations COULD be for a perfectly unfragmented index. So, it still doesn't tell you how fragmented the objects are or what the REAL problem is (or even if it is a problem yet).

Basically, these just tell you where you have the most activity and give you a starting point for problem solving. But, none of these (usage or operational) really tell you how to solve the problem. However, sometimes even knowing where to start IS the problem in and of itself. So, I'm not against these DMVs and I really do think you can get some good insight from them. Just use them as a tool to help focus your investigations. Use better tools like sys.dm_db_index_physical_stats to really see if you have fragmentation and where it's the worst.

hth,
kt

First and foremost, happy spring! I truly hope we're on the path to summer (although who would know it here - we're in Florida for SQL Connections and the weather is a bit chilly and it's been raining off/on today - I hope this is short term (no, I don't want to look at the forecast as I don't want to jinx it :)). But, wherever you are - I hope you're on your way to nicer weather and minimal cold weather (ok, I guess I only have warm wishes for the northern hem... for you southies - I hope your fall is lovely!!).

But, for everyone - now's a good time to start thinking about cleaning out the [non-literal index] closet... and getting rid of some of those dusty indexes?

Why/when should you get rid of some of your indexes?

  1. It's possible that some of them aren't being used at all...
  2. Especially when they're not being used but even when they're "redundant" (or minimally useful) they're still costing you in many ways:
    1. Wasting space on disk
    2. Wasting space in memory (well, if they are being used then they're cluttering up your cache)
    3. Wasting space/time in your maintenance routines (so, here they're cluttering up your cache for sure!)
    4. Wasting space in your backups
  3. You might be able to reduce your overall indexes with index consolidation...

So, for this post, I'm going to target #1 - are there any indexes that just aren't being used at all...

First, how do you know if your indexes are being used?
In SQL Server 2005 and higher, there's a DMV (dynamic management view) called sys.dm_db_index_usage_stats and it's there to track index usage patterns. However, it's not persisted since the beginning of time and as a result, if you look at this and believe that it's telling you ALL of the indexes that have been used in your database - then you might be mistaken. The index usage stats DMV is cleared when SQL Server is restarted as well as when you detach/attach the database or when you backup/restore the database.

Therefore, you don't want to just run the following query and drop all of the indexes that aren't being used. A better way to "trust" this information is to periodically persist the data from the DMV in your own table and then query it after you've completed a business cycle's worth of activity - logging all of the usage stats. Then you can trust this much more. Again, here are a couple of negatives:
1) it's not persisted
2) it only keeps the database_id and the index_id (which could change over time). You're right in thinking it probably shouldn't change but, a nonclustered index's ID is not permanent so, it's better to track the index name in addition to the index_id. And, when you run your queries to determine what to delete, you can easily verify indexes against the current indexes because your comparison is within the same db (more on this below).

If you want to persist this, then you have two ways to do this:

Store the index usage patterns in a table within the specific database you're tracking:

Pro: it goes with the database when you back it up, etc. and, it's easier to reverse engineer which actual indexes you're referring to (grabbing the names and not just the IDs).

Con: it's a bit more complex of a query to run and you'll need to run it for all of your databases (ok, it's really not all that bad - but, using something like sp_msForEachDB will really help)

Store the index usage patterns from all databases in a table within master or your own "performance database":

Pro: you only need one job to handle all the index usage info AND object_name *does* support TWO parameters (object_id and database_id) so, as long as you trap the name at the time of insert then you'll be good.

Con: it doesn't go with the database (e.g. backup/restore - and if you're restoring to a test system and you want to see what the usage patterns were then you'll need to get this information as well...)

Here's a simple query that you can run that shows all the indexes used right now - and adds the databasename/objectname into the results - in a persisted table you'll also want to add the runtime:

SELECT getdate() AS RunTime
, DB_NAME(i.database_id) as DatabaseName
, OBJECT_NAME(i.object_id, i.database_id) as ObjectName
, *
FROM sys.dm_db_index_usage_stats AS i
WHERE object_id > 100

And, if you want to get a few more insghts into how to persist this on Paul's blog here: http://www.sqlskills.com/BLOGS/PAUL/post/Indexes-From-Every-Angle-How-can-you-tell-if-an-index-is-being-used.aspx.

OK, so, you have a few options to think about and I have a few more parts to post!
kt

PS - The Tipping Point is coming soon too. I'm still adding a few things to that one!!! ;-)

OK, I was tagged by Ward Pond here. Ward was tagged by Jimmy May here. Jimmy was tagged by Kevin Kline here. Kevin was tagged by Chris Shaw here. Chris was tagged by Michelle Uford here. And, Michelle was tagged by Mike Walsh here.

Ah... at least I know who to blame. Ah, thanks Mike! ;-)

So, as for some things I know now that I wish I knew then and/or what I value the most. First, I took a peek at a lot of the other posts. There are some really good tips and tricks in all of them and so it's hard to come up with something really clever. But, I had to admit the following:

You don't know what you don't know..

Yeah, I know. Then you don't know. So, does it matter?! Actually, yes. There have been times where I wished I had listened better before I jumped to what turned out to be the wrong solution. Sometimes taking more time at design - can REALLY help. This is something that I've been working really hard to prove in a lot of my recent posts. We've all thrown together a schema... got things working... and then found out it wasn't ideal... But, how much can you really know if you don't know the workload? Well, this is where the listening part comes in. Talk to your users (as painful as it may be ;). Talk to the stakeholders. Even if you can't figure out the specific queries that are going to run, give them sheets of paper on which they can draw their own dialogs and tell you what they'd LIKE to see (tell them to visualize what they want to see on the screen - on the paper). This is an ideal way of getting to the meat of what they want/need - without getting lost in the "glitz" of the UI. Even if you only have some insight into what they'll be doing - some is WAY better than none and the more you listen - the better you'll design. The better you design; the better it will scale/perform.

You can't know it all!!!

Give up now. Don't try to know everything - you'll never sleep. I work really hard knowing a lot about a lot of different things - but, I still specialize AND, I'm HAPPY to say "I don't know" (and say it often). In fact, what I pride myself on the most is that I'm never afraid to prototype and/or try something out. One of my favorite sayings for SQL Server is: "Jack of all trades, master of SOME." Know enough to know you need to learn more (because you plan to use that feature)... And, know enough to know when a feature is NOT appropriate. But, don't get lost in all of the minutia. We just don't have time for it. Choose your battles.

And, since I'm starving and I know when to quit. ;-) I'm going to tag a few folks including a couple of NON-SQL folks and hear what they say!! And, Paul was also tagged - check out his post here: http://www.sqlskills.com/BLOGS/PAUL/post/Things-you-know-now.aspx.

And, I'd love to hear what you know now!

Thanks,
kt

Categories:
Opinions | Tips

When you decide to rebuild or reorganzie an index, you have an option to set something called FILLFACTOR. This is probably the MOST IMPORTANT thing to understand about index maintenance and reducing fragmentation (especially in databases that are prone to it). Unfortunately, we need to dive into some internals to really understand why this is so helpful...

What is FILLFACTOR?

FILLFACTOR defines how full the leaf level pages of the index are filled when rebuilding or reorganizing an index. The leaf level of an index is the largest structure of an index and it's the most prone to fragmentation. A good analogy would be to use the phone book (ah, loosely for this one) and imagine that new people come to the area. The *entire* phone book needs to get reprinted. Obviously this is terribly expensive and therefore it's not done often (well, we all use the internet anyway...). But, imagine if EACH page of the phone book were only filled to 90%. When new people come to the area we could just add their information - on the specific page where it should reside - and the phone book stays relatively intact and we don't have to kill a bunch of trees (OK, you could argue that you would have had to kill more trees to make a phone book that has 10% whitespace...but, let's not go there right now. :). OK, I know this analogy is weak at best but think back to "a page in the phone book" and think of it representing a "page" within SQL Server (quite literally here - SQL Server actually stores data on 8KB pages and indexes are doubly-linked lists of these 8KB pages).

So, if the pages are full and there's nowhere to put a particular row (let's say someone with a lastname of 'Tripp'), how does SQL Server "insert" this new row? They can't possibly shift all of the data down and rewrite all of the pages AFTER 'Tripp' could they? I suppose they could but it would get REALLY expensive - especially if someone with the last name of 'Anderson' moves to town...

How does SQL Server make room for new rows? They do something called a page split. Now, this can be A LOT more technical than I'm going to explain here but simply put, they split the page (on which the row is to be located) into half. The first 50% of the rows STAYS on the current page and the 50% MOVES to a new page (and this new page is probably nowhere near the page being split). However, they keep this list LOGICALLY in order by maintaining a doubly-linked list. So, SQL Server knows where to go when it's reading the data - even if the pages are not physically/sequentially in order. If this happens once or twice, it's not a huge problem. But as it continues to happen, it makes scans of your data more and more expensive and the process of the split can be expensive in and of itself. The biggest problems are:

  1. The data is not physically in order (and can create lot random IOs when scanning)
  2. The pages are no longer very dense (they've gone from 100% to 50%). Splitting does has a positive side in that it's going to take awhile before the page splits again BUT, splitting is meant as a temporary fix and ideally, it shouldn't be happening to every page. But, this also negatively impacts cache - you're only using 50% of the page that's split so now you're wasting half of that page in memory (and, on disk). So, while making a table 10% wider from the start seems bad; it reduces the need for splitting and therefore your table will stay more intact over time. 

What indexes have fragmentation?

Some indexes are prone to more fragmentation than others. For example, indexes with a high-order element (the first column of the index) that is NOT in ever-increasing order. This does NOT mean that ALL indexes should be created such that their first column is ever-increasing (that's NOT a good way to create indexes!). But, it does mean that you should pay special attention to some of your larger tables and therefore larger indexes (meaning ESPECIALLY the clustered index since the clustered index is over ALL of the columns). 

In terms of index management, here are the main things to think about for each index type: 

Rebuilding the clustered index is the most expensive index to rebuild. It's also the one that you'd rather avoid if you can. It has the most limitations... A clustered index CANNOT be rebuilt as an online operation IF the table has ANY LOB columns in it at all. What this means is that the clustered index - if fragmented - must be rebuilt using an offline operation. This translates into downtime for that table - while it's being maintained. So, if you can create a good clustering key EARLY, then you might be able to minimize clustered index maintenance.

Rebuilding a nonclustered index is much easier and likely to be an ONLINE operation. Because nonclustered indexes contain a much smaller number of columns - and rarely ever include a LOB column - they can be managed/maintained through ONLINE index operations. So, even if these are prone to more fragmentation (and, in general, they are) then it's not as big of a deal.

How do I solve this problem?

So, the BEST thing to do is to visually picture the ORDER of the data in the index (just think of it like a phone book - a LIST of records on PAGES - ordered by the KEY of the index). If you can visualize that structure then you can visualize what happens on an INSERT, UPDATE and a DELETE. Once you have a good visual you should be able to define a FILLFACTOR based on these two things:

  1. How much fragmentation is happening (is it becoming massively fragmented quickly) - then set a lower fillfactor (like 70 or 80 - depending on how big the table is)
  2. How frequently are you going to be running maintenance (if you can do this daily or weekly - for some objects) then you might not need much free space. If you can run maintenance against the table frequently then you won't need as much free space.

And, of course, you can directly check the level of fragmentation your table has by using the DMV: dm_db_index_physical_stats (which I referenced in Part I) and ideally, you'll create a procedure that runs through your objects determining how much fragmentation they have and the appropriate FILLFACTOR. Sometimes the best way to do this is to create your own "management table" which lists all of your tables and then have definitions for when to rebuild, when to reorg and what fillfactor to use. There are a few folks that have done things like this. Check out Ola's site here. He's put a lot of effort into a more robust and flexible combination of options for database maintenance. It might be more than you need but it's a GREAT EXAMPLE of all of the things that you can evaluate and consider wrt to maintenance.

Solving this using Database Maintenance Plans

Finally, if you CHOOSE to use a maintenance plan you ONLY have two options:

Reorganize pages with the default amount of free space
Does NOT sound like what it actually does (IMO). It sounds like it uses some form of SQL Server default value. However, it actually uses the "OrigFillFactor" value that's set for the index. This is actually a good thing IF you've set the FILLFACTOR on the CREATE INDEX statement OR on a rebuild. Reorganizing does NOT allow you to set this ONLY a rebuild does. So, if you have set this then this choice is probably best. If you run a query against sysindexes (yes, the backward compatibility view NOT the new sys.indexes catalog view), then you can see the OrigFillFactor setting for ALL of your indexes. (And, it's probably 0 - which sounds strange... 0 actually means 100... I could explain this but just trust me on this one.)

Change free space per page percentage to _____ %
Well, this is certainly an option you can consider. If you've never set the FILLFACTOR (or, even if you have...), this would overwrite it AND change the OrigFillFactor value to the percent you use here. If you decide to take this "sledgehammer" approach then I'd suggest something quite high - like 90-95% so that you don't end up wasting a lot of space. Some indexes (ones that have a high-order element that IS ever-increasing) don't really need free space and this might be completely wasted. So, I think it's better to set this indivually using stored procedures and/or a more custom maintenance procedure.

NOTE/TIP FROM COMMENTS POSTED: While I explain this concept (above) using the index option/setting (called FILLFACTOR) the UI is wanting the FREE SPACE percentage. So, if you want to use 90-95% FULL then be sure to set this value to 5-10%. Using the syntax, you will set page fullness.

OK, I still have at least 2 more in this series and EVEN THOUGH I've been completely sucked into facebook (as Paul blogged here), I'm still trying to manage to work and save SQLskills. Paul, on the other hand, just uploaded more photos. ;-)

Cheers,
kt

YES!!!

OK, well, I guess I should be more specific because as in most things in SQL Server, the real answer is "it depends". And for these two options, it depends mostly on your SQL Server version. Since SQL Server 7.0, the way that auto update works, has changed (much so for the better!!). So, if you're in SQL Server 2005 or SQL Server 2008, I would say most definitely - leave these ON (or if you turned them off - turn them back on!!!)! If you still have problems with a specific index causing your grief, then turn off auto update at the index level NOT at the database level. To turn of auto update at the index level use STATISTICS_NORECOMPUTE in the index (or statistics) definition.

Now, as for why (and how!) this has changed over the versions... here we go:

SQL Server 7.0

  • Invalidation: Statistics were *invalidated* when a row modification counter (sysindexes.rowmodctr) was reached. This meant that they could not tell where the modifications were occuring and, if modifications were somewhat isolated to a specific column ALL of the statistics for the TABLE would be invalidated (so, statistics could be invalidated earlier than necessary)
  • Updating: Even worse, in SQL Server 7.0, when statistics were invalidated, they were immediately updated. This caused two problems - thrashing at the time of invalidation because all of the stats needed to be updated AND two, if the statistics were not used for awhile then extra work was involved to update them and by the time they were used, they might already be somewhat out of date already.

SQL Server 2000

  • Invalidation: Statistics were still invalidated based on a row modification counter.
  • Updating: SQL Server 2000 fixed the "updating-potentially-too-often" problem by only updating statistics when they were needed.

SQL Server 2005

  • Invalidation: The biggest changes were introduced in SQL Server 2005 where they decided to NO LONGER user the sysindexes.rowmodctr and instead use an internal (and undocumented) columns specific modification counter. Now, statistic invalidation is more isolated to only those columns which are heavily modified. This internal/undoc'ed column is sysrowsetcolumns.rcmodified and can only be seen when connecting to SQL Server using the DAC (Dedicated Admin Connection).
  • Updating: Updating didn't really change but, SQL Server 2005 added "Aynch Auto Update" for statistics so that when the QO (query optimizer) encounters an out-of-date (i.e. invalidated) statistic, they can "trigger" the update but not wait for the update (meaning that they'll optimize using the out-of-date statistic). This can be both positive (faster) and negative (might not be the best plan if the statistics have changed drastically). It is off by default and IMO, I'd leave it off in most cases but if you find that auto update events (which can be Profiled) are causing you grief, then you can turn this on at the database level.

SQL Server 2008

Nothing new except "Filtered Statistics" and these are interesting as the density vector is still relative to the table (not the predicate) but the histogram is just over the predicate (OK, I know I'll have to blog a lot more about this one!). Anyway, I'm still playing/learning a lot more about these and they make the most sense with filtered indexes (as opposed to just a filtered statistic) but, just like statistics on secondary columns you will also potentially want statistics on the secondary columns of your indexes. The next question is should they have a filter or not. I've found that sp_createstats doesn't seem to create statistics with filters and I'm going to need to do some testing here but I think statistics with filters (filters that match the non-clustered index) should help to make the stats better (and even allow better usage of filtered indexes) but, I'm really going to need a bunch of time with this - and another post :). As for auto create/auto update - no changes there!

Long story short, if you're using SQL Server 2005 or SQL Server 2008, you should leave auto create/auto update ON.

Thanks for reading!
kt

PS - A few of you have mailed me about a bug in the sp_helpindex2 script(s). OK, that's my next post!!! Possibly with an sp_helpstats2 script as well!

OK, so SQL Server 2008 came out *BEFORE* Visual Studio SP1 but VSSP1 was required (this was a bit annoying and created a lot of confusion/emails/newsgroup posts, etc.) but luckily VSSP1 came out only a couple of days after SQL Server 2008 RTM'ed. As for Data Dude, well, it's not RTM'ing until later this year (I think there will be one more release - probably an RC0 - before RTM) but they've been pushing out various CTPs... so, what works with what and how the heck do you get everything to install?? Oh my!

Really, it's very simple:

  1. Install Visual Studio 2008 (with Team System/Database Edition), then, install SP1
  2. Install SQL Server 2008
  3. Download/Install the latest Data Dude CTP. BOTH CTP15 and CTP16 work with SQL Server 2008 RTM but 16 has a lot of updates.

SQL/VS can really be installed in any order but this (above) is what was recommended as the best order. However, I've not heard any problems with SQL then VS. As for where to go to get more details on Data Dude - check out Gert's "Data Dude" blog for links to the downloads and a lot of tips/tricks.

Have fun!
kt

Categories:
SQL Server 2008 | Tips

OK, so, I don't blog very often. I don't know what it is... I think it's that I feel like I always need to blog huge posts and the thought of writing my huge post makes me not want to blog... so, sometimes my time-between-blogging (TBB :) is long. I'm going to turn over a new leaf (no, really... I'm *really* going to try this time!!) and I'm going to try and write smaller posts and more of them. And, yes, don't worry, I'll still write the long in-depth ones but I'll at least try to give you a few tidbits of things that I encounter - more frequently. And, that's partially why I'm posting this entry...

I've been working quite a bit with SQL Server 2008 and yesterday I went to test of a few things with database mail. Normally this gets setup quite early and gets setup with a lot of other things so restarts of the server/agent happen because of saving/tweaking my test VPC. I don't usually go in and setup Database Mail and then immediately try to send mail. Yesterday, I did... and, guess what, it didn't work. I got error [264] An attempt was made to send an email when no email session has been established. So, I felt like I had seen that before... and, with a bit of web searching I landed on Gops Dwarak's blog for a known issue of SQL Server 2005: http://blogs.msdn.com/gopsdwarak/archive/2006/04/25/583434.aspx. And, yes, restarting the SQL Server Agent solved the problem. I'm surprised that this is still a bug in 2008 but it's not entirely the worst I could come upon. However, having said that, I also thought that there was a general initiative for software to stop requiring restarts of services and/or the OS. And, yes, it's *just* an Agent restart so it won't directly impact your server's availability but, I'm still surprised that it wasn't resolved. Has anyone else found a bug (or even a documented "issue") that requires a service to be restarted - and you think it shouldn't?

And, well, that's definitely not the most interesting thing I've learned/found with SQL Server 2008, there's lots more (yes, I know - I need to blog a lot of it :). The most fun I've had has probably been learning sparse columns and filtered indexes. For right now, I want to give you some quick key points about some of the changes around indexes and in particular around sparse columns/filtered indexes:

  1. Do NOT create non-clustered indexes on sparse columns without filters (a filter that says WHERE sparsecolumn IS NOT NULL) because indexes do NOT have sparse columns in their definition. So, if you don't use a filter you will end up storing all of the NULL values in the index - which will waste a tremendous amount of space.
  2. A non-clustered index on a sparse column (without a filter) will be the same size as a nonclustered index on a non-sparse column (which is essentially the same point as above but, this might help clarify it a bit).
  3. Do NOT think that the increase in total indexes (from 250 in 2005 to 1000 in 2008) is because you should have more indexes... it's NOT. It's specifically because you might have a lot more columns (these changed as well from 1024 in 2005 to 30000 in 2008) because even that increase should not be used unless you're using sparse columns... So, you STILL want to use a lot of the best practices we've recommended in past webcasts, whitepapers, etc. and you still want to care about row size (and page density) and therefore work to create narrower tables (in general and relatively speaking - depending mostly on usage patterns). However, if you have a need for *lots* of wider tables BECAUSE you have an interesting set of properties that only some rows will have (i.e. the main reason to use sparse columns) then using sparse columns to handle these columns that will largely be filled with NULLs is a good thing. It's a good thing because a row that has a NULL for a sparse column will take ZERO BYTES. Absolutely no space is used for a sparse column that is NULL. So, this allows your tables to be wide (in definition) but your rows to be narrow (in practice). And, with well defined indexes you can VERY efficiently and effectively search on these properties.

OK, I hope to get some examples posted as well. Between Paul and I and our upcoming events - where we're demo'ing/discussing a lot of these principles, we'll plan to post a demo or two on how effective these really can be. And, I know... some of you will fight back with the thought that sparse columns introduce bad database design practices... I know, you want (instead) tables that have name/value pairs - which are the way most of us did this in all versions prior to 2008. However, name/value pairs tables become fragmented messes that also have fragmented indexes and therfore overall poor performance (for both inserts and queries). Not to mention, they can be difficult to query/join with (because you need to join multiple times to retrieve multiple properties) and the code gets messy quickly. Anyway, sparse columns - while they may not seem quite right at first - can really be a *MUCH BETTER* way to design (and perform!!!) around this problem.

And, speaking of events. We have a bunch of upcoming events... I'll give you the short bulleted list here because I'm sure many of you also read Paul's blog and he's mentioned quite a few of these coming up. Also, if you're interested in learning more details for these, the full abstract/links for many of these can be found on our Upcoming Events page. However, specific links are also listed below!

  • Best Practices in Performance and Availability for SQL Server 2005/2008, 1-3 September 2008 in Hatfield, England. You can get more details/register here
  • Dublin SQL Server User Group, Index Internals and Fragmentation, 4 September 2008, Dublin, Ireland. Bob Duffy blogged about this here and you can get more details/register here.
  • Microsoft SQL Academy 2008 - Session 1, 5 September 2008, Dublin, Ireland. You can get more details/register here.
  • Indexing for Performance in SQL Server 2000/2005/2008, 8-9 September 2008, Edinburgh, Scotland. You can get more details/register here.
  • SQL Connections "Power Workshops Series" Microsoft SQL Server 2008 Overview for Database Professionals (Hands-on — Bring Your Own Laptop), 6-8 October 2008, San Francisco, CA, USA. You can get more details/register here.
  • Microsoft Tech Ed EMEA ITPro, 3-7 November 2008, Barcelona, Spain. You can get more details/register here.
  • SQL Connections Fall Conference, 9-14 November 2008, Las Vegas, USA. You can get more details/register here.
  • PASS Community Summit 2008, 17-21 November 2008, Seattle, WA, USA. You can get more details/register here.
  • And, the last thing I'll leave you with is a recipe for the best darn Chocolate Chip Oatmeal Cookies *ever*. OK, I did a web search for exactly that (ok, it all started when Paul (who really doesn't really like sweets) decided that for his birthday he wanted Choc Chip Oatmeal Cookies (we asked becuase the girls and I wanted to bake something)). The search led me to here. And, they're definitely right that these are the best cookies ever! And, they make a few really good points:

    1. Don't microwave the butter... set it out for an hour before you're going to make the cookies. It's definitely different/better when the butter is naturally soft.
    2. Definitely take the cookies out when they look like they have a few more mins to go...

But, Paul's birthday was in July... and, it was not the only time we've made these cookies... mostly because *everyone* we introduce to these cookies is addicted - yes, the girls, me, Paul, my Mother, Brian Randell (who was forced to take some "togo" after a BBQ here) and countless of our other friends since this is now our "when-we-entertain" dessert ;-). So, since I've had the opportunity to tweak the recipe myself, here's what I do differently...

  1. I add less chocolate - usually 1-1.5 cups of chocolate chips. Sometimes I mix half/half semi-sweet and milk chocolate but I always use closer to 1 cup...
  2. I add about 1 cup of coconut - right at the end, with the chocolate chips...
  3. I make a single 12x18 cookie sheet "uni-cookie" and then I cut it up like brownies. As for the cook time, I've found that this giant cookie takes right about 14 minutes (but, you'll have to experiment with your oven... I usually take the giant cookie out when it's puffed up and the edges are just starting to go lightly brown). Oh, and you don't have to spread this out perfectly to the edges - it will expand and fill in. Just get somewhat close. The nice thing about this form factor is that it's a lot faster than making 36 individual cookies and then you get to choose the size(s) when you cut up the uni-cookie.

Oh, and don't blame me when you make 10 batches of these things and get all of your friends addicted as well. They really are yummy!!

Next up... installing Data Dude CTP16 with SQL 2008 and Visual Studio... what works with what and what's the best install order!!!

Thanks for reading!
Kimberly

OK, so thought I'd do a follow up to the post I did a couple of days ago titled: The perils of case-insensitive data (and our life in tangent-land). The reason I'd like to followup on it is that I received some excellent comments and I want to make sure that you're all aware of the tips/tricks and recommendations that there were (some of you may not have returned to see all of the comments). Really, I was impressed by the speed at which people responded as well as the great comments (and things I learned!). It just reminds me of the fact that none of us can know everything AND that our SQL community is awesome in its willingness to share and communicate.

As for the tips/tricks and "yes, duh!" realizations I came to... here are the interesting points from the comments:

First - why did my comparison work for a single character (e.g. '%A%') but not when I did a character range (e.g. '%[A-Z]%')? Well, it was because it was unicode! This was a "right! duh!" realization that I think I dreamed after I wrote this BUT, Hugo Kornelis is exactly right in his comment. Thanks Hugo! Here is a direct cut/paste of his comment:

The reason [A-Z] doesn't work, is that a collation doesn't just govern case sensitive vs case insensitive but also (amongst others) the sort order of letters. And most case sensitive collations sort like A - a - B - b - ... - Z - z. So [^A-Z] would include all letters except the lowercase z.

You can use [A-Z] to find uppercase characters in a binary collation (since all uppercase characters are in one range of ASCII, and all lowercase characters in another), but not in any other collation.

And, you can check out more from Hugo on his blog: http://sqlblog.com/blogs/hugo_kornelis/default.aspx

Second - the comparison query that I wrote all together (where I stated each letter individually in the WHERE clause) only took a few seconds to write (thanks to cut and paste :) AND it did work...And, sometimes getting something to work and moving on is all we can do (come on - you've ALL been there, eh? :). However, my main comment was that "it wasn't pretty". A much more elegant and unbelievably simple solution came from David R Buckingham (aren't the great answers always the really simple ones :)). Here is a direct/cut/paste of his comment:

The following query will return any fully lower case names in the table:

SELECT LastName
FROM Person.Contact
WHERE LastName COLLATE Latin1_General_CS_AS_KS_WS = LOWER( LastName ) COLLATE Latin1_General_CS_AS_KS_WS

I don't believe that David has a blog... maybe he should :).

Third - a very cool and clever trick that came in from Denis Gobo is related to the performance of repeatedly doing case-sensitive searches on a case-insensitive column. I suggested that creating an additional column (preferably a computed column that uses the case sensitive collation) would be an easy and optimal solution. This is still definitely true when the case-insensitive values are NOT selective enough to warrant using an index and the case-sensitive values are... However, if both the case-sensitive AND the case-insensitive values are reasonably selective then the trick that helps is from his comment. Here is a direct cut/paste of his comment:

Kimberly, the way to force an index seek is to do this

SELECT *
FROM MyTestContacts
WHERE Lastname = N'adams'
AND Lastname COLLATE Latin1_General_CS_AS_KS_WS = N'Adams'

The WHERE might return more than one row but the AND will return only the case sensitive one

I wrote about that a while back here:
http://sqlservercode.blogspot.com/2007/05/make-your-case-sensitive-searches-1000.html 

And, you can check out more from Denis on his blog: http://sqlservercode.blogspot.com/

Now, as for the issues related to creating a view in a database that has a different collation from the server's collation... Here, I'm fairly certain that there's still a bug. However, I'm happy to say that I don't think that it's the most likely situation that exists for collations. I think the two most likely situations are:

  1. The server has one collation. The database inherits that collation. The database developer makes column level collation changes throughout the db. This seems to work well. OR
  2. The server has one collation. The database has a different collation. The database developer consistently uses that collation throughout their app. A good example of this is where people have case-sensitive databases on case-insensitive servers. This works fairly well (although there are some issues wrt to temp tables, etc. and default_collation is a good thing to know).

I guess there's even a third one where column level changes are made in a database whose collation is different from the server but where there aren't any views that also change the collation to yet a different collation (and this is where there seems to be a bug).

So, this was an excellent (and reasonably fun :) :) exercise to go through wrt collations. And, this is how I (we?) learn! I really want to thank everyone for reading - and commenting/sharing! - the things they learned/knew. That's part of why I love the SQL Server community. And, speaking of which, I thought I'd end this entry with a few community links - as a reminder to everything that's out there:

Thanks for reading! Thanks for commenting!
kt

Categories:
SQL Server 2005 | Tips

DDL Triggers were a new feature of SQL Server 2005 and while seemingly simple, they are very powerful. DDL Triggers allow you to trap an attempted DDL operation to audit it, prevent it, or do anything you want to validate/verify/”authorize”/etc – you write the code. And, since a trigger fires as part of the transaction, you can roll it back.

In many conference demos/webcasts, etc., I have provided a sample script that prevents ddl within a [production] database. That script has been really helpful/useful but recently I thought about an update to it…

SQL Server 2005 has another new feature "execute as". While I definitely see many benefits, I’m also a bit concerned. To a certain extent, I feel that the potential for SQL Injection is actually higher. If a developer creates a poorly written/tested stored procedure (ok, therein lies the problem, really!) that includes dynamic string execution AND then uses "execute as" to essentially elevate a user with minimal privileges to a higher level (so that they don’t need to give the base object rights to the user), a malicious user could “inject” code in and actually succeed if the “execute as” user has rights to the injected code. In prior releases, and with the default behavior (execute as caller), this is not possible (which is good for security but bad for dynamically executed strings within stored procedures as base object rights are necessary).

Having said that, and since security is always a concern, my DDL Trigger only audited for the login of the user who executed the statement, not for the actual user that’s logged in. In other words, if EXECUTE AS is used (or SETUSER is used), then the context of the user executing is actually different then the logged in user. To see this shift in context, SQL Server 2005 added a new function: ORIGINAL_LOGIN().

(reading between the lines is even more frightening in that prior to SQL Server 2005, the original user could not be tracked from SETUSER. The good news is that SETUSER is ONLY allowed to be used by DBOs so it’s not as widespread as the potential for “execute as”).

OK, so how can we put all of this together? We’ll want to add the ORIGINAL_LOGIN function into our audit table in our DDL Trigger. Even if you choose NOT to rollback, at least you’ll know who performed the operation (even if from a dynamically executed string!).

USE AdventureWorks;

go

--Create a login/user - just for this exercise

CREATE LOGIN Paul WITH PASSWORD = 'PxKoJ29!07';

go

CREATE USER Paul FOR LOGIN Paul;

go

sp_addrolemember 'db_ddladmin', 'Paul'

go

 

CREATE SCHEMA SecurityAdministration

go

CREATE TABLE SecurityAdministration.AuditDDLOperations

(

            OpID                            int                                NOT NULL identity    

                                                                                    CONSTRAINT AuditDDLOperationsPK

                                                                                                PRIMARY KEY CLUSTERED,

            OriginalLoginName    sysname                     NOT NULL,

            LoginName                 sysname                     NOT NULL,

            UserName                   sysname                     NOT NULL,

            PostTime                     datetime                     NOT NULL,

            EventType                   nvarchar(100)            NOT NULL,

            DDLOp                        nvarchar(2000)          NOT NULL

);

go

GRANT INSERT ON SecurityAdministration.AuditDDLOperations TO public;

go

 

CREATE TRIGGER PreventAllDDL

ON DATABASE

WITH ENCRYPTION

FOR DDL_DATABASE_LEVEL_EVENTS

AS

DECLARE @data XML

SET @data = EVENTDATA()

RAISERROR ('DDL Operations are prohibited on this production database. Please contact ITOperations for proper policies and change control procedures.', 16, -1)

ROLLBACK

INSERT SecurityAdministration.AuditDDLOperations

                        (OriginalLoginName,

                         LoginName,

                         UserName,

                         PostTime,

                         EventType,

                         DDLOp)

VALUES   (ORIGINAL_LOGIN(), SYSTEM_USER, CURRENT_USER, GETDATE(),

   @data.value('(/EVENT_INSTANCE/EventType)[1]', 'nvarchar(100)'),

   @data.value('(/EVENT_INSTANCE/TSQLCommand)[1]', 'nvarchar(2000)') )

RETURN;

go

 

--Test the trigger.

CREATE TABLE TestTable (col1 int);

go

DROP TABLE SecurityAdministration.AuditDDLOperations;

go

EXECUTE AS LOGIN = 'Paul' -- note: Remember, Paul is a DDL_admin

go

DROP TABLE SecurityAdministration.AuditDDLOperations;

go

REVERT;

go

 

SELECT * FROM SecurityAdministration.AuditDDLOperations;

go

DROP TRIGGER PreventAllDDL ON DATABASE;

go

DROP TABLE SecurityAdministration.AuditDDLOperations;

go

DROP SCHEMA SecurityAdministration;

go

DROP USER Paul;

go
DROP LOGIN Paul;
go

 

So, have fun testing with this one.

 

Thanks for reading!

kt

Categories:
Security | SQL Server 2005 | Tips

I had a discussion earlier today (with Paul Randal) about many misconceptions that exist about upgrading databases and more importantly, about "downgrading" databases. Really, the issue is that I've heard people get frustrated when they find that things like backup/restore works FROM SQL 7.0/2000 TO SQL Server 2005 but not the other way around - even if the database is in SQL Server 2000 (80) compatibility mode. First and foremost, compatibility mode only affects parsing, query processing, and general data manipulation; it does not affect physical storage (well, there's more to it than that but that's a general overview). When you upgrade a database to SQL Server 2005, you WILL benefit immediately from changes in the storage engine, etc. regardless of compatibility mode. Compatibility modes are there to give you time in updating/upgrading your code - if/when necessary. Most code will work when upgrading but some code may not be supported because of changes to keywords, syntax changes, etc... The best thing to do is check your application compatibility with the Upgrade Advisor. I did a a couple of webcasts on Installation/Upgrade as part of my 11-part series on TechNet. See the blog entry for the entire series here. Part 3 and part 4 are focused on Installation and Upgrade and their associated blog entries have a lot of additional links (including links to the Upgrade Advisor as well as a series of things you might want to do before you upgrade). Also, be sure to checkout the upgrade site off of the main Microsoft SQL Server site.

How to move USER databases around - a quick list of what's supported between versions

Backup/Restore from 7.0 to SQL Server 2000
Detach from 7.0, copy the files, then attach to SQL Server 2000
Backup/Restore from 7.0 to SQL Server 2005
Detach from 7.0, copy the files, then attach to SQL Server 2005
Backup/Restore from 2000 to SQL Server 2005
Detach from 2000, copy the files, then attach to SQL Server 2005

Why use Backup/Restore?

PROs

  1. Because you have a backup! This will allow you to go back to the version from which you came. However, without any changes made on the uplevel version.
  2. Because it doesn't require the database to be taken "offline" when the backup is performed (note: that this is both good and bad - bad because you don't really know the exact point in time to which the database reconciles...which may not matter if you're just testing).
  3. Because the backup will be the size of data only and will not include database free space. Free space is not backed up (e.g. a database with a 100GB data file with only 20GB of data should yield a file that's roughly 20GB in size). I say "roughly 20 GB" because the internals of a backup require that the transaction log records for the activity that occured during the backup process are also backed up with the full database (or differential) backup. This is actually the basis for why transaction log backups are not supported during a full/differential backup in SQL Server 2000 (they are in SQL Server 2005). However, this is the reason why the transaction log cannot be cleared while a full or differential is ALSO running in SQL Server 2005.

CONs

  1. You don't know the exact point in time to which the database reconciles (it will be the time that the backup completed) AND logs CAN be restored uplevel as well. NOTE: If you're interested in creating an exact point in time version of the database - consider putting the database into "restricted user" mode or "single user" mode (so that user transactions are not allowed during the backup). Again, this may not be a concern.
  2. It takes time to complete the backup (there are four phases of a restore: create/initialization, copy, redo, undo). Make the create/initialization *much* faster by enabling Instant Initialization. See my Instant Initialization blog post for more details.

Why use detach/attach?

PROs

  1. It's simple, it's fast... but once detached then the database is OFFLINE.
  2. You know the exact point in time to which it reconciles because no transactions are allowed into the database once it is offline. Again, this may not be a concern.

CONs

  1. You must copy the entire file - including the free space to the other location and the network copy might be the most expensive (meaning time consuming) part of the entire process. However, once copied, the files do NOT need to be created on the destination because on attach, these files will be used.
  2. The database is offline once detached and during copy.
  3. If you don't COPY the files and instead you attach the detached files, you will have ABSOLUTELY NO WAY of getting back to the version from which you detached. (ah, this is probably the single most important reason for why I prefer backup/restore!)

Summary for "How to move USER databases around"
Between these versions "upgrades" are supported ONLY to the uplevel version. There is NO single (or simple) feature that can be used to get back to the version from which you started (without exporting/importing all of the data). There is also no undocumented back-door to do this either (no trace flags, no DBCC commands, NA DA!!! as per Paul).

What about System Databases?
This is a whole other can of worms to open and the easiest thing I can say here is that you generally should not move/upgrade system databases across machines. These are upgraded through "in-place" upgrades of SQL Server (on the same machine) or through manual migrations (to different machines) of the users/objects (SQL Agent Jobs, user-defined system procedures in master, logins in master, etc.). This is not an easy process (manual migration) but may prove to be a better choice over an upgrade in place if something were to go horribly wrong (which is unlikely but I'm a "what's the worst case scenario" person when it comes to availability :). The other benefit of NOT upgrading in place - and instead MOVING databases from one version to another on an upgrade - is that you get to complete some basic "spring cleaning". New hardware, freshly formatted, freshly installed/configured OS, clean disks, etc. This can often alleviate some of the strangest, hard-to-determine-problems, that have plagued you for weeks/months. Like I said, this is a whole other can of worms to open!

But - if you're interested in moving system databases around on the SAME machine, here's a great KB that covers the required options, syntax, rules and restrictions: How to move SQL Server databases to a new location by using Detach and Attach functions in SQL Server

And - if you're interested in transferring logins and passwords between instances (for upgrade or for sync'ing two servers used to create a standby partnership - with Database Mirroring and/or Log Shipping), here's a great KB article that includes links to other articles even uplevel transfering of logins (like 2000 to 2005): How to transfer logins and passwords between instances of SQL Server

And - that's it for this week (probably)... two in a row is not likely to become three in a row (just setting expectations :) :) :),
kt

Categories:
Resources | SQL Server 2005 | Tips

Well, I've promised to blog more and I'm really going to try to do so. This morning I got the perfect question/comment (in email) to respond to and after working through a response that was taking me upwards of 3 hours (you'll learn later why I have 3 "spare" hours :)......... I figured that it was time to turn the response into a blog post. ;)

Background: The Clustered Index Debate
In the years since the storage engine was re-architected (SQL Server 7.0+) there's been constant debate on how to appropriately choose the clustered index for your tables. I've generally recommended an ever-increasing key to use as a clustered index and many find that counterintuitive. The primary reason people feel it's counterintuitive is that it creates a hotspot of activity. [If "hotspot" is not a familar term - a hotspot is solely an active place within your table.] Hotspots were something that we greatly tried to avoid PRIOR to SQL Server 7.0 because of page level locking (and this is where the term hot spot became a negative term). In fact, it doesn't have to be a negative term. However, since the storage engine was rearchitected/redesigned (in SQL Server 7.0) and now includes true row level locking, this motivation (to avoid hotspots) is no longer there. In fact (and probably even more counterintuitive), the opposite is true. Hotspots (specifically hot PAGES not hot ROWS) can be very beneficial because they; minimize the number of pages needed in cache, improve the likelihood of the required page already being in cache and in general, they minimize the overall amount of cache required. So, this is why many of us have changed our recommendation on where to create the clustering key in 7.0+. Instead of focusing on range queries we now focus on placing the clustering key on an ever-increasing key. In earlier releases, focusing on range queries for the clustered index reduced hotspots for insert/update and this in fact was the PRIMARY motivation to choose them, NOT range query performance! But - there are even MORE reasons to choose an ever-increasing key and they are based on internals as well. These internals are based on the significant changes made in the storage engine for 7.0+. For a quick start on these, I went through them in the Blog entry here.

And, today's email is not uncommon. This is the basis for the title clustered index debate. In general, there are still a lot of questions related to creating clustered indexes to improve "range query" performance. Don't get me wrong, there's definitely a benefit in performance for some range queries but the first thing to remember is that you get only one CL index per table (therefore only one type of range query can benefit). In the real world, t's not likely that you want to see your data exactly in the same way all the time. Therefore it's very challenging to come up with the "right clustered" index if you're using range queries as your strategy. Even worse, the affect of choosing the clustering key to improve range queries causes problems for modifications against that table (INSERTs/DELETEs and UPDATEs). So.............. this is what started my day today. A great email from a reader that brought up these points. The question/comment (modified to hit only the highlights and to protect their identity :) was this:

The most important characteristic for a Clustered Index key is to satisfy range queries. More often than not, if a sufficient range of data will be scanned, the Optimizer will choose the Clustered Index over all others due to the excessive cost of Bookmark Lookup operations. As such, the table KEY is a more suitable clustered index candidate than any surrogate (few every query a database by range of surrogate keys).  [kt note: this second sentence is not entirely true... SQL Server will certainly choose a clustered index over non-clustered that require table scans but there are A LOT of algorithms that SQL Server can use instead of either of these and my examples later show this... non-clustered covering seekable indexes, non-clustered scanable indexes, index-intersection, etc. ] 

Now, when the default behavior for SQL Server was designed such that the PRIMARY KEY was chosen as the default clustered index, it was exactly for this reason.  It is the business key.  It would satisfy uniqueness (by definition of logical KEY).  And, it is well suited for a wide variety of range scans.  However, this is when the PRIMARY KEY is defined on the Business Key of the data.

But, when you introduce the usage of surrogate keys (i.e., IDENTITY) as a physical implementation, and thus transfer the PRIMARY KEY definition to it, two things must be considered.  First, the Business Key this IDENTITY will be a proxy for must still exist as it is still apart of the logical design.  As part of the physical design, the logical key needs to be implemented as a physical constraint to maintain logical uniqueness.  Second, just because a proxy has been defined does not make it a natural candidate for the clustered index.  The business key still maintains this distinction.

What is often cited as the “reason” for IDENTITY PRIMARY KEY clustered index definitions is its monotonic nature, thus minimizing page splits.  However, I argue that this is the only “reason” for defining the clustered index as such, and is the poorest reason in the list.  Page Splits are managed by proper FILLFACTOR not increasing INSERTS.  Range Scans are the most important “reason” when evaluating clustered index key definitions and IDENTITies do not solve this problem.

Moreover, although clustering the IDENTITY surrogate key will minimize page splits and logical fragmentation due to its monotonic nature, it will not reduce EXTENT FRAGMENTATION, which can cause just as problematic query performance as page splitting.

In short, the argument runs shallow.

Luckily, this email arrived with perfect timing for me as I'm sitting in a "bootcamp" event on Always On technologies and I'm not speaking this morning (my colleague Bob Beauchemin is doing lectures on Scale Out technologies: Scalable Shared Databases, Service Broker, DPVs, etc.). Anyway, in addition to listening to Bob, I've decided to continue the blog series on "the clustered index debate". The first and most important point to stress is that minimizing page splits is NOT the only reason nor is it the most important. In fact, the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits).

The Clustered Index Debate Continued
First, there are many angles to look at wrt to "the clustered index debate" and it's not until all of the issues are reviewed, that this strategy (a monotonically increasing key) becomes obvious. So, I think it will probably take a couple of blog posts to really prove this. I'll start up this debate again here...... When you look at a general purpose table (which is most) where the table has ALL DML (S/I/D/U) then you are best off with an ever-increasing key (again, you have to look at the overall impact of all operations against the table - not just select... because I/D/U will also impact select in the long term). So, I'll break this down into each DML operation here. If you don't look at the overall impact, then large tables can end up having a tremendous number of problems once they're put into production. I've certainly heard this concern/debate before (and most people are skeptical at first glance) but when you look at the situation overall, you'll find that "finding the right balance" includes not just looking at range queries. In fact, here's a quick list of the things/tests/numbers/scenarios that help to prove my strategy:

  • Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing. I have some simple numbers but I'm thinking about creating a much larger/complex scenario and publishing those. Simple/quick tests on a laptop are not always as "exciting". But - this is a well documented issue (IAM/PFS lookups) and poor performance on a heap is also referenced in this KB: PRB: Poor Performance on a Heap. note: this KB is quite dated and I don't actually agree with everything in this article however, the general concern of poor performance for inserts is still true on SQL Server 2005.
  • Updates are often faster (when the row needs to be relocated) and for the same reason (IAM/PFS lookups) BUT there are many types of updates and not all updates cause records to be relocated. Here are a few things to think about wrt to updates:
    • Updates that are completely in-place (some examples are where the update is updating a fixed-width column OR to variable-width columns where the row size doesn't change, etc.). These types of updates don't really care.
    • Updates that cause record relocation (where the row size changes) are definitely better by having a clustering key because the record relocation (which will be handled by a split) is defined by the clustering key
    • Updates to the clustering key are the WORST (in this case) which is one of the key reasons for having a cl key that is static (so we have to keep this in mind when we choose a clustering key).
  • Deletes aren't nearly as big of a concern BUT deletes in heaps create more gaps and more gaps creates more work in PFS/IAM lookups and while this helps to reduce wasted space, it still requires the time to find the space........ hence the slowed performance of Inserts/Updates. I've also written some blog entries that cover very interesting test cases for large scale deletes and why you'd want to consider partitioning to optimize for the "sliding window scenario" in this blog entry: MSDN Webcast Q&A: Index Defrag Best Practices - Fragmentation, Deletes and the “Sliding Window” Scenario and it's the LAST one!.
  • Selects.............. now this is the hardest one to go through in just a couple of bullets (ah, I guess this will lead to another one or two posts :) BUT I'll start by saying that the best way to tune the vast majority of range queries is through non-clustered [covering] indexes. But, it's also important for me to stress that I do NOT advocate covering every query (it's impossible to do). What's important to realize in terms of covering is that SQL Server 7.0 and up continues to include internal algorithms to improve performance when you don't have the "perfect" non-clustered covering seekable index and instead still gives better performance than going to the base table (or performing bookmark lookups - as mentioned in the mail...and I completely agree that these [bookmark lookups] can be evil!). To start this discussion, I'll give one of my favorite examples of a large-scale aggregate. The absolute best way to improve the performance is through an indexed view but the data can be gathered through many other algorithms - ideally through a non-clustered covering index that is in order by the group by and that includes the column(s) being aggregated. For example, take this query:

SELECT c.member_no AS MemberNo,
 sum(c.charge_amt) AS TotalSales
FROM dbo.charge AS c
GROUP BY c.member_no

On a charge table of 1.6 million rows here are the performance numbers to handle this aggregation:

  • Clustered table scan (CL PK on Charge_no) with a hash aggregate = 2.813 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 1.436 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = .966 seconds
  • Indexed view = .406 seconds

Now this was a pretty small table (narrow rows and only 1.6 million rows) AND I didn't have any concurrent activity. The concurrent activity would have caused this to be even slower for hash aggregates, etc. Regardless, it proves the point (at least generally). Now, if I wanted to improve this range query then I'd have to cluster on the member_no column (and this is an ideal example because I often hear people say that clustering on a foreign key column helps to improve range/join queries - which can be true as well)......... But - this strategy has a few problems in addition to a few benefits (and we have to look at everything to be sure of our choice/decision). First, member_no is not unique (in the charge table) so SQL Server has to "uniquify" the rows. The process of "uniquification" impacts both time (on insert) and space (the rows will be wider to store each duplicate row's uniqufier). Also, theoretically it could change (in this case that's not true). Anyway, the time it takes for the clustered index is 2.406 seconds which is better than the clustered on the PK (of course) but if I were to also start modifying the rows (which creates splits) or even just insert 15% more rows........ then my table would become fragmented. At that point, the query performance should get worse in the table clustered by member_no table and it will continue to get even worse in the table clustered by charge_no (because of the worktable created in tempdb by the hash aggregate) BUT it won't be all that much worse in the non-clustered index examples (especially the covering index that's in the order of the group by - because this doesn't require a worktable).........

  • CL on member_no = 4.906 seconds
  • CL on charge_no = 6.173 seconds
  • Index scan (non-clustered covering but NOT in order of the group by) with a hash aggregate = 3.906 seconds
  • Index scan (non-clustered covering in order of the group by) with a hash aggregate = 1.250 seconds
  • Indexed view = .516 seconds

This is a great start to furthering the clustered index debate but I do have to admit that it's a counterintuitive and difficult issue to tackle because often isolated tests lead you to different conclusions. In this case though, the non-clustered indexes are better for this range query and the indexed view is the best (but I wouldn't consider the Indexed unless this were more of a read focused database rather than read/write). [and - of course, that statement warrants yet another blog post :)]

So, depending on the tests that you do - especially if you focus only on selects and you don't have modifications (i.e. fragmentation) - then they will make "creating the clustered index for range queries" appear to be best. Again, I'm not just saying this to prevent fragmentation, I'm saying this because I wouldn't use the clustered index OR a non-clustered index with bookmark lookups to handle this query. I'd consider a non-clustered covering that's seekable OR even a non-clustered covering that's scanable before I'd even choose the clustered (and that's what the optimizer would prefer as well). In the end it's really a bit of an art and a science to "finding the right balance" of indexing.

Oh - and if you arbitrarily add a column to use for clustering (maybe not as the primary key) that can help but many would prefer to use actual data... which means [potentially] creating your primary key with a new identity [or similar] column and this can impact your business logic (absolutely). I'm certain that certain tests can show that range queries are faster and it's absolutely correct that business application/usage can be a concern but when you look at the big picture (and the impact on I/D/U) then the benefits of the monotonically increasing key significantly outweigh these concerns. Simply put, a small/narrow key can help join performance and an ever increasing key can also help lookups for rows! (yes, definitely more coming)

Happy Friday! Have a great weekend. I'll try to continue more threads on this debate shortly!
kt

Categories:
Indexes | Opinions | SQL Server 2005 | Tips

OK, it's been a heck of a long time since I blogged... and for that I apologize. I'm also WAY overdue in my posting my demo scripts from a TON of conferences BUT... now everything has been posted. Check out the past events page on SQLskills and you can find the demo scripts that you're looking for......lots of fun stuff and TONS of scripts to play with and test.

Now - as for the reason(s).... many are business and for that I blame the following (yes, 17 flights [yes, one boarding pass is missing] over ONE 5 week trip with 7 events and 5 continent changes):

The other reason(s) are personal...suffice it to say that the last 6 months have been some life changing times for me and what I'm finding (or trying to find) is that ever important balance between work and life. During this holiday season (and always), I wish you and your loved ones well and I hope that you too can find (and cherish) what's most important to you.

So, you won't see anything else from me for this year but I do hope to be better (and more frequent) with blogging in the New Year and I especially hope to see you again at an upcoming conference. Let me leave you with the most exciting picture I've witnessed this year...it was during my one day of site seeing in Cape Town - where I went cage diving with Great White Sharks (and got horribly sea sick - which is rare for me) but where I was able to witness these amazing and powerful creatures....

Have a happy and safe holiday season!
kt

Been thinking a lot about something that was mentioned in a few of my most recent posts... Especially when I get comments like "that's another item to add to our checklist" or "that's a good trick to add to our arsenal" and well, I thought in this blog entry I'd ask for your tricks that fall under the umbrella of designing for performance.

For example - do you change collations? I had a recommendation here.
For example - do you have a view that you want ordered? I had a recommendation (with caution) here. But - Adam Mechanic came back and said that he's used that trick to improve performance... and, I'm sure that's the case as well!
For example - do you have stored procedure parameters that are giving you grief? I had a series of recommendations in my Optimizing Procedural Code category here.

In fact, sometimes the best form of "hint" to SQL Server is NOT an optimizer hint but instead a more subtle change to the join (derived tables for example) or the infamous subquery -> join rewrite or the join -> subquery rewrite. I'm always asked "which is better - a subquery or a join" and I always answer YES. ;-)  OR taking a complex process and breaking it down into temp tables (I'd try to create views instead of temp tables first and see if the optimizer figures it out but there are cases when sometimes they just don't). Remember, it's not the optimizer's job to find the absolutely BEST plan; it's their job to find a good plan fast. And - they typically do. Really, no general "tricks" work ALL of the time and often they don't help at all but there are LOTS of things that I'm sure you've done and you really want to tell someone about it. How about here? I'm going to try to compile these tips/tricks into a best of...

Categories:
Design | SQL Server 2005 | Tips

Hey there everyone - If you're into Analysis Services, Integration Services, Reporting Services and BI in general - you'll want to check out Elizabeth Vitt's new blog. Liz has been specializing in BI since SQL Server started adding BI-centric components. She's got a wealth of information to share and many great insights into performance tuning as she's working on a BI Performance Tuning resource that will probably hit 100 pages (from current guestimates).

And - no surprise from Liz, she's out the gates running with her first entry on Influencing Aggregation Candidates.

Subscribe now!

And a big welcome to blogging for Liz!
kt

Categories:
Resources | SQL Server 2005 | Tips

Well, it's been a GREAT week here in Switzerland while working with my partner Trivadis. Today, we wrapped up a two-day course on Designing for Performance (in Geneva) while on Monday/Tuesday we did a two-day course on Indexing for Performance in Zurich. The food, the wine, the cheese, the butter, yum! Oh... and the questions/comments/technical focus, etc. has all been great. :) I'm flying home today (Sat) so wish me luck on having internet access at 36,000 feet again (probably not...I'm flying United instead of Lufthansa - and it's only Lufthansa that has FlyNet). Wow - can you imagine where we're going to be in only a couple more years? Internet access everywhere! (hmmm.. how do we escape? well, that's another blog entry for another day :)

Anyway, one of the great things about teaching is that I get to meet all sorts of people and work through all sorts of interesting problems... And - this blog entry is based on a discussion I had with [a very blogless ;-] Meinrad Weiss - a Trivadis employee/consultant AND a fellow RD. (I was bullied into blogging by CV so now I do my part and do the same to others)

I can't remember how it started but somehow a discussion started on Top 100 PERCENT being used in views. I mentioned that while this was a good trick in SQL Server 2000, it has been REMOVED from SQL Server 2005 (meaning that TOP 100 PERCENT does NOT order the data within a view). Theoretically, I agree with this decision - data within a view should NOT be ordered. A view should SOLELY define a tabular set. It is up to the query which is accessing the view to define the presentation of the view. Using TOP within a view should be limited to ONLY when it is used to further define the data set (i.e. TOP 10 PERCENT... ORDER BY TotalSales DESC makes perfect sense).

Now, having said that - it was a cool trick - but with Pros/Cons. The obvious Pro is simplicity in access. While adding the ORDER BY to the query against the view really isn't all that difficult, it does make it a bit easier for quick/simple query access. BUT - there's a HUGE con here too. If the view starts getting used for other purposes (like in joins to other tables), then the being ordered before the joins, etc. can cause you an additional step that is NOT necessary. As a result, performance was compromised.

Long story short, I generally recommended against it but it was still cool. So - then Meinrad started playing and came up with - what about 99.9 on a table that has < 100 rows OR 99.99% on a table that has < 1000 rows, etc. And - yes - that DOES work, because SQL Server rounds to 100%. So, you are back to getting 100% of your data, ORDERED within a view. But - you need to set your percentage to an appropriate percentage based on rows - but what if you don't know the row count? How about TOP n where n = the max value for a bigint (9,223,372,036,854,775,807)?? That should always work...and it does.

OK - so what's the point? Yes, we DO have a workaround for the removal of TOP 100 PERCENT in SQL Server 2005 - but be CAREFUL - you are potentially shooting yourself in the foot. If this view is NEVER used for anything but SELECT * FROM View, you're OK. If you start adding joins, etc. then you might get into trouble. In the showplan below - the data returned is EXACTLY the same.

SELECT C.ContactName, Sub.*
FROM OrderSubTotalsViewOrdered2 AS Sub
   
JOIN Orders AS O ON O.orderid = Sub.Orderid
   JOIN Customers AS c ON c.customerid = o.customerid
WHERE C.City = 'Madrid'
ORDER BY SubTotal DESC
go

SELECT C.ContactName, Sub.*
FROM OrderSubTotalsViewNOTOrdered AS Sub
   
JOIN Orders AS O ON O.orderid = Sub.Orderid
   JOIN Customers AS c ON c.customerid = o.customerid
WHERE C.City = 'Madrid'
ORDER BY SubTotal DESC
go

BUT - the first plan of execution queries against the ORDERED set and the second against the un-ordered. Check out the showplans below:

This is a VERY COMPELLING reason to BE CAREFUL ordering data within a view. While the trick does work, please use it sparingly.

If you want to play with the views created above, you'll need a copy of the Northwind Database and you'll need this script: Top 100 Percent in SQL Server 2005.sql (4.46 KB).

Have fun!
kt

PS - I'm adding my blog to Technorati so I need to post their link to start generating my profile... here we go: Technorati Profile

Categories:
SQL Server 2005 | Tips

Ha... did that get your attention? Well, what I really hope to do is make everyone aware of what's made the Developer Community rounds this week. On Wednesday, Microsoft announced "Data Dude" (aka the Visual Studio sku for Database Developers). This was an announcement that may be glanced over by many DBAs thinking it's just another tool for developers...what can it offer me? And, well, that's where I think there are some VERY cool things to point out. I've been following Data Dude for a couple of months now (ah... a little birdy told me :) and at first I wasn't sure how much it would impact me. However, after starting to get a better feel for their future directions, I've realized that even though I'm not their initial and/or direct target audience that I'll definitely find some great uses for their product. In fact, in getting ready for their announcement and in chatting with a few press folks, I wrote up a small amount of text. Some of this was quoted in the eWeek article here but there's a few more things that I really think you'll (yes, even DBAs) be interested in. This is the second half of the content that was quoted:

For Administration and Operations, I especially like their direction with regard to unit testing and sample data generation. I work with a myriad of customers who do not let development/operations perform testing/tuning on real production data (even a copy) due to data sensitivity requirements/policies. As a result, performance testing can be horribly flawed. With the ability to generate large sample volumes of statistically "real" data, real-world tuning will be possible without compromising data sensitivity. This is the area that I'm most interested in initially but refactoring and schema comparisons are very interesting as well. One of my favorite sayings is "The sooner you start to code, the longer the program will take." (Roy Carlson) as schema changes can be challenging at best and often things can be missed (data types, columns names, etc.). Often the changes are made on alternative systems and then they need to be integrated in - often through comparing schemas and with hand-created alteration scripts. With the ability to have intelligent refactoring, application and database logic can be fixed through a straightforward and flexible interface. This will help to minimize later errors or even harder to recognize performance problems caused by implicit conversions.

So, in the end, I'm not their primary target but I think I'll probably get really into it and try to consider a variety of ways to leverage it for Operations/Administrations teams for "after the fact" tuning cases. However, I do think teams will be even more productive if they adopt it earlier in their production lifecycle.
Now, if you're even slightly interested. You can get a lot more information about Data Dude already. Here's a beginning list:

And, if you're going to be at Microsoft TechEd 2006 - in Boston in a couple of weeks, there's a LOT more information coming. In fact, my pre-conference workshop co-presenter Brian Randell has authored some Hands-on Labs for Data Dude and those will be available in the HOL area. He's on DotNetRocks this week (to be released on DNR's site next Tuesday) and he's chatted with them in the past about Virtual Server/Virtual PC (for hours...now you know how I feel ;). Honestly though, we use VPC/VS a lot in our HOLs and Brian is REALLY knowledgeable about how to optimize them, compact drives, etc. Anyway, here's the link to his last show on DNR and here's Brian's blog entry on Data Dude. And, just as a small hint... you should consider making a Sched+ for the "Live from TechEd" show from DNR. All I have to say is that it might have some great guests on it (maybe even related to Data Dude, hint, hint)!

Now, the last thing that's the most exciting for me to announce is that there are some new bloggers as a result of the Data Dude annoucement. FINALLY, one of my best friends - Gert Drapers - has started blogging (don't forget his already awesome content site: http://www.SQLDev.Net). If you're at all interested in geeky database development stuff, subscribe now! And - many of his team members are great friends too (Richard and Matt!) and I'm very excited to see them blogging as well (it's just that I've been begging Gert to do it for the past couple of years ;)). Anyway, it will be great fun watching this team grow and watching this product evolve.

Here are the Data Dude team blogs:

Here's the official Visual Studio Team Edition for Database Professionals site.

The times are changing............. for the better!
kt

Categories:
Resources | SQL Server 2005 | Tips

OK, so... I don't know how many of you use different collations but if you do then you know that there are two truths:

1) They're very flexible
2) They can cause you a bit of grief (changing collations and tempdb)

Flexibility

As of SQL Server 2000 (or heck, maybe it was 7.0?), database collations could be changed at installation OR set/changed later. You can set the collation when a database is created (if not set, the database will use the server's default). You can set the collation when a table is created (if not set, the table will use the database's default). You can set the collation when a query is executed (which doesn't really make sense unless it's in a WHERE clause or ORDER by clause). And - you can set the collation in a view or stored procedure to do things like case sensitive searching - on the fly. However, neither of these will perform well over large results sets (at least not without indexes) so, I'd be careful of doing any adHoc changes to collations (even in views/sps - without appropriate indexes)!

Anyway, the key point is that they're very flexible. In many international databases/localized databases, column collation differs by table (in order to do efficient sorting, etc.) and different language data may be separated (either with a column that described which language/country code is used OR in different tables).

Grief in Changing Database Collations

Actually, changing database collation is *very* simple. Literally, it only takes an ALTER DATABASE to do. For example, the following code runns flawlessly:

USE master
go

DROP DATABASE TestCollation
go

CREATE DATABASE TestCollation
COLLATE
SQL_Latin1_General_CP1_CI_AS
go

sp_helpdb TestCollation
go

ALTER DATABASE TestCollation
COLLATE Latin1_General_CS_AS_KS_WS
go

sp_helpdb TestCollation
go

BUT... if you go from case sensitive to case insensitive... be careful! It is important to realize that ALL of your tables AND data will need to be checked against the new collation. In fact, changing database collation will not be allowed if the objects/data would no longer adhere to your unique constraints, etc. Check out this more complete script (ChangingDatabaseCollation.sql (2.85 KB)), if you want to see what happens.

Grief with temporary objects

So.. the other area (and this seems to be the one where everyone has trouble), is with temporary objects. If you create a temp table and your database has a different collation other than TempDB (which has the same collation as the system - based on installation), then comparisons/lookups/joins - may have problems. A simple trick to get around this is to use database_default. Check out this sample and you'll see how it works:

CREATE DATABASE Test
COLLATE Icelandic_BIN
go

USE Test
go

CREATE TABLE #test1
(
   col1 varchar(12)
)
go

CREATE TABLE #test2
(
   
col1 varchar(12) COLLATE database_default
)
go

USE Tempdb
go

CREATE TABLE #test3
(
   
col1 varchar(12) COLLATE database_default
)
go

sp_help 'tempdb..#test1' -- Will use TempDB's collation
exec sp_help 'tempdb..#test2' -- Will use Test's collation (Icelandic BIN)
exec sp_help 'tempdb..#test3' -- Will use TempDB's collation
go

So simple, so obvious... and, well - I just found out about that one?! I used to recommend that you explcitly set the collation for every column. Now, that still works - but, it doesn't offer you any flexbility. So, you could get around that with dynamic string execution but that can also get very complicated, very quickly. So... database_default is a VERY simple and clean way of doing this.

Have fun,
kt

Categories:
SQL Server 2005 | Tips | tempdb

Hey there everyone - The series has completed and I know that many of you struggled to get access to the surveys... Microsoft has asked me to post links to the surveys...so, for completeness, I decided to create this blog entry to have links for every session, every blog link (resources, demo scripts, etc.) and the survey links. I really did have a lot of fun on the series and I hope we can do this again!

TechNet Webcast Series

Session 1: A Fast-Paced Feature Overview and Series Introduction (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 2: Security (Level 200)
   Presenter: Bob Beauchemin, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 3: Understanding Installation Options and Initial Configuration (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 4: Upgrade Considerations and Migration Paths (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 5: Effective Use of the New Management Tools (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 6: New Application Design Patterns for Scalability and Availability and the Operational Implications of Service Broker (Level 200)   
   Presenter: Bob Beauchemin, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 7: Technologies and Features to Improve Availability (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 8: Implementing Database Mirroring, Part 1 of 2 (Level 200)
   Presenter: Mark Wistrom, SQL Server Program Manager - Microsoft Corp., 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 9: Implementing Database Mirroring, Part 2 of 2 (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry is here
   Session's survey is here.

Session 10: Recovering from Isolated Disasters and Human Error (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry here. And a second blog entry here.
   Session's survey is here.

Session 11: Best Practices in Building Robust, Recoverable, and Reliable Systems (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry here.
   Session's survey is here.

And that's about it! I hope you really enjoy the series... and if you like that one, you might want to checkout the entire 10-part series on MSDN. The link to the blog entry that has all the links (like this one) is here.

Have fun,
kt

Well... 11 of 11 has completed. Friday was our last chat - until next time ;). It was a summary event where I took a slightly different spin on things focusing on grouping technologies by the amount of effort that's needed to implement them. Simply put, we looked at the technologies in order of what gives you the biggest bang for the buck. We ended the session with a ton of great questions (as always!) and there was even a question on the origin of foo (make sure to also see fubar).

First, there were a few links that I wanted to provide from the session, I'll start with those:

And, we also talked about Migrations:

Finally, capacity planning:

  1. Calculate the amount of space needed for your tables (calculate this as rows per page and then required pages as MB)
  2. Calculate the amount of space needed for your indexes (you can use sp_spaceused to get a current ratio of index to data and then use that OR you can estimate 1-3times your current data in indexes...yes, if you have 10GB of tables - you should estimate 10-30GB for indexes)
  3. Calculate in your estimate on future growth
  4. Take your single largest table and multiply by 1.5 for free space. (Use 2.5 IF you're going to use ONLINE index operations). So, if the single largest table is 3GB then I'd add 7-8GB for free space)
  5. Add a "just in case" extra 10-20%
  6. And, I didn't mention this BUT you should also include alerts to help you monitor space usage and significant changes to your free space!

And that wraps up the series. Wow - I can't believe how many of you joined in for questions as well as stayed on until the end. It's really great that so many of you are still having fun with SQL Server as well. I look forward to another series with you...at some point! In the interim, here are a few places where I'll be:

SQLskills Immersion Events - in the US... will be announced shortly. The BEST place to be when we announce the dates for these events is a subscriber on SQLskills. Subscribing is FREE and the announcements are going to be later this month. Here's a link to directly subscribe on SQLskills: http://www.sqlskills.com/login.aspx.

Thanks again for attending the series! It was great fun. I'll post a final blog entry with ALL of the links as well as all of the survey links. I know that they're going to send me these so that you can get easier access to them.

See you next time,
kt

In part 9 of our webcast series titled: Implementing Database Mirroring, we covered the steps from setup to failover to monitoring. There were lots of great questions and I think we could easily go back and do a couple more hours on database mirroring, failover combinations - including manual failover and client application questions. Having said that, there were a few interesting scenarios that came up that I thought I'd add a bit more details about here. For simplicity I created sections...

Where to go for more information on Database Mirroring and SQL Server SP1

Database Mirroring between Editions

Database Mirroring is supported in both the Standard Edition (SE) and the Enterprise Engine (EE) Edition(s): Enterprise, Enterprise Eval and Developer. In the EE Editions all configurations (synchronous and asynchronous) are supported: High Availability (sync), High Protection (sync) and High Performance (async). In the SE, only the synchronous forms of Database Mirroring are supported: High Availability and High Protection. One thing that is true however, (and I learned this as well - durin the webcast in Part 9 - thanks to the question submitted and Mark being present...thanks Mark!), is that even while synchronous mirroring is supported in both SE and EE, you can only create a mirroring partnership between servers of the same edition.

Database Mirroring between Platforms

Database Mirroring is supported in both the Standard Edition (SE) and the Enterprise Engine (EE) Edition(s): Enterprise, Enterprise Eval and Developer. In the EE Editions all configurations are supported: High Availability, High Protection and High Performance. In the SE, only the synchronous forms of Database Mirroring are supported: High Availability and High Protection but not the asynchronous High Performance configuration. One thing that is true however, (and I learned this as well in Part 9 - thanks Mark!), is that even while synchronous mirroring is supported in both SE and EE, you can only create a mirroring partnership between servers of the same edition.

Combining Database Mirroring with Other Technologies

The Books Online has a section targeting exactly this discussion. Review this section in the SQL Server 2005 Books Online (April Update): Database Mirroring and Other Features and Components. Additionally, I've provided a few comments for you to review as well as links to some of the specific BOL topics that exist on these combinations.

Database Mirroring with Failover Clustering

These two technologies CAN be combined but there are multiple things with which you should be aware. First, a failover of a cluster is SLOWER than a failover of a Mirror pair... as a result, it is likely that your secondary server will come online as the new principal in the time that it takes your principal (which is on a cluster) to recover. In a lot of cases, this is good because this keeps you online longer and results in less downtime but it may also be undesireable when your primary is now running at your alternate operations site - which is unstaffed. So, in some cases you may want to prevent automatic failover and instead only use the secondary mirror when you absolutely have to (i.e. NOT just when the cluster fails). If this is the case then you might prefer running with the High Protection configuration of Database Mirroring instead of the High Availability configuration.

This will allow you to manually failover when desired.

As another option - you can increase the timeout for Database Mirroring failover to 90 seconds. If the cluster comes back online within 90 seconds then the automatic detection/failover of the High Availabilty configuration will not occur unless the cluster does not come back online (as the principal) within x seconds. You can configure the Database Mirroring Failover timeout by using ALTER DATABASE.

ALTER DATABASE dbname SET PARTNER TIMEOUT x

Please note, this is only one timeout of many. There are many different types of timeouts in the system that can cause a failover. However, a hard error code generally starts the failure procedure sooner.  Mark pointed this out in his failure detection slides in our TechNet webcast series, Part 8.

Review this section in the SQL Server 2005 Books Online (April Update): Database Mirroring and Failover Clustering.

Database Mirroring with Replication

These two technologies CAN be combined together but not all configurations are supported and where supported, there are specific setup requirements. From the BOL: Replication supports mirroring the publication database for merge replication and for transactional replication with read-only Subscribers or queued updating Subscribers. Immediate updating Subscribers, Oracle Publishers, Publishers in a peer-to-peer topology, and republishing are not supported.

 
Review this section in the SQL Server 2005 Books Online (April Update): Replication and Database Mirroring

Database Mirroring with Log Shipping

These two technologies CAN be combined together but it will require a bit of manual configuration to continue log shipping when a mirror becomes the new principal.

Review this section in the SQL Server 2005 Books Online (April Update): Database Mirroring and Log Shipping.

And - there are others in the BOL. Please reference the sections listed above for more details.

And - with that - we're caught up with our resources and references for this series. Part 11 - the LAST one - is this Friday, May 19. I look forward to your being there LIVE. Register here and come ready with your questions, this one is going to be VERY focused on best practices, ideas/architectures and your questions. Those of you that are there LIVE will help to direct the session.

Thanks!
kt

In the last few minutes of the webcast (part 10), I goofed up one line of code and didn't realize it until today. As my very last demo (and there were at least 10 different scenarios/concepts/demos yesterday) in my webcast, I decided to show a Database Snapshot on a Mirror database. It was the second database snapshot that I had created so my first database snapshot demo was just fine. However, when I went to create the database snapshot on the mirror, I inadvertently left off the most important part "AS SNAPSHOT OF AdventureWorks". The irony is that I tried to query some tables and just ended up (because we were right at the end of the webcast ;)) saying that I probably wasn't getting the table names right. Ha - there were no tables... I hadn't created a database snapshot, I had created just another database - so the only tables I was seeing were the catalog views.

Anyway, just for clarity, I corrected the "Demo Scripts" zip that's associated with Part 10 BUT if you've already downloaded it then you'll have the old (and incorrect) version of this script (SnapshotOnMirror.sql). And, for completeness, I'll put the code that I executed during the webcast here:

USE AdventureWorks
go

USE master
go

CREATE DATABASE AdventureWorksSnap
ON
( NAME = N'AdventureWorks_Data',
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\AdventureWorksSnap_Data.mdfss')
-- , SIZE = 167872KB , MAXSIZE = UNLIMITED, FILEGROWTH = 16384KB )
go

and the code that I should have executed here:

USE AdventureWorks
go

USE master
go

CREATE DATABASE AdventureWorksSnap
ON
( NAME = N'AdventureWorks_Data',
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\Data\AdventureWorksSnap_Data.mdfss')
-- , SIZE = 167872KB , MAXSIZE = UNLIMITED, FILEGROWTH = 16384KB )
AS SNAPSHOT OF AdventureWorks  <<<< ----------
go

USE AdventureWorksSnap
go

SELECT * FROM person.contact
go

So, quick demos right at the end of the webcast might not have been my best idea ;). But - I'm surprised none of you called me on it?! I'll blame it on this for now.

Have a great weekend,
kt

OK - today's session was quite fun... lots of demos and quite a few "tie-ins" where I tried to bring together many things that we've touched on in our series. And - that's really the point of the series - creating a reliable, robust, scalable and available environment takes MANY different features. You really need to architect a complete solution in order to handle the many potential problems that may occur. And, unfortunately, it's a never ending process; you're never done and you're never going to get everything (sorry!). You will need to re-evaluate, monitor, and manage your system as long as it runs to keep it reliable, available and fast. Something will come up...someday...that you didn't think about, evaluate and/or prevent. But, then you'll know and then you'll put something into place to keep it from happening again.

So - to tie back into some of the other sessions and resources, here is a list of everything to date in the series as well as a few specific references I made during the session.

Demo Scripts are here: 20060512_TechNetWebcast-Part10.zip (25.46 KB) (updated on Sat, May 13 at 2:55 PDT)
Credit Database zip is here. NOTE: This is a 48MB zip which expands to a 175MB backup and restores to a 700 MB database (with a lot of free space for testing, etc.).

TechNet Webcast Series

Session 1: A Fast-Paced Feature Overview and Series Introduction (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here

Session 2: Security (Level 200)
   Presenter: Bob Beauchemin, SQLskills.com, 
   Session's corresponding blog entry, here

Session 3: Understanding Installation Options and Initial Configuration (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here

Session 4: Upgrade Considerations and Migration Paths (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here

Session 5: Effective Use of the New Management Tools (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here

Session 6: New Application Design Patterns for Scalability and Availability and the Operational Implications of Service Broker (Level 200)   
   Presenter: Bob Beauchemin, SQLskills.com, 
   Session's corresponding blog entry, here

Session 7: Technologies and Features to Improve Availability (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here

Session 8: Implementing Database Mirroring, Part 1 of 2 (Level 200)
   Presenter: Mark Wistrom, SQL Server Program Manager - Microsoft Corp., 
   Session's corresponding blog entry, here

Session 9: Implementing Database Mirroring, Part 2 of 2 (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   Session's corresponding blog entry, here.

Session 10: Recovering from Isolated Disasters and Human Error (Level 200)
   Presenter: Kimberly L. Tripp, SQLskills.com, 
   You're reading it! :-)

Recovery Models and Backup/Restore

  • MSDN Webcast Parts I and II cover Recovery Models and some issues/best practices related to changing recovery models. Check out the blog entry here which has links to the sessions and their associated blog entries.
  • MSPress Title: SQL Server 2000 High Availability, Chapter 9: Database Environment Basics for Recovery is here. The MSPress page for this title is here.
  • SQL Server Magazine Article on Isolated Disasters and Recovery (using RESTORE with STANDBY/STOPAT to investigate when a database became damaged) is here. Check out a consolidated list of all of my SQL Server Magazine Articles here and SQL Server Magazine here.

Table and Index Partitioning

RAID 0+1 and RAID 1+0

There was a question that came up on this and the question basically asked - which is better. Well, this is a hard question to answer because they both have pro's and con's BUT before I get to the pro's/con's there's also another [more important] issue; these two get confused and swapped all the time. In fact, many vendors USED to refer to these interchangeably and even just lumped them together as RAID 10. Today, most people don't do this and most people also try to refer to the underlying technlogy instead of the numbers. Having said all of that, RAID 1+0 is Striped Mirrors and is my general recommendation because it tends to be more reliable than 0+1 and can tolerate more drive failures than 0+1. RAID 0+1 is Mirrored Stripes - which generally outperforms RAID 1+0 but cannot tolerate the loss of more than one drive and because of that it's more vulnerable. In the end, I'd suggest a simple "educational" site here (it's on a commercial site but it has a nice - and short - description of the different types of RAID arrays).

See you next week - for our LAST part in this series - Part 11: Best Practices in Building Robust, Recoverable, and Reliable Systems (Level 200).

Thanks for reading, listening and continuing to ask great questions!
kt

On Wednesday I had the pleasure of chatting with Greg Low - a fellow RD and MVP who is also excited to be focused and working in the SQL Server space. He's been doing a few podcasts on database technologies and we were finally able to hook up today. He caught me off guard with a couple of questions on my hobbies (the hobbies that I have outside of SQL Server ;-) but more than anything he got me talking about indexes (well, now that's not all that hard, eh? ;-). It was fun - thanks for the chat Greg.

Check it out:

SQL Down Under: http://www.sqldownunder.com/
The show in mp3 format: http://www.sqldownunder.com/SDU15FullShow.mp3
The show in wma format: http://www.sqldownunder.com/SDU15FullShow.wma

Enjoy - and thanks Greg!
kt

Categories:
Events | Indexes | Tips

Well, Friday brought another flood of great questions from everyone as we moved our way through many of the new 2005 tools. The one thing that I really wanted to stress was that *many* SQL Server 2005 tools (SQLCMD, SSMS and SQL Profiler) offer important features that can be leveraged today, even if your primary production servers are still SQL Server 2000. I did move through the tools quickly and showed quite a few new features; there are a lot of excellent resources to help you dive in deeper now that you're interested, ready and know some of the rewards of starting now. Here are a few of those resources:

For deleting old database backup history, there are a couple of stored procedures in msdb that can be used:

  • sp_delete_backup_and_restore_history
  • sp_delete_backuphistory
  • sp_delete_database_backuphistory

For cycling errorlogs, use: sp_cycle_errorlog.

And - lots of other questions that I primarily answered online in the last 40+ minutes. We had a great group and I hope everyone had fun. For the second half+ of the series we're going to focus on architectures and solutions - mostly related to disaster recovery and avoidance. However, the next part of the series is going to branch into a new (and *very* interesting) area of SQL Server 2005 - Service Broker. There are many impacts of Service Broker on the SQL Server system AND you might find a few applications of the technology within your own application as well. Have a great time with Bob for part 6 and I'll be back for part 7 next Friday.

See you soon!
kt

And another one bites the dust! Wow - what a great group today... soooooo many questions! For those of you that weren't there - the lecture was 80 minutes and the additional Q&A went on for another 45 minutes. So - as a result, there were *a lot* of additional resources needed. Let me get started with all of those right away.

To prepare for moving to SQL Server 2005 there are a few EXCELLENT resources with which you should start:

Phase 1 - Prepare to Upgrade/Migrate

Phase 2 - Database-level Testing

  • Copy Database Wizard or
  • Backup/Restore or
  • Detach/Attach

Phase 3 - Server-level Testing

  • Consider upgrade in-place or
  • Make sure that you manually migrate all EXTERNAL objects, logins, jobs, error messages, etc.

Phase 4 - Testing/Updating after the upgrade/migration

  • Update statistics immediately
  • Test application code, database compatibility modes, session settings
  • Check for "broken code" in terms of system table changes
  • MOST of this should have already been done and assessed in Phase 1 but better to be safe!

And - finally - the other things we talked about and the rest of the links are here:

And - that's it for this week. See you next Friday when we chat about the new Management Tools and how to effectively use them!

Thanks for listening/watching and asking GREAT questions,
kt

Hey there everyone - Well there was lots of excitement around our first session...so much so that apparently a Live Meeting server went down and caused MANY of you to get booted-out or even blocked-from attending (figures, right!).... Ugh (talk about the irony here - a series on high availability that isn't available because a server crashes...hhmmm, I think I know where to go for my next potential customers ;) ;). Regardless, I'm glad that at least a couple hundred of you did get in. For the more than 1000 others that were registered but unable to get in - I truly want to apologize!

The good news is that we now have the on-demand link available and for all of you who registered, it should have been sent to you via email. Also, as promised, I've attached the resources and demo scripts we talked about today.

Partial Database Availability Demo Scripts: PartialDBAvail-DemoScripts.zip (4.19 KB)
Database Mirroring Demo Scripts: DatabaseMirroring-DemoScripts.zip (3.74 KB)
Replication
Demo Scripts - Since this demo was completed through the UI, here are some useful references on Replication:

Other Resources:

SQLCMD Resources: My blog entry after Michiel Worries' Webcast (includes links to webcast, etc.)
TechNet Resource Center: SQL Server 2005 Mission Critical High Availability
Demo: Windows Server System Reference Architecture Design Considerations for SQL Server 2005 High Availability
Whitepaper: Choosing a Database for High Availability: An Analysis of SQL Server and Oracle

Also, to get you ready for SQL Server 2005 - check out the Upgrade/Migration Resource Center: Upgrading to SQL Server 2005

And... that should keep you busy between now and next week!

Have fun,
kt

Hey there everyone! I know I still owe you a few Q&A entries (for sessions 7, 8 and 9) but I wanted to get this blog entry out there so that you can play a bit with some of the resources. This series was targeted at developers but really helps to "bridge the gap" between development and administration by always remembering the impacts of what you implement (and techniques to help you to see if you do). More specifically, everything you do and/or design, has the potential for a negative impact to something else - there's no free lunch, eh?

So, this series focused more on Scalability but always remembered the impact to availability and/or reliability. This last session brought together the three primary areas to remember while developing a scalable and reliable architecture:

  1. Know your data
    • Design for Performance - Sessions 1, 2, 3, 6, and 8
  2. Know your users
    • Indexing for Performance - Sessions 4, 5 and 9
    • Optimizing Procedural Code - Session 7
    • Controlling Mixed Workloads and Concurrency - Session 6
  3. Users lie
    • Profile - to make sure that you're tuning what's really happening as opposed to what you think was going to happen! - Session 9

This last session had some great questions and as a result, a few new resources were used. Here are a few of the things we talked about:

Event Notifications and DDL Triggers

DMVs

Webcast links for the entire series!

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click here.

Part 4: SQL Server Indexing Best Practices
For the MSDN Download for Part 4, click here.
For the SQLskills Blog Entries for Part 4
Resource links blog entry, click here.
Q&A blog entry, click
here.

Part 5: SQL Server Index Defrag Best Practices
For the MSDN Download for Part 5, click here.
For the SQLskills Blog entry, click here.

Part 6: Mixed Workloads, Secondary Databases, Locking and Isolation
For the MSDN Download for Part 6, click here.
For the SQLskills Blog Entry for Part 6, click here.

Part 7: Understanding Plan Caching and Optimizing Procedure Performance
For the MSDN Download for Part 7, click here.

Part 8: Data Loading and Aging Strategies - Table and Index Partitioning
For the MSDN Download for Part 8, click here.

Part 9: Profiling for Better Performance
For the MSDN Download for Part 9, click here.

Part 10: Session Summary - Common Roadblocks to Scalability
For the MSDN Download for Part 10, click here.
Transcript can be found here.

So, the series comes to an end (even though I still have more work to do). I have to say that it was a lot of fun and I enjoyed everyone's questions. And /start shameless plug here/ starting in March, SQLskills will begin a 10-12 part series on TechNet. The series will include sessions from my colleague Bob Beauchemin as well as me. This will definitely be more Operations and DBA focused but for all of you developers - it may help you better understand the system, High Availability and a myriad of New Features in SQL Server 2005.

I hope to see you there - or at least your DBA... ;-)

Thanks again everyone,

Kimberly

This is a much needed and much overdue blog entry... In 8 Steps to Better Transaction Log throughput, I mentioned a customer that was helped by TWO typical optimization problems I see. In that blog entry, I said I would write two blog entries - that one on transaction log optimization and another on common tempdb optimizations. Well, I forgot...until I was reminded with an email this morning (thanks Marcus!).

First - a bit of understanding of TempDB - what goes there?

  • Internal temporary objects needed by SQL Server in the midst of other complex operations. For example, worktables created by a hash aggregate will be stored in TempDB or interim tables uses in hash joins (almost anything that shows as "hash" something in your query plan output is likely to go to tempdb).
  • User objects created with either # (for local temporary objects), ## (globabl temporary objects) or @ (table variables)
    • # = Local temporary object
      Local temp objects are objects accessible ONLY in the session that created it. These objects are also removed automatically when the session that created it ends (unless manually dropped).
    • ## = Globabl temporary object
      Global temporary objects are objects that are accessible to ANYONE who can login to your SQL Server. They will only persist as long as the user that created it lasts (unless manually dropped) but anyone who logs in during that time can directly query, modify or drop these temporary objects. These objects are also removed automatically when the session that created it ends (unless manually dropped) OR if being used by another session when the session that created it ends, when the session using it finishes using it (and it's only as long as any locks are held). If other sessions need more permanent use of a temporary object you should consider creating a permanent objects and dropping it manually.
    • @ = User-defined Table Variable
      User-defined Table Variables were introduced in SQL Server 2000 (or, wow - was it 7.0?) and provide an alternative to temporary tables by allowing you to create a variable defined as type TABLE and then you can populate and use it in a variety of ways. There has been A LOT of debate over whether or not you should always use table variables or always use temp tables. My response is that I ALWAYS avoid the word always! My point is that table variables are NOT always better nor are temp tables always better. There are key uses to each. I tend to like temp tables in scenarios where the object is used over a longer period of time - I can create non-key indexes on it and it's more flexible to create to begin with (SELECT INTO can be used to create the temp table). I also have the ability to use the temporary table in nested subprocedures because it's not local to the procedure in which it was created. However, if you don't need any of those things then a table variable might be better. When it is likely to be better - when you have smaller objects that don't need to be accessed outside of the procedure in which it was created and when you only need KEY indexes (a table variable ONLY supports the indexes created by a create table statement - meaning PRIMARY KEY and UNIQUE KEY).
  • Objects created by client applications - this is possibly a large part of your problem... Profiling can help you to determine if there's a lot of TempDB usage from your client applications.

OK, so now that you know what goes there - how do you make it optimal?

First and foremost, TempDB is in cache just as any other database is in cache. TempDB does not spill to disk unless you are low on cache and/or if you have a lot of inflight transactions in TempDB. Although TempDB is not persisted from shutdown to restart - it still needs to do some logging and therefore you should consider its optimization a lot like other databases.

Things you should do for TempDB (that are a lot like what you should do for every database):

  1. Isolate the data and log portion of TempDB.
  2. Place them on clean, defragmented disks.
  3. Pre-allocate them so they don't need to do a lot of autogrowth.
  4. Make sure you have sufficient memory to support active objects (check for disk activity to the disks that contain TempDB files).
  5. Make sure that transactions are written efficiently so that there are no unusually long running transactions that are unnecessarily holding resources (and therefore locks and therefore log activity).

And - if you need to move TempDB, you should review this KB Article: Moving SQL Server databases to a new location with Detach/Attach

Things you should do SPECIFICALLY for TempDB (especially if you're running on a multiproc machine):

Before I say what... let me tell you why? TempDB has a large number of objects being created all the time. For an object to be created, space must be allocated to it. Space allocation is determined by looking at some of the internal system pages (the GAM, and SGAM). In the end, it is these pages that start to have significant contention (with just one file) in a VERY active TempDB. To minimize that contention you can create multiple files.

  1. Consider creating multiple files for TempDB (even if on the same physical disks) so that there is less of a bottleneck when objects are being allocated. Make sure to read associated KB.
  2. Consider setting a trace flag to have object allocation grab extents rather than pages. Make sure to read associated KB.

BOTH of these last two are described in detail by a KB article: FIX: Concurrency enhancements for the tempdb database.

OK - so that should really help! Moving forward (meaning SQL Server 2005), having multiple files can still help for TempDB.

Effectively Designing a Scalable and Reliable Database

A Primer to Proper SQL Server Development

New Features in Indexing and Index Maintenance Best Practices, Part 5 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Part 5 can be replayed by clicking here.

Q: Where can we get the demo scripts AND the sample database: Credit? The demo scripts are in this zip (20050902MSDNDemoScripts.zip (8.52 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.  To download the ZIP of the Credit Database Backup click here. Once unzipped, restore this backup to SQL Server 2000 or SQL Server 2005. The backup is a SQL Server 2000 backup and can be restored to either version! If restoring to SQL Server 2005, you might want to change the destination for the data and log file as the path will probably be different.

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click
here.

Part 4: SQL Server Indexing Best Practices
For the MSDN Download for Part 4, click here.
For the SQLskills Blog Entries for Part 4
Resource links blog entry, click here.
Q&A blog entry, click here.

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Related Resources

MSDN Webcast: Indexing for Performance – Proper Index Maintenance MSDN Whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices TechNet It’s ShOwtime Webcast: Index Defragmentation with SQL Server 2005 

Technical Questions

Q: In your script, what is "HA Requirements"? HA = High Availability. This is the requirement that your table stay online and available. Some companies are trying to achieve 99.999% uptime, this is especially challenging when even maintenance operations take a table offline.

Q: If you create extra indexes is there a easy to configure utility that you can run across an application after it has run for a few months to list keys that were never or infrequently used? Use one of the new DMVs: sys.dm_db_index_usage_stats. To see the complete list of DMV objects, use the following query:SELECT * FROM sys.system_objects WHERE [name] LIKE 'dm[_]%'

Q: What is DMV again? Dynamic Management View. These are new objects which give information about in-memory objects and state information.

Q: What are the parameters and their usage – for sys.dm_db_index_physical_stats? (DatabaseID, ObjectID, IndexID, PartitionNumber, Mode)

DatabaseID = [ NULL | 'DatabaseID' ] NULL: returns information for ALL databases, if NULL is used no other options can be supplied. This returns ALL indexes for all objects in all databases. Easy but possibly slow.
DatabaseID: smallint type. Refers to the ID for a specific database. DB_ID() or DB_ID('DatabaseID') can be used. The latter allows you to run this from ANY database as long as you have access. However, 3-part naming must be used.
ObjectID = [ DEFAULT | NULL | 'ObjectID' ]DEFAULT/NULL: return ALL base data: CL, Heap, LOB for the specified database.
ObjectID: int type. Refers to the ID for a specific object. OBJECT_ID('TableName') can be used. When using OBJECT_ID, you can use 1/2/3-part naming. Be sure to use 3-part when executing outside of database.
IndexID = [ DEFAULT | NULL | 'IndexID' ]DEFAULT/NULL: All indexes
IndexID: tinyint type. Refers to the ID of a specific index.
PartitionNumber = [ DEFAULT | NULL | # ]DEFAULT/NULL/0: return ALL partitions
#: returns only the details about specific partition. When a PartitionNumber is specified then an IndexID must also be specified.
Mode = [ DEFAULT | NULL | 'SpecificMode' ]DEFAULT/NULL/LIMITED: return FAST scan and use only an IS (Intent Shared) Table-level lock. This lock blocks ONLY eXclusive TABLE-level locks and schema changes. Excellent, relatively unobtrusive way to get fragmentation details.
LIMITED: IS Lock. Same as SQL 2000 WITH FAST, only page counts and EXTERNAL fragmentation displayed. Does not detail INTERNAL fragmentation and page density.SAMPLED:  IS Lock. For tables less than 10,000 pages (~80MB), all details are produced. For tables of more than 80MB, two samples are done (1% and 2%) at every nth page. The samples are compared and if close, 2% sampling output returned. If not close, then up to 10% will be sampled.DETAILED: S Lock. Entire table analyzed for both internal and external fragmentation. Returns one row for each level of the index from the leaf level (level 0) all the way up to the root level. This can help you determine fragmentation in the non-leaf levels but at the expense of holding a shared table level lock.
Q: How often should you run DEFRAG on your SQL server box? Should this be a part a regular schedule? Taking down SQL is their any other consideration? First, the only thing that’s not available is the table being REBUILT. Defragging an index does not take that table/index offline. So, more than anything, it depends on what you’re trying to achieve. If you want achieve better availability on SQL Server 2000 then you might choose to defrag rather than rebuild – to keep your tables available.

Q: How often do you get such perfect tables in practice? A table is always completely clean and contiguous after a rebuild. To periodically fix a table, you should use consistent and automated rebuild strategies.

Q: Do you have suggestions for developers using MSDE when customer’s demands can vary? Vary from few transactions to a large customer with many transactions. The general best practices in database and table design scale from the low end all the way up to the high-end and in the end – helps your database scale!

Q: Can you touch on rules of thumb for "pad index"? If fragmentation in the leaf level is minimized through proper index maintenance and fillfactor – then fragmentation in the non-leaf levels should be low as well. You rarely need to specify padindex unless you have widely varying distribution of data and really want to leave larger gaps because of strange densities of data.

Q: Do most of these "Index Rules" apply to Indexed-Views? Yes! All indexes can become fragmented after data modifications... Your scripts should always look for fragmentation across all scripts.

Q: Can you discuss fragmentation WRT horizontal partitioning, especially range partitioning on the primary key? SQL Server 2005 offers more granular rebuild options –but not necessarily online. In many cases, you might want to design a read-only partitioned table and keep the volatile portion of the table (especially if only one partition), in its own separate table – possibly using a partition view (or an inline table valued function) over these two tables.

Q: If I'm selecting from a table with a where FirstName = ... and LastName = ... and I have 2 indexes, one on LastName and another on FirstName. Are they both used? With an AND – maybe. The optimizer will look at the Index statistics to determine if either of them selective enough to use only one index. If neither is selective alone and a better index does not exist (a better index for AND would be one that includes BOTH of the columns in the SAME index – as a composite index), then SQL Server may choose to join the indexes (index intersection).

Q: URLs on the Resources slide can't be read. Could you type then into the Q&A, please? When the session is available for download (which is what happens when MSDN put this online), then you can access the URLs there as well. Typically, I place all of the links at the beginning of the Q&A – resources section. I’ll make sure to do this consistently!

Q: How does an uniqueidentifier used as a clustered primary key effect performance? This is best answered by session 4. In short, a non-sequential GUID can cause a lot of fragmentation.

Q: What is ExtentFragmentation as reported by DBCC SHOWCONTIG and is it less important than Logical Fragmentation? Extent Fragmenation refers to how many extents are next to each other. This is a bit more important than Logical Fragmentation as logical fragmentation shows whether or not the pages are next to each other.

Q: How much danger is there in the defrag processes? What kind of backup procedures do you suggest when you defrag? More frequent transaction log backups. A defrag generates a lot of log information. However, it does so in mini transactions. As a result, transaction log backups can occur concurrently with the defrag process and even though the defrag is not complete, the transaction can still be cleared because the defrag process runs as small transactions instead of one long running transaction. This also improves concurrency because the locks are released throughout the process.

Q: Defraging a large index can cause the log file to grow quite large. Is there a way to minimize this other than frequently log backups? Yes, you’re correct – defraging a large index WILL grow the log file quite large! As for minimizing this activity in the log – no way to do that. But – you’re correct in increasing the frequency of log backups!

Q: With very large tables, how much available disk space (both transaction logs and data drive) do you need to have to rebuild? Does it take less space to defrag than to rebuild? Well, this is really a multipart question… First, log space for rebuilds is mostly dependent on the recovery model. If you are running the FULL recovery model then creating and/or rebuilding indexes will take enough log space for the entire rebuild to complete. If you are running in the BULK_LOGGED or SIMPLE recovery models then this operation will run as a bulk operation and will be minimally logged. While this will take less time and significantly less log space, you are giving up some features when switching recovery models. I would strongly suggest reviewing the second session to see if this is appropriate.Now, as for data space – a rebuild will always require at least the table size in free space and possibly as much as double (if an online rebuild is being performed). Typically, when space estimates are being done (when capacity planning the database) I always recommend taking the largest table size and multiplying it by 2 or 3 – in order to make sure you have enough space for rebuilds. There is space needed for sorting as well – this can come from the database OR from tempdb (using the SORT_IN_TEMPDB option).Defraging doesn’t move an object so it doesn’t take additional data space BUT it does require more overall log space because it runs as mini transactions instead of just one.

Q: Should we look at different fragmentation stats if there are multiple files in the same filegroup? No, you still want to review average fragmentation. However, you may have more “fragments” in a table that spans filegroups; this does not necessarily mean that your table is fragmented.

Q: Are there any good third party tools for checking fragmentation and performing maintenance? Unfortunately no revolutionary ones (that I know of and/or can recommend)...but I still have high hopes :)

Q: How do you determine the appropriate fill factor? Unfortunately, there isn't a magic number... but, you can test your guestimate by seeing how fragmented the table becomes between your regularly scheduled defragmentation routines.

Q: Does it matter if I build the clustered index before/after rebuilding the non-clustered indexes? You should always create the clustered index before creating non-clustered index but as for rebuilding - you can rebuild your indexes independently as a rebuild of the clustered does not (generally) cause a rebuild of the non-clustered. There is one exception to this rule in SQL Server 2000 – in 2000 ONLY, SQL Server will automatically rebuild the non-clustered indexes when a non-unique clustered index is rebuild. Why? Because the uniqueifiers are rebuilt.

Q: Will doing a defrag followed later by a rebuild decrease the work of the rebuild? Not really. A defrag doesn’t move the object – only a rebuild does. However, you might minimize the cost of the sort…

Q: How does cache map to table pages, i.e., does free space in table pages have a 1:1 correspondence to wasted cache? SQL Server reads the 8K page from disk into memory. The number of bytes that are wasted on disk are also wasted in memory. This is often the motivation for vertical partitioning! You might refer back to session three for more details on row/page structures!

Q: If switching a varchar cluster to a bigint and vice-versa in 2000, what would the best order of drop/create index? Actually, this is the reason that CREATE with DROP_EXISTING was created… so that you could “change” the definition of the clustered

DROP TABLE test
go
CREATE TABLE test
(
      testid      int               not null,
      col1       varchar(100)      not null
)
go
CREATE CLUSTERED INDEX testind ON test(col1)
go
CREATE CLUSTERED INDEX testind ON test(testid) WITH DROP_EXISTING
go
sp_rename 'test.testind', 'NewIndexName', 'INDEX'
go
sp_helpindex test
go
Q: What about instances of one name only? (like Madonna, Cher, etc. ;^) Well, this is a good question and this is something that you might need to plan for in design. In these cases, you might allow NULLs in the lastname column and then make sure to search both when a lookup is performed. To be honest, I probably won’t do all that much to find these special first names – unless you wanted to do searches across both columns without knowing whether or not this is a first or last name. You might do something like this in a lookup
SELECT * FROM NamesTable
WHERE LastName = @variable
      OR (FirstName = @variable AND LastName IS NULL)
Comment: Just wanted to say I appreciate the blog you have put together.

Thanks for the thanks! It's a lot of work but I think it's great as a reference!! Even for me! To be honest, I can't always remember where to find things either! J

Thanks! So – we’re half way there – 5 more to go! And, lots more questions coming I’m sure J For the next session, we’re going to cover Isolation and options in Isolation in SQL Server 2005. If you’re interested in hearing more isolation, locking/blocking – here’s the registration link:MSDN Webcast: A Primer to Proper SQL Server Development (Part 6 of 10): Mixed Workloads, Secondary Databases, Wait States, Locking and Isolation See you on Friday!

kt

Effectively Designing a Scalable and Reliable Database A Primer to Proper SQL Server Development
Indexing Best Practices, Part 4 of 10
Presented by Kimberly L. Tripp, SQLskills.com For a list of the non-technical Resources related to this webcast, please review the following blog entry:MSDN Webcast Q&A for Resources: Indexing Best Practices, Part 4 of 10  Technical Questions

Q: When a clustered index is added to a table what happens physically to the data pages? SQL Server moves the data to the new location defined by the ON clause of the index creation statement. The ON clause can specify a different filegroup or even a partition scheme and the heap will move to the new location.

 Q: I'm not getting the follow-up emails with the replay link. How do I get on this email list for replay?

You *must* register through MSDN. If you are using a partner site then that is probably why you’re not getting the email. IF you are registering through MSDN and still not receiving the email, please contact eventsup@microsoft.com with your information.

 Q: Does the copy affect the transaction log?

Yes. A create index or rebuild index statement does require the statement to be a single transaction. If you are running with your database set to the FULL recovery model then this single transaction will create a lot of log activity. You could switch to the BULK_LOGGED or SIMPLE recovery model and get less log activity but switching to the SIMPLE recovery model should be avoided as it will break the continuity of the transaction log. I would recommend two resources to get more information about Recovery Models and their impact on performance, logging and recovery.

MSDN Webcast Q&A: A Primer to Proper SQL Server Development, Creating a Reliable and Automated Backup Strategy, Part 2 of 10

Review the sample chapter: Chapter 9: Database Environment Basics for Recovery from the MSPress title: Microsoft SQL Server 2000 High AvailabilityQ: Can one switch back and forth between simple and full recovery mode?

Ideally, you would switch ONLY between the FULL Recovery model and BULK_LOGGED Recovery model but not with SIMPLE. The two resources listed above will also help to clarify this point.

 Q: What about re-building a clustered index? Will it be that slow too? Is it recommended to do it off-hours?

Rebuilding a clustered index takes a lot of the same requirements that creating the CL index does. It can also be slow and yes – as a result, you might want to do this off hours. You can speed up the process by changing recovery models but switching recovery models can also have an impact on recovery. See the two resources above for more details to help clarify this point. And, in part 5, I’ll cover more of the fragmentation details.

 Q: When is it desirable to move a table?

This is not necessarily an easy one to answer simply. But, I often recommend separating transaction processing data from read-only data as well as possibly placing a large table on its own filegroup (especially true with partitioned objects). But, I don’t often recommend separating indexes. Here’s a link to a SQL Server Magazine article I wrote – it does give some guidance. In general I don’t separate too many objects. However, if you’re building a database that you will put into a read-only state you might want to work a bit harder to “shift” objects to filegroups in such a way that each of those objects gets a nice and clean and contiguous chunk of space within the file. Another time that it’s useful to “move” an object is when you load data into a heap and then want to build the clustered index on another filegroup – where that heap will become a partition of a partitioned table. This is good when the clustered index will move the object into a filegroup that does not have other objects in it. Again, the clustered index will be nice and clean and contiguous in the destination filegroup.

 Q: So, if I understand correctly… a good approach to creating a table would be to have a clustered index independent of the primary key. So, ideally you could use a GUID for the PK and then have a clustered index in order to minimize the fragmentation.

The most ideal combination is when the clustered index is on an identity column – and even better if this is the Primary Key. However, if you do want to keep a GUID as the primary key then yes, you can create a clustered index which is independent of the primary key.

Q: In general, do you think folks have more problems setting up indexes for OLTP, OLAP/DWH, or Web apps?

In general, I think Decision Support environments can have more indexes with less of a negative… OLTP is a bit harder to find the right balance. The hardest is combination environments…

 Q: What was the diff between 0 and 1 on the third argument of DBCC IND?

The third parameter is for Index ID. 0 = Heap, 1 = Clustered Index, 2-250 are non-clustered index ids.

 Q: Should we keep the install defaults like Collation, and others? Or do you have recommendations?

Generally speaking, yes – I would recommend keeping the defaults. However, I do recommend that some development environments – especially those who are producing software for sale. When the database can be installed/setup on other servers that may have different code pages, etc., you might want to use binary case-sensitive sort. This will make your application more consistent – even with case-sensitive servers…and you never know if one of your customers may have one! It will also help to make you a better coder.

 Q: What design considerations should I take into account when planning my primary keys, indexes and clustered indexes for databases that will be used with replication?

There are no specific requirements for Replication other than the fact that you must have a Primary Key… but the same best practices discussed today apply and should be used in replication as well.

 Q: I am using SQL Server 2000. I have a table that is used a lot on the web server. The scan density is 97%. Should I adjust the indexing and/or Primary Key?

I tend to think more about how fast it drops rather than the actual number. But – 97% is pretty good in general. What you might want to do is a rebuild IF you have off hours time to rebuild (in SQL Server 2000 rebuilding can be done offline ONLY). Otherwise, you might just want to setup a defrag procedure. However, this is NOT

 Q: Why are so many SQL commands undocumented and how do you get to know how to use them if they are undocumented?

First and foremost, I want to strongly urge you to take your time with commands that are undocumented. If the commands are really useful for day-to-day operations, they’re probably already documented. Occasionally, development teams add additional functionality during development to test their functionality. These commands – while proving useful in some cases – do not generally go through the same quality assurance procedures that fully documented commands go through and as a result, I would be very cautious before using most undocumented commands.

 

Having said all of that, you’ll find a lot of useful tidbits in articles, on webcasts, in books, etc.  Just make sure you test the command thoroughly and don’t expect to get support on it. In general, undocumented commands are NOT supported.

 Q: What is the command for scan density on SQL Server 2000?

DBCC SHOWCONTIG. If you’re interested in learning more about defragmentation on SQL Server 2000, please review this MSDN Webcast: SQL Server 2000 Performance Tuning: Index Defragmentation Best Practices

 Q: In SQL Server 2000, DBCC SHOWCONTIG, can you explain extent scan fragmentation and if we should pay attention to it?

We’re going to go over this more with our next session on Index Defrag but Extent Scan Fragmentation describes whether or not the extents that an object owns are “next” to each other. If there are gaps then this implies that there are other objects “interleaved” with this one. This is not ideal, defragmenting the object will help to improve scans and should.

 Q: The size of the database doubles when a clustered index is added?

Once the index is created the space needed to build the index is released. However, while the index is being created you need roughly 2-3 times the TABLE size to build it. If you have a table which is 1GB then you should plan for at least 2-3GB of free space in order to build or rebuild a clustered index. However, it’s only the table’s size that doubles and it’s only during the index creation NOT after the index has been created.

 Q: Aren't GUIDs too large to use for an efficient index?

No, an index can be useful even if it’s a large percentage of the size of the table – mostly, because of the order of the columns and the criteria used in the query. Hard to give a fast answer to this but in general 16 bytes isn’t so bad – much larger may even be acceptable.

 Q: Does level 1 of a clustered index point to rows, or does it describe the minimum and maximum row on a page?

Basically, it’s a pointer to the page and it has a pointer to the lowest value on the page. Each pointer in the non-leaf always points to the “first” value on the page (meaning the lowest).

 

Q: Why would you be inserting using an index other than using the primary key? Assuming the primary key is appropriate, I wouldn’t. However, there are a lot of environments where the primary key is chosen and kept – for a variety of reasons. SQL Server creates the clustered index on the Primary Key by default but it is not required’ you can create the clustered index on something else. It’s the clustered index which defines the insert order – not the Primary Key; however, they are often one and the same.

 Q: Should all indexes be clustered if unique? Would this increase performance on search results from SP?

I think the question should really be should all clustered indexes be unique. Generally, the answer to this is yes. But to answer the second part – would this increase performance on an SP – the answer here is not necessarily. Stored procedures have many reasons for being – or not being – optimal. Plan choice, optimization and recompilation issues all play a major factor in whether or not a stored procedure is optimal. However, in session 7, I will talk about plan caching and stored procedure optimizations!

 Q: If there is no natural clustering key, what is the advantage of "coming up with one". Or, said another way, what is the basic reason why I want a cluster rather than a heap?A heap does not inherently have order to it (that’s the point) but when the records are not ordered then SQL Server needs to find a location in which to insert the row. This causes a lookup in something called a PFS (page free space) as well as a lookup in the table’s Allocation Map (actually called an Index Allocation Map in SQL Server 2000 and a Heap or B-Tree Map in SQL Server 2005). To help the insert performance you generally want to have the insert location defined. However, just having a clustered index is not the best answer as there are other internal dependencies on the log. Because of these internal dependencies you typically want the clustered index to be unique, narrow and static.  If you’re interested in reading more about the reasons behind having a clustered index which is unique, narrow and static:Ever-increasing clustering key - the Clustered Index Debate..........again! MSDN Webcast Q&A: Index Defrag Best Practices - Clustering Key Choice In some cases this means adding a column to a table solely for the purposes of clustering on it. Q: Is it a good practice to just have non-clustered indexes?Well, that’s a different way of answering the prior question…and while heaps can be useful for high performance loading they are generally not as useful for day to day OLTP or combination OLTP/DSS tables. So, my recommendation is almost always – the RIGHT clustered index + additional supporting non-clustered indexes. Q: In large OLTP databases is it recommended to move reporting to a separate database which has a lot of non-clustered indexes and no clustered indexes?No… Again, it’s not really a binary thing (should you always have a clustered, or not always have a clustered). Really, there are cases for both BUT in most databases – even DSS (Decision Support System) – a clustered index helps to define a narrow and unique value that all indexes have in common. This can help other operations…  So, even in a DSS environment I generally recommend a clustered, as well as additional supporting non-clustered. Having said that there are cases where a single clustered – setup for order by and range queries can be useful but that’s generally only when the large majority of queries all want * (all columns) and you typically want the exact search and order by in almost every case. Q: I've read that you shouldn't cluster a column that is an Identity column because it hurts performance for OLTP as all writes are on same page/area of disk? Your thoughts?

There are some cases where insert performance can be compromised – in very high volume OLTP systems (typically more than 400 inserts per second but this can vary) but in general, inserting into the same “hot” spot of a table improves performance because the activity is isolated and the needed pages are already in cache. Lots more on this on in other Q&As.

 Q: If I add an identity column to my table to have a good clustered key, it will never be used in a query. Would I naturally also include the primary key, even though it would make the clustered key less narrow?

You don’t need to include the primary key is the clustered key is already unique, narrow, static and ever-increasing.

 Q: Does this mean that vertical partitioning of a table is only helpful when the table is scanned?

No, there are still many benefits in vertical partitioning – such as with locking, scanning. I’d refer you to review the Webcast that was part 3 in this series, for more details.

 

Q: Can you use DTA against SQL 2000 databases?

Yes. You can use either the DTA or the ITW against SQL Server 2000. If you're looking for details about how to use the SQL Server 2000 Index Tuning Wizard you can review this whitepaper: Index Tuning Wizard for Microsoft SQL Server 2000.

 Q: Is DTA in the Express version, too?

DTA does not come with the Express Edition but you can certainly do tuning on a higher version and then use that tuned database in Express.

 Q: Can you explain briefly Index View?

It’s a materialized view – in the sense that the data as defined by the view – is actually materialized and stored on disk. This has both pro’s and con’s and needs to be evaluated fully before you implement these. Please check out these whitepapers for more details:

 Improving Performance with SQL Server 2005 Indexed Views by Author: Eric Hanson and Contributors: Gail Erickson, Lubor Kollar, Jason WardImproving Performance with SQL Server 2000 Indexed Views by Author: Gail Erickson and Contributors: Lubor Kollar, Jason Ward contributor Q: Do you think one can become expert in SQL Server and other aspects of Design/Programming or does one need to specialize?

Wow, I think it this is a tough question; more than anything I would recommend learning all of the options available so that you can better architect the right solution. Once you think you have a good solution, then move forward to really learn that area/topic/features. I’ve been using a phrase with SQL Server 2005 a lot – Jack of all trades, master of some...

 Q: Can you repeat the new function INCLUDE on index?

In SQL Server 2000 the maximum size of an index key is 900 bytes or 16 columns – whichever comes first. This key helps to keep an index’s b-tree smaller and more scalable. In SQL Serer 2005, the leaf-level of the index can include additional non-key columns. This continues to help keep the tree structure scalable but also allows you to cover more queries.

 Q: I have installed SQL Server 2005 CTP, but can only find SQL Configuration Manager. I don't see the management tools that you are using. Any suggestions?

Actually, no. This seems a bit strange?! I might try uninstalling and re-installing, as well as reviewing the setup logs to see if there are errors within them. For SQL Server 2005, all installation log files are stored in C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG.

 Q: if your non clustered index either includes all or part of the clustered index are the clustered index columns added again to the non clustered index?

NO! SQL Server only adds the column(s) of the clustering key – if they are missing.

 Q: Did you mention the cool timebound option in DTA?

Yes, but only briefly. Vipul Shah discusses DTA in greater detail in these two webcasts.

 TechNet Webcast: Performance Diagnosis in SQL Server 2005 (Level 300) by Vipul ShahTechNet Webcast: SQL Server 2005 Database Tuning Advisor (Level 300) by Vipul Shah Q: What is the fill factor and how is it used?

I think I’m going to defer you to the next session! Now, there’s some motivation. J

 

And here are a few comments that I’ve received! I just wanted to say thanks for your comments and thank yous… J It's your questions that keep me on my toes and current. This is still fun so keep it coming.

 

Comment: Thank you again, particularly for thoroughness and enthusiasm.

Comment: Let me add my voice to the chorus of "thank you's", Kim. You're  presentations and verbal Q&As are great, and your online ones are a truly generous gift to the community. No other webcaster has come close to doing what you do. Many thanks from all of us.  

Comment: I must compliment you on your Q&A in your blog - I've just checked it out - couldn't be more comprehensive!

 

So – 4 down, 6 to go! Lots more questions coming I’m sure J

 

If you’re interested in hearing more about fragmentation, how to view it, how to clean it up and finally, how to prevent it – here’s the registration link:

MSDN Webcast: A Primer to Proper SQL Server Development (Part 5 of 10): New Features in Indexing and Index Maintenance Best Practices (Level 200) Thanks!kt

Effectively Designing a Scalable and Reliable Database A Primer to Proper SQL Server Development

Indexing Best Practices, Part 4 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? The webcast will be available for viewing within 24 hours. All “on-demand” webcasts can be found here.

Q: Where can we get the demo scripts AND the sample database: Credit? The demo scripts are in this zip (20050826MSDNDemoScripts.zip (12.17 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.  To download the ZIP of the Credit Database Backup click here. Once unzipped, restore this backup to SQL Server 2000 or SQL Server 2005. The backup is a SQL Server 2000 backup and can be restored to either version! If restoring to SQL Server 2005, you might want to change the destination for the data and log file as the path will probably be different.

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Part 3: Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures
For the MSDN Download for Part 3, click here.
For the SQLskills Blog Entry for Part 3, click here.

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Resources mentioned in today’s Session:

If you’re interested in hearing more of the theory behind the indexing recommednations, as well as more of the reasoning behind unique, narrow and static criteria for the clustering key –  watch the TechNet "It's Showtime" recording, recorded at Tech Ed Europe titled: Index Creation Best Practices with SQL Server 2005 which covers Index Creation Best Practices on SQL Server 2005 and really gets into a lot of the "why's". If you’re interested in hearing a lot more about how to use Profiler effectively with ITW, watch the MSDN Webcast titled: SQL Server 2000: Performance Tuning - Finding the Right Balance of Indexes which covers Index Creation Best Practices on SQL Server 2000.

As for whitepapers:

SQL Server 2005 Beta II Whitepaper: Partitioned Tables and Indexes by Kimberly L. Tripp

SQL Server 2005 Beta II Whitepaper: Snapshot Isolation by Kimberly L. Tripp

PREVIEW: The Database Administrator’s Guide to the SQL Server Database Engine .NET Common Language Runtime Environment by Kimberly L. Tripp

Improving Performance with SQL Server 2005 Indexed Views by Author: Eric Hanson and Contributors: Gail Erickson, Lubor Kollar, Jason Ward

Improving Performance with SQL Server 2000 Indexed Views by Author: Gail Erickson and Contributors: Lubor Kollar, Jason Ward contributor 

Additional Webcasts:

TechNet Webcast: Performance Diagnosis in SQL Server 2005 (Level 300) by Vipul Shah

TechNet Webcast: SQL Server 2005 Database Tuning Advisor (Level 300) by Vipul Shah Support WebCast: SQL Server 2000 Profiler: What's New and How to Effectively Use It by Sri Kasam and Ajay Manchepalli TechNet Support WebCast: I/O performance problems and resolutions in Microsoft SQL Server 2000 by David G. Brown TechNet Webcast: SQL Server 2005 Series (Part 7 of 10): Indexing Enhancements (Level 200) By Bryan Von Axelson

Summary!

Finally, check out the Indexes category on my blog – for a variety of Q&A postings, other resources and lots of great links! The rest of the technical Q&A, I should be able to post soon. Stay tuned! Thanks,kt

MSDN Webcast: A Primer to Proper SQL Server Development
Designing Tables that Scale, Best Practices in Data Types and Initial Table Structures, Part 3 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Here’s the specific replay link.

Q: Where can we get the demo scripts? The demo scripts are in this zip (20050819MSDNScripts.zip (3.67 KB)); here in this blog entry. However, at the series completion, I will also create an entry under Past Event Resources for the entire webcast series.

Q: Does Kimberly L. Tripp have a blog? Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Q: Where are the links to all prior Webcast Q&As from this series?

Part 1: Creating a Recoverable Database
For the MSDN Download for Part 1, click here.
For the SQLskills Blog Entry for Part 1, click here.

Part 2: Creating a Reliable and Automated Backup Strategy
For the MSDN Download for Part 2, click here.
For the SQLskills Blog Entry for Part 2, click here.

Q: Is this Session 2 of the "Effectively Designing a Scalable and Reliable Database" webcast series? The Title bar here reads "MSDN Webcast: MSDN: A Primer to Proper SQL Server Development (Part...)". A bit misleading don't you think? So, I finally figured out the mystery here. When I originally submitted my proposal for this series, I titled it like this: Effectively Designing a Scalable and Reliable DatabaseA Database Developer’s Primer to Proper SQL Server Development (Level 200-300) But somehow, through a few edits and since each module also has a more detailed title shown in this following list, then I think the title just got edited down to fit. In fact, with subtitles like these, I can barely fit the entire title, subtitle and session title on a single slide. J

Part 1: Creating a Recoverable Database

Part 2: Creating a Reliable and Automated Backup Strategy

Part 3: Designing Tables that Scale – Best Practices in Data Types and Initial Table Structures

So, the mystery ends here… It’s all the same content and the general idea is that building a scalable and reliable database takes a variety of best practices that together make it happen; there is no “DBCC MAKEITGOFASTER(‘TRUE’)” option. I'll unveal a new title slide in part 4... oh, the excitement!

Q: How can I replay previous sessions? I thought we were going to get emails for replaying -- but I haven't received any replay emails. You will receive replay emails ONLY when you register for these sessions through MSDN. We’ve come to find out that there are other ways to register but it’s only through MSDN that we know for sure you will receive the replay information. Regardless, you can always find the “on-demand” version of the sessions here.

Q: What does CLR stand for? Common Language Runtime. Quick WHATIS.COM Definition here.

Q: If something uses sp_OA* now in SQL 2000, am I guaranteed it will use SQLCLR in SQL 2005? Not automatically; however, you should consider rewriting you sp_OA* is used to automate COM objects; these, by definition, are not framework objects. Moving to SQL Server 2005, you can continue to run with your previous sp_OA procedures to access COM objects but if you can write SAFE .NET Frameworks code to do the same thing then you should convert your sp_OA* code where possible. Not everything can be converted but the basic principle is that which you use sp_OA* is a good thing to target for conversion. For some great details on the differences and how to move forward – go here. Also helpful is this KB Article regarding COM and .NET in SQL Server 2000 Using extended stored procedures or SP_OA stored procedures to load CLR in SQL Server is not supported

Q: I loved how XML data is available interactively in SQL Server Management Studio (SSMS); can HTML data columns be displayed in the same way? Yes, if you store HTML in a column of type XML then the column data will appear as a link and when clicked, you can edit it in the XML Editor window in SSMS.

Q: Is the "XML" datatype really a "varchar(max)"? No. In previous releases the XML type really was just a “blob” stored in the database. There were no inherent optimizations and you were not able to define a scheme for an XML column. In SQL Server 2005, XML data is stored natively and offers the ability to put indexes on the columns as well as better access and manipulation. You *can* store XML as varchar(max) but it will be harder to access/code. There’s a very nice description of the tradeoffs in this whitepaper:XML Options in Microsoft SQL Server 2005 

Q: What is the difference between numeric and decimal? In SQL Server they are synonyms. I seem to remember hearing that there was a difference in the standards…somewhere but with a few references and looking around I can’t find any. Regardless, I’d make sure that you standardize on one or the other – just in case they do change, at least all of your data is consistent.

Q: I use decimal datatype for money values. Why it may be suboptimal to use decimal (instead of money datatype)? I wouldn’t say that it’s suboptimal – except maybe in storage length. However, decimal can be more precise.

Q: When should you use a float? A number of my clients use Float – i.e. Banks, Stocks, etc. What would be your argument against this? Just the lack of an absolute minimum precision/scale.

Q: We have a database where datatypes are set. But to use a simple percentage calculation we had to add .0 to the calculation to get the correct percentage. For example 100(5/9) is truncated to 0, but 100((5+.0)/9) yields correct result. Is there any standard procedure without massive conversions to assure a correct result? It’s a matter of data type. In the case of 100, 5 and 9 – these are all integers – without any decimal place. So – the division of 5 divided by 9 yields 0. However, 5 divided by 9.0 or 5.0/9 – both have a decimal and in that case the other data types are implicitly converted to the higher/more precise data value. A cool way to find out what the actual “base type” and/or the level of precision/scale is to ask these questions as if the data is of type sql_variant:

SELECT SQL_VARIANT_PROPERTY(5/9.0, 'BaseType')
SELECT SQL_VARIANT_PROPERTY(5/9.0, 'Precision')
SELECT SQL_VARIANT_PROPERTY(5/9.0, 'Scale')   -- numeric(8,6)
SELECT SQL_VARIANT_PROPERTY(5/9, 'BaseType')
SELECT SQL_VARIANT_PROPERTY(5/9, 'Precision')
SELECT SQL_VARIANT_PROPERTY(5/9, 'Scale')   -- int

Q: I am surprised by the nchar/nvarchar suggestions. I've always been told to NOT use these data types. Can you give the "reasoning" for this suggestion here? The primary reason that you want to use Unicode data is to avoid a cost in “translation” between the client and the server – when you’re using web-based applications that natively use Unicode. Additionally, with ever growing customer bases – you can more effectively store international data properly by supporting Unicode in the client and in the database. You are correct that these take 2 bytes per character rather than one – but that’s what gives them the larger range of characters to use/display/manipulate. Also, if you plan to use the SQLCLR against these columns, you will need to use Unicode as opposed to ASCii as the SQLCLR requires it.

Q: Does Unicode use twice as much space? Yes, Unicode columns are 2 bytes per character. If you defined a column of type nchar(100), that column will require 200 bytes of storage.

Q: Sometimes, and not always, it becomes a tradeoff in Storage Space v. Performance, doesn’t it? That’s an excellent point. You always need to look at the trade-offs associated with any decision and in fact, this is yet another. So, I always want the row size to be as optimal and as reasonable as possible. However, there are many compelling reasons – international and/or web-based applications as well as SQLCLR functionality.

Q: Is the N' keyword usable in MSSQL 2000? Yes, all Unicode character strings should be preceded with an upper case N in all releases where Unicode is supported (SQL Server 7.0 was the first release that supported Unicode).

Q: Is there any relationship between the OS's code page and nvarchar/varchar? Yes. When SQL Server is installed- it chooses the server’s default code page from the Operating System. A code page (or collation) defines the order in which characters are evaluated. The additional attributes a collation can define are: whether or not the characters are case-sensitive; the sensitivity of width and kana-type. This may not seem obvious but if these characters sets are native to you then the order to the data would probably make more sense.  Once the collation is set for the server, the database inherits the server’s collation by default, when tables are created – each character-based column inherits the database’s default collation by default. However, at ANY of these levels the collation can be changed. There are benefits to doing so – in that you can keep data within a specific table ordered for a specific character set. Wow, now that I’m getting started, this could be a session in and of itself. So, just to give you a few tips:

  • Be careful using temporary tables if your database’s collation is different than your server’s collation.
  • Changing collation on the fly (for example, if you want to do case-sensitive searching through a view – of which I’ve included a sample file) can be expensive.
  • Changing a database’s collation is supported but you might have troubles converting from a case-sensitive to a case insensitive database.

The files that can help you to get more familiar with verifying collations, changing collations, etc. are CaseSensitiveSearchingInView.sql and ChangingDatabaseCollation.sql.

Q: Another great use of adjunct tables is to isolate culture-specific data from the base (and usually indexed) data. So, this is a great point. If you change the collation for a column – and index it – then you can get better performance when accessing that column. Changing a collation on the fly (as in the script sample for the previous questions), is what’s expensive. However, you should make sure to do a lot of testing.

Q: I have a US code page OS but Latin general 850 collation set for SQL Server; I have no problems storing and retrieving data. Exactly! That’s the benefit of collation information being stored with each column. It allows the application to store many types of data – natively and then just deal with their display at the client.  For those of you who want more details on how to do this, see COLLATE in the books online.

Q: Somewhere along the line I was always told to define varchars in multiples of 8 characters. Something to do with space allocated with new rows.  Any ideas if it has any premise? Well, my first response to this was that I’ve never heard of this… So, I thought I’d check around and well, the couple of folks I’ve chatted with haven’t heard this either… IF anyone has – feel free to comment on this blog.

Q: How about images? Is it better to store them in the db or just the url? This is really an old debate. I still think there’s validity to both sides of it as well. I’d say that you should go through the pro’s and con’s and really choose what’s best for your application. If you use a lot of urls then you might be able to reference more without having to manage a copy of that data… If the url is your data then you might find that managing it in the database (from a backup/restore perspective) is easier – and you know that it’s always going to be there v. the url occasionally not working. Wow, I’d almost like to get a panel of folks together to have everyone hash out their opinions on this one. J SQL Server 2005 does make managing images easier but there are still points to both arguments!

Q: How would SQL Server support becoming a data server of video file? The only real format for this data would be binary. However, there is a 2GB limitation to a single value.

Q: Are you aware of a stored procedure to handle blob data effectively? i.e. for image storage? No. But there is an executable that comes with SQL Server 2000’s samples and I think it’s called textcopy.exe. That might be some good sample code to review.

Q: Does SQL Server support the SQL 2003 NULLS FIRST and NULLS LAST keywords of the ORDER BY clause? No support for NULLS FIRST or NULLS LAST but the order by can include ASC for ASCending or DESC for DESCending. NULL values are treated as the lowest value so NULLS FIRST would be ASC and NULLS LAST would be DESC. But – that’s the only support for specific placement of NULL values.

Q: Is there a "best practice" for the new C# nullable types? i.e. int? vs. int -- re:MSSQL2005? When programming with the SQLCLR, always use types that allows NULLs (assuming your data can contain nulls) and then make sure to always follow the appropriate behavior with NULL. For example, concatenating NULL should yield NULL.

Q: Can I change (refactor) the underlying type of a UDT (e.g., from int to bigint) in SQL Server 2005?  Unless I've missed something, I can't do it in SQL Server 2000 unless I drop the type from all uses. No, as the user-defined data type (UDDT) is only used at create time to map to the datatype; the connection is then broken (altering the UDDT definition will only impact new tables and variable definitions). User-defined Types (UDTs) are even more schema bound than UDDTs and it's a much tougher job to alter them (you need to write conversion from the old to the new). This is much more complex since they fall outside the conversion matrix and the only thing SQL knows is the binary representation. So if the data type was smart enough to add a signature to the serialized data, so that the next version can understand which data type is represents and determine if it can convert to it you might have a shot, otherwise you have to add a column, and convert columns and drop a column afterwards.

Q: Can I easily replace the SQL Server 2000 UDT with a SQL Server 2005 CLR Custom Type? I have a client who wants to -- eventually -- migrate to SQL Server 2005, but for the immediate future wants me to build their application with SQL Server 2000. I'm very interested in SQLCLR types, especially WRT to maintaining type consistency across T-SQL, CLR SP's and external data objects, so using CLR custom types. This is a typical thought for the SQLCLR – “I am going to build an object database.” Type fidelity between the client and the server exists today, but they want class fidelity, so they are going to create the customer object UDT, which is not really for what the SQLCLR was intended. You should be thinking scalar types with additional custom characteristics – not object types – as the scalar types will give you optimal and predictable performance.

Q: Can you find ANSI_NULL_DFLT_OFF setting as on or off? Does DBCC USEROPTIONS do that? Yes, DBCC USEROPTIONS is a nice, quick way of seeing the currently set LIST of session settings. However, there are better ways to check these settings programmatically. it’s not as nice to check it programmatically  The files that can help you to get more familiar with ANSI Null Issues is ANSINullIssues.sql.

Q: It would be great if you could show us a list of  classic fields "Account No", "Account Balance", "Description" etc… and tell us what type you would use inc. "Null/Not Null" etc. Well, this is a bit subjective and with a bank account there would probably be a lot of business rules that surround the account number. However, if this were a sales table and you needed an ordered then it would be a bit easier (again, business rules might change this). Typically, I’d go for a Primary Key which is simple and narrow and not necessarily a “natural” key. However, some of this reasoning comes from how tables are joined and how SQL Server internally stores data. Account Balance – I’d go with a precise numeric. Based on length of storage and the business rules behind required precision, you might choose a numeric(19,6). This will require 9 bytes of storage. For description, I would definitely go with a variable type – probably nvarchar(200) depending on what description really means. Here’s how the table would probably look on creation:

CREATE TABLE Account(
AccountID     int           IDENTITY  NOT NULL CONSTRAINT AccountPK          PRIMARY KEY,
Balance       numeric(19,6)           NOT NULL CONSTRAINT AccountBalanceDflt DEFAULT 0,
Description   nvarchar(200)           NULL)
go
INSERT Account (Description) VALUES ('The account of Kimberly Tripp.')
PRINT 'Kimberly Tripp''s account number is ' + convert(varchar(10), @@IDENTITY) + '.'
go

Q: In the Q&A you might want to tell your listeners that in SQL the string 'abc' is equal to 'abc ' when compared – unlike for C, C++ ,C# , VBNET, etc. This will help with the char v. varchar issue… It’s a good point! You should make sure that you do some testing to make sure that data behaves the way you expect. Trailing blanks may be trimmed with some data types and not others and once again, ANSI settings may impact this behavior.

Q: What are the NULL block and varchar block in the row layout? The NULL block helps to store and identify rows that are set to NULL. In the past, SQL Server stored this as a single space in a varchar column ad there was no real way to differentiate between a real space (as a value) from a NULL value (stored as a space). When the storage engine was re-architected for SQL Server 7.0 the NULL block was added to store a NULL value specifically.

Q: How would the table structures differ between the 32-bit and 64-bit in SQL Server 2005 and SQL Server 2000? They don’t! There is no difference in the on-disk format between any of the versions of SQL Server. In fact, this was even true of the MIPS/Alpha/Win32 releases when SQL Server 6.0 released on all of those platforms.

Q: do you have a script to calculate rows per pages (using the formula you showed? Ha! Seemed a bit painful I know… But that formula works.But, yes – it’s a heck of a lot easier to just do this yourself if you have a table already. In SQL Server 2000 you need to use DBCC SHOWCONTIG with TABLERESULTS and in SQL Server 2005, you use a new DMV called sys.dm_db_index_physical_stats. This takes 5 parameters and has some really nice new features to it.

Q: I might have missed this, but is the SQL Server Management Studio (SSMS) the replacement for Enterprise Manager? Yes, SSMS replaces the Enterprise Manager and the Query Analyzer. And, for a more scaled down query tool, you can use Express Manager (XM). To download XM, you can go here  If you want more details regarding the latest SQL Server CTP, you can go here.

Q: Is Management Studio backward compatible for use with SQL Server 2000? Yes! You can use SSMS to connect to SQL Server 2005 and SQL Server 2000.

Q: How do we write our queries for Insert/Update/Select on a database that we have no idea how it was setup - so we don’t experience the local issue that you talked about? Basically, the best way is for the developer to create an interface that masks the complexities of the database schema through stored procedures, functions and views.

Q: Do indexes point to the row level or the page level? There are two primary parts to an index – the leaf level and the non-leaf level. The leaf-level of a clustered index IS the data (a clustered index is an ordered table).

Q: What is the best practice or rule of thumb on using identity primary keys on tables in a database? And would that be all tables and/or some tables or not all? This is a great question. However, I might phrase it a bit differently because I *think* I know where you’re going with this… Generally, I recommend that large tables have a clustered index on an identity column. This can help to improve performance of inserts by not causing splits (inserts go to the end) and by helping non-clustered indexes look up the corresponding rows in the clustered table by using a clustering key which is unique, narrow and static. This is really quite a bit related to the session on indexing… I think you should definitely attend session!

Q: How could a LOB end up in the leaf level? This is really another great question for the indexing session – and a bit on internals as well. In SQL Server 2005, rows can span pages (so you can have a row greater than 8060 bytes) and rows can have LOB types (even in the leaf level of an index). I’ll explain how this can happen with the new INCLUDE option for indexes – in the next session!

Q: Are you talking at the Vegas launch of 2005? Yes! You can see the complete list of my upcoming events here.

Comment: By the way, thanks for the great Web cast. I don't know of many other companies that do this sort of thing -- take questions directly from their customers.

Thanks! This is part of what I love about these webcasts… it allows me to hear what you’re interested in and it allows me to keep learning. To be honest, often I have to look things up and/or hit the KB and/or ask some of the other SQL geeks I know – in order to answer some of these questions. It really helps to keep me on my toes! In fact, this week I bothered two of my favorite people, my partner Patrick and my good friend Gert.

 

Thanks again for another GREAT week of questions!

kt

See you later today – Friday, Aug 26th

MSDN Webcast: A Primer to Proper SQL Server Development (Part 4 of 10): Best Practices in Indexing (Level 200) 

Two days isn't enough time!

In fact, that's really true. There are so many interesting (well, I think they're interesting) aspects to Indexing that we could in fact do a whole 10-part Series on Indexing and it's likely that we'd still have other things to look at and talk about. So, in preparation for part 4 of my MSDN Webcast Series, I thought I'd take a few minutes to pull together a number of related resources. If you watch/read them prior to the webcast then you'll be a step ahead. To minimize overlap, part 4 is a myriad of best practices with more of the "what's" and "how's" to implement the right indexes - but not quite as much of the theory and/or the "why's" as I've done quite a bit of that before and even if you don't watch/read these resources BEFORE the session...you can use these resources to fill in the gaps and answers the "why's" AFTER the session!

So, let's talk about the session... I'm going to focus on SQL Server 2005 for all of my demos (and I'm going to try to be VERY demo heavy in this one) but quite a bit of the theory also applies to SQL Server 2000...

If you're interested in some of the "why's" and/or you want to hear the information for SQL Server 2000 and/or you just want to see what other folks have asked, here are your options:

Watch the MSDN Webcast titled: SQL Server 2000: Performance Tuning - Finding the Right Balance of Indexes which covers Index Creation Best Practices on SQL Server 2000
Watch the TechNet "It's Showtime" recording, recorded at Tech Ed Europe titled: Index Creation Best Practices with SQL Server 2005 which covers Index Creation Best Practices on SQL Server 2005 and really gets into a lot of the "why's"

Read many of the past blog entries - including A LOT of Q&As created from former webcasts by accessing my "Indexes" category on my blog entries here.

So, if you're not scared to spend quite a bit of time with indexes... I think there's lots of good stuff to review! At a minimum you should think about showing up tomorrow, I promise to give you Practical Indexing Strategies and a lot of demos to prove my point!

Effectively Designing a Scalable and Reliable Database
A Database Developer’s Primer to Proper SQL Server Development
Part 4: Best Practices in Indexing

Attend Part 4 LIVE on 26 August 2005 by registering at this link!

MSDN Webcast Q&A: A Primer to Proper SQL Server Development

Creating a Reliable and Automated Backup Strategy, Part 2 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Here’s the specific replay link.

Q: Where can we get the demo scripts? The demo scripts are in this zip (20050812MSDNWebcast.zip (4.86 KB)); here in this blog entry. However, I will also create an entry www.SQLskills.com under Past Event Resources for the entire webcast series.

Q: Does Kimberly L. Tripp have a blog? Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Q: Where are the links to all prior Webcast Q&As from this series? Click here for the MSDN Download for Part 1, click here. For the SQLskills Blog Entry for Part 1, click here.

Q: Is this Session 2 of the "Effectively Designing a Scalable and Reliable Database" webcast series? The Title bar here reads "MSDN Webcast: MSDN: A Primer to Proper SQL Server Development (Part...)". A bit misleading don't you think? So, I finally figured out the mystery here. When I originally submitted my proposal for this series I titled it like this: Effectively Designing a Scalable and Reliable DatabaseA Database Developer’s Primer to Proper SQL Server Development (Level 200-300) But somehow, through a few edits and since each module also has a more detailed title like the following, then I think the title just got edited down to fit. In fact, with subtitles like the following… I can’t even fit the entire title, subtitle and session title on a single slide. JPart 1: Creating a Recoverable DatabasePart 2: Creating a Reliable and Automated Backup StrategyPart 3: Designing Tables that Scale – Best Practices in Data Types and Initial Table StructuresSo, the mystery ends here… It’s all the same content and the general idea is that building a scalable and reliable database takes a variety of best practices that together make it happen; there is no “DBCC MAKEITGOFASTER(‘TRUE’)” option.

Q: WRT "best practices," what do you think of SQLBPA? It’s a great (and free J) tool that can help you to recognize some of the most commonly overlooked best practices. Part of the intent is also to help you determine if you’re violating certain practices that might impact your upgrade strategy. For example, if you’re accessing the system tables you will want change that code to use information_schema views, stored procedures (sp_*) or system functions to gather the system information. If you’re looking solely for information about your upgrade path, the BPA has been updated/replaced with a new tool called the Upgrade Advisor.

Best Practices Analyzer Tool for Microsoft SQL Server 2000 1.0 

Upgrade advisor URL - Microsoft SQL Server 2005 upgrade Advisor - CTP June version

Q: In the SBS2003 environment the default backup does a live backup using a volume shadow snapshot of the MSDE and SQL databases on the server. How bad is this practice v. doing a SQL backup and backing up the backup and the logs? Well, there are really a couple of issues here… First, the positives of controlling backups within SQL Server:(1)   You can backup databases individually(2)   With certain database backup strategies (specifically, Database/Log strategies in the FULL Recovery Model), you can recover up to a specific point in time (P.I.T. Recovery) or to a marked transaction or even up-to-the-minute (if the transaction log is still accessible at the time of the disaster.(3)   The process of backing up the transaction log helps to keep the transaction log small and manageable.Second, the negatives:(1)   You have to manage each database backup strategy individually – however, tools like the Database Maintenance Wizard can help to minimize this.(2)   You’ll need to do backups more frequently if you want to take more advantage of these features – however, this would also give you better granularity on your backups AND result in less data loss because your snapshots probably aren’t very frequent.Now – as for volume shadow snapshot – this is GREAT for a complete image of the system and it does work (especially well) when you have to replace the complete system or when you want access to specific files on the system. Where you can run into troubles (and I’m not saying that this is always going to be the case, one way or the other) is when a SQL Server database becomes damaged and you revert back to just a file (or even all of the database files) from a shadow copy image. In almost all cases (and I really can’t think of too many where this would work except in some read only cases), you won’t be successful *just* restoring an old image of just a single database file. So, the damaged database *might* end up being lost. Instead, you could restore the entire database image from the shadow copy point in time and that *might* work; however, if the database structures have changed since you backed it up then the best choice would be to drop the database, copy over the files (the mdf and ldf) and then re-attach the database. That should work and all you’ll need to do from there is add any logins and/or changes since the last backup. Where things could be a lot worse is if one of your system databases becomes corrupt. This might limit your ability to recover. Now, having said all of this – I *know* that your SQL backup strategy WILL work and I know the complete set of options related to it. While it is a bit more administrative work, if your data is critical and minimizing data loss is *very* high on your list; I would use BOTH. Volume Shadow Copy for the system and file recovery strategies and SQL Server database and log backups for your databases. OK, so I decided that I wanted to check things out a bit more…especially, since I am not all that well-versed on VSS; I decided to look up as much as I could find and well, I’m going to stop with this. First, pre-Windows 2003 sp1 it looks like even backing up a volume that includes databases that are in the FULL or BULK_LOGGED Recovery models, generates an error. As of sp1 (or the hotfix associated with this KB), the error is no longer generated BUT I did find the following paragraph very useful:

This hotfix does not let Windows Backup perform a Volume Shadow Copy service backup of a SQL Server database with a recovery model that is set to Full or Bulk-Logged. You must use the SQL Server backup procedure to back up your SQL Server databases that have a recovery model that is set to Full or Bulk-Logged. And – if you do a backup with NTBackup, it will log a message in your backup log of the following: NtBackup does not support backing up SQL databases which are configured in non-simple recovery mode.If the database has been backed up, it is likely corrupted. Please do not restore this database.The recommended way of backing up non-simple SQL database is to use the backup solution that comes with SQL Server.

So, while I’d really like to dig into this deeper – I have to use my best judgement at this point and say that you should probably work to handle backups through SQL Server.

Q: What is the default recovery model in MSDE and SQL Express? I ended up demo’ing this in the presentation. I was sure about MSDE but I wanted to do a quick demo on SQL Express. The quick check is just to check the database properties of the model database… Another option, create a new database and see what the properties are. For both MSDE and SQL Express, the default database recovery model is SIMPLE.

Q: How would you compare the native backup to products like LiteSpeed from Imceda (which is used internally at Microsoft - I'll avoid the typical dog food comments) and SQL Backup from the chaps at Red-Gate? Third-party products can offer significant additional benefits over just the native backup types – for example, encryption and compression. This can result in improved security, faster backup times and smaller backup files.  

For more details about SQL Lite Speed, please go to: http://www.imceda.com/For more details about SQL Backup, please go to: http://www.red-gate.com/

However, I would also like to make a comment about the fact that Lite Speed is used internally at Microsoft. Backup products are an interesting special case in terms of how they work with regard to SQL Server. Most backup products use the VDI interface (the Virtual Device Interface provided for SQL Server) in order to get access to essentially the same pages that SQL Server Backup does… So, in fact, using a third-party product like SQL Lite Speed, is actually still testing the VDI interface and still working SQL Server pretty well.

Q: Can you use 3-rd party backup with log shipping? Depends on whether or not they support it and/or have automated software to help you configure it. Many products do offer ways to help automate log shipping with compressed versions of your backup files… See earlier links for references to other products.

Q: How do you fix a log file that won't truncate? The best way to fix a transaction log that won’t clear:(1)   Terminate all connections to the database (best to set the database to SINGLE_USER mode)(2)   Backup the transaction log with a normal BACKUP LOG command(3)   Shrink the transaction log to the smallest size possible (DBCC SHIRNKFILE WITH TRUNCATEONLY)(4)   Alter the database to increase the transaction log to a more appropriate size with ALTER DATABASE.(5)   Return the database to MULTI_USER mode.These steps are exactly the same as the steps necessary to remove fragmentation in VLFs. The file that can help you to get more familiar with this process is located in the scripts associated with this session and is named: VLF Optimization - Northwind Walkthrough.sql.

Q: If you do the steps recommended to clean up the fragmented VLFs, does it break the continuity of the transaction log backup chain? No! You can backup the transaction log before, as part of (step 2) and after you clear up the fragmentation – and no other backups are necessary. It is a complete chain that can be used in recovery.

Q: The transaction log portion of the tempdb seems to be growing and growing- even with the simple recovery model. In fact, it only seems to release space when CHECKPOINT is run manually? Well, I have to admit that this is a bit strange and should not otherwise be the case. I would suggest seeing if you have any open transactions in the database (DBCC OPENTRAN) but that shouldn’t make a difference if it DOES clear when you execute CHECKPOINT. So, I have to admit that this is a bit bizarre. I would try to do some profiling and see if you can find any consistencies with the types of commands that are running at the time when it doesn’t seem to clear. I would also wonder if this is ALL the time or just occasionally. You may want to contact PSS for more assistance with this.

Q: TRUNCATE TABLE does not benefit from BULK_LOGGED? When I say that something benefits from the change to the BULK_LOGGED recovery model, it means that the change must offer a performance gain or otherwise. In the case of TRUNCATE TABLE, it is always executed in a manner which is logged – efficiently. So, there is no gain in switching; it is always an optimal command to execute.

Q: We have a habit of switching to "bulk-logged" mode when we are doing index -rebuilding, etc. which seem to benefit from this setting. Then, when we are done, we "switch" back to full mode. Does this affect our ability to recover later on using the logs that we are creating because of this switching back and forth? and Q: If I switch between the FULL and BULK_LOGGED recovery model, do I need to take full database backup after the switch to BULK_LOGGED? No! In fact, it’s recommended as long as you’re aware of the options that you lose when you are running in the BULK_LOGGED recovery model. However, the transaction log continuity is not broken – given the fact that you can successfully back up the log after the bulk operation completes. To minimize the time of which you are vulnerable, make sure to backup your log right before the switch (to BULK_LOGGED) as well as right after the switch (to FULL).  The file that can help you to get more familiar with this process is located in the scripts associated with this session and is named: DBAlterForBatchOperation.sql.

Q: Is it advisable to use transactions for each and every SQL statement? Even if they are a single, simple, one-command transaction? If not, how can we use the marked transaction? Well, the answer to the first part of the statements are: NO, it is not necessary and it is not advisable to wrap every command in a BEGIN TRAN/COMMIT TRAN block. However, as per the second part – you cannot mark a transaction without using the following complete syntax: BEGIN TRANSACTION TransactionName      WITH MARK 'description of mark'SQL_StatementsCOMMIT TRANThe file that can help you to get more familiar with this process is located in the scripts associated with this session and is named: Restore Log and Marked Transactions.sql.

Q: Do you dynamically create Transaction MARKs to make the mark names unique? I would assume a restore would stop at the first mark it came across with the specified name? You can certainly use dynamic string execution to build your transaction mark names dynamically. However, you can also use a mark with a datetime parameter. From the BOL: The STOPATMARK and STOPBEFOREMARK options both support an AFTER datetime clause. When used with datetime, mark names need not be unique. In a RESTORE DATABASE statement, datetime is required to recover to a marked transaction. In a RESTORE LOG, however, datetime is optional. If AFTER datetime is omitted, roll forward stops at the first mark with the specified name. If AFTER datetime is specified, roll forward stops at the first mark having the specified name exactly at or after datetime.

Q: When configuring a database maintenance plan through the Database Maintenance Plan Wizard, there is no possibility to switch between recovery models…Correct. However, you can modify the jobs created by the Database Maintenance Plans to include changes to the recovery model. BUT – having said that, I’m not sure why you’d want to change it during the actual backups… Instead, you should consider changing the recovery model as part of your batch process that would benefit from the change.

Q: How do you clear the log in SQL Server 2000 through the Enterprise Manager? The only way to clear the transaction log through the SQL Server Enterprise Manager is to do a regular transaction log backup. The way you can do this is to right-click on databases, tasks, backup - then choose a transaction log backup. The default behavior is to clear the inactive portion when the backup completes.There is no way to just “clear the transaction log” as that would break the continuity of the log. You could execute a command to clear the log but if you find that you’re manually clearing the log and NOT keeping transaction log backups, I would recommend setting the recovery model to SIMPLE so that the transaction log is cleared automatically.

Q: Can you explain when you need to recover .mdf and .ldf files? Recovering directly from the .mdf and .ldf files (through sp_attach_db) is really only guaranteed when the files were closed properly – through either a sp_detach_db or the files were copied when the server was shutdown.

Q: How do you recover if the master database becomes corrupt? Is the only option uninstalling and reinstalling SQL Server? No, SQL Server 2000 includes a utility to rebuild the master database called rebuildm. SQL Server 2005 uses setup – but with special parameters that only rebuild the system databases as opposed to do a full installation.

Q: Any BP on restores on replicated database i.e. distribution matching database? Transaction Log marking can help to ensure that multiple servers are all restored and rolled forward to the same point in time. Additionally, there are options related to whether or not you KEEP_REPLICATION (this is for the publisher). However, outside of those couple of options, you should review the Books Online for best practices in this configuration.

Q: Is it possible to backup database in the standby mode? No. But, I wasn’t completely sure during the webcast so I decided to try it. On the backup, I received:Database 'test' is in warm-standby state (set by executing RESTORE WITH STANDBY) and cannot be backed up until the entire load sequence is completed. BACKUP DATABASE is terminating abnormally.

Q: With log shipping and 2000, and the destination database is exactly the same as the source, how about backing up the destination rather than the source? What happens to the shipped logs during the backup? You can’t backup a database that is in the NORECOVERY or the STANDBY recovery completion states. As a result, you can only backup the source, not the destination.

Q: What are my risks if I use Simple Recovery and what are the advantages of it? Risks are mostly in the potential for work-loss exposure and data loss in the event of a disk failure for your database. The advantages are in simplicity and ease of management because you don’t need to backup the transaction log. However, through simple automation procedures you can automate the backups of transaction logs while getting the most options for recovery. If you want more details, you should review the chapter that’s available for download on SQLskills. This chapter covers the differences in the Recovery Models as well their pros and cons in management and recovery! Here’s a direct link: www.sqlskills.com/resources/SQLServerHAChapter9.pdf 

Q: How come you have both the SQL Server 2005 and Express on the same machine? Is it because you don't have Visual Studio 2005 installed? The documentation of VS.Net 2005 states to uncheck the SQL Server 2005 Express if also installing SQL Server 2005 CTP. Part of the reason that Visual Studio doesn’t want you to install Express if you already have the CTP of SQL Server is that they may not be the same versions. However, if you have the appropriately matched builds of SQL Server and Visual Studio, you can install multiple versions of SQL Server on the same machine- with Visual Studio Whidbey. Having said that – of what’s available today- the correct combination would be SQL Server CTP June (IDW15) with Visual Studio Beta II. On my machine, I’m actually running SQL Server 2000 (as a default instance), two Developer editions and one Express edition. AFTER SQL Server was successfully installed, I installed the appropriately matched Visual Studio installation. There won’t be any need for this once the products release in November but while they’re both being developed – and while there are “later” releases that leap-frog over each-other, it’s a bit confusing.

Q: SQL 2000 Maintenance Plans for transaction logs (say, all user databases) indicate job failure if just one user db is in the SIMPLE recovery model. Any alternatives short of creating our own backup script to avoid this SQL Maintenance backup job failure? Yes, this is because the sqlmaint.txt does not verify the recovery model.

Q: Is this a SQL 2005 behavior too? Database Maintenance Plan Wizard has been changed significantly in SQL Server 2005 and it’s now a lot easier to create more interesting/more complex plans on one or more databases – I would work harder to create a class of jobs for your SIMPLE recovery model databases (just full backups and maybe differentials) and a different class of jobs for your FULL/BULK_LOGGED recovery model databases.

Q: Does the execution of a transaction log backup impact simultaneous database usage? If well optimized the cost of a transaction log backup should be relatively low. Might be a good time to make sure that the transaction log is optimized! Go figure, I wrote another blog entry to help you optimize the transaction log. Check out 8 Steps to Better Transaction Log Throughput.

Q: Why does the log backup restore slower than restoring a full database backup? In general, a transaction log is a more intense operation. When restored, the transaction log must be redone (“redo”) and then undone (“undo”). This process is what applies the changes to get you up to what the database looked like when the transaction log backup was executed. And, while each log row’s redo operation is generally much faster than the original execution (they do not have to do any calculations, etc. as the FINAL version of the changes for the row is what’s in the log), there might be a lot of them!

Q: Do I have to backup the ReportServerTempDB? I’m not a reporting services expert but my understanding is that yes, you do need to backup a ReportServerTempDB. The best thing to read is from the Books Online: Backup and Restore Operations for a Reporting Services Installation.

Q: Will the 2005 Backup wizard allow scripting of the 'scheduling'? The backup command itself will be in the scheduled job but the statement that is used to create the job can only be scripted by scripting the job – not the backup command.

Q: Did you talk about partial database availability? Yes! SQL Server 2000 does not offer this new feature, but SQL Server 2005 does. Partial database availability is a new feature that allows a damaged database to stay online – even while secondary data files are damaged and unavailable.

Q: What is an optional secondary file?An optional secondary file is a data file that is NOT the mdf. Secondary data files are created in larger databases and store user-defined data. They might be used to store different types of data and/or even possibly part of a large table (partitioning). Imagine having 100s of millions of rows – because you have three years of historical data – if year 2003 is in a secondary data file and becomes damaged, it will not require the database to be taken offline.

Q: I have not seen Management Studio. Is this new with SQL Server 2005? Yes, SQL Server Management Studio (SSMS) replaces both Query Analyzer and SQL Server Enterprise Manager – in a more robust UI that also include Solution/Project control and Source Integration. Check out the online labs to get some hands-on lab time in the Virtual Lab environment here.

Q: A FULL BACKUP has the option "remove inactive entries from transaction log" --- why doesn't it seem to do this? It would seem from this option that nightly full database backups would keep the log size small? This is a SQL Server Enterprise Manager bug. It should NOT even be an option on that tab. It should be grayed out. If you click around to differential and then back to full on the General Tab, before you go to the Options tab, then you will no longer see that option available. As far as this being something desired – not really… The reason why clearing the log is NOT performed after a full backup (or as part of a full backup) is so that if the full database backup becomes corrupt in some way, you can fall back on your transaction log sequence to successfully restore! Basically, this ensures that nothing ever breaks the continuity of the transaction log!

Q: You recommend Trace Flag 3231, what does it do? Trace Flag 3231 is an undocumented Trace Flag that disables the ability to clear the transaction log in a database that is running in the FULL or BULK_LOGGED recovery models. This significantly reduces the ability for someone to break the continuity of the transaction log… ah, so that would probably be the next question – what else could possibly break the continuity of the transaction log? Changes to the SIMPLE recovery model. It is highly recommended that ONLY changes between FULL and BULK_LOGGED be performed in databases where transaction log backups are relied upon for recovery.

Q: How can we test our backup strategy? How do we test our differential backup content? When we test the differential backup content, do we need to restore the full backup content at the same time?The best test of your backup strategy is a complete restore sequence to a secondary server – which you then follow up by testing your application. There *are* other dependencies outside of the database that could cause your application to fail. However, the restore to the secondary server will ensure that at least your data is accessible. Below are some KB articles that you can review for more details on some of these additional dependencies. To access them, go to http://support.microsoft.com.  Q240872 – INF: How to Resolve Permission Issues When a Database is Moved Between SQL ServersQ246133 – INF: How To Transfer Logins and Passwords Between SQL ServersQ307775 – INF: Disaster Recovery Articles for Microsoft SQL ServerQ224071 – INF: Moving SQL Server Databases to a New Location with Detach/Attach.

Q: I have a database server hosted by a 3rd party on the web ... is there a secure, automated way for me to make remote copies of the Database Backup? You'll need to talk to your ISP to see if they support that. More than anything it depends on the interface they give you, the services you've paid for and whether or not you have secure channels over which you can communicate with your ISP.

Q: Is there a way for you to answer these questions – that is – release the question from the question manager so that another question can be asked? I had a question in the "queue" for most of the webcast and because of that, I could not ask additional questions which I had wanted to ask... I ended up deleting my original question just so the question window would allow me to ask the next question... This is a LiveMeeting issue. I’m trying to put just a quick “a” in the response during the webcast but really, it’s an issue with the LiveMeeting Query Manager UI. I REALLY wish that questions just went into the queue and didn’t hold your window as well… I’ll see if we can pass your suggestion on to the LiveMeeting folks!

Thanks again for another GREAT week of questions. See you in the next session…which because this is posted so late, is session 4 - on Friday, August 26th. You can register for this session here: MSDN Webcast: A Primer to Proper SQL Server Development (Part 4 of 10): Best Practices in Indexing (Level 200)

Thanks for reading!

kt

MSDN Webcast Q&A: A Primer to Proper SQL Server Development

Creating a Recoverable Database, Part 1 of 10

Presented by Kimberly L. Tripp, SQLskills.com

Q: Can I view a recording of this webcast? Here’s the specific Replay Link: http://msevents.microsoft.com/CUI/EventDetail.aspx?EventID=1032278585&Culture=en-US

Q: Where can we get the demo scripts? The demo scripts are in this zip (20050805 MSDN Webcast ScalableSys01.zip (5.47 KB)); here in this blog entry. However, I will also create an entry www.SQLskills.com under Past Event Resources for the entire webcast series.

 

Q: Does Kimberly L. Tripp have a blog? Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Q: How about running on a VMware Virtual Machine? Sure.  

Q: Is the placement of data and log important when using a disk array or SAN? It’s important to make sure that you don’t have bottlenecks at the physical disk. You should see if your SAN software supports monitoring at the disk level and if so, consider reconfiguring if bottlenecks occur. It’s harder to say that a SAN will have the same problems that you would have with direct attached storage since SANs often place larger logical disks on more than one physical disks (by using 9, 18 or 36 GB chunks from a variety of physical drives). The main point is that SANs offer a lot of caching and other optimizations, so you don’t often have to worry as much. However, I would strongly suggest getting to know your SAN config as well as work with your SAN vendor to really get an understanding of how the resources are allocated as well as used. Also, make sure you’re current with all bios, firmware, drivers, etc.

Q: Where will be the demo code available? What version of SQL Server will you be using? All demos were done on SQL Server 2005; however, for this (and even for many others), a lot of the demo code works on both SQL Server 2000 as well as SQL Server 2005. At a minimum, many of the concepts apply and where a feature is new or only supported on one version, it will be pointed out. For example, fast file initialization is on SQL Server 2005 only.

Q: Does SQL Server support log mirroring (i.e. dual logging)? No. A very old version did (I think 4.2 on OS/2 did) but the key problem was performance. You are a lot better off letting the hardware handle the mirroring rather than software talking to os talking to hardware.

Q: Do Secondary files allow partial backups? Not really sure I follow this question but…if you’re asking whether or not you can perform a backup of just a part of the database – yes! In fact, any file can be backed up independently of the filegroup of which it’s a member (there was a restriction in 7.0 that didn’t allow this but as of SQL Server 2000, files can be backed up at any time). If you’re really inrerested in file and filegroup backups, I wrote a couple of articles for SQL Server Magazine and you can find the complete list here: http://www.sqlskills.com/articles.asp 

Q: What is a page? Generally speaking a page is a unit of storage. Specifically speaking with regard to to SQL Server, a page is 8K in size and is the smallest unit of I/O in terms of data. When a table is stored on disk, it is stored in 8K chunks (and most of the time SQL Server allocates 8 – 8K chunks to objects). A 64KB block of the database is called an extent. SQL Server allocates extents once an object reaches a minimum size (which is also 64KB) to try to keep an object more contiguous.

Q: Is caching user specific or server specific. If two users are accessing same table, does it create two separate pages in cache or just one? Well, data cache is not user specific (at least not with regard to your question). However, there are a few things that would be local in scope – like a user-defined temporary table… However, a database table being accessed by many users, would only have one set of pages in cache.

Q: Why would a user of the database do a checkpoint? Users cannot perform checkpoints, only database owners and administrators can force a checkpoint. Generally speaking, it is not often that a checkpoint needs to be force. SQL Server controls and handles the checkpoint automatically.

Q: Wouldn't the engine know the best time to do a checkpoint? Yep! And it does. The default setting for checkpoint is “0” minutes. Meaning – SQL Server decides!

Q: How does page caching function during the 'redo' process? Same as it would normally. Data being “redone” is loaded into cache and changed – based on what’s held in the transaction log. (Keep reading, more in the next question/answer.)

Q: So does the transaction log contain all the information about say an update… i.e. what columns, what data? Yes, for the most part what is contained in the transaction log is the “after” version of the modifications so that log rows do not need to execute any functions or other code. The idea is that log rows can be processed extremely quickly – but have sufficient information to make sure that the data is modified properly. (And keep reading, more in the next question/answer.)

Q: When you say transaction is re-done, what exactly happens? Does SQL Server automatically take values again from cache and copy to the disk? What if we loose the cache also during the process? Then redo will begin again when the system restarts…Generally, the process is – go to the log, redo, undo, checkpoint and the information is not deemed inactive until after it checkpoints. So, even if you had the equivalent to “truncate log on checkpoint” this information would not be lost in the midst of restart recovery (redo) because it’s still active until it’s checkpointed!

Q: Is Roll back notification done asynchronously or is another client request required, e.g. long running client with long SQL call intervals…I’m not sure if I follow this exactly BUT I think you’re wondering how you would know that your transaction has been rolled back? The key way that many applications know – is that they never got actual confirmation and maybe they’ve lost their connection and just timed out. But, if this doesn’t answer your question…send me mail!

Q: Can you pre-grow a database at non-peak times grow the database when space starts getting low? Sure, there are a few options really. One way, create a SQL Agent Job that checks space allocation at 2am and if it’s getting close to full – execute a manual increase in space. With a bit of dynamic string execution I think you could get this to be a very flexible and easily automated process!

Q: Best drive configuration for data, logs, indexes? Well, typically, I like to separate data by access pattern and type rather than data from indexes. Typically, I like RO v. RW v. a single large table v. LOB Data.

Q: I have a 24/7 SQL Server production environment with a database that's well over 300 GB - when do I defrag it? You really have a couple of options to defrag a table – truly defrag (and only defrag it) or rebuild it (which does a lot more than just defrag). To make the answer even more clear – defrag often as it doesn’t take the table offline (in 2000). Rebuild the table when/if you can afford downtime against that table (in SQL Server 2000, a rebuild requires that the table be either read-only [when rebuilding a non-clustered index] or completely inaccessible [when rebuilding the clustered index]). So, now this gets a bit harder to answer! Check out the webcast on SQL Server 2000 – Index Fragmentation Best Practices here.

Q: If you have multiple data file, can you merge back to a single data file? Only when the files are a member of the same filegroup. If/when you want to do this you have two steps: DBCC SHRINKFILE(file_to_remove_logical_name, EMPTYFILE) This will empty the contents of the file into the other files within the same filegroup. Once emptied, use:ALTER DATABASE dbnameREMOVE FILE file_to_remove_logical_name

Q: What is zero initialization?Where the entire contents of the file are zero’ed out. This is done for security reasons.

Q: Doesn’t truncateonly make us lose continuity between transaction backups? First – this was with regard to my recommendation to use DBCC SHRINKFILE with the TRUNCATEONLY option NOT using with TRUNCATE_ONLY on a transaction log backup… So, specifically, NO. DBCC SHRINKFILE with the TRUNCATEONLY option does not break the continuity of the transaction log. Now – just to add a bit of irony here… BACKUP LOG with TRUNCATE_ONLY no longer breaks the continuity of the transaction log in SQL Server 2005. In fact, BOTH the TRUNCATE_ONLY and the NO_LOG options have been changed to ONLY perform a CHECKPOINT. In a database running in the FULL or BULK_LOGGED Recovery Model, this will have NO real impact on the transaction log. In a database running in the SIMPLE Recovery Model, this will execute a checkpoint and the database setting of simple truncates the inactive portion of the transaction log when a checkpoint occurs.

Q: I've read the BOLs about faster performance when putting files on raw partitions? Hmm… I’d love to see the reference. They might have said that raw partitions may offer performance benefits but I would generally doubt it. More importantly, you’d lose other key features if you didn’t use NTFS – like Database Snapshots. So, my main point – don’t use raw partitions! Even if they did offer a performance gain, what you’d lose isn’t generally worth AND I can get better gains elsewhere (indexing, optimizing procedural code, etc.)

Q: Could you please tell us quickly why raid 1+0 is better? Basically, RAID 1+0 offers better availability than RAID 0+1 because it can tolerate the loss of more than one drive. If a drive in a RAID 1 array is lost, all other drives still function. If a drive in a RAID 0 array is lost, all other drives in the RAID 0 array stop functioning. Here’s a good link to review more about different RAID configurations: http://www.raidarray.com/04_00.html

Q: I recently came across a 200meg db with a 50GB log. We tried everything to truncate it but finally could only fix it by changing to a SIMPLE recovery model. What is the most likely culprit for such an outrageous log growth? Well, I wish I could say I hadn’t seen this… But, it’s due to autogrowth and it’s due to running in the full recovery model without performing transaction log backups (but you are performing database backups). Now, the reason why you had so much trouble with it – was because it was horribly fragmented (because of the autogrowth defaults).  So, the main point here – if you follow the steps of the VLF Optimization content from the session, you would have also solved the problem. I’d at least check your current fragmentation and make sure that everything is fixed!

Q: What is a "long" transaction (seconds, minutes, or hours?) There is no specific time but the longer a transaction is – especially relative to transaction log backups – the more possible larger growth of the log and less control.

Q: Will *you* be the one doing parts 2-10? Yep… I’m a glutten for punishment. Just kidding! To be honest, I really enjoy these webcasts!!! See you in the next one.

Q: Should SQL Server developers invest the time to learn Microsoft's Enterprise Library (the Application Blocks thing)? OK, I checked around a bit on this and the general consensus is that developers should be aware of all of the resources that exist. It’s by no means an absolute but there are quite a few *very* useful resources there. Review things lightly to see if anything seems useful to your application and then dig deeper if/when they do.

Q: Should the SQL Server Disks be defragged at the os layer with disk tools to defrag? Sure. It doesn’t hurt and can help. The only negative is that you must shutdown SQL Server in order to do so. The good news, once defragged you don’t really need to do it again (unless you have a lot of autogrowth and other files on the disks).

Q: I had a lot of problems convincing my clients to create a new instance of SQL in their servers. The objective was to isolate our systems from theirs. What are the pros and cons of creating an additional instance against adding more DBs to an existing one? Well, you’re correct that it’s mostly an isolation issue. A big concern in some SQL Server environments is the access to metadata that everyone has. In SQL Server 2000, logins can see that other databases and other logins exist – even if they can’t use them. This presents a security concern. In SQL Server 2005, metadata is restricted so that this doesn’t happen.

Finally, here is the link to the Part 2 Registration page.

See you in Part 2,

kt

Wow, I've been horribly bad at blogging these days and I owe you a few entries as well... But, good news, I have a new MSDN Webcast series that will help to answer a lot of best practices in database design strategies that can help you achieve better scalability, availability and reliability. All of my demos are on SQL Server 2005 but many of the concepts apply to both SQL Server 2000 as well as 2005.

The sessions are all targeted at best practices and as such we went with a base of 200 level. Knowing me I can't go through an entire session without some internals and/or technical depth so you should expect 200-300 overall. I'm very excited about this series because everything counts! What I mean is that each and every best practice you implement helps to bring about a more scalable, available and reliable solution - there are no magic bullets only great overall design can achieve these targets!

So, what are we going to talk about in 10 - 90 minute presentations..... a lot!

Here's a link to the first one and as we get rolling, I'll blog a lot more about additional resources, best practices and of course, I'll blog the webcast Q&A as I've done in the past.

Enjoy!

NOTE/UPDATE: Be sure to read this post as well: Transaction Log VLFs too many or too few? after reading this one. Not only can you have too many small VLFs but if incorrectly sized, you can have too few! Enjoy!!

On a few of my last trips, I've had the pleasure of doing a lot of customer visits - visiting some of the larger implementations of SQL Server. Many of these implementations are at banks where both performance and recovery are critical. After my trip to Turkey (where I did customer visits prior to presenting at the Microsoft Professional Developers Summit), I received this email:

As you might remember we talked at xxxbank, Turkey about performance problems in transaction log backups and splitting tempdb data file into equal size chunks equal to number of CPUs. I have implemented both optimizations in my troubled server, and the results are great! Transaction log backups do not impact the server at all and with lowered congestion in tempdb, overall system performance has gone up as well.

So - this is my first of two blog entries targeting these two VERY typical problems:
* Transaction log fragmentation (both internal and external - this customer was having problems with internal fragmentation)
* Optimizing TempDB

We'll start with 8 Steps to Optimizing your Transaction Log - not necessarily in any order. It's best to review all of these to make sure you have a comprehensive view of how to improve both performance and recovery of your transaction logs!

1) Try to dedicate the transaction log portion of your database to its own phyiscal disk. In high volume OLTP system, isolating the transaction log can allow the disk head to be ready for the next write by not having other files contend for the physical disk resource. If your database already exists, the best way to "move" your transaction log is to detach your database (sp_detach_db) and then reattach it (sp_attach_db). Here's a great KB article titled: Moving SQL Server databases to a new location with Detach/Attach.

2) Defrag the disk(s) on which your transaction logs reside. This will get rid of external fragmentation of the transaction log - better known as disk file fragmentation. This will require that your server be taken offline but if your files have had a lot of autogrowth and/or they reside on a disk with a lot of other files that have been modified, then all of your files (incl. the transaction log file) are likely to be interleaved and fragmented. The good news is that this is a one time operation (unless the files continue to grow and shrink excessively - but we'll fix that as well). The bad news is that you'll need to shutdown SQL Server and then use some form of system tool - Microsoft or third party - to defrag. On Windows XP and Windows Server 2003 machines, the disk defragmenter tool can be launched by running: dfrg.msc.

3) Create only ONE transaction log file. Even though you can create multiple transaction log files, you only need one... SQL Server DOES not "stripe" across multiple transaction log files. Instead, SQL Server uses the transaction log files sequentially. While this might sound bad - it's not. If you want to get better performance out of the transaction log, place it on faster disks and/or a more performant disk (RAID) configuration. You could place the transaction log on a RAID 0 array but you'll likely want fault tolerance as well as performance. Isolation, as per number 1 is a good start but if you also need increased capacity then using hardware based RAID to combine disks for both better performance and redundancy is your next choice. See 4.

4) Not only should you try to isolate the transaction log to its own physical disk but you should make sure that the logical/physical disk configuration is as efficient as possible. Try to use an isolated RAID 1 mirroring set if you don't need significant capacity. If you need a greater capacity OR you want better performance, consider a combination of RAID 0 and RAID 1 (either RAID 0 + 1 or RAID 1 + 0). While RAID 0 + 1 can often offer better performance, RAID 1 + 0 offers better reliability. If you're new to RAID and are interested in learning more - check out the RAID Tutorial here: http://www.raidarray.com/04_01_00.html.

5) Don't be caught up in nothing but transaction log speed, you'll also want to make sure that your transaction log is always available as this can help you in times of disaster. Even if the data is damaged, if the transaction log is available and you have a series of backups up to and including the last transaction log then you can make a final backup of the transaction log that will represent all of the changes since your last transaction log backup. If this backup is possible (it's called backing up the "tail" of the log), then you can achieve up-to-the-minute recovery. This is only possible when the log is available. By placing the log on mirrored disks you can increase your chances of recovering data and minimize data loss!

6) Create transaction log files with a reasonable initial size. When you create a database it's ideal to PRE-ALLOCATE both your data files and your transaction log file. A little bit of capacity planning goes a long way... Now, if you think that you've got absolutely no idea how to size your transaction log you're going to need, well - here are the things that have the greatest impact:

  • Type of activity - transaction processing or decision support
  • Frequency of that activity - the more frequent the changes, the faster the transaction log will grow
  • Recovery Model - the recovery model of the database
  • Frequency of transaction log backups
  • Whether or not replication is used (since the log reader relies on the transaction log)

I wish I could give you a rough idea on sizing but if you look only at database size and none of these other factors, you could end up with a transaction log that's seriously oversized or seriously undersized. I've seen recommendations of 10-25% of the size of the data and you can use that but I would also add a bit of common sense. A larger database with very frequent transaction log backups may not need a transaction log that's even 1% of the size of the data... The best way is to setup your development environment similar to that of your production environment (including backup jobs) and then see how the transaction log grows. If you have a lot of autogrowth (because your guess was wrong), you can later clean up the fragmentation that has occurred and get back to a reasonable, intact, and optimal transaction log. If you want to know more about Recovery Models and their impact on the transaction log, review a sample book chapter from an MSPress book (Microsoft® SQL Server™ 2000 High Availability) that I helped to co-author. I only helped out on a couple of chapters but this specific chapter explains a lot about the general database environment settings that impact recovery. You can download Chapter 9: Database Environment Basics for Recovery here: http://www.sqlskills.com/resources/SQLServerHAChapter9.pdf

7) Don't let autogrowth get out of control. As important as capacity planning, you're not likely to be spot-on in your estimates. I don't recommend completely turning off autogrowth but in general I also don't like SQL Server 2000's default growth rate (or max size). In general, I would recommend setting ALL of the transaction log files attributes: initial size, growth rate AND maximum size. For the growth rate, I recommend something that can be allocated somewhat quickly and something of a fixed size. In general, I recommend a value which is less than or equal to 1GB (based on total size) but something that doesn't mean that you're going to autogrow again soon. So, for databases whose transaction logs are under 1GB then you might set autogrow to somewhere between 20 and 100 MB. For those of you who have transaction logs measured in GB, then I'd set the autogrowth to 500MB or 1GB. In general, I don't like the percentage because it needs to be calculated (which I realize isn't really all that big of a deal) but the larger the file the larger the autogrowth and the longer it takes. In SQL Server 2000, autogrowth can create blocking...so, it's best to minimize this in general.

8) Check/fix your internal fragmentation. OK, so this is the one that really helped the customer in Turkey. Often, when transaction logs are not pre-allocated and/or when there's been a lot of autogrowths, the transaction log can become internally fragmented. Internally your transaction logs are broken down into smaller more granular chunks called VLFs (Virtual Log Files). The size and number of VLFs you'll have depends largely on the size that the chunk is when it's added to you transaction log. If you add a new chunk to the transaction log which is 20MB (through autogrowth or through manual growth) then the number of VLFs that are added is 4. If you add a chunk which is greater than 64MB but less than or equal to 1GB, you'll add 8 VLFs. If you add more than 1GB then you'll add 16VLFs. In general, most transaction logs will only have 20 or 30 VLFs - even 50 could be reasonable depending on the total size of the transaction log. However, in many cases what happens is that excessive autogrowths can cause an excessive number of VLFs to be added - sometimes resulting in hundreds of VLFs. Having an excessive number of VLFs can negatively impact all transaction log related activities and you may even see degradation in performance when transaction log backups occur. To see how many VLFs you have solely look at the number of rows returned by DBCC LOGINFO. The number of rows returned equals the number of VLFs your transaction log file has. If you have more than 50, I would recommend fixing it and adjusting your autogrowth so that it doesn't occur as fequently. To get rid of all of the execessive VLFs, follow these easy steps to shrink off the fragmented chunk and add a new, clean chunk to your transaction log:

1. Wait for an inactive time of day (ideally, it would be best to put the database into single user mode first) and then clear all transaction log activity through a regular transaction log backup. If you're using the simple recovery model then you don't need to do a log backup... Instead, just clear the transaction log by running a checkpoint.

 BACKUP LOG databasename TO devicename

2. Shrink the log to as small a size as possible (truncateonly)

DBCC SHRINKFILE(transactionloglogicalfilename, TRUNCATEONLY)

NOTE: if you don't know the logical filename of your transaction log use sp_helpfile to list all of your database files.

3. Alter the database to modify the transaction log file to the appropriate size - in one step

ALTER DATABASE databasename
MODIFY FILE 

      NAME = transactionloglogicalfilename 
    , SIZE = newtotalsize
)

And, that should do it!

Many of you in the .NET Community are already familar with theServerSide.NET but many of you in the SQL Server community are not...well, here's your chance to bridge the gap over to .NET and hear a bunch of great .NET interviews! My interview is more SQL-centric but there are lots of great ones on Development in general and even one on SQL Server 2005 Reporting Services by Jason Carlson.

My interview was recorded with them back in February when I was speaking at VSLive in San Francisco and they've been saving them up, editing them and then posting an interview roughly each week (mine is the 29th posted!). The interview was great fun and Paul Ballard asked some really interesting questions about tuning, indexes, stored procedures and SQL Server 2005. Mostly it's an interview about optimization tips in both SQL Server 2000 and 2005 and just in general - things to look forward to in SQL Server 2005.

Here's a link to all of theServerSide.NET interviews: http://www.theserverside.net/talks/index.tss 
Here's a link specifically to mine: http://www.theserverside.net/talks/videos/KimberlyTripp/interview.tss?bandwidth=dsl
Here's the link to Jason Carlson's Reporting Services interview: http://www.theserverside.net/talks/videos/JasonCarlson/interview.tss?bandwidth=dsl and finally, here's a TechNet Webcast that Jason did titled: Authoring Reports in SQL Server 2000 Reporting Services (Level 200)

Enjoy!

As I'm preparing for my Tech*Ed session on Indexing Best Practices in SQL Server 2005, I'm reminded that there are a ton of best practices that really apply to both SQL Server 2000 as well as SQL Server 2005. When it comes to indexing, there are many dependencies on the storage structures. These dependencies are the basis for why I recommend a very specific type of clustering key - for all versions of SQL Server, 7.0 and higher!

I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....

Let's start with the key things that I look for in a clustering key:

  • Unique
  • Narrow
  • Static

Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Why Narrow?
A clustering key should be narrow for some of the same reasons it should be unique. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. If the clustering key is really wide, then all of the non-clustered indexes will be [unnecessarily] wide. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of size) when rebuilding these index structures. So, what does narrow mean - as few bytes as possible to help uniquely define your rows. A narrow numeric when possible.

Why Static?
A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. In fact, for a given table the clustering key will be the most duplicated data. If this data changes then they'll need to update the value in the base table as well as in EVERY non-clustered index. And, if the key changes, it will cause the record to move. When a record moves, it creates fragmentation. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of record relocation and [the likely] subsequent splits) and require more maintenance. 

OK, so it sounds like I want a narrow, unique and static value... What about a guid?
Typically, I recommend a numeric IDENTITY column as the clustering key but I always get this question. In fact, I often wait to see how long it's going to take before I get this question ;). Anyway, a guid does meet the criteria fairly well - it's certainly unique, it's usually static and it's relatively narrow. So, what's wrong with it? In SQL Server 2000, the guid function (newid()) is built using a value that does not create an ever increasing pattern (an IDENTITY column would). But wait, I didn't say that you needed to have an ever-increasing pattern.....

OK, so the final criteria I look for in a clustering key is: an ever-increasing pattern!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits - simply put - 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index - called the leaf level - is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible - like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity). But wait, what about that a guid.

Well, in SQL Server 2000 the only SQL Server function for guids is newid - that does not create an ever increasing pattern. In SQL Server 2005, you can use a new guid function called newsequentialid() to populate your uniqueidentifier column. Here's an example of how you can use it:

CREATE TABLE Test
(
TestID uniqueidentifier CONSTRAINT Test_TestID_Default DEFAULT newsequentialid(),
Inserted datetime CONSTRAINT Test_Inserted_Default DEFAULT getdate()
)
go

INSERT Test DEFAULT VALUES
go

SELECT * FROM Test
go

Is there a way to create a sequential guid in SQL Server 2000?
YES, use Gert Drapers wrote an xp to generate sequential guids! Check it out here. He just published this recently (May 2005) and it's a good change for you in SQL Server 2000 databases/applications.
Full title with link: XPGUID.DLL - Sequential GUID generation and GUID helper functions XP

OK, so I've tackled a few things here today and I've answered a few questions related to indexes before. Check out my Indexes Category of blog entries here and if you're at Tech*Ed today, I hope to see you at 5PM.

Thanks for reading,
kt

Categories:
Events | Indexes | SQL Server 2005 | Tips

I think there are numerous reasons for why I love technology but at the top of the list: learning. It's amazing to me that not a day goes by where I don't dig deeper into something or clarify it further. Even learning something trivial, like a new keystroke, can make our work easier to do and make us more productive at it. Things are constantly changing; the one thing I do know about technology is that there's a lot to learn!

So, to start, thanks for everyone's feedback (esp. Steffen Krause here) on some of my recent posts regarding LEFT and RIGHT based partitioning functions here. There was a lot feedback regarding the simplicity in syntax and declaration of a RIGHT-based partition function and by making the first partition of a RIGHT-based partition function empty, YOU'RE RIGHT! We can eliminate the need for data movement. There's no performance difference and it doesn't matter internally which type you choose but - I've heard you all loud and clear! You don't like dealing with the imprecision of a datetime data type when specifying upper boundaries. So, having said that - I need to make a few changes. In my next revision of my presentation materials, whitepaper, scripts, etc. I'll work to give both perspectives. For some reason, I still like LEFT-based partition function but RIGHT is definitely easier to define.

So, keep it coming everyone. You've hit the nail on the head. The fun part about technology is... no one knows everything and we're all always learning!

Enjoy Tech*Ed!

Blog edits brought to you by Richard Campbell (long story)

NOTE: Generally, I recommend RIGHT-based partition function so that you don't have to deal with datetime timetick issues at all. However, this post can really help you if you still want to use LEFT-based partition functions. Enjoy!!! 

When creating partitioned tables in SQL Server 2005, a partition function requires a LEFT or RIGHT designation. In general, I recommend that you choose to create a LEFT-based partition. However, choosing LEFT always creates some confusion because a left-based partition function requires upper boundaries. This results in a more complicated partition function definition (than a RIGHT partition function).

Background First
In SQL Server 2005, you can create truly Partitioned Objects (objects are inclusive to Tables and Indexes) and to create a partitioned table you must base that table on a Partition Scheme (PS) and the PS must be based on a Partition Function (PF). Now, if you're new to the concept of partitioning, you should seriously consider reading the whitepaper I wrote for MSDN which is posted here.

Regardless of your having read the whitepaper, I want to add a few basic principles and make sure that some of the key points are clear. Simply put, a PF is the logical definition of your boundary cases (or possibly better described as the intervals over which each partition covers). A PF must cover the entire domain from negative infinity to positive infinity. If a value doesn't have a place to go then you'd have a problem with the logic of your partition function. Having said that, when you define the PF you are ONLY stating values, not where they will physically will reside. Futhermore, the expression on which the data will be based has to be somewhat straightforward and consistent. In other words, simple. If the PF were extremely complex it could compromise the usage and performance of the Partitioned Object. Finally, the definition of the PF is really just a way to easily define these boundaries. As a result, you can choose to enter a specific value (like June 1, 2005) or a function like the following to create a PF over 4 months with each month being computed from current:

CREATE PARTITION FUNCTION TestFunctionValues(datetime)
AS
RANGE RIGHT FOR VALUES (
   CONVERT(char(4), year(dateadd(m, 1, getdate()))) + right('0' + convert(char(1), month(dateadd(m, 1, getdate()))), 2) + '01'
    -- next month (executed in May, 2005 yields June 2005)
   , CONVERT(char(4), year(dateadd(m, 2, getdate()))) + right('0' + convert(char(1), month(dateadd(m, 2, getdate()))), 2) + '01'
    -- executed in May, 2005 yields July 2005
   , CONVERT(char(4), year(dateadd(m, 3, getdate()))) + right('0' + convert(char(1), month(dateadd(m, 3, getdate()))), 2) + '01'
    -- executed in May, 2005 yields August 2005
   , CONVERT(char(4), year(dateadd(m, 4, getdate()))) + right('0' + convert(char(1), month(dateadd(m, 4, getdate()))), 2) + '01'
    -- executed in May, 2005 yields September 2005
)
GO

Since all of these complex expressions are evaluated to a constant at creation, you can immediately verify your boundary cases by accessing some catalog views. Here is the catalog view query which helps you see what your partition boundaries are:

SELECT PF.[name], RV.boundary_id, RV.[value]
FROM sys.partition_range_values AS RV
 JOIN sys.partition_functions AS PF
  ON RV.function_id = PF.function_id
WHERE PF.[name] = 'TestFunctionValues'

In the above partition function, you'll notice that in addition to the values, I also have a RIGHT designation. The choice of using RIGHT, helps to target whether or not the value is a member of the first partition or the second partition. In the case of RIGHT, this says that the value is a lower boundary of the second partition. In an example that involves something like "sales" where data rows are ever increasing and their usage is for more recent rather than older data you migth want to always have the most recent four months of data available (in the real world you'd probably want more than 4 months but I'm going to simplify to four so that we have fewer boundaries with which to work). So, if we want 4 partitions - one for June, July, August and September - we need to define the partitions in the form of intervals (using boundaries). The first boundary point will either reside in the first partition (as an upper boundary) or the second partition (as a lower boundary). For four partitions, there are numerous ways in which you could define the partition function such that the logical/physical placement of the data is THE SAME however, the behavior of the partition function during later SPLIT and MERGE operations is also impacted by the choice of LEFT/RIGHT so understanding how they work is critical in making your decision to use one or the other. More than anything, will need to respect the full domain from negative infinity to positive infinity.

The following partition function is one of the easiest and most logical to understand. We need four partitions so let's just enter our boundary points and be done.

CREATE PARTITION FUNCTION RightPartitionFunction1(datetime)
AS
RANGE RIGHT FOR VALUES (
   '20040601', -- Jun 2004
   '20040701', -- Jul 2004
   '20040801', -- Aug 2004
   '20040901' -- Sep 2004
)

However, remember that we have the full domain to cover. As a result, the above partition function will create 5 partitions where the partitions will be:
 Partition 1: all data less than 20040601
 Partition 2: all data greater then/equal to 20040601 and less than 20040701
 Partition 3: all data greater then/equal to 20040701 and less than 20040801
 Partition 4: all data greater then/equal to 20040801 and less than 20040901
 Partition 5: all data greater then/equal to 20040901

This really isn't too bad but it seems to waste the first partition because it's completely empty... so, you could remove the first boundary point and just supply the second, third and forth. In that case, the first partition would cover June (instead of being empty) and you could get down to 4 partitions with only 3 boundary points.

CREATE PARTITION FUNCTION RightPartitionFunction2(datetime)
AS
RANGE RIGHT FOR VALUES (
   '20040701', -- Jul 2004
   '20040801', -- Aug 2004
   '20040901' -- Sep 2004
)

This will create 4 partitions where the partitions will be:
 Partition 1: all data less than 20040701
 Partition 2: all data greater then/equal to 20040701 and less than 20040801
 Partition 3: all data greater then/equal to 20040801 and less than 20040901
 Partition 4: all data greater then/equal to 20040901

OK, so this seems a bit more optimal. And by definition alone, I would agree with you. Where it becomes problematic is in the merge and split operations. When a partition function goes through a merge of a boundary point, that boundary point is essentially removed. This seems simple but also has an impact on the data which resides in that partition (the partition which includes that boundary point). So, if this partitioned table is NEVER going to be modified and you never need to plan for a merge or split, then you can choose whatever definition is easier for you to use. However, if you plan to do a merge or split, you need to understand the ramifications of the LEFT/RIGHT decision. If November 1 comes around and you want the data for October to be added to your partitioned table and you ONLY want to show the most four months, then you will need to add October and remove June. The process will essentially consist of the following:

  1. Create a place for new data (Oct) to reside
  2. Add this location to the partition scheme (more details on the scheme v. the function shortly)
  3. Split the last boundary to add the new month (Oct)
  4. Switch-in the staging table of October data so it becomes part of the partitioned table
  5. Create a staging table into which the old data (Jun) will go
  6. Switch-out the June data into the staging table
  7. Merge the boundary point to remove June from the table

More background first - Partition Schemes
You will want to have a place to put the new data (there are lots of things we could look at here... do you want to ALWAYS only use 4 filegroups and when you remove June's data do you want to reuse that space - or do you have a new location on which October can be placed). This is a whole different topic over which we should chat... but for simplicity let's just say that we want to use a new location. To tell the PF where this location is, you need to work with the physical definition of the partition function. OK, that might not be clear. So far, I've only told you about the PF and so far, I've only said that the PF described the logical boudaries. You may have been wondering how SQL Server maps these logical boundaries to the phyiscal files and the answer is through the Partition Scheme. The partition scheme (at initial creation) sets up the mapping of the boundaries to their phyiscal locations (through filegroups). For the first and second PFs created above the respective partition schemes would be:

CREATE PARTITION SCHEME [PartitionScheme1]
AS
PARTITION [RightPartitionFunction1] TO
  ( [Primary]    -- should be empty?
  , [PartitionedTablesFG1]  -- for June
  , [PartitionedTablesFG2]  -- for July
  , [PartitionedTablesFG3]  -- for Aug
  , [PartitionedTablesFG4]  -- for Sept
  )
and/or

CREATE PARTITION SCHEME [PartitionScheme2]
AS
PARTITION [RightPartitionFunction2] TO
  ( [PartitionedTablesFG1]  -- for June
  , [PartitionedTablesFG2]  -- for July
  , [PartitionedTablesFG3]  -- for Aug
  , [PartitionedTablesFG4]) -- for Sept

In both cases, it seems as though these do the same thing with the exception that the first PF ends up with an extra partition which is essentially empty. In this case (in a RIGHT-based PF), the second is a better choice (coming up there will be a reason for why we create an empty partition).

Back to SPLIT and MERGE
OK, so back to the issue of spliting and merging. Our goal is to remove June and add October so that the "current" data shows the four months of July, Aug, Sept and Oct. If we want to create a location on which October can be placed, we need to add another filegroup to the scheme. The way we do this is that we first need to add the filegroup/file to the database and then we need to add this new filegroup to our partition scheme.

Step 1 - Create a place for new data (Oct) to reside
Create a new filegroup and new file.

ALTER DATABASE CreditPT
 ADD FILEGROUP PartitionedTablesFG5
GO

ALTER DATABASE CreditPT
 ADD FILE
  (NAME = N'CreditPTFG5File1',
  FILENAME = N'C:\SQLTemp\CreditPTFG5File1.ndf',
  SIZE = 30, FILEGROWTH = 10, MAXSIZE = 50)
  TO FILEGROUP [PartitionedTablesFG5]
GO

Step 2 - Add this location to the partition scheme
The way to do this is to add a "NEXT USED" filegroup. Syntax:

ALTER PARTITION SCHEME PartitionScheme2 NEXT USED [PartitionedTablesFG5]
GO

Once the partition scheme has a new filegroup to use (and you can only have one at a time - note that if you add a second it solely replaces the first and does not warn you that you already had a next used), then the next time a partition is needed (which will be on the next split) then this "next used" partition will be used.

Step 3 - Split the last boundary to add the new month (Oct)
Now that we have a filegroup AND we have told the partition scheme to use it for the next split operation, we can split the PS to include this 5th filegroup

ALTER PARTITION FUNCTION RightPartitionFunction2()
SPLIT RANGE ('20041001')
GO

Step 4 - Switch-in the staging table of October data so it becomes part of the partitioned table
This step really needs more to it. To switch in your staging table you need to already have a staging table. To make this an optimal operation, you MUST create your staging table on the same fielgroup on which the partition will reside. What you're trying to do is make the switch optimal by making it a metadata only operation. This step really consists of creating an identical table with identical indexes and creating it with a constraint that matches the partitions' definition (so that SQL Server can trust that it only includes data within the interval for that partition).

CREATE TABLE ChargesPTRightOct
(  charge_no  int   NOT NULL IDENTITY,
 member_no  int   NOT NULL
      CONSTRAINT ChargesPTRightOctMemberNoFK
       REFERENCES dbo.Member(Member_No),
 provider_no  int   NOT NULL
      CONSTRAINT ChargesPTRightOctProviderNoFK
       REFERENCES dbo.Provider(Provider_No),
 category_no  int   NOT NULL
      CONSTRAINT ChargesPTRightOctCategoryNoFK
       REFERENCES dbo.Category(Category_No),
 charge_dt  datetime  NOT NULL
      CONSTRAINT ChargesPTRightOctChargeDtCK
       CHECK (Charge_dt >= '20041001'
         AND Charge_dt < '20041101'),
 charge_amt  money  NOT NULL,
 statement_no int   NOT NULL,
 charge_code  char(2)  NOT NULL
) ON [PartitionedTablesFG5]
GO

ALTER TABLE ChargesPTRightOct
ADD CONSTRAINT ChargesPTRightOctPK
  PRIMARY KEY CLUSTERED (charge_dt, charge_no)
--   ON Credit4MonthPFN (charge_dt)
GO

INSERT ChargesPTRightOct (member_no, provider_no, category_no
      , charge_dt, charge_amt
      , statement_no, charge_code)
 SELECT member_no, provider_no, category_no
   , dateadd(yy, 5, charge_dt), (charge_amt + charge_no)/10
   , statement_no, charge_code
 FROM CreditPT.dbo.Charge
 WHERE month(charge_dt) IN (10)
 ORDER BY charge_dt, charge_no
GO

ALTER TABLE ChargesPTRightOct
SWITCH TO ChargesPTRight PARTITION 5
GO

Step 5 - Create a staging table into which the old data (Jun) will go
Similarly, you will want to create a place for the June data to go - kind of like a “staging” table but more for the exit of the table.

CREATE TABLE ChargesPTRightJun
(  charge_no  int   NOT NULL IDENTITY,
 member_no  int   NOT NULL
      CONSTRAINT ChargesPTRightJunMemberNoFK
       REFERENCES dbo.Member(Member_No),
 provider_no  int   NOT NULL
      CONSTRAINT ChargesPTRightJunProviderNoFK
       REFERENCES dbo.Provider(Provider_No),
 category_no  int   NOT NULL
      CONSTRAINT ChargesPTRightJunCategoryNoFK
       REFERENCES dbo.Category(Category_No),
 charge_dt  datetime  NOT NULL,
 charge_amt  money  NOT NULL,
 statement_no int   NOT NULL,
 charge_code  char(2)  NOT NULL
) ON [PartitionedTablesFG1]
GO

ALTER TABLE ChargesPTRightJun
ADD CONSTRAINT ChargesPTRightJunPK
  PRIMARY KEY CLUSTERED (charge_dt, charge_no)
--   ON Credit4MonthPFN (charge_dt)
GO

Step 6 - Switch-out the June data into the staging table
This must also be a metadata only step. You need to make sure that it's created on the same filegroup as where the data currently resides (see the ON [ParitionedTablesFG1] clause. This is the current location of this partition. As a result, the switch out will be very fast.

ALTER TABLE ChargesPTRight
SWITCH PARTITION 1
TO ChargesPTRightJun
GO

Step 7 - Merge the boundary point to remove June from the table
This last step is the whole reason for this blog entry... yes, we're getting there ;).

ALTER PARTITION FUNCTION RightPartitionFunction2()
MERGE RANGE ('20040701')
GO

When you merge this boundary point, you will remove the value AND the parition in which it resides. Since this boundary point resides in the second partition (PartitionedTablesFG2) then it's actually PartitionedTablesFG2 that gets removed. As a result ALL of the data that resides on it, must be moved to the filegroup that now covers this boundary (which is PartitionedTablesFG1 and which was just emptied). As a result, MERGE does NOT optimally merge with a RIGHT-based partition on the rolling range scenario.

And - as a result, I recommend LEFT-based partition functions... Since the whitepaper tends to focus on using LEFT-based parition functions, please review that for more details. You can access the “Paritioned Tables” whitepaper on MSDN, here.

Hope this helps and thanks for reading!
kt

The title is a common question I've received in the past and I thought I'd take a few minutes to explain a bit about keys and indexes...

This is by no means a lot of detail regarding relational theory, etc. but there are a few things that we should quickly review to make sure that the basis for indexes being created makes sense. First, from a relational theory perspective every table must have a primary key. From SQL Server's perspective it's not a requirement but it's generally a good idea. A primary/unique key are entity identifiers. Each are a unique way of identifying a row. There are subtle differences between the two - in implementation:

Primary Key

  • In SQL Server the Primary Key is enforced through a Primary Key Constraint.
  • None of the columns that make up the primary key allow nulls.
  • The values in the Primary Key must be unique - to enforce uniqueness (as well as make it efficient), SQL Server creates a unique clustered [composite] index on the column(s) that make up the key.

Unique Key

  • In SQL Server the Primary Key is enforced through a Unique Key Constraint.
  • The columns that make up the unique key CAN allow nulls but not for more than one complete key. Meaning that allowing Nulls on a single column unique key really only allows NULL (only one NULL value). Overall, allowing Nulls in a column (like Social Security Number) doesn't really make much sense but if you need to then you can't go with a unique constraint - instead consider a unique index. At that point, I typically get the question of what's the difference between a unique key and a unique index...
  • Allowing Nulls values in the columns that make up a composite unique key makes more sense as long as the complete key is not null for more than one row.
  • The values in the unique Key must be unique and to enforce uniqueness (as well as make it efficient), SQL Server creates a unique NON-clustered [composite] index on the column(s) that make up the key.

What's the difference between a unique index and a unique constraint?

  • A unique key CAN be referenced by a foreign key constraint and a column which has only a unique index cannot be referenced.
  • Constraints are checked before indexes and this can lead to a large multi-row insert/select or update to fail before modification. However, indexes might (for a large modification) be validated at the end (instead of row by row) so a large modification that has a failure will need to rollback at the end of the modification rather than before. This is a good point - and one I hadn't really thought of until I was poking around some of the Q&A on SQLMag's website. Here's where I found it.

So, all of this leads me up to the original question (yes, you knew I get here someday :) and that's “When did SQL Server stop putting indexes on Foreign Key columns?”

First, SQL Server has NEVER put an index on a foreign key column... Indexes are used (as described above) to make the lookup (in a primary or unique key) for a duplicate value FAST. If the keys are ordered then checking to see if one already exists is trivial (i.e. fast). There is NO reason for SQL Server to put an index on a foreign key column as the column does not (and probably would never be) unique (if it is then it's likely to have a primary or unique key on it as well - as in a 1-1 relationship). So, that leads me to another key point...

Are there any benefits to indexing foreign key columns? YES

  • Better performance on maintaining the relationship on a delete of a primary/unique key. When you delete a key row, SQL Server must check to see if there are any rows which reference the row being deleted.
    • If the foreign key relationship is defined with NO ACTION (on update/delete) then a referenced row CANNOT be deleted as it would leave the referencing rows “orphaned.” To find the rows efficiently an index on the foreign key column helps!
    • If the foreign key relationship is defined with CASCADE (on update/delete) then when a referenced row is modified all of the referencing rows must be modified as well (either updated to reflect the new value or on cascade delete). To find the rows to modify efficiently, an index on the foreign key column helps!
  • Better join performance - for many of the reasons above, SQL Server can more effectively find the rows to join to when tables are joined on primary/foreign key relationships. However, this is NOT always the “best” indexing choice for joins but it is a good start. 

Finally, if you want a few titles related to relational theory check out these links:

An Introduction to Database Systems, Eighth Edition
     by C.J. Date
     E.F. Codd

Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design
     Mike Hernandez

Have fun!
kt

Categories:
Design | Indexes | Resources | Tips

One of my favorite features of Management Studio is the ability to “script” a step rather than execute it. Within SQL Management Studio, each dialog has 4 different options from which to choose (these are including the typical OK/Cancel) but also add a couple more at the top of the dialog:

 

At the top of the dialog

  • Schedule – which will create an Agent job.
  • Script – which will send the command(s) to a new Query Window in Management Studio

At the bottom of the dialog (the norm):

  • Run it right now – with the OK button
  • Cancel it – fairly obvious, eh?

The “Script” option has to be one of my favorites. You can learn from this dialog, you can cut/paste and help to build more complex scripts, you can just see what the heck the UI was going to do, etc… and well, that’s what led to SQLCMD. Something I wanted to do (from just a practical perspective) was create a simple/fast way to connect to my existing SQL Server 2000 machine, backup a database and then restore it to my SQL Server 2005 machine. One of the benefits of backup/restore is that SQL Server will upgrade the database on restore (no special settings required - this is default behavior if you restore a SQL Server 2000 backup to SQL Server 2005). Although finding the correct syntax for each step is easy through the UI, I wanted to take the overall process one-step further and create a SQLCMD-specific script to automate it (yes, I kind of like the idea of a command line tools for batch/automation, etc.. and yes, I could have also done this with DTS or the SQL Agent... good idea for another blog entry! Hey guys (and you know who you are!) you should do the comparable DTS/Agent jobs?! I'll be waiting. J J J).

 

Anyway, the primary goal is to leverage some of the new capabilities of SQLCMD (which I have to admit – if you’re a command line/automation person – this is going to quickly become your favorite new features).  Here’s a very quick summary of SQLCMD:

 

  • Replaces OSQL (and then some!)
  • Can set a variety of environment variables
  • Offers a -A parameter for the dedicated admin connection (only one window available for this at a time and the usage here is when you can’t otherwise connect to your server – for whatever reason. To ensure that they can always use this they have kept resources available so that this should always work. I use “should” because well, I rarely use always (or never)… J
  • Adds –v for variable input from the command line but also supports variables set as environment variables (via SET) or variables set within the SQLCMD script (SETVAR).
  • Allows connections to be specified within the script via:
        
    :CONNECT servername
    NOTE: the semi-colon is required
  • Has a few new error handling options such as
        
    :ON ERROR EXIT
    NICE: this uses RAISERROR and when a severity of 11 (I think) or higher is raised you can define what happens (EXIT, IGNORE or RETRY n) to control script flow and retry behavior (yeah!)
  • And with the exception of a few new (yet VERY cool things like this) SQLCMD is really just TSQL plus some very nice automation stuff. (I can hear TSQL geeks cheering...)

So – I created a script that does this and here’s the play by play of it:

 

I don't want to support system databases for backup/restore. You could certainly do a backup and then restore to a new name... but that will have to be in version 2.

:ON ERROR EXIT
go

 

IF '$(DB)' IN ('master', 'model', 'msdb', 'ReportServer', 'distribution', 'tempdb')
BEGIN
 RAISERROR ('System Databases are NOT supported. Script Terminated.', 16, -1)
END
go

Next, I wanted to create a connection to my 2000 server (this could have easily been a parameter!)

:CONNECT servername

 

BACKUP DATABASE $(DB)
 TO DISK = '\\servername\sharename\path\$(db).bak' -- NOTICE YOU CAN USE VARIABLES WITHIN STRINGS!
 WITH INIT
PRINT ' '
go

NOTE: Be sure to set a GO after each batch that includes changes in connections, etc. The idea is that much of the SQLCMD code (per batch) is evaluated and executed first. Without the go I would have ended up in the next connection - executing the backup. Here's a quick example of what I mean:

-- Example 1

:CONNECT SERVER1
EXEC PROC1


: CONNECT SERVER2
EXEC PROC2

-- In Example 1, both procedures execute on SERVER2

 

:CONNECT SERVER1
EXEC PROC1
go

 

-- Example 2

: CONNECT SERVER2
EXEC PROC2
go

-- In Example 2, PROC1 executes on SERVER1 then PROC2 executes on SERVER2 - as desired.

After the database is backed up I connect to the SQL 2005 server and drop the existing database... you may or may not want to do this so blindly.

:CONNECT servername

SET NOCOUNT ON

 

IF DATABASEPROPERTYEX('$(DB)', 'COLLATION') IS NOT NULL
 DROP DATABASE [$(DB)]

 

WAITFOR DELAY '$(DELAY)' -- this was just for testing...

Next, I need to determine the SQL 2005's Data Path as I didn't want to assume C:\Program Files\...

To do this I'm using an undocumented xp (cause I wanted to see if it still works? it does!)

DECLARE @SQLPath nvarchar(512)
EXECUTE master.dbo.xp_regread 'HKEY_LOCAL_MACHINE',
            'SOFTWARE\Microsoft\MSSQLServer\Setup',
            'SQLPath', @SQLPath OUTPUT

Next, I do some cool stuff to create the correct locations for the new data and log file names. This was the hardest part really... What if the database has multiple files - and the pathes are NOT the same as the server from which the database was backed up. Well, now I need to do a restore with MOVE. Have you seen the syntax for RESTORE with MOVE... it's not overly pretty. Anyway, I first create a table into which I will store the output of LOAD FILELISTONLY (so I can interrogate the backup device and see the filelist of the database which was backed up):

 

CREATE TABLE #BackupFileList
( LogicalName sysname NULL
 , PhysicalName sysname NULL
 , [Type] char(1)
 , FileGroupName sysname NULL
 , Size bigint
 , MaxSize bigint
 , FileId smallint
 , CreateLSN numeric(25,0)
 , DropLSN numeric(25,0)
 , UniqueId uniqueidentifier
 , ReadOnlyLSN numeric(25,0)
 , ReadWriteLSN numeric(25,0)
 , BackupSizeInBytes bigint
 , SourceBlockSize bigint
 , FileGroupId  smallint
 , LogGroupGUID uniqueidentifier
 , DifferentialBaseLSN numeric(25,0)
 , DifferentialBaseGUID uniqueidentifier
 , IsReadOnly bit
 , IsPresent bit
)

Next, I get the info from the backup device (again, notice the parameters being used even in Dynamic String Execution!)

INSERT #BackupFileList
 EXEC('LOAD FILELISTONLY FROM DISK = ''D:\SQLDemo\$(db).bak''')

Now I reverse the physical name to find only the “file“ part of the name... reversing it to find the first string up to a backslash (from the end). Then I extract that string and reverse it back... and - at the beginning of the string, I add the local server's data path.

UPDATE #BackupFileList
 SET PhysicalName = @SQLPath + N'\Data\' + REVERSE(SUBSTRING(REVERSE(PhysicalName), 1, PATINDEX('%\%', REVERSE(PhysicalName)) -1))

Now I'm ready to walk the filelist and build the RESTORE with MOVE database string. This completes the script!

DECLARE @LogicalName sysname
 , @PhysicalName sysname
 , @ExecStr   nvarchar(max)

 

DECLARE FileListCursor CURSOR FAST_FORWARD FOR
 SELECT LogicalName, PhysicalName
 FROM #BackupFileList

 

OPEN FileListCursor

 

FETCH NEXT FROM FileListCursor INTO @LogicalName, @PhysicalName

 

SELECT @ExecStr = N'RESTORE DATABASE $(DB)' +
       N' FROM DISK = ''D:\SQLDemo\$(db).bak''' +
       N' WITH MOVE ''' + @LogicalName + N''' TO ''' + @PhysicalName + N''''
 
FETCH NEXT FROM FileListCursor INTO @LogicalName, @PhysicalName

WHILE @@FETCH_STATUS <> -1
BEGIN
 SELECT @ExecStr = @ExecStr + N', MOVE ''' + @LogicalName + ''' TO ''' + @PhysicalName + ''''
 FETCH NEXT FROM FileListCursor INTO @LogicalName, @PhysicalName
END

 

-- SELECT @ExecStr

EXEC (@ExecStr)

 

DEALLOCATE FileListCursor
go

I set my parameters on the call to the script.

If you're interested in reading more about somewhat stressful demos for the BI Team read here.

Categories:
Events | Tips | SQL Server 2005

People always ask me how I keep up to date with SQL Server and how I could possibly know so many intricate details about SQL Server… Well, I test/play/figure it out until it’s obvious (painfully sometimes and sometimes it’s not obvious). So, this blog entry is based on a specific question that made me want to see if I could come up with a test and positive proof in patterns and performance – when working with large tables and range deletes. I generally recommend a clustered index be created on a unique, narrow and static column – and best if it’s an identity column – yet there is an exception to this when the table has VERY specific administrative/operational/archival patterns a.k.a. the “sliding window” scenario. The question starts by asking about my point that “deletes aren’t as big of a deal” in fragmentation. While my point was solely referring to the internal v. external fragmentation issue that led us to take some more time evaluating delete performance. In talking about delete performance, we arrive at some best practices in index creation as well as horizontal partitioning! (fyi – the scripts used to create/build this scenario are fairly cool in and of themselves – not a bad idea to spend some time reviewing that!)

 

Q: You mention that deletes are not as “big of a deal” since they leave gaps. I had a situation where I had a table with around 50 million rows, about a million rows per day. After about a month and a half, I delete some old ones. I can delete a day's worth of data (1 million rows) in about 2 minutes with a delete statement, if there are no indexes. If I have a clustered index based on the date, it took about 22 minutes. Inserts were instantaneous, about two pages worth of rows every second, and even updates were quicker. Is there anything obvious that would cause this?

OK, so I need to clarify first. My comment that deletes aren't as “big of an issue” is specifically regarding fragmentation (not really performance of the actual delete)... A delete (when it’s a range delete – as opposed to a lot of singleton deletes) typically de-allocates the entire page freeing it back to the table (remember, it’s still “reserved” by this table) or if the entire extent is freed then it goes back to the DB. However, remember, this is only when the range is contiguous or grouped, per se. So, looking solely at the result and impact to the base table – we would say that range deletes aren’t that big of a deal because they free entire pages and even entire extents – effectively, the table doesn’t have gaps. OK, so having said that, you make a good point: what about performance?

 

First, I’m somewhat intrigued by the fact that you say a delete was faster with no indexes rather than with a clustered especially if the clustered was date BUT I have a feeling that there might have been nonclustered indexes as well? (In the upcoming test matrix, you may see some patterns in timing that corroborate this idea).  Without knowing whether you had non-clustered indexes or not, let me at least speculate and test (and hopefully prove) a few points? First, I would ask what the exact definition of the CL index was – was it just date? Remember, I really feel as though there are three types of tables – tables with the wrong clustered index, tables with NO clustered index (i.e. HEAP) and tables with the right clustered index. However, the “right” clustered index CAN vary based on pattern usage of the table. In a table that has range deletes based on dates – as well as an identity column – I often create a clustered index on the composite combination of these two columns. If the CL Key is on the date itself then this should not be that much slower (that the heap – in fact, I would have expected it to be faster even though there will be some overhead in maintaining the index). I really think there’s something else there – probably nonclustered indexes, maybe a blocking scenario? So, to help – I think I can prove a few points (in terms of perf) with a script/test to see if I can give you some rough numbers (I’m really, really intrigued to see if I can’t prove my point as well J). So, I decided to try an interesting combination – and really test the numbers. My test (albeit not entirely scientific) should show a pattern! First, I’m going to take the 1.6 million row charge table (in my “pumped up version of the credit database) and I’m going to copy it 10 times into slightly different table structures. Details and the creation for each table follow:

 

ChargeHeap – A table with NO Clustered index and NO non-clustered indexes.

SELECT *

INTO ChargeHeap

FROM Credit.dbo.Charge

 

ChargeCLPKForDelete – Is really NOT that bad of a choice… I typically recommend this choice for everyday tables where there are lots of inserts/updates/deletes. However, with the specific “sliding window” scenario (where data comes in at the end AND is periodically archived off the other end) then I might make a different choice. But – still a good table to create and compare against. This will certainly improve inserts and minimize fragmentation BUT it might not be ideal for range deletes and archiving. This table has a clustered index on the Primary Key; this column is an ever-increasing identity value on charge_no.

 

SELECT *

INTO ChargeCLPKForDelete

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLPKForDelete

ADD CONSTRAINT ChargeCLPKForDeletePK

          PRIMARY KEY CLUSTERED (Charge_NO)

go

 

Again, because this table will follow a “sliding window” scenario then I would change the clustering key to include the date and then the ID. I don’t typically chose date (as a single column primary key) because dates alone are rarely, if ever, unique (for more details on datetime data check out the articles I’ve written on Datetime data for SQL Magazine here). Because you already have an identity column in the table (which uniquely identfies every row), there’s no need to add a “uniqifier.”  The process of “uniqification” happens internally if the clustereing key is defined on a non-unique column (see last month’s webcast if you’re not familiar with this “uniqification” process). In the case where you already have a row identifier, it’s best to use it in the clustering key with date – especially as it has useful data in it. So – this leads us to the next TWO tables…

 

ChargeCLDateForDelete – Is a clustered index on the date ALONE.

 

SELECT *

INTO ChargeCLDateForDelete

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLDateForDelete

ON ChargeTesting.dbo.ChargeCLDateForDelete (Charge_DT)

go

 

ChargeCLDateForDeleteWithCompPK – Is a composite clustered primary key index on the date, charge_no together.

 

SELECT *

INTO ChargeCLDateForDeleteWithCompPK

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPK

ADD CONSTRAINT ChargeCLDateForDeleteWithCompPK_PK

          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)

go

 

The last base table will have a poor choice for the clustering key (for these deletes) as they will not have “ranges” with which to work.

 

ChargeCLReallyBadForDelete – This clustered index will be on something NOT used at all in this delete. The deletes will result in lots of internal fragmentation. In other words, the table will be larger and “internally” fragmented. The table will essentially look like “swiss cheese” meaning that the table will have lot of little gaps within the pages themselves.

 

SELECT *

INTO ChargeCLReallyBadForDelete

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLReallyBadForDelete

ON ChargeTesting.dbo.ChargeCLReallyBadForDelete (Charge_amt)

go

 

And what about nonclustered indexes? Finally, we will add five final tables – each of the above – with two non-clustered indexes added to see what their impact has on overall performance.

 

ChargeHeapWNCIndexes – Is a heap table with two non-clustered indexes.

 

SELECT *

INTO ChargeHeapWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC2

ON ChargeHeapWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeHeapWNCIndexesNC3

ON ChargeHeapWNCIndexes (Member_no)

go

 

ChargeCLPKForDeleteWNCIndexes – Is clustered on the Primary Key and has two additional non-clustered indexes.

 

SELECT *

INTO ChargeCLBadForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLBadForDeleteWNCIndexes

ADD CONSTRAINT ChargeCLBadForDeleteWNCIndexesPK

          PRIMARY KEY CLUSTERED (Charge_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC2

ON ChargeCLBadForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLBadForDeleteWNCIndexesNC3

ON ChargeCLBadForDeleteWNCIndexes (Member_no)

go

 

ChargeCLDateForDeleteWNCIndexes – Is clustered on the Date alone and has two additional non-clustered indexes.

 

SELECT *

INTO ChargeCLDateForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLDateForDeleteWNCIndexes

ON ChargeTesting.dbo.ChargeCLDateForDeleteWNCIndexes (Charge_DT)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC2

ON ChargeCLDateForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWNCIndexesNC3

ON ChargeCLDateForDeleteWNCIndexes (Member_no)

go

 

ChargeCLDateForDeleteWithCompPKWNCIndexes – Is the composite primary key on the date and identity. This table also has the same two non-clustered indexes.

 

SELECT *

INTO ChargeCLDateForDeleteWithCompPKWNCIndexes

FROM Credit.dbo.Charge

go

 

ALTER TABLE ChargeTesting.dbo.ChargeCLDateForDeleteWithCompPKWNCIndexes

ADD CONSTRAINT ChargeCLDateForDeleteWithCompPKWNCIndexes_PK

          PRIMARY KEY CLUSTERED (Charge_DT, Charge_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC2

ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLDateForDeleteWithCompPKWNCIndexesNC3

ON ChargeCLDateForDeleteWithCompPKWNCIndexes (Member_no)

go

 

ChargeCLReallyBadForDeleteWNCIndexes – Is the poor choice (for this delete) of a clustered index on charge_amt. Again, this table will end up with a lot of internal fragmentation after these deletes (not to mention that inserts will probably cause a lot of fragmentation as well). Over time, this table would probably be one of the worst performers – for almost all DML.

 

SELECT *

INTO dbo.ChargeCLReallyBadForDeleteWNCIndexes

FROM Credit.dbo.Charge

go

 

CREATE CLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesCL

ON ChargeTesting.dbo.ChargeCLReallyBadForDeleteWNCIndexes (Charge_amt)

go

 

CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC2

ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Statement_NO)

go

 

CREATE NONCLUSTERED INDEX ChargeCLReallyBadForDeleteWNCIndexesNC3

ON dbo.ChargeCLReallyBadForDeleteWNCIndexes (Member_no)

go

 

Well, now that you know the players we can setup the test. The idea of the test is to take a large chunk of early data “off” the table. Presumably to archive but I will not address the whole archiving process at this time. Once archived the data must be removed (for many – just removed – as they can always rely on backups if they need the earlier data… just make sure you keep those backups!)

 

In the tests we will delete a varying percentage of the table based on a specific date for the charge rows. I have modified and increased the size of credit in many ways before getting to these scripts so it might be hard to reproduce this BUT there are some great samples to use from these.

 

First, I increased the number of rows in credit.dbo.charge by just “doubling” the rows:

 

INSERT charge

SELECT c.[member_no] ,

            c.[provider_no] ,

            c.[category_no] ,

            c.[charge_dt] ,

            c.[charge_amt] ,

            c.[statement_no] ,

            c.[charge_code]

FROM charge AS c    -- to go to 200K rows

go

INSERT charge

SELECT c.[member_no] ,

            c.[provider_no] ,

            c.[category_no] ,

            c.[charge_dt] ,

            c.[charge_amt] ,

            c.[statement_no] ,

            c.[charge_code]

FROM charge AS c    -- to go to 400K rows

go

 

 

However, once I get the rows there I’ve realized that I have a bunch of data that doesn’t really follow an increasing ID and increasing date –which isn’t realistic at all… And, of course, I realized this on a version of my database which already had the rows increased and which I always use for all sorts of samples/examples SO – I’d rather fix the current data rather than re-write the INSERT/SELECTs. But (and as it would always happen, I found other problems – like that I had a gap in the identity values). Anyway, I decided that the best way to get dates to be ever increasing along with the identity is to FIRST cleanup the identity. I’m going to do that first by dropping and re-adding the identity col. Re-creating the identity column to reseed it and then updating the charge_dt column based on a starting time and then adding the charge_no to it. OK, so that’s not what I did BUT that would work to give you a relatively increasing value – there would be duplicates though as SQL Server rounds ms values to the nearest timetick. AND – I’m not going to get into this here at all but I did write a series of articles on datetime data for SQL Server Magazine, see here.

 

OK, so crank up your data, have ever-increasing identities to correspond to ever-increasing data and then create duplicate test tables based on the settings and patterns above. You can then create the testing patterns. The tests are going to be based on deleting a percentage of your data (you will need to determine your date ranges and estimate the correct date which is 5-10% of your table). 5-10 percent is a reasonable percentage to delete based on an “archiving” principle/idea (in fact, this might be large in many situations). Since the date values are the same for all tables this will make testing easier. Also, there a few simple testing practices here – use a separate database in which you track test run numbers, do multiple test runs and take averages (consider removing the highest and lowest values to get rid of any strange anomalies as well). SO, here’s the (TestingDeletePerformance SetupScript1 ChargeTestingResults.sql (2.49 KB)) that I used to track/store the test results. Note: you may need to change directory/path/instance names. Once the results database (ChargeTestingResults) has been created then you can create your testing database (based on your modified version of Credit.dbo.charge). The setup script to create this database (ChargeTesting), the testing procedure (DeleteTestRun) and perform a backup (to restore for reproducible test runs) is here (TestingDeletePerformance SetupScript2 ChargeTesting.sql (9.41 KB)).

 

Now, on to the testing…. You should run a series of test runs of slightly varying numbers of rows being deleted and then analyze your findings. I have a feeling that we’ll all get slightly different results on the extreme ends of a few tests BUT if you average the timings over multiple test runs I think you’ll find the same thing I’ve found!

 

So, you’ll need to set your @DateToDeleteTo and make sure your path is correct. based on your dates) but here’s the general idea… so that you input the timings into your ChargeTestingResults database:

 

DECLARE @DateToDeleteTo              datetime,

                   @NumberOfTestRuns     tinyint,

                   @CurrentTestRun           tinyint

 

-----------------------------------------

-----------------------------------------

SET @NumberOfTestRuns = 1

SET @DateToDeleteTo = 'yyyymmdd'

-----------------------------------------

-----------------------------------------

 

SET @CurrentTestRun = 0

 

WHILE @CurrentTestRun < @NumberOfTestRuns

BEGIN

          USE ChargeTesting

          EXEC DeleteTestRun @DateToDeleteTo

          USE Master

          RESTORE DATABASE ChargeTesting

                   FROM DISK = 'fix path'

          SET @CurrentTestRun = @CurrentTestRun + 1

END

go

 

OK, so now we’re ready to roll! I’ve run this 23 times (yes, thanks to my machine being idle for a few of my sleeping hours – it’s either a great time for SETI or automated tests…)

 

I’ve run the following query:

SELECT TableName, AVG(DeleteTime) AS [Average Time]

, MIN(DeleteTime) AS [Min Time]
, MAX(DeleteTime) AS [Max Time]

FROM ChargeTestingResults.dbo.DeleteTestResults

WHERE TestRunID NOT IN

                             (SELECT TestRunID

                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTR

                             WHERE DeleteTime = (SELECT min(DeleteTime)

                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC -- correlated

                                                                             WHERE DTR.TableName = DTRC.TableName GROUP BY TableName)

                             OR

                                        DeleteTime = (SELECT max(DeleteTime)

                                                                             FROM ChargeTestingResults.dbo.DeleteTestResults AS DTRC2 -- correlated

                                                                             WHERE DTR.TableName = DTRC2.TableName GROUP BY TableName))

GROUP BY TableName

ORDER BY [Average Time]

 

This query removes the highest and lowest values from the test runs and then averages the final results into this data:

 

TableName

Average Time

Min Time

Max Time

ChargeCLDateForDelete

1566

1143

2903

ChargeCLDateForDeleteWithCompPK

2359

1253

8080

ChargeHeap

10392

9603

11476

ChargeCLPKForDelete

11377

9453

18476

ChargeCLReallyBadForDelete

17491

12640

35440

ChargeCLDateForDeleteWithCompPKWNCIndexes

19594

8020

29414

ChargeCLDateForDeleteWNCIndexes

22467

9243

63520

ChargeCLPKForDeleteWNCIndexes

30132

17343

59366

ChargeCLReallyBadForDeleteWNCIndexes

44208

25946

62000

ChargeHeapWNCIndexes

49407

19716

78383

 

Now, I’m sure you’ve completely forgotten what brought us to this result set but it all started when we were concerned about large scale range deletes… I really created this test set for multiple reasons:

1)     To give you ideas for tracking test runs (separate database, testing tables, dates/times and interesting data)

2)     Show you a few practices for reproducible tests (i.e. backup after setup and then restore before each test run)

3)     Show a variety of examples of dynamic string execution

4)     Show you an interesting query which removes the best case and worst case test scenario and then averages the results (fyi – that query is easier to write in SQL Server 2005 with common table expressions!).

5)     Prove my original point… (of course) J

6)     Lead you to wanting to learn more about the BEST point of all… consider partitioning.

 

So, let’s talk about #5 for a second – the original point was that I *think* the reason the deletes were slower with the clustered index is that you also had non-clustered indexes. Hard to say now but that would be my guess. The MORE interesting point is that the Date alone in the CL key is faster than the composite clustering key. Now, I’m not about to start doing insert testing or space testing but having a clustering key which is NOT unique has other problems. Given the VERY small difference in times for these deletes (yes, it’s almost 50% here but the numbers are VERY small comparatively speaking regardless – ok, we can certainly debate this BUT are you really ONLY focusing on performance for deletes or do you need to look at the big picture? When you do I would guess that INSERTs will be more important overall). But – what you really get to when you look over all of this you see the ultimate problem – NON CLUSTERED INDEXES…. Yes, if you have non-clustered indexes they will kill your delete performance. So – the point here is going to be – how can you get GREAT archiving performance when deleting ranges AS WELL AS great performance for INSERTs AS WELL AS a more manageable environment……… partitioning.

 

So – there was another reason for this whole entry… it’s partially to help try and convince a few of you (especially those of you that have tables in double-digit or triple-digit millions of rows) that large tables are harder to manage and large archival operations can be EXTREMELY costly!

Categories:
Events | Indexes | Resources | Tips

Q: How do I interpret DBCC SHOWCONTIG WITH ALL_INDEXES on index id = 255 on a table with multiple text columns. I'm looking at a copy of last night's production after running Maintenance Plan Optimizations. Scan density on table and all indexes is 99 - 100%. Logical Fragmentation is 0% for everything except text column index, which is over 99% (Extent Fragmentation is 83%). Do these numbers have any negative impact on performance? Are they worrisome?

OK, so the first thing to respond to is the “Logical Fragmentation is 0% for everything except text column indexes” is a bug. This is a recently filed bug. It’s based on using the ALL_INDEXES option with DBCC SHOWCONTIG so if you can walk the indexes individually instead then you can probably avoid this (and you could always tweak my version of sp_RebuildIndexes to do this).

 

Now, related to text indexes (and this is as per the Index Defrag Whitepaper), Logical Scan Fragmentation is not relevant on text indexes. However, Extent Scan Fragmentation is and yes, this is something that could be worrisome IF you return the text/image data to your clients in large requests (meaning LOBs measured in MBs+ or requests where you’re returning a lot of LOBs). If you think you’re doing large scans (or if the text data is very large and generally returned to the client) then you might want to compact this part of the database and even consider moving the text/image data into its own filegroup (but – remember that singleton selects won’t really benefit from this).

 

So, to start – it’s important to just mention this option. When you create a table, you can state where the data is located with the ON clause; you can also state where to place the text/image data with the TEXTIMAGE_ON clause. A sample CREATE TABLE that places data on one filegroup and text/image data on another is below:

 

CREATE TABLE dbo.TestTable

(

          col1   int              identity(100,10),

          col2   datetime     default getdate(),

          col4   char(30)      default suser_name()

)

ON [FG1]

TEXTIMAGE_ON [FG2]

 

Since the table is already created – and very fragmented – the only option you have now is to “move” that data to another filegroup. To do this you have a couple of options depending mostly on table size. The easiest way would be to use the SQL Enterprise Manager to first create a new file and filegroup and then set the Table Properties for the Text filegroup (under properties in Design table) to be this new filegroup. The bad side of using this is that the entire table (and all related objects, constraints and indexes) will also be recreated, as the ENTIRE table will be “rebuilt” using this method. The other option – which is a bit harder to implement is to BCP out and then BCP back in the data; however, you cannot change the filegroup when using ALTER TABLE so – this would only work if you wanted to load it back into the same filegroup in which the data already resides. Depending on how many gaps there are within the file you may not end up with very contiguous allocations at the end of this either.

 

So, I think it mostly depends on how much work you’re willing to do and how much of a gain you think it’s going to have.  If MOST queries only return a singleton text/image value AND the performance of these operations is adequate, then I might not worry about it so much (even though, yes it is fragmented). As per one of the dev team folks – one thing to do might be to test the timings of SELECTs and determine if they are slower than you would like and if yes, then DELETE/INSERT the text/image value to have it re-inserted. BUT – if you have time to shift things around and don’t mind some of the administrative impact then there are a couple of options to review if you are getting poor performance from text/image data. One option would be to separate it on its own filegroup (fyi – there are other management benefits in doing this especially in the area of backup and restore when this is a large portion of your database) and you should consider taking the time to do this off hours. Additionally, it will probably be easiest using the UI. You should really follow these steps:

 

  1. Perform a full backup your database
  2. Change the recovery model to simple or bulk_logged (unless it’s already simple or bulk_logged) and then restrict user access to dbo use only (just in case you need to revert back to the older version). You can choose bulk_logged if this operation is going to take a significant amount of time and IF you want to do periodic log backups while it’s running. If you just plan to do a full backup upon success completion of this operation then you can choose the simple recovery model.
  3. Use the UI to create a new file and filegroup (although this syntax is pretty easy)
  4. Use the UI to change the table’s properties to point to your new filegroup (probably should do some testing with a development version before you do this on production – especially with the same sizing, etc. so you can see how much log space and time this is likely to take, etc.)
  5. Change the recovery model back to what it was (likely full?)
  6. If you switched to simple, perform a full backup of your database. If you switched to bulk_logged then all you need is a transaction log backup.
  7. Allow users back in.

 

As for resources that might help you out more with this question and with understanding some of the management options (in terms of backup/restore for VLDB, see my series on Backup/Restore on SQL Server Magazine. A complete list of my articles is here.

 

Q: After doing index maintenance in Enterprise Manager, the LogicalScanFragmentation was zero for the indexes that I modified. After running DBCC DBREINDEX the LogicalScanFragmentation jumped up to approx 12% on all the indexes. Not sure why this happened, but there must be some difference in the way the indexes were rebuilt.

Also, somewhat related to the first part of the previous question… It turns out that Logical Scan Fragmentation of 0 (as reported by a DBCC SHOWCONTIG with ALL_INDEXES) is actually a bug. It's unlikely that it “jumped” to 12% but that it may not have been 0 to begin with...

 

But – for completeness – there could be one other factor in it increasing after a rebuild and that's IF the space into which it rebuilds is NOT contiguous. You *MAY* want to consider rebuilding/moving the object to another filegroup to make sure that it's completely contiguous but that might not be worth it.

 

Q: (This is from a recent email, fyi) I found a bug in your sp_RebuildIndexes.sql script… It doesn’t work for clustered indexes that have columns set to DESC!

THANKS! This is great. I have to admit that I haven't had clients with clustered indexes with DESC columns but it could certainly happen and good to know that I have a bug (we all have to have one someday, right? he he)!

 

So, I have updated/fixed the script here:sp_RebuildIndexes.sql (11.08 KB). For completeness, I have also updated the msdn webcast Q&A zip files: Part I here and Part II here and I have updated the version that is listed on my sample scripts page on SQLskills (here). It shouldn't impact too many people but good to have the right code! Always let me know if you find something. (This fix is from fellow RD Morten Abrahamsen who is an RD for Denmark, Norway and Sweden.)

 

Q: (Also, from a recent email) I don't quite understand what indexdefrag will do for a clustered index. It is my understanding that the leaf level of a clustered index is the data (i.e. the leaf pages are the data pages). If these pages are being physically reordered and compacted [during a rebuild], wouldn't that accomplish the same thing as a defrag index?

Rebuilding and Defraging both work on the leaf level of an index - regardless of index type... So a rebuild completely rebuilds the entire structure (so, it’s more thorough). Defrag solely rearranges the leaf level pages in "mini" transactions as it moves data through the leaf level pages.

 

So - your question is really more of a statement - and a correct one too... Yes, the leaf level of the clustered index IS the data so if you perform a rebuild then there's no reason to defragment.

 

Q: You mentioned that DBCC DBREINDEX and CREATE w/DROP_EXISTING are synonymous and from what I can tell from your documentation as well as other documentation that I have read, the DBCC DBREINDEX does rebuild the clustered index.  We currently have a script similar to the sp_RebuildIndexes.sql that you provided; however, we perform the DBCC DBREINDEX for all indexes including the clustered indexes.  We have a consultant telling us that we must use the CREATE w/DROP_EXISTING for the clustered indexes.  At first I didn't think this was true based on what I learned from your webcast. However, after looking at your sp_RebuildIndexes.sql script I do see that you are also using the CREATE w/DROP_EXISTING for the clustered index. Can you provide some clarification for me and verify if we should be using CREATE w/DROP_EXISTING for clustered indexes?

Well, yes and no… To be honest… you caught me! J

So, to answer whether or not these are synonymous – well, kind of! There's one thing you can do with CREATE w/DROP_EXISTING that you can't do with DBCC DBREINDEX. Using CREATE w/DROP_EXISTING you can CHANGE the definition of the clustering key. If you're just rebuilding then there's NO difference....

 

Now, to answer the question of why I use CREATE with DROP_EXISTING for the clustered index... it was just for the challenge! Really, I was trying to see how complex the dynamically executed string was going to be to build in order to supply the entire CREATE INDEX statement. So, didn’t mean to create the confusion! There’s really NO technical reason for it! (Good catch!)

Categories:
Events | Indexes | Resources | Tips

Q: Can I automate DBCC DBREINDEX to rebuild all tables on all databases on a server?

Actually, yes and no… With stored procedures you’ll have a hard time changing database context without having to use dynamic string execution…. And so this can add a bit of frustration. BUT – it’s your lucky day (and really, it’s mainly because I’m on a ferry going to Victoria Canada right now and the Washington State Ferry system doesn’t currently have wireless internet access… and well, I have a few spare minutes J).

 

Anyway, I’ve put together a stored procedure which allows you to enter a command that will be processed in all databases. It’s similar to the undocumented sp_MSforeachdb but it’s got a simpler format. The basic idea is to create a cursor which has the list of all of the databases in it and then while walking the cursor build a string which includes a change in database context switch followed by the command to execute.

 

                   SELECT @ExecString = N'USE ' + QUOTENAME(@DBToProcessName, ']') + N' ' + @CommandToExecute

                   SELECT @ExecString AS 'Command to be executed'

 

Please note that I have NOT done a lot of error handling nor have I done a lot of testing but this works pretty well for what I’ve used it for and in response to this question it certainly works. It’s a great start for you to use/learn from and build upon for your own procedures. Here’s the script: sp_ExecuteInDatabase.sql (3.76 KB). And here’s how you use it:

 

            exec sp_ExecuteInDatabase N'exec sp_RebuildIndexes'

 

or if you want to execute ONLY in one database:

 

            exec sp_ExecuteInDatabase N'exec sp_RebuildIndexes', 'pubs'

 

Let me know if you find anything interesting and/or make changes to this one! Have fun.

 

Q: Is there a baseline amount of improvement for INDEXDEFRAG or DBREINDEX to be worthwhile? Please express your answer looking at % change for logical fragmentation and Scan Density.

Actually, the associated whitepaper does a nice job of correlating table size with defragmentation best practices. I’ll defer to that BUT I will say that IF you have the maintenance window then there’s never a negative in doing this – even for small tables. However, (and as the whitepaper also states) the larger and more active the table the more rewarding (in terms of performance as well as disk/cache savings) a defrag/rebuild is! Here’s the link to the whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices.

 

Q: In a *typical* scenario (500 tables, 15 large ones) how often would you recommend to go between rebuilding indexes? Ex. lower the fill factor until you go X days before your scan density is below the threshold.

Yep! I think I got this one during the webcast as well BUT – this is exactly my suggestion! If you can find the stable point where fragmentation only begins to occur when you’re about to rebuild then that’s the best choice. In that case you will keep your table the most compact (i.e. at the highest fillfactor) without wasting space ; at the same time you will minimize fragmentation and keep the DML operations fast! Again, the defrag whitepaper has a few additional ideas for you. Here’s the link to the whitepaper: Microsoft SQL Server 2000 Index Defragmentation Best Practices.

Categories:
Events | Indexes | Resources | Tips

Q: Is there another way to determine fragmentation? I don’t want to use DBCC SHOWCONTIG as it locks objects.

Well, this is a good question and unfortunately, there really isn’t a way to generate information such as Scan Density or Average Page Density without running DBCC SHOWCONTIG. However, if you decide that you’re only interested in Scan Density then you will be interested in the WITH FAST option for DBCC SHOWCONTIG. Using the WITH FAST option the command will only calculate the out of order extents (i.e. extent switches) and not calculate average row length, maximum row length, minimum row length, average page density, etc. and in using this option SQL Server will execute this command a lot faster thereby releasing the locks faster. However, locks will still be acquired. As an odd response though – I’d have to say that you could just defrag without looking… Meaning instead of even checking for fragmentation you could just execute DBCC INDEXDEFRAG. Since DBCC INDEXDEFRAG doesn’t use long running locks (it runs as “mini transactions” as it moves through the leaf level of the index) and doesn’t do anything when there’s no fragmentation (it really only moves data around when fragmentation exists) then you could do this without checking – and not creating blocking.

 

Q: How does the fill factor influence index structure?

FILLFACTOR – which is an option which ONLY takes effect when an index is initially built or later when an index is rebuilt or defragmented – defines the level of fullness for the leaf level of an index. Fillfactor ONLY applies to the leaf level and when set appropriately for the data and the volatility patterns of the data, it should help to minimize fragmentation thereby reducing the need to free space to be left at higher levels in the tree. However, if you also want to pad the upper levels of the tree with the same level of free space as the leaf level, you can add PAD_INDEX to the index statement and it will use the fillfactor setting for all levels of the index. For example:

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80

 

Fills the leaf pages 80 percent full and the non-leaf pages are balanced such that at LEAST one entry is open on each leaf page.

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80, PAD_INDEX

 

Fills BOTH the leaf and non-leaf pages 80 percent full.

 

Typically, PAD_INDEX is not needed unless you have extremely unbalanced activity and are unable to rebalance the tree (meaning you need to achieve high availability). However, I would say that PAD_INDEX is not perfect as implemented as I would rather set that value differently from fillfactor…

 

So, not a big fan of PAD_INDEX but feel it’s very important to set an APPROPRIATE fillfactor. (Review the webcast for more details on what I mean by “appropriate” fillfactor.)

 

Q: What would be the ideal settings [of fillfactor] for OLAP/DSS Databases?

To be honest, 100% is an ideal setting for a read-only decision support database, as new data is not causing splits. However, data loads (which build these reporting databases) can cause fragmentation. Ideally, you would drop your indexes, load your data and then rebuild your indexes upon completion. If you’re trying to reduce the amount of time the process takes AND you’re not loading in a lot of data (compared to what’s already there) then you might not want to drop and recreate indexes… Setting fillfactor in these cases will be based on how much data you will be loading compared with how much data you have. Set fillfactor to whatever that percentage is (if you’re loading 100,000 rows into a table which already has 900,000 rows then a fillfactor of 90 should have just enough room – IF the data is evenly distributed). If the data is NOT evenly distributed then you might want to drop this number a tad lower. AND – you only want to set this on indexes that are going to have splits. Remember, if your clustering key ever increasing and the new data coming is “appended” to the end of the table – both logically and by clustering key definition – then you only need the fillfactor setting on the nonclustered indexes. Regardless, it is still likely that you will end up with some fragmentation. Dropping the indexes, loading the data, and then rebuilding the indexes would certainly be better – but it will take more time. IF you’re loading a lot of data relative to what you already have then there’s NO question: drop the indexes first, load the data and then rebuild. When dropping indexes ALWAYS make sure to drop NC Indexes first and then drop the clustered index. When creating indexes ALWAYS make sure to create the clustered index first and then add the nonclustered. For more details on High Performance Data Loading check out the presentations on Gert Drapers’ site: www.SQLDev.Net, under Events. (FYI - There are all sorts of great presentations there!)

 

Q: When trying to determine the fill factor to set, would you recommend setting the same fillfactor to all tables or just adjust the fillfactor on the large tables and keep the small ones at the default?

I know I got this one during the webcast… But as a recap here. I would really recommend starting with the “monster” tables and then work your way down from there. Remember, you don’t really need a fillfactor unless your object is going to have splits. In many cases your clustered structures will not have a lot of splits as they might have a clustering key which is ever-increasing. Additionally, if you have all fixed width fields or you don’t have updates to varchar columns – then you might be able to keep a large table very compact at 100% in fact. The reason I mention this is that I have a customer who did exactly what you’re asking… They decided it was easier to just set the server option for fillfactor to 90% so that they’d change the default value instead of having to set it individually for every index. Unfortunately, for their largest (yes, their monster!) table it turned out that they didn’t have updates and their clustering key was on an identity column. Once they went back and changed their rebuild job to build to 100% they reclaimed 3.2 GB of disk space!

 

So, let me summarize. Only worry about fillfactor for your largest tables. Set it somewhat strategically and make sure to check on it using DBCC SHOWCONTIG (if possible) or just defrag regularly (if you just want to make sure and can’t afford the impact on availability – review the rest of the questions in “Index Fragmentation” for more details).

 

Q: Will Kimberly show the rebuild or defrag statement she used? I had to run to the restroom and missed that brief part.

I would have answered your question but I had to use the restroom myself… Seriously though, J the commands are:

            DROP and re-CREATE

            DBCC DBREINDEX (was added in 6.5)

            CREATE with DROP_EXISTING (was added in 7.0)

            DBCC INDEXDEFRAG (was added in 2000)

Personally, when I’m going to do a rebuild – I use DBCC DBREINDEX as it’s easier to automate and easier syntax in general. If my availability requirements don’t allow the table/indexes to be rebuilt because of locking, then you only have one choice for defraging – DBCC INDEXDEFRAG. I have a feeling that I cleared this up once you returned and certainly, you can review the webcast on demand here!

 

Q: Regarding sp_RebuildIndexes, does the SP need to be created in the Master Database? Our companies Best Practices recommend against it. We use an “Admin User Database” for most shared stored procedures.

Sure, you can create the sp in any database (just make sure to modify the script) and then make sure to explicitly qualify the execution with the datasbasename.owner.procedurename and everything should work. Also, there are a couple of procedures created. Make sure to review thoroughly the script.

 

Q: What type of Defrag do the data maintenance plans use?

They use DBCC INDEXDEFRAG.

 

Q: Can you use DBCC DBREINDEX if you are using transactional replication? I thought this 'broke' replication.

Checked the KB and pinged a few people and no one is aware of any issues with regard to DBREINDEX breaking replication (nor should creating/dropping indexes). In fact, because transactional replication focuses on the transactional data changes (not the page locations of the data) this operation should not even have the potential of “breaking” replication. Don’t get me wrong, it can create blocking and this in turn can impact user access to the replicated tables but it still should not break replication. 

On a related note (and probably another common replication/maintenance question I get), often people want to know how to have replicated table’s indexes rebuilt/defraged when the source table is rebuilt/defraged. And – this is a fairly simple trick. Use stored procedures to execute the rebuild/defrag and then stored procedure replication to have the execution of the stored procedure replicated to the subscribers.

Categories:
Events | Indexes | Resources | Tips

Before I launch into the Q&A from this area there are a few other resources with Index Usage Q&A that you should also check out:

            Review the Q&A from the June 11 Webcast here.

            Review the Q&A page on SQLskills here.

 

Q: How does SQL server decide what index to use when there are multiple indexes in at table?

When SQL Server receives a request for data it goes through 5 primary steps: parsing, standardization, optimization, compilation and execution. The process of optimization (while a bit simplified here) includes multiple phases as well: query evaluation, index evaluation, join evaluation and so on. During index evaluation SQL Server determines whether or not USEFUL indexes exist to aid in processing your query. The usefulness of an index is based on the selectivity of the query and the columns requested in the query… There are many things for which SQL Server looks but the idea is that the better the information it has – the better the job of optimization. To ensure that the optimizer does a good job you should make sure to have auto update statistics and auto create statistics and you should consider using tools like Index Tuning Wizard to help you to create useful indexes.

 

While this is only touching the tip of the iceberg on these concepts last month’s webcast: Indexing Best Practices – will give you a lot more information. And – if you’re really interested… I offer a 5 day course in Performance Tuning and almost 2 full days cover almost nothing but indexes (nope, not kidding!). For details on the next SQL Immersion Event, click here.

 

Q: Why does the optimizer occasionally decide NOT to use an index(s)? In a few cases, we have had to force an index (e.g select * from a (index=index name) join b on a.id = b.id…

Also, related to how the optimizer makes decisions on index usage... and a bit hard to answer with certainty BUT there are a few reasons why the optimizer might make an incorrect decision. However, before I get there – are you sure that using an index was really the “right” decision? There are often cases where a query is not selective enough to use an index and a table scan is actually a better choice… But if you find a case where your queries performance improves when you force an index and SQL Server is unable to make that same optimization choice then I’d check these things:

  • Make sure that statistics are current and up to date. First, make sure the auto create and auto update statistics database-level options are turned ON. Also, you can use these commands to get some information about statistics: DBCC SHOW_STATISTICS or the STATS_DATE function. If the statistics are up to date then you might also want to check to see if they were based on a sampling v. a full scan. In the output of DBCC SHOW_STATISTICS you can see the number of rows as well as the number of rows sampled. If the sampling is low compared to the number of rows in the table then you might want to automate an UPDATE STATISTICS with FULLSCAN. However, this can cause blocking so you’ll want to schedule it off hours.
  • Maybe you don’t have the best index for the query… consider using Index Tuning Wizard on the query in Query Analyzer.
  • Maybe consider re-writing the query.. If it’s a join consider a subquery and visa versa.
  • If the command is in a stored procedure consider recompilation techniques. See this blog entry for more details on stored procedure recompilation issues.

Well, this certainly isn’t an exhaustive list but it should help!

Categories:
Events | Indexes | Resources | Tips

NOTE: These have NOT yet been completed and there are still a few questions that need to be answered. I will wrap these up shortly and also break them down into separate blog entries so they're easier to wade through. I will also post the pdf, the webcast link (when it's available), the demo scripts and links to all of these blog entries - on SQLskills - when everything is complete. BUT - here's a good start with ALL of these resources as well as the majority of the Q&A. Have fun!

 

Q: Can I view a recording of this webcast?

It should be available on www.microsoft.com/usa/webcasts/ondemand shortly.

 

Q: Where can we get a print out of the slides?

I have posted a pdf of the presentation here as well as on the “Past Event” resources page on SQLskills.com here.

 

Q: Where can we get the demo scripts?

I have posted the demo scripts here (MSDNWebcast-Tripp-20040719-SCRIPTS.zip (68.37 KB)) as well as on the “Past Event” resources page on SQLskills.com here.

 

Q: Does Kimberly L. Tripp have a blog?

Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

 

Q: Do you speak at other events for more advanced tuning topics?

Yes, in fact, SQLskills has a new event coming up in October that’s a 5-day event on performance tuning on SQL Server. We’re calling these events “Immersion Events” as they’re focused, indepth and intense! Full details are on www.SQLskills.com.

 

Finally, here are all of the resource links I recommended in the resources section of the slides:

Q: What about using uniqueidentifier fields for keys in a clustered index, will this affect performance? They are random numbers so they are not monotonically increasing…

Ha, I always want to time how long it’s going to take for me to get this question. Seriously, I don’t think I’ve EVER given an indexing lecture (where I talked about the clustering key) where someone didn’t ask about GUIDs. And – once again – a great question! I do not really have a problem with GUIDs in general. I think they have a purpose in that they are relatively “random” numbers (they are not really random entirely as they get part of their value from SQL Server, part from Windows, part is based on time, part is ever increasing and part is random; however, the final value created does not have an ever-increasing or decreasing pattern to it). They truly make sense in distributed applications or those where you do not want a predictable row identifier; however, they are not always an appropriate choice for a key. But this leads me to another discussion – is the primary key always clustered and/or always need to be an identity. These three things are really three separate things and they do NOT have to be on the same column. I often prefer that all three are the same column as an identity is naturally unique, naturally static, naturally narrow as well as creates an ever-increasing pattern that minimizes splits/fragmentation due to inserts (again, only when this is also the clustering key). So I guess the question is – does it really need to be the Primary key? And well, that’s the debatable but if all three criteria come together that often gives you the largest gains. If the Primary key is not an identity (and it’s a very large transaction processing table with a lot of inserts/updates/deletes) then I would consider adding a surrogate column – solely for clustering – and make the Primary Key non-clustered. The non-clustered index will certainly become fragmented (if on a GUID) so as long as proper maintenance practices are in place you can reduce the excess disk space and improve modification performance.

 

Q: What about UPDATEs that change the row's clustering columns?

This question was related to “what can cause fragmentation” where I was discussing UPDATEs to variable width columns. While updates to variable width columns can cause a row to widen (and therefore cause a split) it’s a good point to mention that volatility in the clustering key is far worse. In fact, in last month’s ewbcast I spent quite a bit of time discussing proper placement of the clustering key – to minimize fragmentation and in general, improve performance. There are three criteria for which I look in a clustering key: the key should be Unique, Narrow and Static. As a final criteria – and this helps to proactively minimize fragmentation – I look for the key to have an ever-increasing pattern.

 

Q: Does clustering rely on the operating system for performance?

Well, I have to admit I am not entirely sure of the question here… But – I think I understand from where it might originate (well, I’ll try). First – and unfortunately (and also somewhat seriously), I think many people lack creativity in choosing a term for something – at least in the computer industry! J Here, I think the confusion is between “clustered index” and the term  OS term/feature “clustering.” The term clustered index stems from having data “clustered” in a certain order or grouping. In previous releases, we tended to create the clustered index for range queries (like on LastName) and therefore people with the same LastName were “clustered” together. However, because the internals of SQL Server changed [significantly] between 6.5 and 7.0 the internals really dictated a change in where the clustered index should be created. So, even though the term didn’t change it doesn’t entirely apply. Now the OS term/feature “clustering” applies to high availability. Specifically, failover clustering refers to a feature of the Enterprise Edition of SQL Server – which requires a higher level of Windows (typically an “Advanced”, “DataCenter”, or “Enterprise” version) to support it. This really is not the primary focus for this webcast but below are a couple of references if you want to understand more about High Availability and Failover Clustering in SQL Server:

 

Q: How well are Microsoft's standard products that use SQL optimized? Are there published studies on optimizing Microsoft CRM, Sharepoint Services, Project Server, and the various Microsoft Business Products?

This one I kind of answered during the webcast…First, I will start by mentioning that I’m NOT an expert on any of these products specifically. However, you are correct in that they are based on SQL Server and use SQL Server as their data store. What that means is that they too must be optimized and maintained; however, it is impossible for anyone to predict perfectly the exact usage of each of these products and as a result they might have a good selection of indexes but they are unlikely to have maintenance operations defined. However – again, I cannot say that I am an expert and I could be wrong! What I will say however, is that regardless of what is defined and what is maintained – you can create additional indexes, possibly drop indexes (more on this in a moment) and maintain indexes with NO application changes. The reason why I say “possibly” drop indexes is because queries/procedures, etc. will FAIL if index hints are used and the index does not exist at runtime. And while I’m fairly certain that NO Microsoft product hardcodes index hints, I have seen other third party products which do and unfortunately they become VERY hard to optimize and manage. In general, it is harder to drop indexes than add indexes. Nevertheless, it is always possible to defrag/rebuild indexes – regardless of the application whether Microsoft of not. So, while I can’t speak to each of these applications directly (although my partner works on Microsoft Project and he’s pretty sharp! J), you can always improve upon an application with better/more useful indexes (meaning more specific indexes for YOUR usage patterns and requirements) and proper/consistent/automated index maintenance.

 

Q: Is there any internal command to output query plan used by optimizer (not showplan or external tools available), which may show splits with DML operations?

Well, I’ll start by saying that these are really two different parts to your query processing. First, splits are NOT part of the optimization plan as they are only known at runtime. So, a plan really wouldn’t ever show them. As a result, I don’t believe an internal command exists to show this; however, I’m a bit confused as to why it must be “internal” and not something else? Are you just trying to determine which commands and in which tables you are having a lot of splits? (And IMO, this is the most important.). If that’s the case then you can start by using PerfMon to see if you have splits (and get a rough idea of when) and then use DBCC SHOWCONTIG to see which tables are fragmented (meaning they had the splits). Perf Mon will only tell you that you have them – not in which table they are occurring (but I will say that you might be very pleased with a few new features of SQL Server 2005 – look forward to playing with beta II Profiler and PerfMon). Right now, check out PerfMon SQL Server: Access Methods, Page Splits/sec.

 

Q: How do we use the I/O affinity option in SQL Server 2000?

Actually, I’ll let the KB take care of this one. Check out KB Article Q298402 titled: INF: Understanding How to Set the SQL Server I/O Affinity Option.

 

Q: Can we use MAXDOP with Index Creation?

No, MAXDOP as a query hint only works on DML statements. However, if you want to limit all operations to a reduced number of CPUs you can set the server setting Max Degree of Parallelism. See the sp_configure option in the Books Online as well as KB Article 329204 titled: INF: General Guidelines to Use to Configure the MAXDOP Option.

 

Q: Is there another way to determine fragmentation? I don’t want to use DBCC SHOWCONTIG as it locks objects.

Well, this is a good question and unfortunately there really isn’t a way to generate information such as Scan Density or Average Page Density without running DBCC SHOWCONTIG. However, if you decide that you’re only interested in Scan Density then you will be interested in the WITH FAST option for DBCC SHOWCONTIG. Using the WITH FAST option the command will only calculate the out of order extents (i.e. extent switches) and not calculate average row length, maximum row length, minimum row length, average page density, etc. and in using this option SQL Server will execute this command a lot faster thereby releasing the locks faster. However, locks will still be acquired. As an odd response though – I’d have to say that you could just defrag without looking… Meaning instead of even checking for fragmentation you could just execute DBCC INDEXDEFRAG. Since DBCC INDEXDEFRAG doesn’t use long running locks (it runs as “mini transactions” as it moves through the leaf level of the index) and doesn’t do anything when there’s no fragmentation (it really only moves data around when fragmentation exists) then you could do this without checking – and not creating blocking.

 

Q: How does the fill factor influence index structure?

FILLFACTOR – which is an option which ONLY takes effect when an index is initially built or later when an index is rebuilt or defragmented – defines the level of fullness for the leaf level of an index. Fillfactor ONLY applies to the leaf level and when set appropriately for the data and the volatility patterns of the data, it should help to minimize fragmentation thereby reducing the need to free space to be left at higher levels in the tree. However, if you also want to pad the upper levels of the tree with the same level of free space as the leaf level, you can add PAD_INDEX to the index statement and it will use the fillfactor setting for all levels of the index. For example:

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80

 

Fills the leaf pages 80 percent full and the non-leaf pages are balanced such that at LEAST one entry is open on each leaf page.

 

CREATE INDEX Test

ON TestTable (TestColumn)

WITH FILLFACTOR = 80, PAD_INDEX

 

Fills BOTH the leaf and non-leaf pages 80 percent full.

 

Typically, PAD_INDEX is not needed unless you have extremely unbalanced activity and are unable to rebalance the tree (meaning you need to achieve high availability). However, I would say that PAD_INDEX is not perfect as implemented as I would rather set that value differently from fillfactor…

 

So, not a big fan of PAD_INDEX but feel it’s very important to set an APPROPRIATE fillfactor. (Review the webcast for more details on what I mean by “appropriate” fillfactor.)

 

Q: What would be the ideal settings [of fillfactor] for OLAP/DSS Databases?

To be honest, 100% is an ideal setting for a read-only decision support database, as new data is not causing splits. However, data loads (which build these reporting databases) can cause fragmentation. Ideally, you would drop your indexes, load your data and then rebuild your indexes upon completion. If you’re trying to reduce the amount of time the process takes AND you’re not loading in a lot of data (compared to what’s already there) then you might not want to drop and recreate indexes… Setting fillfactor in these cases will be based on how much data you will be loading compared with how much data you have. Set fillfactor to whatever that percentage is (if you’re loading 100,000 rows into a table which already has 900,000 rows then a fillfactor of 90 should have just enough room – IF the data is evenly distributed). If the data is NOT evenly distributed then you might want to drop this number a tad lower. AND – you only want to set this on indexes that are going to have splits. Remember, if your clustering key ever increasing and the new data coming is “appended” to the end of the table – both logically and by clustering key definition – then you only need the fillfactor setting on the nonclustered indexes. Regardless, it is still likely that you will end up with some fragmentation. Dropping the indexes, loading the data, and then rebuilding the indexes would certainly be better – but it will take more time. IF you’re loading a lot of data relative to what you already have then there’s NO question: drop the indexes first, load the data and then rebuild. When dropping indexes ALWAYS make sure to drop NC Indexes first and then drop the clustered index. When creating indexes ALWAYS make sure to create the clustered index first and then add the nonclustered. For more details on High Performance Data Loading check out the presentations on Gert Drapers’ site: www.SQLDev.Net, under Events. (FYI - There are all sorts of great presentations there!)

 

Q: When trying to determine the fill factor to set, would you recommend setting the same fill factor to all tables or just adjust the fill factor on the large tables and keep the small ones at the default?

I know I got this one

 

Q: Will Kimberly show the rebuild or defrag statement she used? I had to run to the restroom and missed that brief part.

I would have answered your question but I had to use the restroom myself… Seriously though, J the commands are:

            DROP and re-CREATE

            DBCC DBREINDEX (was added in 6.5)

            CREATE with DROP_EXISTING (was added in 7.0)

            DBCC INDEXDEFRAG (was added in 2000)

If I’m going to do a rebuild – I typically use DBCC DBREINDEX as it’s easier to automate and easier syntax in general. If my availability requirements don’t allow the table/indexes to be rebuilt because of locking, then you only have one choice for defraging – DBCC INDEXDEFRAG. I have a feeling that I cleared this up once you returned and certainly, you can also review the webcast on demand once it’s available!

Q: Regarding sp_RebuildIndexes, does the SP need to be created in the Master Database? Our companies Best Practices recommend against it. We use an “Admin User Database” for most shared stored procedures.

Sure, you can create the sp in any database (just make sure to modify the script) and then make sure to explicitly qualify the execution with the datasbasename.owner.procedurename and everything should work!

 

Q: What type of Defrag do the data maintenance plans use?

They use DBCC INDEXDEFRAG.

Categories:
Design | Events | Indexes | Resources | Tips

** Well, this started out as an easy “row size” write-up but I ended up adding a bunch of tangents in it and well, there's apparently more coming because as I got started I realized there's way more to this topic than I even barely touch on here... fyi :) **

Maybe not something that will make performance scream and/or improve performance by orders of magnitude but just choosing the right data type for the job goes a long way to creating a more compact and optimal row, keeping table schema stable (well thought out design shouldn't be as volatile - hopefully) and this can result in better cache utilization because the rows are more compact and more rows fit on a page. The overall table size – in terms of pages – is then smaller.

But this might not make as much sense to you yet... It's partially based on internals and the way in which SQL Server stores the data row. First, rows have a maximum length of 8060 bytes; this is based on the fact that rows are stored on pages and pages are 8K in size in SQL Server. Now, rows could possibly span pages (with a change in internals and design) but I would probably still argue against it. I can make numerous arguments that even if this were possible it's not necessarily desirable. In fact, I'm often advocate forms of “vertical” partitioning where columns are split into multiple tables in order to allow row sizes to be narrower and therefore fit more rows on a page... This leads to a narrower table where scans require fewer pages in cache. OK, vertical partitioning is another topic in and of itself. I'll touch on this toward the end. Back to row size - this 8060 byte limitation applies to ALL data types except LOB types (text, ntext, and image) as well as overhead. I'm not really interested in all of the details in the overhead as there are numerous excellent resources that describe each byte (and how they work) in great detail (for example, Inside SQL Server by Kalen Delaney) but I'm interested in the overall choices you make. Since rows cannot span pages then some consideration has to be made during table design because that implies that free space might remain if rows do not optimally fit on a page. Since many data rows have variable width data then the number of rows that fits on a page varies but the best example to understand the impact of improperly chosen types is seen if I start with fixed width rows. And - before I go too far, this is not THE silver bullet to make performance scream but a best practice that's easy to follow and HELPS lead to a well balanced system.

Take a row size of 400 bytes per row. If a page is 8k = 8192 bytes and SQL Server uses 96 bytes of overhead then the remainder of the page available for data is 8096. The SQL Server team consciously chose to restrict the total size even further (to 8060) to make room for potential overhead needed in future changes (and they are using at least 14 bytes per row for an upcoming feature in Yukon - which I won't mention now but an upcoming whitepaper I've written will go into quite a few details on this feature and the overhead needed, etc.). So - as with any good design, forethought went into overhead and internals while they weren't exactly sure how they might use those bytes in the future they kept them available. This often allows upgrades to future versions to be made simply and less painfully (and this is in fact the case on the upgrade to Yukon). It's not always perfect and there are other things that might require work to upgrade (efficiently - again, thinking ahead :) but it's always better to think about current, futures and how long you expect the design to work (and then double it :). So - suffice it to say - you get 8060 for a single inserted row but the entire 8096 including overhead if there are multiple rows on a page. Taking a simple table design where the rows are 400 bytes per row you can see that the math works out well 8096/400 = 20 rows per page with 96 bytes remaining (and my 400 bytes includes overhead - there's a header of 4 bytes, a null block, a variable block, and then offsets for each variable width column). Regardless, we have 20 rows per page and a very small amount of space remaining. If the row size were 4500 bytes then we'd have a very different story - only 1 row would fit per page and we'd end up wasting disk space. Again, with variable character data this probably wouldn't be as extreme... And, many of you would probably answer “disk space is cheap” which I agree with BUT... since the smallest unit of I/O is the page level then your free space wasted on disk translates to wasted space in cache - which is not cheap and is definitely NOT infinite. So - narrower rows where the right data type is used is a good start. I've seen designs where every time someone needs a date they choose datetime (which is 8 bytes and stores time - precise to the time tick (which is 3.33 ms)) and each time they needed a number they chose int - even if they only have 15 status codes currently and they only plan to add 20-30 more. The highest status code they're looking at is 35-40... Or let's even say 60-75... These numbers can be stored in a tinyint - which is only 1 byte. No real reason to waste 3 bytes for every row. And - I know you're thinking who cares about 3 bytes but imagine 3 bytes for each row, for 6 different columns over 87 million rows.... (It’s something like 1.5GB) which is a heck of a lot of memory and a heck of a lot of disk space...

And - just to make another point. I don't really care (from a disk space perspective) about EVERY table? But - these are good practices everywhere. They can become very important for LARGE tables though; review the “monsters” that will give you the most grief over time as they grow and change. Almost every OLTP database has lots of tables (lots of lookup tables, etc.) but often only a handful of really large tables.

If you’re interested in adding up all the bytes yourself – for a new table you can certainly do this… There are a few things to keep in mind in terms of overhead and since you’ll probably have variable width data then you probably don’t have to worry about being exact here. Just try and get the average size of actual data and add a bit of padding. If you have existing data you can get the average BYTE size of the column using:

SELECT avg(datalength(column)) FROM tablename

Make sure to use DATALENGTH – which is byte length – and not LEN – which is string length. Remember, UNICODE columns are two bytes per character.

If you have a table already – with sample data and you want to see the ROW length averages, min and max – then consider using DBCC SHOWCONTIG to see these columns: MinimumRecordSize, MaximumRecordSize, and AverageRecordSize. The only way to see this information is to use DBCC SHOWCONTIG WITH TABLERESULTS. Check out the BOL for more information as this command DOES use locking to access the data so it can cause shared lock/blocking.

OK, so I really need to wrap this up. What are some of my final thoughts in terms of this [probably one of many] mail on table design:

First, take some time to learn all of the different data types and then make sure to find the right data type for the job! Use the most optimal yet flexible data type possible. If you have 4 status codes but think you might have 100 - then go tinyint, no need for int.

Consider data access/pattern usage as well. If you are storing dates - without time then you have a few options, you could go smalldatetime (which is 4 bytes) if your dates are all greater than Jan 1, 1900 or if you don't want the hassle of dealing with time (when you only want a date) you could actually consider a char(8) for date. OK, this seems to waste space but can often make other operations easier with which to deal. A formatted string of yyyymmdd is a valid input to all of the datetime functions and can be a useful way of storing date without time. And - I have lots of opinions on datetime data... In fact, I've written a series of articles for SQL Server Magazine on exactly this topic so if you're interested check out the links here.

And what about Nullability and/or GUIDs? Ugh – I thought I’d get to these but I just can’t even go here now… This is enough for today. I’ll get motivated again soon; in fact, on my 8th flight in three weeks tomorrow - whatever shall I do on that plane?!  ;)

Categories:
Design | Tips

Well, the person then asked if we could sitdown... I said sure...

We sat down at a table and they proceeded to rummage through their bag, pull out a bunch of paper (which turned out to be a single, very, large piece of plotter paper which was folded up) and proceeded to unfold it and smooth it out on the table...

It was their ERD - yes, all 500 tables (or whatever)...

They smoothed it all out (I was getting nervous as to the length of the still pending “question” :) and then here it came.....

How would you index this?

Literally, that was it. Here's my ERD - all 500 tables... what indexes do I need.......

How do I even begin? Database design (and indexing) are NOT something with an easy and single “right” answer. In fact, one of the reasons I love working with SQL Server is that there's almost always multiple ways to handle a problem - each having different pros and cons. So - understanding more than just the data is ciritical. For example, I could have noticed a customer table in their ERD and I could have said that an index to aid in customer lookups - by name - would be a good idea. However, if they responded with “well, we don't really use the customer name - this is our demographics analysis database,” then my recommendation would have been not only useless but a negative in terms of overhead, maintenance and therefore performance.

So - what's my point - there are hundreds of things that lead up to good design, optimal indexing, and effective procedural code... a design that gets better cache utilization and one that keeps the data pages compact, logically ordered or grouped, minimizes maintenance and has efficient lookups/modifications, etc... There isn't a single right answer in design.... and I think that's good! It means that you can design for the usage patterns you anticipate (within reason).

I promise - if there were a sqlserver /faster switch or a single silver bullet - I would tell you (well, I'd have you hire me as a consultant, surf the web and IM with friends for a week, and then set the switch on Friday :)....

Seriously, how does one design for performance?

You need to know the data and more important know how it's going to be used (which I realize might NOT be perfect - in fact, I've sometimes been surprised by usages patterns once something goes into production so you might not always be right). However, the more you know about usage patterns the better you will be able to design for them and proactively reduce the possibly negative impact of them (for example data modifications can cause fragmentation...which not only creates potentially excessive freespace but also makes the modification take longer to process - and well, there's so much more to this one...).

Having said that, over my next few blog entries I'll give you a bunch of ideas and best practices for which to look...

Categories:
Design | Indexes | Opinions | Tips

Well... the performance ramifications are... not good!

Without seeing more of the proc I'd have to suggest that you avoid this like the plague! I think I can convince you with this simple example which shows why it's better to have multiple procedures and limit the flexibility a bit OR use DSE = dynamic string execution (and unfortunately even recompilation issues don't help - which I talked about here)...

And - if we go down the path of DSE I'm sure that will start a separate thread as far as "sql injection" issues/attacks and security (the user executing the procedure will need permission to directly execute the statement which is in the DSE string). But - there are a few ways to protect parameters in a stored procedure. In this procedure, the parameters will not allow “injection” because of the addition of the QUOTENAME function. This function is INVALUABLE for protecting isolated input values (which is another trick: isolate parameters if you can).

In the following “multi-purpose” procedure it becomes obvious (after testing) that ONLY DSE solves the performance problem. I could still argue that more than one procedure could be beneficial especially as it wouldn't have the access requirements that this one has (remember - users would have to have permissions to the base table). BUT that's a lot more work to maintain/call, etc. Regardless, one procedure with “all purpose parameters” is definitely NOT a good choice (without DSE). (And fyi - a future release might help with some of the permissions issues - that's all I'll say for now! :)

OK - so see the ProcedureParameters.sql (4.3 KB) procedure to play with this within the CreditSampleDB.zip (55.79 KB). You will need to modify some data to get rows to return to these queries; use the updates in the .sql script.

CREATE PROC dbo.GetMemberInfoParam
(
 @Lastname varchar(30) = NULL,
 @Firstname varchar(30) = NULL,
 @member_no int = NULL
)
AS
SET NOCOUNT ON

SELECT m.*
FROM dbo.member AS m
WHERE (m.lastname LIKE @lastname OR @lastname IS NULL)
 AND (m.member_no = @member_no OR @member_no IS NULL)
 AND (m.firstname LIKE @firstname OR @firstname IS NULL)
go

This next procedure uses dynamic string execution to build the exact and needed string. You could have built this from the client as well... and that would introduce other sql-injection issues as well. But - this is a simpe idea and I hope this helps.

CREATE PROC dbo.GetMemberInfoParamDSE
(
 @Lastname varchar(30) = NULL,
 @Firstname varchar(30) = NULL,
 @member_no int = NULL
)
AS
SET NOCOUNT ON

IF @LastName IS NULL AND @FirstName IS NULL AND @Member_no IS NULL
 RAISERROR ('You must supply at least one parameter.', 16, -1)

DECLARE @ExecStr varchar(1000)
  , @MemberNoStr varchar(100)

SELECT @ExecStr = 'SELECT m.* FROM dbo.member AS m WHERE '

IF @LastName IS NOT NULL
 SELECT @Lastname = 'm.lastname LIKE ' + QUOTENAME(@lastname, '''')
IF @FirstName IS NOT NULL
 SELECT @Firstname = 'm.firstname LIKE ' + QUOTENAME(@firstname, '''')
IF @Member_no IS NOT NULL
 SELECT @MemberNoStr = 'm.member_no = ' + convert(varchar(5), @member_no)

SELECT @ExecStr = @ExecStr + ISNULL(@LastName, ' ')
 +
 CASE
  WHEN @LastName IS NOT NULL AND @FirstName IS NOT NULL
   THEN ' AND '
  ELSE ' '
 END
 +
 ISNULL(@FirstName, ' ')
 +
 CASE
  WHEN (@LastName IS NOT NULL OR @FirstName IS NOT NULL)
     AND @MemberNoStr IS NOT NULL
   THEN ' AND '
  ELSE ' '
 END
 +
 ISNULL(@MemberNoStr, ' ')

EXEC(@ExecStr)
go

The first procedure generates a plan based on the MOST selective criteria (there's more info in the script and you'll see this in the showplan/statistics io output that I describe - also in the script). The second procedure generates a good plan for each and every execution (as expected). I think this is a good example of what you can do!

I'm going to put together one more entry to bring all of these points together as well. Stay tuned!

Have fun!

Index Management/Maintenance Questions:

Q: Is there a tool or method for monitoring a database for awhile to identify indexes that are rarely or never used – that should be considered for removing?

I think I mentioned this one late in the presentation but it certainly warrants an answer. Unfortunately, the answer is both yes and no. To start with the “no’s” – there’s nothing inside of SQL Server [today] that keeps track of index usage. One option is to create a trace and run it through ITW. ITW has an option on the first dialog that asks if you want to “Keep all existing indexes.” If you deselect this it will give recommendations to drop indexes that are not used by that workload. Be careful! You might drop an index that someone else needs. In SQL Server 2005, the ITW is being replaced by DTA (the Database Tuning Advisor) and it has a "drop-only" mode that allows DTA to only check for the unused indexes.

 

But – this still doesn’t really give you a list of what gets used, what doesn’t. I can certainly think of a way that will be expensive (in terms of profiler cost and your time J) to implement and given a bit of time this might be somewhat useful but not sure?! Here’s my idea, you could capture showplan information into a profiler trace, load it into a sql table and then look for your indexes by name (make sure to use the fully qualified name such as [Credit].[dbo].[member2].[member2Cl] as index names only need to be unique per table). You could even do an outer join with sysindexes to get the complete list of indexes related to the workload. It won’t be pretty but it can certainly be done and it’s still ONLY for the time of the workload. Overall, I think this is too expensive to do. Good news coming though… In SQL Server 2005 there will be a virtual system table that will keep track of index usage since the last server restart or since you reset the values.

 

Q: What is the best way to analyze both the space used and fragmentation of each table? More specifically, is there one procedure that will check all tables in a database, output the results and show me which tables/indexes are foobar?

I don’t think any particular command actually outputs the level of “foobar-ness” but… this one will get you very close. J Using DBCC SHOWCONTIG WITH TABLERESULTS returns almost everything you’re looking for. Use the “pages” column to calculate space ((convert(decimal(10,3), pages) * 8K)/1024 = MB), and then the AveragePageDensity (for internal fragmentation) and then ScanDensity, LogicalFragmentation and ExtentFragmentation (for different forms of external fragmentation). One of the best parts of having returned a tabular set is that you can put the data into a table, verify if/change it and even walk it to analyze it as well as run certain operations as a result of the information you learn. In fact, this is what my sp_RebuildIndexes script does. And check out DBCC ShowContig with TableResults.sql (another demo script) for the definition of the table. Both of these scripts can be found with the demo scripts: MSDNWebcast-Tripp-20040611-SCRIPTS.zip (73.19 KB). Additionally, you can find the scripts on www.SQLskills.com under Event resources.

 

Q: Does SHOWCONTIG keep a table lock? and I was told not to use DBCC SHOWCONTIG on a production server as it can adversely affect performance… Is that true?

First, yes, there is a shared table level lock acquired for the length of the scan. If you run DBCC SHOWCONTIG with the default options that scan will not only scan for external fragmentation but also calculate internal fragmentation. To reduce the time for which these locks are held you can use the WITH FAST option on DBCC SHOWCONTIG. This will ONLY do the scan (returning less fragmentation details) but it's good if all your looking for is Scan Density! So – this can create blocking and you should consider running it off hours OR consider running it on another server where you restore a backup of your production database. All fragmentation that exists when the database is backed up - will be restored since the restore does not physically change any data/pages, etc.

 

Q: Could you tell us how FILLFACTOR affects performance?

The short answer is that it helps IMPROVE performance by leaving free space on the pages when the index is built, rebuilt or defragged. When building or rebuilding the index you can specify (using FILLFACTOR) the percent fullness for each page. By leaving free space you can minimize splits, reduce fragmentation and therefore minimize the need to defrag as frequently. So – you should set the fillfactor based on the frequency with which you can rebuild… Combining the frequency with which you can rebuild and the rate at which your table becomes fragmented between rebuilds you can start to find the most optimal setting. The more fragmented it becomes then you should lower the fillfactor (on the next rebuild)... and if the table doesn’t become overly fragmented (and there’s still free space by the time of the next rebuild) then you might want to increase the fillfactor... so over time you can find the most optimal setting... but it does take a bit of work! Start with the “monster” tables first – the ones that give you the most problems and the ones that are the largest! With a bit of time you’ll have a more consistently balanced system.

 

Q: On a very large table (32 GB, 20 mil rows) with a clustered index and 10 non-clustered indexes – where would you start? If the length of time it takes to execute is long and you can only reindex or defrag a few indexes per day, which ones would you do and in what order?

Well, this is a great question! But – I’d like to change the focus a bit; if you focus on minimizing fragmentation then that will in turn keep things more optimal. Minimizing fragmentation in the clustered index is done by creating the recommended type of clustering key (unique, narrow, static and ever-increasing). If you meet all of these then the base table should only be getting fragmented due to updates to varchar. If you have lots of these then setting a fillfactor can help. The fillfactor defines the amount the pages should be filled to when the index is rebuilt or defraged. If your index is becoming very fragmented between builds then you should consider increasing this frequency or lowering the fillfactor. So – this is really what you should look into: verifying that you’re setting the fillfactor appropriately between rebuilds/reindexes and then adjusting that schedule appropriately!

 

But – when you still have a lot of fragmentation and still need to do frequent rebuilds – focus on the most important tables first, then the most important indexes first (i.e. start with the clustered). From there it’s hard to say but the wider indexes are probably getting used more so I’d look at those next.

 

And – one other tip – if some of you are getting a lot of fragmentation due to updates, think about: fixed length fields (which is rarely an option), default values (GREAT choice if the updates come in with no value and later get updated) and rebuilds either more frequently or with a more appropriate fillfactor.

Categories:
Events | Indexes | Resources | Tips

Index Creation Questions:

Q: In an OLTP server do clustered indexes create a negative performance impact to the constant inserts and updates?

The wrong clustered index can for sure… In most environments, having the right clustered index (an index on an ever-increasing key) will be better than a heap (a table without a clustered index) and much better than the wrong clustered index (a clustered index that creates wildly random access patterns for insert – i.e. LastName). So – the idea with the clustered index debate (as I often refer to this as J) is that the right clustered index balances an improvement in performance for inserts (by minimizing splits as well as having the needed pages already in cache – and by needing less cache overall) with keeping the table more compact (i.e. less fragmentation). However, if you have a high volume of inserts (typically more than 400 per second) then you might want to add additional hot spots (cluster on a composite key like “state, id” which creates as many hot spots as you deal with states) to better balance this increased volume. I still wouldn’t go with something random (like name) and if the table has a lot of scans/reads I certainly wouldn’t go with a heap. You do have to be careful with updates though – if there’s a high volume of updates and the updates are to varchar then there are a few things I’d think about: fixed length fields (which is rarely an option), default values (GREAT choice if the updates come in with no value and later get updated) and rebuilds with a more appropriate fillfactor.

 

Q: Is a unique key the same as a primary key?

Almost… SQL Server has two ways of enforcing entity integrity: the Primary Key (of which there can only be one per table) and Unique Key(s) (of which there can be many – 249 is the maximum number of non-clustered indexes so unique keys and other non-clustered indexes combined can be no more than 249).

 

Q: I thought you said the primary key constraint automatically gets a clustered index, how can you make it non-clustered key?

When specifying index-based constraints (Primary Key and Unique Key) you can specify index type. The default index type (when not specified) is Clustered for the Primary Key and non-clustered for a unique key. Here’s complete syntax for adding these keys to existing tables.

 

ALTER TABLE dbo.Employee
  ADD CONSTRAINT EmployeePK
    PRIMARY KEY CLUSTERED (EmployeeID)

 

ALTER TABLE dbo.Employee
  ADD CONSTRAINT EmployeeSSNUK
    UNIQUE NONCLUSTERED (SSN)

 

In either case you can CHANGE the underlined part of the syntax to change the index type.

 

Q: How is a unique key created? Explicitly or is it Implied via Indexes?

Well, I have to admit I’m not entirely sure of the question here but I think I do… When you create a unique key constraint SQL Server enforces that uniqueness by automatically adding a unique non-clustered index to the table.

 

Q: Are GUIDs as efficient as identity for the clustering key?

Not really. They do meet the criteria of being static and unique however, they’re not as narrow (they’re 16 bytes) and if the newid function is used to populate the values (which is the common approach) then the values are not ever-increasing. With the values populated by the newid function you can end up with a lot of fragmentation quickly. There are alternatives to the newid function for generating the GUID value as an incrementing value through a windows call. Gert Drapers has an extended procedure to help you do this on www.SQLDev.net. I’m not sure of the specific link but I will update this here when I do!

 

Q: When you say to avoid a “volatile” clustering key, do you mean the data value actually changing/being updated?

Yes! Since the clustering key is in EVERY nonclustered index you want to choose it wisely and make sure that it’s relatively static. If it changes that will require all of the redundant versions of the value to change as well…

 

Q: If you create an identity column in a table just for the sake of creating a clustered index on it, how does this improve performance (especially if most of the queries use other fields in the table)?

Another great question! Remember two things: the internals of SQL Server rely heavily on the clustering key (even if you don’t) and if the table stays more compact (by not becoming fragmented) then queries [especially large scan queries] improve. So – it’s a better balance. Now to add a bit more – in many cases having the right nonclustered index can yield better performance for most queries anyway!

 

Q: You said in the power point slides that generally a few wide indexes are better than several narrow; However, later you said that one narrow clustered index on an id column + non clustered indexes is the way to go. So which is it? Or did I not really understand?

You’ve got it – it’s a combination of the two. The clustered should be narrow/static/unique but non-clustered indexes tend to be more effective when they are slightly wider (i.e. they contain a couple to a few columns rather than just one) so that they can service more types of query requests.

 

Q: Do I need to change my indexes if I use the like operator? I noticed that I get Index Scans instead of seeks when I use LastName like '%johnson%' v. LastName = 'johnson'?

Well, this one is kind of mixed. But – let me give you something to think about (which might help)… Think about the phone book for a second – if you have someone’s last name it’s pretty easy to find them. However, if you only know they have “johnson” somewhere in their name it’s a lot harder, right. This could yield ‘Cajohnson’, ‘Ejohnson’, ‘Smith-Johnson’ and so forth. So – there’s really no effective way to get at that data… SQL Server will always need to scan for that type of request (now I could add that covering indexes might help but quite honestly I think there’s a better option here). SO - changing index strategy really won’t help. But – I often see applications that just automatically add the % before and after the requested data. If this is the case, can you ask the user if it’s a real value or a pattern value? Or can you even allow them to supply the exact request? When the value is specific (i.e. Johnson) then the query will be fast, when the value is pattern matching (i.e. %johnson%) then SQL Server will need to scan and the query won’t be as fast – but it will be correct! If you want to search searching for more wildcards and more complex pattern matching into descriptions, etc. then you’ll probably want to look into Full-text Indexing.

Categories:
Events | Indexes | Resources | Tips

Tools Questions:

Q: Isn't there some issue with profiler where it does not recognize DBName but only the DBID?

Well, this is a good one – and a frustrating one as well. It is true that Profiler doesn’t always generate a data column value for every type of event. So – if you setup filters you will only filter rows which have a value for that data column. If you ONLY want to see things with a very specific value it can take quite a bit of time to remove all of the uninteresting events to narrow it down to only what you need. In the SQL Server Books Online there is a topic for each category – for example: Stored Procedures Event Category has a listing of all of the data columns that are produced for each event. This is a bit tedious to go through though and there’s a nice full html version that’s available on Gert Draper’s website: www.SQLDev.net. Checkout this link specifically: http://sqldev.net/misc/SQLTraceEventMatrix.htm

 

Q: Is it true that ITW is beneficial because it can predict how the optimizer will benefit most from the indexes it recommends? Sometimes I find that the ITW recommends statistics that hinder the query performance. Do you feel comfortable using it for the most part?

Ok, so there are really two questions here – first “how does ITW work?” and second “is it always a good thing?”

 

As for how it works – yes, you’re correct… ITW hypothesizes about indexes by creating statistics about your data, looking at the queries in the workload and then based on how the optimizer would optimize – it tells you which indexes would be best serve your data!

 

As for whether or not it’s always a good thing… the first thing I’ll jokingly say is – Is there ever an ALWAYS good, good thing? J Joking aside, there is some truth to this (although rare). I do feel comfortable using ITW and because the statistics are based on real data, not updated if they’re not used (there’s a good whitepaper that explains statistics here), so in general, statistics don't usually hinder query (or server) performance. However, having an extremely large number of statistics and indexes can increase optimization time (in analyzing all of these choices) and this may in fact be something you’re seeing.

 

Q: When statistics exists on some of my client tables, SQL Server sometimes chooses to use an index that is less beneficial (even after we update stats with full scan). In some cases, we've had to force the index using optimizer hints. As a result, we disable "AUTO CREATE" and only create indexes when performance issues arise and are needed. Granted this isn't the best method but until the tables become really large we find that some queries work fine.

Ok, well there’s a lot to this one. First, I’d say that there’s a difference between just having statistics and having the right indexes. There are some cases where ONLY having statistics isn’t good enough to give SQL Server enough information to proceed and instead they make incorrect decisions. It’s generally when SQL Server only has statistics for individual columns and doesn’t have better information for a combination of columns (wow, this is hard to describe in a short Q&A). But – I’ll sum it up to say that statistics aren’t always perfect – some assumptions may need to be made and sometimes those assumptions don’t work out because the data isn’t evenly distributed… So, if the indexes that exist are not "perfect" for the query SQL Server ends up not doing as good of a job as it might if it had the right indexes... If it helps, I have a class where I get into more details on statistics (a whole module on just statistics J) that explains a lot more details... It's hard to sum up in a sentence or two BUT I would guess that a bit more tuning (i.e. more real indexes and not just statistics) might help solve the problem... Really, hard to say in a quick chat but that really shouldn't be the majority of the time.

Categories:
Events | Indexes | Resources | Tips

Related Resources:

Q: Can I view a recording of this webcast?

It should be available on www.microsoft.com/usa/webcasts/ondemand shortly.

 

Q: Where can we get the powerpoint file?

The powerpoint will be available on the MSDN webcast resource page. I will post the exact link here and on SQLskills.com as soon as the final link is available.

 

Q: Where can we get the demo scripts?

The demo scripts will also be available on the MSDN webcast resource page but since it’s a small file I will also place them here MSDNWebcast-Tripp-20040611-SCRIPTS.zip (73.19 KB). Additionally, you can find the scripts on www.SQLskills.com under Event resources.

 

Q: Where can we get BPA – the Best Practices Analyzer?

The BPA is on the front page http://www.microsoft.com/sql and the download is specifically here.

 

Q: Can you recommend a good source for information on optimizing full-text indexes and searches?

Start with the Books Online topic: Full-text Query Architecture and then make sure to check out this MSDN article by Andrew B. Cencini titled: Building Search Applications for the Web Using Microsoft SQL Server 2000 Full-Text Search.

 

Q: Do you speak in other events for more advanced tuning topics?

Yes, in fact, SQLskills has a new event coming up in October that’s a 5-day event on performance tuning on SQL Server. We’ve just announced this event in June and for those of you who become subscribers on SQLskills and sign up by July 1st you’ll receive the lowest early bird cost for the event. We’re calling these events “Immersion Events” as they’re focused, indepth and intense! Full details are on www.SQLskills.com.

 

Finally, here are all of the resource links I recommended in the resources section of the slides:

 

Q: Does Kimberly L. Tripp have a blog?

Yes, www.SQLskills.com/blogs/Kimberly and if you’re reading this on my blog…well, then you already know! J

Categories:
Events | Indexes | Resources | Tips

All sorts of purposes: maintenance, rebuilding the database objects, dropping/recreating the database in test environments, changing a state option like readonly (did you know that no users can be using the database when you make this change to RO)....

Anyway, lots of reasons and this has always been the case. In previous releases we used to create sp_ (you know this is a special naming convention for objects in master that can be referenced in any database WITHOUT three-part naming) stored procedures that would detect the users using a database and either KILL spid directly or do a net send and then use WAITFOR DELAY to give the users a couple of seconds/minutes/whatever, to complete. Well, procedures like that are still useful (because you can set the database to “dbo use only,” warn the user, and then give them some time... However, warning them typically uses xp_cmdshell (which a lot of you don't allow) and sometimes you don't want to wait and/or give them much time :) :).

So - did you know that ALTER DATABASE in SQL Server 2000 has been changed to allow termination options for STATE changes. The termination options allow one of three types of termination:

  • Give the users n number of seconds to wrap things up
    • ALTER DATBASE dbname SET READ_ONLY WITH ROLLBACK AFTER
  • Kick them out immediately
    • ALTER DATBASE dbname SET READ_ONLY WITH ROLLBACK IMMEDIATE
  • Or, optionally you can have your process NOT WAIT
    • ALTER DATBASE dbname SET READ_ONLY WITH NO_WAIT

If no options are set and the process cannot get exclusive use of the database then it will till it can - indefinitely. Some database option changes generate a “could not get exclusive use” error but READ_ONLY (and a few others) don't. If this is in an automated script/process this can cause you a lot of grief. (In fact, I'm adding this note about 90 minutes after I created this blog entry...because my QA window is still trying to get exclusive access to pubs - which I was setting to readonly just to see if it would eventually time out....well, 90 minutes and still running it hasn't. I think I'm going to kill it!)

For a good example of how to change certain state changes when creating batch processes see, DB Alter for Batch Operation.sql (6.3 KB).

Categories:
Tips

Have you ever been watching a profiler trace only to find what you're looking for, select it and then all of a sudden get "auto-scrolled" down to the end... UGH - that's the default! But - Profiler does have an icon for stopping that behavior. (and since I generally (and personally) HATE most icons cause I can't ever seem to tell what the heck they are) I'll just tell you it's the 5th one from the RIGHT. It's titled auto scroll window (at least I like the "tool-tips"). Click it to turn the option off (I wish that were the default)!

Categories:
Tips

OK, so I can't tell you how fast I type... I have no clue! But - I do know that people get made at me on IM chats because they (well, most of them) cannot keep up (especially my Father!!!)... And the key reason for why??! I tend to stay on the keyboard instead of constantly going for the mouse; I love learning shortcuts/keystrokes, etc. and I like creating them as well... SO - first, did you know that the SQL UE (User Education) Team documented many of the keyboard shortcuts? In the books online click on the small keyboard at the top of the right pane window. It links to "Using the SQL Server Tools, User Interface Reference, Keyboard Shortcuts."
 
Second, did you know that you can create your own keyboard keystrokes within Query Analyzer? Check out Tools, Customize. In fact, Alt+F1 already exists (and cannot be changed) as object help. Highlight an object in the QA window and then press Alt+F1 - this will execute sp_help objectname. If you want keyword help (meaning a SQL keyword), highlight the keyword (like SELECT) and press Shift+F1 - this will give you context sensitive books online entry. I typically setup Ctrl+F1 as sp_helpindex (be careful - if you get an extra space in there it won't work!) and Ctrl+3 as sp_helpconstraint.

Categories:
Tips

First, do you have the latest version of the SQL Server Books Online? Make sure you do: http://www.microsoft.com/sql/techinfo/productdoc/2000/books.asp
Second, do you ever need the info when you're not at your desk? Make sure you know where they are: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/startsql/portal_7ap1.asp

Categories:
Tips

In a discussion with Scott about the "text in row" option for LOB (Large Object Data types: text, ntext, and image) data he questioned whether or not it was really a good idea to set... Well, I have mixed feelings about it really. In general I don't recommend it unless EVERY query tends to want the LOB data and if the LOB data is in general - narrow (meaning very few bytes). Yes, you do skip the extra lookup into the text/image b-tree but you also make all of the rows wider. If you don't want the text info most of the time (i.e. the description is only needed when they [occassionally] click on that button for the request, etc.) then you're putting all of those descriptions into memory when you don't really need to. It's [potentially] a great way to waste a large percentage of your cache.
 
So - I'd say that there really isn't a byte limit more than a usage rule. BUT - if you do want to skip that lookup (because most of the time you DO want to return the description) then I'd start to look at the data and see what the "average" byte length is. If most of the descriptions are small but they want to use the text datatype for the very rare and very large description then this is a GREAT feature to keep the small (and very frequently accessed descriptions) with the data row and avoid the lookups!
 
(fyi - the b-tree index is just like any other; it's the "index" associated with the table (object_id('tablename') = id) and has an indid of 255 in sysindexes. Each table has only ONE btree for all LOB columns of that table and seeing the indexes, statistics, etc. that each table has is interesting if you ever look at the system tables.)

Categories:
Tips

Theme design by Nukeation based on Jelle Druyts