The question is not ‘when should the PK be NC’, but instead you should ask ‘what is the proper key for the clustered index’?
And the answer really depends on how do you query the data. The clustered index has an advantage over all other indexes: since it always includes all columns, is always covering. Therefore queries that can leverage the clustered index certainly do not need to use lookups to satisfy some of the projected columns and/or predicates.
Another piece of the puzzle is how can an index be used? There are three typical patterns:
- probes, when a single key value is seek-ed in the index
- range scans, when a range of key values is retrieved
- order by requirements, when an index can satisfy an order by w/o requiring a stop-and-go sort
So if you analyze your expected load (the queries) and discover that a large number of queries would use a particular index because they use a certain pattern of access that benefits from an index, it makes sense to propose that index as the clustered index.
Yet another factor is that the clustered index key is the lookup key used by all non-clustered indices and therefore a wide clustered index key creates a ripple effect and widens all the non-clustered indices and wide indices mean more pages, more I/O, more memory, less goodness.
A good clustered index is stable, it does not change during the lifetime of the entity, because a change in the clustered index key values means the row has to be deleted and inserted back.
And a good clustered index grows in order not randomly (each newly inserted key value is larger than the preceding value) as to avoid page splits and fragmentation (without messing around with
So now that we know what a good clustered index key is, does the primary key (which is a data modelling logical property) match the requirements? If yes, then the PK should be clustered. If no, then the PK should be non-clustered.
To give an example, consider a sales facts table. Each entry has an ID that is the primary key. But the vast majority of queries ask for data between a date and another date, therefore the best clustered index key would be the sales date, not the ID. Another example of having a different clustered index from the primary key is a very low selectivity key, like a ‘category’, or a ‘state’, a key with only very few distinct values. Having a clustered index key with this low selectivity key as the leftmost key, e.g.
(state, id), often makes sense because of ranges scans that look for all entries in a particular ‘state’.
One last note about the possibility of a non-clustered primary key over a heap (i.e. there is no clustered index at all). This may be a valid scenario, the typical reason is when bulk insert performance is critical, since heaps have significantly better bulk insert throughput when compared to clustered indices.
The basic reason to use Clustered indexes is stated on Wikipedia:
Clustering alters the data block into a certain distinct order to match the index, resulting in the row data being stored in order. Therefore, only one clustered index can be created on a given database table. Clustered indices can greatly increase overall speed of retrieval, but usually only where the data is accessed sequentially in the same or reverse order of the clustered index, or when a range of items is selected.
Say that I have a table of People, and these people have a Country column and a unique Primary Key. It’s a demographics table, so these are the only things I care about; what Country and how many unique people are tied to that country.
I am thus only ever likely to SELECT WHERE or ORDER BY the Country column; a clustered index on the Primary Key doesn’t do me any good, I’m not accessing this data by PK, I’m accessing it by this other column. Since I can only have one clustered index on a table, declaring my PK as Clustered would prevent me from using a Clustered Index on Country.
In addition, here’s a good article on Clustered vs Nonclustered Indexes, turns out clustered indexes caused insert performance issues in SQL Server 6.5 (which at least hopefully isn’t relevant for most of us here).
If you put a clustered index on an IDENTITY column, then all of your inserts will happen on the last page of the table – and that page is locked for the duration of each IDENTITY. No big deal… unless you have 5000 people that all want the last page. Then you have a lot contention for that page
Note that this isn’t the case in later versions.
If your primary key is of the
UNIQUEIDENTIFIER, make sure to specify that it’s
NONCLUSTERED. If you make it clustered, every insert will have to do a bunch of shuffling of records to insert the new row in the correct position. This will tank performance.