Saturday, March 13, 2010 1:42 PM
Database Design - Surrogate keys: Part 1 of many (Rules for Surrogate Keys, E. F. Codd and C J Date research and problems they solve)
I started writing an article for my blog on surrogate keys drawing in the original research by E F Codd and C J Date, its getting a bit big :) so I'm going to chop it up into a number of posts over the coming weeks depending on my time.
I'm interested in your thoughts and if you disagree please let me know but more importantly give me references so I can go back and read what you've read and either modify my position or not :), either way it will improve the article.
Hope it makes sense.
There are two factions in the world of Database Design that enact wars against the use of surrogate keys, there is a lot of misunderstanding around what a surrogate key actually is and the need for them.
This article intends to draw a line under that battle and defend once and for all the use of surrogate keys, to do that we need to look at what problems they solve, the theory and rationale behind them and how to implement them in your chosen product.
I'm going to cover a lot:
- Rules for a Surrogate key
- Problems they solve
- The "Theory" - we will look at the research done primarily by E. F. Codd and C. J. Date but will also look at other authors and their take on the subject.
- How to implement a surrogate key
- o Database Design (includes a discussion on Normalisation and when you'd introduce a surrogate)
- o Techniques for generating the surrogate key value (IDENTITY, NEWID(), NEWSEQUENTIALID, Hashing and do it yourself)
- o Considerations around Referential Integrity
- o Considerations around External Verification of the surrogate key value and why even the notion is invalid
- o Locking Considerations - effect on concurrency
- Example Tricks using the surrogate key for example Table Partitioning
Rules for a Surrogate Key
- They must never be related (directly or indirectly) to the physical hardware or storage - to use the Relational phrase - they are not tuple-id's (row/record locators).
- Once a value has been used - it must never be reused by something else - basically once the value "5" has been used, then the surrogate key generator must never generate the number "5" again even if the original row has been deleted.
- They are read only - once generated that value must never be changed.
- The value must be atomic, that means no sets - just a constant for instance "5".
- The value should never be exposed outside the limits of the application as a whole (the database is not a black box) - I explain more later but basically the User of your application for instance the Call Centre staff entering client details, the external machine calling your web service should never see the surrogate key value (they should see one of the candidate key values instead) - that is because the surrogate key cannot be used for verification purposes but the candidate key can.
- A surrogate key can be the sole key for a table in a situation where there are no natural occurring candidate keys for example a Message Board. When used in this scenario arguably it’s no longer a surrogate key but an artificial one but I would still follow rules 1 – 4. Having said that you need to consider verification for instance if the rows were re-inserted and the artificial key re-generated then the same rows (entities) should get the same artificial key value; there are hashing solutions to provide that facility but the obvious problem of duplicate rows come into play – that discussion is outside the scope of surrogate keys.
The problem they solve
Surrogate keys form an abstraction from the candidate keys on a table, it can be argued that creating a surrogate key on a table where a single column candidate key exists makes no sense and that their real use is for making composite keys simpler to use throughout the application - a fair point but there is more to consider.
I'd say the real benefit of a surrogate key is in the abstraction it provides thus removing concurrency problems caused should the candidate key values you've used in foreign key clauses in related tables change - a surrogate key value once generated can never change and its value can never be reused even if the rows have been deleted (Codd and Date - see later).
I've summarised below some of the main points I see surrogates provide us:
- Joins between tables are done on a single expression unlike say a join using a composite key for instance instead of writing: ON Manufacturers.Maker = Registrations.Maker AND Manufacturers.Model = Registrations.Model you would write ON Manufacturers.ManufacturerID = Registrations.ManufacturerID.
- Reduce concurrency problems caused when values of existing candidate key being used in foreign key references change.
- Removes any possibility of inconsistencies in data type, length and collation creeping into the schema, for instance in the Manufacturers table the column Maker is defined as varchar(250) with a collation of Latin CI, in the Registrations table the column Maker is defined varchar(200) with a collation of Spanish.
- Join performance, this is really debatable; with today's hardware and improved code efficiency within the Optimiser and Storage Engine I don't think this really matters (that much) anymore.
- The surrogate key can provide a method for partitioning of your data - I'll show that later with an example of using partitioned views (as opposed to the Enterprise Edition feature Partitioning). The effect of partitioning your data helps reduce locking contention both within the storage engine (allocating new extents) and for ACID in terms of consistency - less blocking!
E. F. Codd and C. J. Date are seen as the parents of the Relational Model, their initial and continued research are the underpinnings of Database Design and why Relational Databases exist.
Codd on Surrogate Keys
In the paper ACM Transactions on Database Systems, Vol. 4, No. 4, December 1979 pages 409 - 410 Codd introduces the concept of a surrogate key, the key points of his text are as follows:
"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.
(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g., if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed).
(2) Two relations may have user-controlled keys defined on distinct domains (e.g., one uses social security, while the other uses employee serial number) and yet the entities denoted are the same.
(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g., an applicant for a job and a retiree).
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution-proposed in part in  and more fully in -is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them.
Users will often need entity identifiers (such as part serial numbers) that are totally under their control, although they are no longer compelled to invent a user-controlled key if they do not wish to.
They will have to remember, however, that it is now the surrogate that is the primary key and provides truly permanent identification of each entity. The capability of making equi-joins on surrogates implies that users see the headings of such columns but not the specific values in those columns."
What Codd means by "User of the database" is open to debate in respect of whether the surrogate value should be displayed to them, C J Date makes the point that if the value is not displayed it breaks Codd's own Information Principal. I believe the User of the database is not the Developer or Database Administrators - they are not users of the Database - they build systems that have users a user being a person using the application or a machine using a web service for instance.
His point about "The capability of making equi-joins on surrogates implies that users see the headings of such columns" means that we do see the columns - we'd have to in order to join our tables together! Ok, some products do have this ability - they internally have their own surrogate keys that are not exposed and are used in joins under the covers, but that is not a requirement by Codd just a nice feature of those products - in SQL Server land we unfortunately do not have that feature.
Date on Surrogate Keys
From C. J. Date's book "An Introduction to Database Systems, C J Date, 8th Edition":
Page 434, reference 14.21 - he refers to P. Hall, J. Owlett and S J P Todd "Relations and Entities" and says:
"Surrogate Keys are keys in the usual Relational sense but have the following specific properties:
They always involve exactly one attribute.
Their values serve solely as surrogate (hence the name) for the entities they stand for ......
When a new entity is inserted into the database, it is given a surrogate key value that has never been used before and will never be used again, even if the entity in question is subsequently deleted.
Ideally surrogate keys value would be system-generated.
Surrogates must not be concealed from the user because of the Information Principal - tuple ID's should be however."
"In a nutshell: Surrogates are a model concept; tuple IDs are an implementation concept"
C J Date makes the distinction between a tuple ID (basically a row ID) and a surrogate key; the row ID changes as rows get removed from the table - that should never be exposed from the database even for developers or Database Administrators, the SQL language has the ROW_NUMBER() function to provide us with a numbering system for a result.
Filed under: Database Design