Thursday, May 11, 2006 12:10 PM tonyrogerson

Surrogate V Natural Key Usage

Introduction

This is a common problem within database design and implementation and often discussed with great passion within the community.

Most tables, allthough not all have a natural key, with some however its just not practical to use the natural key, for instance a membership database of a user group like mine (sqlserverfaq.com). One school of thought dictates you should always use the natural key throughout the design including foriegn keys, the second school of thought dictates you should have the natural key once and then use surrogates throughout your design.

The rest of this entry is quoted un-edited from usenet posts and shows the arguments on both sides.

My Opinion

The main opponent to the surrogate key crowd (--Celko--) fails to grasp the implementation of the logical model in a real business environment, we need to consider performance and concurrency of our design implementation. Using Natural Keys as foriegn key references causes a number of problems, notably performance for long and composite keys and join complexity for composite keys. It is difficult for applications to efficiently use a composite key of which most pure natural keys usually are. Also, what happens when the natural key changes, it will cause a very big transaction updating all those foreign key references, and the application will have very significant problems and possibly update the wrong data because you are referencing the natural key which may now have changed!

Surrogate keys are an answer to this performance and concurrency problem. They never change, never need updating and are consistent within the database and application, the application can very easily use them too. A surrogate key is using just an integer number or some people use guid's in a distributed environment. Auto-populating of the surrogate key is simple, you can use the IDENTITY property which incurrs no locking penalty unlike MAX( x ) + 1. Guid's present their own problem, they aren't easily readable so application debuging is difficult, also there generation is not sequential so you will get bad fragmentation on any index you use on them, and lets face it the surrogates will need indexing!

My own typical implementation as DDL is as follows :-

create table client (
   id int not null IDENTITY constraint sk_client unique clustered
,
   client_name nvarchar(100) not null constraint pk_client primary key
nonclustered
)

create table trade (
   client_id int not null references client( id
),
   trade_date int not null check( isdate( trade_date ) = 1
)
)

insert client( client_name ) values( 'tony rogerson' )
insert trade( trade_date, client_id ) values( 20060510, scope_identity() )

References

Quotes from 'Surrogate or Natural Keys?', 28th April 2006

Tony Rogerson reply to --Celko--

> An appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning.  ACM
> Transactions on Database Systems, 4(4).  pp. 397-434.

You've just described how people use IDENTITY as a surrogate correctly. The
column with the IDENTITY property can not be changed, users have no control
over its value.

If used as a surrogate it will never be displayed to the user, simply used
internally by BOTH the database and application for performance and
scalability.

-------------

David Portas reply to --Celko--

> An appeal to authority, with a quote from Dr. Codd: "..Database users
> may cause the system to generate or delete a surrogate, but they have
> no control over its value, nor is its value ever displayed to them
> ..."(Dr. Codd in ACM TODS, pp 409-410) and Codd, E. (1979), Extending
> the database relational model to capture more meaning.  ACM
> Transactions on Database Systems, 4(4).  pp. 397-434.

> This means that a surrogate ought to act like an index; created by the
> user, managed by the system and NEVER seen by a user.  That means never
> used in queries, DRI or anything else that a user does.

However, in that same paper, Codd goes on to say that joins and other
relational operations ARE permitted on the surrogate keys and that "it
is now the surrogate that is the primary key and provides truly
permanent identification of each entity" so Codd's surrogates are most
definitely part of the logical model and are definitely NOT what you
say is "like an index".

It's difficult to see how to reconcile Codd's surrogate key idea with
the rest of his model. The consequences of admitting "special data"
that isn't represented as real values in relations seem too serious.
Perhaps that's why I don't know of anyone other than you who has
anything to say about Codd-style surrogates. The definition that I
think most people are more familiar with is the one given by Date (in
An Introduction to Database Systems) who says "Surrogate keys are keys
in the usual relational sense" and "Tuple IDs are usually concealed
from the user, while surrogates must not be (because of The Information
Principle)".

--Celko-- reply to David Portas

>> Codd goes on to say that joins and other relational operations ARE permitted on the surrogate keys and that "it is now the surrogate that is the primary key and provides truly permanent identification of each entity" <<

Surrogate means "something which acts in place of another", so of
course the relational operations can be performed on them.  Would you
hire a sterile surrogate mother :)?  However, the system gets to pick
the method.  Consider a covering index.  I do not have to read the base
table to do an EXISTS() or other operations.  IThe index is my
surrogate.  A stronger version in Sybase is the PK-FK "key joins" and
the Informix "pre-join" indexes.  These produce pointer chains,. so I
can read one base table and get the rest of the data from the fact that
a relationship exists without ever seeing it.

>> It's difficult to see how to reconcile Codd's surrogate key idea with the rest of his model. The consequences of admitting "special data" that isn't represented as real values in relations seem too serious.  <<

That is why they have to be like indexes, hashing, pointer chains or
other access methods.  Hidden from the user and handled **completely**
by the system.

>> most people are more familiar with is the one given by Date <<

You have to watch out for Date versus Codd (both RM1 and RM2).   They
disagree on NULLs and quite a few other things.  Date's surrogate key
was what other people would call an artificial key at one point, then
he made some changes and I do not know what his position is now.  I am
still trying to clean up "scalar" versus "atomic" in my own notes!

David Portas reply to --Celko--

That is not Codd's position as I understand it. Where does he say that
another key is required? He says "Users [...] are no longer compelled
to invent a user-controlled key if they do not wish to". This idea is a
non-starter in my opinion. For example the projection consisting of the
surrogate key alone has no exposed key at all. How can such a relation
be completely hidden from the user? It cannot be, nor can it be a
proper relation as far as I can see. I have no problem with the idea of
indexes of course but in my view Codd's surrogates can only weaken RM.
They are hidden pointer values that do not belong in the logical model
where he put them.

> >> most people are more familiar with is the one given by Date <<

> You have to watch out for Date versus Codd (both RM1 and RM2).   They
> disagree on NULLs and quite a few other things.  Date's surrogate key
> was what other people would call an artificial key at one point, then
> he made some changes and I do not know what his position is now.

They do not agree because the field has moved on a great deal in 30
years. One of the things that got left behind was Codd's idea of
surrogates. Can you name any reference since Codd that refers to
surrogate keys that meet his definition? I can't.

Filed under: ,

Comments

# re: Surrogate V Natural Key Usage

Thursday, May 11, 2006 4:01 PM by Colin Leversuch-Roberts

In practical use I've found a mixture of both usually work best, the other point is so few systems are ideal anyway - there's still a number of dbase apps ported to sql which are pretty dire!!
Where I do get concerned is where the first column a developer places for a table is an identity regardless of whether it has a requirement. Nothing more frustating to join against an identity column to return an integer from the table with the same value as the identity - and yes I've seen it in prod systems many times!!!

# Subquery returned more than 1 value

Wednesday, May 31, 2006 6:00 PM by Tony Rogerson's ramblings

I've had so many referrals from google on this error because of my previous blog entry on the T-SQL Value...

# re: Surrogate V Natural Key Usage

Saturday, July 8, 2006 1:12 PM by Shan Plourde

I do prefer typically to use surrogate keys for several reasons, two of which are database insularity and agility. I think that some naturally selected keys are prone to these types of problems. Definitely lots of things to consider!

# Surrogate vs. NaturalKeys - References

Friday, February 2, 2007 7:35 PM by Mystery Blogger

Although not a complete listing, this is a good start for anyone interested in the debate over natural

# It works only in theory? Or does it? &laquo; ym-group.com

Tuesday, January 13, 2009 3:23 PM by It works only in theory? Or does it? « ym-group.com

Pingback from  It works only in theory? Or does it? &laquo;  ym-group.com

# DELETED and INSERTED tables in UPDATE TRIGGER | keyongtech

Pingback from  DELETED and INSERTED tables in UPDATE TRIGGER | keyongtech

# Un consiglio sulla progettazione delle chiavi delle tabelle. | hilpers

Pingback from  Un consiglio sulla progettazione delle chiavi delle tabelle. | hilpers

# Username as the primary key? | keyongtech

Sunday, February 8, 2009 9:07 AM by Username as the primary key? | keyongtech

Pingback from  Username as the primary key? | keyongtech

# MySQL vs MS ACCESS speed in UPDATE query | mywebsite

Saturday, July 28, 2012 6:00 AM by MySQL vs MS ACCESS speed in UPDATE query | mywebsite

Pingback from  MySQL vs MS ACCESS speed in UPDATE query | mywebsite