March 2010 - Posts

Scottish Area SQL Server User Group Meeting, Edinburgh - Thursday 25th March

 

An evening of SQL Server 2008 Reporting Services Scalability and Performance with Rob Carrol, see how to build
a high performance, scalable reporting platform and the tuning techniques required to ensure that report
performance remains optimal as your platform grows. Pizza and drinks will be provided!

 

Register at http://www.sqlserverfaq.com/events/221/SQL-Server-2008-Reporting-Services-Scalability-and-Performance.aspx

One of the pioneers of the Relational Database, Chris Date is giving a 2 day seminar in Edinburgh (13th and 14th May 2010) based around his new book "SQL and Relational Theory - How to Write Accurate SQL Code" which if you don't already have I'd say is a must buy.

When I first saw this and what he will cover I thought, oh yer - this is going to cost the earth, well it doesn't - its £750 for the two days and there are discounts available for multiple bookings, being a member of the UK SQL Server User Group or members of the academic community.

More information here: http://www.justsql.co.uk/chris_date/cjd_edin_may_2010.htm.

The person you need to contact is Peter Robson (rde@justsql.com), if you are a member of the UK SQL Server User Group just say so and mention my name.

If you are interested - talk to Peter this week because he needs to formulate numbers by the end of March, you may miss out otherwise.

The detail of what Chris Date will cover is shown below, note this will be technology neutral, what no T-SQL you ask; don't worry - SQL Server uses ANSI 92 SQL so the majority of syntax he shows will work, but frankly this is more about thinking Relationally than being a code monkey.

1. Setting the scene

  • Codd's relational model
    Model vs. implementation
    Properties of relations
    Base vs. derived relations
    Relations vs. relvars
    Values vs. variables


2. Types and domains

  • Domains are types
    Types and operators
    System vs. user defined types
    Scalar vs. nonscalar types
    Scalar types in SQL
    SQL type checking and coercion
    SQL collations
    SQL row and table types


3. Tuples and relations, rows and tables

  • What's a tuple?
    Rows in SQL
    What's a relation?
    Relations are n-dimensional
    Relational comparisons
    TABLE_DUM and TABLE_DEE
    Tables in SQL
    SQL column naming


4. No duplicates, no nulls

  • What's wrong with duplicates?
    Avoiding duplicates in SQL
    What's wrong with nulls?
    Avoiding nulls in SQL
    A remark on outer join


5. Base relvars, base tables

  • Data definition
    INSERT, DELETE, UPDATE
    Updating is set level
    Relational assignment
    Candidate and foreign keys
    Tables and predicates
    Tables vs. types


6. SQL and relational algebra I:
The original operators

  • Result types
    Restriction, projection, join
    Union, intersection, difference
    WITH and complex expressions
    What expressions mean
    Evaluating SQL expressions
    Optimization
    Importance of column names


7. SQL and relational algebra II:
Additional operators

  • Semijoin and semidifference
    Extend
    Image relations
    Divide
    Aggregation and summarization
    Grouping and ungrouping
    "What if" queries
    What about ORDER BY?


8. SQL and constraints

  • Type constraints
    Type constraints in SQL
    Database constraints
    Database constraints in SQL
    The role of transactions
    Immediate vs. deferred checking
    Multiple assignment
    Constraints vs. predicates
    The Golden Rule
    Correctness vs. consistency


9. SQL and views

  • Views are relvars
    The Principle of Interchangeability
    Views and predicates
    Retrieval operations
    Views and constraints
    Updating operations
    What are views really for?
    Views and snapshots


10. SQL and logic I:
Relational calculus

  • Propositions and predicates
    Quantification: EXISTS, FORALL, UNIQUE
    Range variables and correlation
    Calculus expressions
    SQL support
    Transforming expressions


11. SQL and logic II:
Using logic to write SQL code

  • Important identities
    SQL and implication
    SQL and FORALL
    Correlated subqueries
    Naming subexpressions
    Dealing with ambiguity
    Using COUNT
    ALL or ANY comparisons
    GROUP BY and HAVING


12. Further SQL topics

  • Explicit tables
    Range variables
    Table, row, and scalar subqueries
    "Possibly nondeterministic" expressions
    Cursor operations
    Empty set issues
    A BNF grammar


13. The relational model

  • The relational model vs. others
    The relational model defined
    What remains to be done?
    The future of SQL


14. Database design theory

  • The place of design theory
    FDs and BCNF
    JDs and 5NF
    6NF
    Normalization is not a panacea
    But don't denormalize!
    The Principle of Orthogonal Design
    Remarks on physical design

 

I started writing an article for my blog on surrogate keys drawing in the original research by E F Codd and C J Date, its getting a bit big :) so I'm going to chop it up into a number of posts over the coming weeks depending on my time.

I'm interested in your thoughts and if you disagree please let me know but more importantly give me references so I can go back and read what you've read and either modify my position or not :), either way it will improve the article.

Hope it makes sense.  

Surrogate keys

There are two factions in the world of Database Design that enact wars against the use of surrogate keys, there is a lot of misunderstanding around what a surrogate key actually is and the need for them.

This article intends to draw a line under that battle and defend once and for all the use of surrogate keys, to do that we need to look at what problems they solve, the theory and rationale behind them and how to implement them in your chosen product.

I'm going to cover a lot:

  • Rules for a Surrogate key
  • Problems they solve
  • The "Theory" - we will look at the research done primarily by E. F. Codd and C. J. Date but will also look at other authors and their take on the subject.
  • How to implement a surrogate key
    • o Database Design (includes a discussion on Normalisation and when you'd introduce a surrogate)
    • o Techniques for generating the surrogate key value (IDENTITY, NEWID(), NEWSEQUENTIALID, Hashing and do it yourself)
    • o Considerations around Referential Integrity
    • o Considerations around External Verification of the surrogate key value and why even the notion is invalid
    • o Locking Considerations - effect on concurrency
  • Example Tricks using the surrogate key for example Table Partitioning

Rules for a Surrogate Key

  1. They must never be related (directly or indirectly) to the physical hardware or storage - to use the Relational phrase - they are not tuple-id's (row/record locators).
  2. Once a value has been used - it must never be reused by something else - basically once the value "5" has been used, then the surrogate key generator must never generate the number "5" again even if the original row has been deleted.
  3. They are read only - once generated that value must never be changed.
  4. The value must be atomic, that means no sets - just a constant for instance "5".
  5. The value should never be exposed outside the limits of the application as a whole (the database is not a black box) - I explain more later but basically the User of your application for instance the Call Centre staff entering client details, the external machine calling your web service should never see the surrogate key value (they should see one of the candidate key values instead) - that is because the surrogate key cannot be used for verification purposes but the candidate key can.
  6. A surrogate key can be the sole key for a table in a situation where there are no natural occurring candidate keys for example a Message Board. When used in this scenario arguably it’s no longer a surrogate key but an artificial one but I would still follow rules 1 – 4. Having said that you need to consider verification for instance if the rows were re-inserted and the artificial key re-generated then the same rows (entities) should get the same artificial key value; there are hashing solutions to provide that facility but the obvious problem of duplicate rows come into play – that discussion is outside the scope of surrogate keys.

The problem they solve

Surrogate keys form an abstraction from the candidate keys on a table, it can be argued that creating a surrogate key on a table where a single column candidate key exists makes no sense and that their real use is for making composite keys simpler to use throughout the application - a fair point but there is more to consider.

I'd say the real benefit of a surrogate key is in the abstraction it provides thus removing concurrency problems caused should the candidate key values you've used in foreign key clauses in related tables change - a surrogate key value once generated can never change and its value can never be reused even if the rows have been deleted (Codd and Date - see later).

I've summarised below some of the main points I see surrogates provide us:

  • Joins between tables are done on a single expression unlike say a join using a composite key for instance instead of writing: ON Manufacturers.Maker = Registrations.Maker AND Manufacturers.Model = Registrations.Model you would write ON Manufacturers.ManufacturerID = Registrations.ManufacturerID.
  • Reduce concurrency problems caused when values of existing candidate key being used in foreign key references change.
  • Removes any possibility of inconsistencies in data type, length and collation creeping into the schema, for instance in the Manufacturers table the column Maker is defined as varchar(250) with a collation of Latin CI, in the Registrations table the column Maker is defined varchar(200) with a collation of Spanish.
  • Join performance, this is really debatable; with today's hardware and improved code efficiency within the Optimiser and Storage Engine I don't think this really matters (that much) anymore.
  • The surrogate key can provide a method for partitioning of your data - I'll show that later with an example of using partitioned views (as opposed to the Enterprise Edition feature Partitioning). The effect of partitioning your data helps reduce locking contention both within the storage engine (allocating new extents) and for ACID in terms of consistency - less blocking!

The Theory

E. F. Codd and C. J. Date are seen as the parents of the Relational Model, their initial and continued research are the underpinnings of Database Design and why Relational Databases exist.

Codd on Surrogate Keys

In the paper ACM Transactions on Database Systems, Vol. 4, No. 4, December 1979 pages 409 - 410 Codd introduces the concept of a surrogate key, the key points of his text are as follows:

{snip}

"There are three difficulties in employing user-controlled keys as permanent surrogates for entities.

(1) The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g., if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed).

(2) Two relations may have user-controlled keys defined on distinct domains (e.g., one uses social security, while the other uses employee serial number) and yet the entities denoted are the same.

(3) It may be necessary to carry information about an entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g., an applicant for a job and a retiree).

These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution-proposed in part in [4] and more fully in [14]-is to introduce entity domains which contain system-assigned surrogates. Database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them.

{snip}

Users will often need entity identifiers (such as part serial numbers) that are totally under their control, although they are no longer compelled to invent a user-controlled key if they do not wish to.

They will have to remember, however, that it is now the surrogate that is the primary key and provides truly permanent identification of each entity. The capability of making equi-joins on surrogates implies that users see the headings of such columns but not the specific values in those columns."

What Codd means by "User of the database" is open to debate in respect of whether the surrogate value should be displayed to them, C J Date makes the point that if the value is not displayed it breaks Codd's own Information Principal. I believe the User of the database is not the Developer or Database Administrators - they are not users of the Database - they build systems that have users a user being a person using the application or a machine using a web service for instance.

His point about "The capability of making equi-joins on surrogates implies that users see the headings of such columns" means that we do see the columns - we'd have to in order to join our tables together! Ok, some products do have this ability - they internally have their own surrogate keys that are not exposed and are used in joins under the covers, but that is not a requirement by Codd just a nice feature of those products - in SQL Server land we unfortunately do not have that feature.

Date on Surrogate Keys

From C. J. Date's book "An Introduction to Database Systems, C J Date, 8th Edition":

Page 434, reference 14.21 - he refers to P. Hall, J. Owlett and S J P Todd "Relations and Entities" and says:

"Surrogate Keys are keys in the usual Relational sense but have the following specific properties:

They always involve exactly one attribute.

Their values serve solely as surrogate (hence the name) for the entities they stand for ......

When a new entity is inserted into the database, it is given a surrogate key value that has never been used before and will never be used again, even if the entity in question is subsequently deleted.

Ideally surrogate keys value would be system-generated.

Surrogates must not be concealed from the user because of the Information Principal - tuple ID's should be however."


He finishes:

"In a nutshell: Surrogates are a model concept; tuple IDs are an implementation concept"

C J Date makes the distinction between a tuple ID (basically a row ID) and a surrogate key; the row ID changes as rows get removed from the table - that should never be exposed from the database even for developers or Database Administrators, the SQL language has the ROW_NUMBER() function to provide us with a numbering system for a result.

 

Granularity defines “the lowest level of detail”; but what is meant by “the lowest level of detail”?

Consider the Transactions table below:

create table Transactions (

    TransactionID       int not null     primary key clustered,

    TransactionDate     date not null,

    ClientID            int not null,

    StockID             int not null,

    TransactionAmount   decimal( 28, 2 ) not null,

    CommissionAmount    decimal( 28, 5 ) not null

)

 

A Client can Trade in one or many Stocks on any date – there is no uniqueness to ClientID, Stock and TransactionDate.

The level of granularity on the table is TransactionID because that is our lowest level of detail, TransactionID uniquely identifies the row – there are no other candidate keys, the columns ClientID, StockID and TransactionDate form a context to the TransactionID.

Consider now that ClientID, StockID and TransactionDate are unique and enforced by a unique constraint (a Client can trade multiple stocks but only once in a given day) – what would the level of granularity now be?

The TransactionID in this case would be acting as a surrogate for the candidate key (ClientID, StockID, TransactionDate) therefore the granularity can be said to be TransactionID or all of the columns ClientID, StockID and TransactionDate and those columns are in no specific order when specified.

When using a separate key specific for the business “Transaction” i.e TransactionID the ClientID, StockID and TransactionDate do not form part of the granularity at all, in fact they provide a context to the Transaction, and the context itself is hierarchical because Clients trade in Stocks on a given Date (see below):

The above really talks about non-aggregated data, data you tend to find in the OLTP system.

When looking at aggregated data you would take all columns as the granularity because all the columns together determine the level of detail you are looking at. Aggregated data can be thought of as dimensional, unfortunately SQL does not give us operators to work in dimensional instead we’d use MDX. But it’s important to understand that an aggregated value sits in time and space so to speak and you access it through any member columns of the granular key.

If we had aggregated on ClientID, StockID, TransactionDate (see below) then CommissionAmountTotal has granularity of all the columns (ClientID, StockID and TransactionDate) – remember the columns have no ordering – the granularity key is all of them.

select ClientID,

       StockID,

       TransactionDate, sum( CommissionAmount ) as CommissionAmountTotal

from Transactions

group by ClientID, StockID, TransactionDate

 

This can be proved because you can rewrite the group by clause within any combination and still get the same answer.

When working upon aggregated data if you miss one of the columns out from the granular key then you are implying “All” so effectively the granularity has not changed on the query, but if aggregating a level up then the result of that instance of the query execution will have a different granularity:

 The query below the source granularity is TransactionID, the result granularity is ClientID.

select ClientID, sum( CommissionAmount ) as CommissionAmountTotal

from Transactions

group by ClientID

The query below the source granularity is still TransactionID within the derived table, the outer query the source granularity is {ClientID, StockID, TransactionDate} and the resulting granularity is ClientID.

select ClientID, SUM( CommissionAmount ) as CommissionAmountTotal

from (

    select ClientID,

           StockID,

           TransactionDate, sum( CommissionAmount ) as CommissionAmountTotal

    from Transactions

    group by ClientID, StockID, TransactionDate

 

    ) as agg( ClientID, StockID, TransactionDate, CommissionAmountTotal )

group by ClientID


In summary granularity has a lot to do with perspective and how the lowest level of data is identified, the level of granularity can change – it has a source and result when aggregating. The level of granularity is important because it helps us understand what aggregations can be performed on what data to stop double counting for instance.