SQL and the like

Dave Ballantyne's blog. Freelance SQL Server database designer and developer at Clear Sky SQL

December 2011 - Posts

Book review - SQL Server Secret Diary (Know the unknown secrets of SQL Server)

Like a lot of people within the SQL community, I can never read enough on the subject.  Books, whitepapers, academic research and blogs can all be valuable source of information, so whilst browsing Amazon I found this book on a free kindle download.  The preface makes some bold claims indeed :

“This book is for developers who already know SQL Server and want to gain more knowledge in SQL Server.  This book is not for starter who want to start from the beginning.

The problem-solution approach will help you to understand and solve the real-time problems easily.

This Book will teach you (their emphasis)

  • How to solve common real-time problems
  • How to improve performance
  • How to protect your data and code
  • How to reduce your code
  • How to use SQL Server efficiently
  • Advanced topics with simple examples
  • Tips and tricks with sample queries
  • And also teach how to do in the better way.

The last bullet point, sets the tone of the quite appalling use of grammar (yes, yes , people in glass houses and all that.. ) contained throughout the entire book,  I get that the authors may use english as a second (or third) language,  but where are the proof readers ?  That i can live with though,  its the technical content i really have a problem with.  Here is just a small selection:

Q 2) How to use GO statement in SQL Server ?

IMO,  the most important concept to understand about GO is that it is not a SQL Statement.  It is processed on the client (SSMS, ISQL etc) and splits the workload into separate batches.  This is not mentioned here,  though to be fair in Q3 (How to repeat the statements without using loops ?) the author notes “GO is a client command and not a T-SQL command". So GO or GO <N> can only be used with Microsoft SQL Server client tools.”. 

Q 5) How to use ORDER BY clause in view ?

Here the author spends a great deal of time and effort working around “As per RDBMS rule ORDER BY clause is not allowed in view”.  This section should be thrown away entirely,  if you are depending on the view ordering ( which is a contradiction in terms)  for your result set ordering you deserve all the law suits that are thrown at you. 

Q 10) How to do case sensitive searches in SQL Server ?

The authors solution here is to cast a column as varbinary. OK, fair enough it works. Personally, i would have used COLLATE but lets not split hairs.  The biggest issue i have here is sargability is not mentioned,  we are introducing the possibility of a scan.

Q 12) How to solve server time out problem ?

The scenario presented here is that session #1 has updated some data that session #2 needs to read.  The author presents 2 solutions NOLOCK and READPAST and ,to be fair, does make an attempt at highlighting the dirty reads.  My issue here is that, once again, locking is seen as the enemy that must be worked around.  We should embrace locks, understand why they are happening and how they are protecting us.  The point is not raised that the fault here lies with the UPDATE’ing transaction not completing in a timely fashion, not that the reader cannot complete due to that. The consequences of reading and processing dirty data are not explored thoroughly enough and once again, NOLOCK is used as a “go faster” button.

Q 33 ) How to improve the performance of stored procedure ?

Here we have been given 11 bullet points by the authors , which I have copied verbatim below. My thoughts about each point are inlined in red:

  • Use SET NOCOUNT ON to avoid sending row count information for every statement. So, this can help, but will only have a measurable effect if you have many many statements,  but in that case you are coding sql wrong anyway.
  • Always use the owner name or schema name before the object name to prevent recompilation of stored procedure.  Does this mean that by not referencing the owner or schema (which one is it ?? ) objects will always cause a recompile of the entire stored procedure ? No.  The statement not necessarily the stored procedure, will recompile if the user has a different default schema to the existing compiled statement.
  • Avoid using DISTINCT Just distinct ? Any thing else ? Unnecessary ORDER BY ?
  • Minimize the number of columns in SELECT clause So, Select Col1,Col2,Col3 is bad but Select Col1 +’ ‘+ Col2 +’ ‘+Col3 is ok ? Better wording here would be “Return only the data that is required by the application, nothing more, nothing less.”
  • Use table variables instead temporary tables. Seriously ! What ! Come again.  As a sweeping general statement wrong wrong wrong.
  • Use the CTE ( Common Table BLOCKED EXPRESSION instead of derived tables and table variables as much as possible. Again, massive over generalisation.  Horses for courses.  Also, didn't you just say that i should use table variables.
  • Avoid using cursors Why ? and what should i do instead ?  I have to get the data out some how , what alternatives are there ?
  • Don’t use duplicate codes, reuse the code by Views and UDF’s  This section is about performance , right ? I would like to see one single instance where using a view ( presumably unindexed ) or a UDF (cough , splutter) improves performance.
  • Begin and commit transactions immediately Better wording would be “Keep transactions as short as possible, never leave a transaction open while waiting to user input.”
  • Avoid exclusive locks Confusing,  in what context ?
  • Use table hints BwaaHaa,  this is really a pandora’s box best left by the audience of this book.

And so it continues.  I’m trying really hard to not be to scathing or nit-picky about this book, there is some good advice here, but SQL Server is full of caveats , confusing and contradictory best practices and ultimately 90% of the time you can state that “It depends”. 

Questions are presented with solutions that can work but are given as 100% solutions not with any degree of warning that that may not always be the case.  Even as a free download, it is way too expensive, and, remembering the target audience, could ultimately do more harm than good.

Extended Events - inaccurate_cardinality_estimate

Extended events have been a bit of a personal “Elephant in the room” for me.  I know they are there and I should really get on a start using them but never *quite* have a compelling enough reason. 

So now i really do,  after comparing the events in sys.dm_xe_objects between 2008r2 and 2012 I found one that really peaked my interest,  inaccurate_cardinality_estimate.  This is described as “Occurs when an operator outputs significantly more rows than estimated by the Query Optimizer. Use this event to identify queries that may be using sub-optimal plans due to cardinality estimate inaccuracy. Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time.

IMO cardinality estimation errors are the number one cause of performance problems.  If sqlserver deduces ( or even guesses) an incorrect row estimation then all bets are off,  if you get anything approaching a decent plan , its by luck not judgement.

So, lets see what if we can cause this event to fire.  Firing up management studio, we have the new extended events manager,

image

which sounds like a fun tool to play with Smile So starting a new session and going to the events library

image

and filtering by ‘Card’

image

Oh , nothing found.  Its simply not there , here is a connect item for this issue.

So , we will have to do this a more ‘manual’ way

CREATE EVENT SESSION inaccurate_cardinality_estimate ON SERVER
ADD EVENT sqlserver.inaccurate_cardinality_estimate
( ACTION (sqlserver.plan_handle, sqlserver.sql_text) )
ADD TARGET package0.asynchronous_file_target
( SET FILENAME = N'c:\temp\inaccurate_cardinality_estimate.xel',
metadatafile = N'c:\temp\inaccurate_cardinality_estimate.xem' );

Ok , session defined , lets start it.

ALTER EVENT SESSION inaccurate_cardinality_estimate ON SERVER STATE = START

To demonstrate the actual event we need to create and populate a temporary table :

drop table #newids
go
create table #NewIds
(
id char(36)
)
go
insert into #NewIds
select top(100)
cast(newid() as char(36))
from sys.all_columns a cross join sys.all_columns b
If we now execute
 
declare @v varchar(10)
Select @v='%XX%'
select COUNT(*) from #NewIds where id like @v

We will get an estimated row count of 5.37528.
 
I can control the exact actual row count by updating a number of rows to start ‘XX’ and thereby create a cardinality estimation error.
 
For starters lets update all the rows ( Im wrapping the updates in a transaction and rolling back , for sake of brevity this is not shown)
 
update 
#NewIds
set id = 'XX'+left(id,20)

declare @v varchar(10)
Select @v='%XX%'
select COUNT(*) from #NewIds where id like @v
Then plan for the select shows a cardinality error, as expected

image
 
100 actual , 5.37528 expected.
 
Stop the extended events session
 
ALTER EVENT SESSION inaccurate_cardinality_estimate ON SERVER STATE = STOP

and all being well , in the c:\temp folder you will see an extended event log file.
 
Open that in management studio
 
image

There is the event, nice.  But hold on one cotton picking minute, look at the row counts.  Estimated = 5 , actual = 26 !?!

What happens if we repeat this operation but doubling the rows in the temp table

If we double the amount of rows in our temp table to 200,  our estimate rows in the plan will show as 10.7506 and actual as 200. In the extended event we see :

image

So the estimated count is shown as floor(row estimate) and the event is fired when the actual row count goes over 5*plan estimate, which is why actual is shown here as 53 not 200.  Notice that we also have the plan_handle and the node_id if we wish to tie this back to an exact operator in our system.

Quite why this is an extended event and not a plan warning , i really have no idea,  still its nice to know its there.

MythBusting–“Table variables have no statistics”

Ok, as myths go, its a pretty weak one.  In fact, it is true, this whitepaper explicitly states that.  But hand in hand with that statement goes another one, “Table variables will always estimate to one row”.  This is most definitely false,  if there are no statistics then sql server can, at times, default to its ‘guessing’ of distribution of data based upon row counts.  This behaviour can even further muddy the water of the old “Which is better, table variables or temp tables” argument.

To demonstrate this, firstly we need to populate a numbers table

create table numbers
(
Num integer primary key
)
go
insert into numbers
Select top(1000) ROW_NUMBER() over (order by (select null))
from sys.columns a cross join sys.columns b

Now we execute the following code

Declare @TableVar Table
(
ID integer not null primary key,
Mod10 integer not null
)

insert into @TableVar(ID,Mod10)
Select top(20) num,num%10
from numbers
order by num

Select tv.Id,num
from @TableVar tv
join numbers
on tv.ID = num


and looking at the execution plan, we see :
image
 
1 Row estimated and 20 rows actual, as you may well expect.  Now add ‘OPTION(RECOMPILE)’  the plan is now different.
 
image
 
Look at that an accurate row estimation.  How about if we are filter to the statement say ‘Mod10=0’
 

image
 
Another different but wrong estimation.  This is because table variables dont have statistics, but we do have row counts.  It is worth pointing out at this point that these are the same numbers you will get if you did these operations on a normal ‘permanent’ table, but had turned off AUTO STATISTICS.
 
Obviously in a production environment, you would only be using RECOMPILE in ‘special’ circumstances, right ?  So, this isn't an issue.  All your table variables will be estimating as one row.  Wrong,  I would be willing to bet that a surprisingly high number are estimating as something else.  If you are so inclined, so can probably find quite a few in the dmv sys.dm_exec_query_plan.  So, how does this happen ? Well,  in a way its nothing to do with table variables per se , but if you are joining to another table, then if (and when) that table has its stats updated then that will cause the statement to recompile and , surprise , surprise , you have a table variable with an estimate > 1.
 
OK…  So lets step through that.  Ignore the extra Select statement that counts from adventureworks,  its just there to create a more ‘complicated’ stored procedure and we get multiple statements cached in the plan.
 
drop table IDs
go
create table IDs
(
Id integer primary key,padding char(255)
)
go
insert into IDs(Id,padding)
Select top(1) num,'xxx'
from numbers
order by num
go
drop procedure TableVarTest
go
create procedure TableVarTest
as
declare @TableVar Table
(
ID integer not null,
Mod10 integer not null
)

insert into @TableVar(ID,Mod10)
Select top(20) num,num%10
from numbers
order by num

select COUNT(*)
from AdventureWorks2008r2.dbo.Customer C
join AdventureWorks2008r2.dbo.CustomerOrders CO
on C.CustomerId = CO.CustomerId

Select tv.Id,IDs.id
from @TableVar tv
join IDs
on tv.ID = IDs.Id
where mod10 =0
go

On first execution the join of the table variable to IDs produces…
 
image
 
Now, lets add some more data to ID’s and force a recompile just for good measure :
 
insert into IDs(Id,padding)
Select top(1000) num,'xxx'
from numbers
where not exists(select id from IDs where id = num )
order by num
go
exec sp_recompile ids

and then re-execute the stored procedure
 
image
 
So, one myth busted and one proved, not bad for one blog.