SQL and the like

Dave Ballantyne's blog. Freelance SQL Server database designer and developer at Clear Sky SQL
Reporting on common code smells : A POC

Over the past few blog entries, I’ve been looking at parsing TSQL scripts in a variety of ways for a variety of tasks.  In my last entry ‘How to prevent ‘Select *’ : The elegant way’, I looked at parsing SQL to report upon uses of SELECT *.  The obvious question leading on from this is, “Great, what about other code smells ?”  Well, using the language service parser to do that was turning out to be a bit of a hard job,  sure I was getting tokens but no real context.  I wasn't even being told when an end of statement had been reached.

One of the other parsing options available from Microsoft is exposed in the assembly ‘Microsoft.SqlServer.TransactSql.ScriptDom’,  this is ,I believe, installed with the client development tools with SQLServer.  It is much more feature rich than the original parser I had used and breaks a TSQL script into intuitive classes for analysis.

So, what sort of smells can I now find using it ?  Well, for an opening gambit quite a nice little list.

  • Use of NOLOCK
  • Set of READ UNCOMMITTED
  • Use of SELECT *
  • Insert without column references
  • Explicit datatype conversion on Sargs
  • Cross server selects
  • Non use of two-part naming convention
  • Table and Query hint usage
  • Changes in set options
  • Use of single line comments
  • Use of ordinal column positions in ORDER BY clause

Now, lets not argue the point that “It depends” as smells on some of these, but as an academic exercise it is quite interesting. 

The code is available on codeplex http://tsqlsmells.codeplex.com/ 

All the usual disclaimers apply to this code, I cannot be held responsible for anything ranging from mild annoyance through to universe destruction due to the use of this code or examples.

The zip file contains a powershell script and my test cases.  The assembly used requires .Net 4 to run, which means that you will need powershell 3 ( though im running through PowerGUI and all works ok ) . 

The code searches for all .sql files in the folder hierarchy for the workingpath,  you can override this if you want by simply changing the $Folder variable, and processes each in turn for the smells.  Feedback is not great at the moment, all it does is output to an xml file (Smells.xml) the offset position and a description of the smell found.

Right now, I am interested in your feedback.  What do you think ?  Is this (or should it be) more than an academic exercise ?  Can tooling such as this be used as some form of code quality measure ?  Does it Work ? Do you have a case listed above which is not being reported ? Do you have a case that you would love to be reported ?

Let me know , please Smile mailto: parser@clearskysql.co.uk.

Thanks

How to prevent ‘Select *’ : The elegant way

UPDATE 2012-09-12 For my latest adventures with TSQL Parsers please see this post.

I’ve been doing a lot of work with the “Microsoft SQL Server 2012 Transact-SQL Language Service” recently, see my post here and article here for more details on its use and some uses.

An obvious use is to interrogate sql scripts to enforce our coding standards.  In the SQL world a no-brainer is SELECT *,  all apologies must now be given to Jorge Segarra and his post “How To Prevent SELECT * The Evil Way” as this is a blatant rip-off Smile

IMO, the only true way to check for this particular evilness is to parse the SQL as if we were SQL Server itself.  The parser mentioned above is ,pretty much, the best tool for doing this.  So without further ado lets have a look at a powershell script that does exactly that :

cls
#Load the assembly
[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.Management.SqlParser") | Out-Null
$ParseOptions = New-Object Microsoft.SqlServer.Management.SqlParser.Parser.ParseOptions
$ParseOptions.BatchSeparator = 'GO'
#Create the object
$Parser = new-object Microsoft.SqlServer.Management.SqlParser.Parser.Scanner($ParseOptions)
$SqlArr = Get-Content "C:\scripts\myscript.sql"
$Sql = ""
foreach($Line in $SqlArr){
    $Sql+=$Line
    $Sql+="`r`n"
}
$Parser.SetSource($Sql,0)
$Token=[Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::TOKEN_SET
$IsEndOfBatch = $false
$IsMatched = $false
$IsExecAutoParamHelp = $false
$Batch = ""
$BatchStart =0
$Start=0
$End=0
$State=0
$SelectColumns=@();
$InSelect = $false
$InWith = $false;
while(($Token = $Parser.GetNext([ref]$State ,[ref]$Start, [ref]$End, [ref]$IsMatched, [ref]$IsExecAutoParamHelp ))-ne [Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::EOF) {
    $Str = $Sql.Substring($Start,($End-$Start)+1) 
    try{
        ($TokenPrs =[Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]$Token) | Out-Null
        #Write-Host $TokenPrs
        if($TokenPrs -eq [Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::TOKEN_SELECT){
            $InSelect =$true
            $SelectColumns+=""
        }    
        if($TokenPrs -eq [Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::TOKEN_FROM){
            $InSelect =$false
            #Write-Host $SelectColumns -BackgroundColor Red
            
            foreach($Col in $SelectColumns){
                if($Col.EndsWith("*")){
                    Write-Host "select * is not allowed"
                    exit
                
                }
            
            }
            $SelectColumns =@()
        }
        
    }catch{
        #$Error
        $TokenPrs = $null
        
    }    
    if($InSelect -and $TokenPrs -ne [Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::TOKEN_SELECT){
        if($Str -eq ","){
            $SelectColumns+=""
        }else{
            $SelectColumns[$SelectColumns.Length-1]+=$Str
        }    
    }
}

OK, im not going to pretend that its the prettiest of powershell scripts,  but if our parsed script file “C:\Scripts\MyScript.SQL” contains SELECT * then “select * is not allowed” will be written to the host. 

So, where can this go wrong ?  It cant ,or at least shouldn’t Smile , go wrong, but it is lacking in functionality.  IMO, Select * should be allowed in CTEs, views and Inline table valued functions at least and as it stands they will be reported upon.

Anyway, it is a start and is more reliable that other methods.

Joining on NULLs

A problem I see on a fairly regular basis is that of dealing with NULL values.  Specifically here, where we are joining two tables on two columns, one of which is ‘optional’ ie is nullable.  So something like this:

image

i.e. Lookup where all the columns are equal, even when NULL.   NULL’s are a tricky thing to initially wrap your mind around.  Statements like “NULL is not equal to NULL and neither is it not not equal to NULL, it’s NULL” can cause a serious brain freeze and leave you a gibbering wreck and needing your mummy.

Before we plod on, time to setup some data to demo against.

Create table #SourceTable
(
Id         integer  not null,
SubId      integer null,
AnotherCol char(255) not null
)
go
create unique clustered index idxSourceTable on #SourceTable(id,subID)
go
with cteNums
as
(
select top(1000) number
from  master..spt_values
where  type ='P'
)
insert into #SourceTable
select Num1.number,nullif(Num2.number,0),'SomeJunk'
from  cteNums num1
cross join  cteNums num2

go
Create table #LookupTable
(
Id    integer  not null,
SubID integer null
)
go
insert into #LookupTable
Select top(100) id,subid
from #SourceTable 
where subid is not null
order by newid()
go
insert into #LookupTable
Select top(3) id,subid 
from #SourceTable 
where subid is null
order by newid()

If that has run correctly, you will have 1 million rows in #SourceTable and 103 rows in #LookupTable.  We now want to join one to the other.

First attempt – Lets just join

select * 
 from #SourceTable join #LookupTable
   on #LookupTable.id = #SourceTable.id
  and #LookupTable.SubID = #SourceTable.SubID

OK, that’s a fail.  We had 100 rows back,  we didn’t correctly account for the 3 rows that have null values.  Remember NULL <> NULL and the join clause specifies SUBID=SUBID, which for those rows is not true.

Second attempt – Lets deal with those pesky NULLS

select * 
  from #SourceTable join #LookupTable
    on #LookupTable.id = #SourceTable.id
   and isnull(#LookupTable.SubID,0) = isnull(#SourceTable.SubID,0)

OK, that’s the right result, well done Smile and 99.9% of the time that is where its left. It is a relatively trivial CPU overhead to wrap ISNULL around both columns and compare that result, so no problems.  But, although that’s true, this a relational database we are using here, not a procedural language.  SQL is a declarative language, we are making a request to the engine to get the results we want.  How we ask for them can make a ton of difference.

Lets look at the plan for our second attempt, specifically the clustered index seek on the #SourceTable

image

 

There are 2 predicates. The ‘seek predicate’ and ‘predicate’.  The ‘seek predicate’ describes how SQLServer has been able to use an Index.  Here, it has been able to navigate the index to resolve where ID=ID.  So far so good, but what about the ‘predicate’ (aka residual probe) ? This is a row-by-row operation.  For each row found in the index matching the Seek Predicate, the leaf level nodes have been scanned and tested using this logical condition.  In this example [Expr1007] is the result of the IsNull operation on #LookupTable and that is tested for equality with the IsNull operation on #SourceTable.  This residual probe is quite a high overhead, if we can express our statement slightly differently to take full advantage of the index and make the test part of the ‘Seek Predicate’.

Third attempt – X is null and Y is null

So, lets state the query in a slightly manner:

select * 
 from #SourceTable join #LookupTable
   on #LookupTable.id = #SourceTable.id
  and ( #LookupTable.SubID = #SourceTable.SubID or
        (#LookupTable.SubID is null     and #SourceTable.SubId is null)
      )

So its slightly wordier and may not be as clear in its intent to the human reader, that is what comments are for, but the key point is that it is now clearer to the query optimizer what our intention is.

Let look at the plan for that query, again specifically the index seek operation on #SourceTable

image

No ‘predicate’, just a ‘Seek Predicate’ against the index to resolve both ID and SubID.  A subtle difference that can be easily overlooked.  But has it made a difference to the performance ? Well, yes , a perhaps surprisingly high one.

image

Clever query optimizer well done.

If you are using a scalar function on a column, you a pretty much guaranteeing that a residual probe will be used.  By re-wording the query you may well be able to avoid this and use the index completely to resolve lookups. In-terms of performance and scalability your system will be in a much better position if you can.

Indexed view deadlocking

Deadlocks can be a really tricky thing to track down the root cause of.  There are lots of articles on the subject of tracking down deadlocks, but seldom do I find that in a production system that the cause is as straightforward.  That being said,  deadlocks are always caused by process A needs a resource that process B has locked and process B has a resource that process A needs.  There may be a longer chain of processes involved, but that is the basic premise.

Here is one such (much simplified) scenario that was at first non-obvious to its cause:

The system has two tables,  Products and Stock.  The Products table holds the description and prices of a product whilst Stock records the current stock level.

USE tempdb
GO
CREATE TABLE Product
(
ProductID INTEGER IDENTITY PRIMARY KEY,
ProductName VARCHAR(255) NOT NULL,
Price        MONEY NOT NULL
)
GO
CREATE TABLE Stock
(
ProductId INTEGER PRIMARY KEY,
StockLevel INTEGER NOT NULL
)
GO
INSERT INTO Product
SELECT TOP(1000) CAST(NEWID() AS VARCHAR(255)),
                 ABS(CAST(CAST(NEWID() AS VARBINARY(255)) AS INTEGER))%100
        FROM sys.columns a CROSS JOIN sys.columns b
GO
INSERT INTO Stock
SELECT ProductID,ABS(CAST(CAST(NEWID() AS VARBINARY(255)) AS INTEGER))%100
FROM Product

There is a single stored procedure of GetStock:

Create Procedure GetStock
as
SELECT Product.ProductID,Product.ProductName
  FROM dbo.Product
  join dbo.Stock
    on Stock.ProductId = Product.ProductID
 where Stock.StockLevel <> 0

Analysis of the system showed that this procedure was causing a performance overhead and as reads of this data was many times more than writes,  an indexed view was created to lower the overhead.

CREATE VIEW vwActiveStock
With schemabinding
AS
SELECT Product.ProductID,Product.ProductName
  FROM dbo.Product
  join dbo.Stock
    on Stock.ProductId = Product.ProductID
 where Stock.StockLevel <> 0
go
CREATE UNIQUE CLUSTERED INDEX PKvwActiveStock on vwActiveStock(ProductID)

This worked perfectly, performance was improved, the team name was cheered to the rafters and beers all round.  Then, after a while, something else happened…

The system updating the data changed,  The update pattern of both the Stock update and the Product update used to be:

BEGIN TRAN
UPDATE...
COMMIT
BEGIN TRAN
UPDATE...
COMMIT
BEGIN TRAN
UPDATE...
COMMIT

It changed to:

BEGIN TRAN
UPDATE...
UPDATE...
UPDATE...
COMMIT

Nothing that would raise an eyebrow in even the closest of code reviews.  But after this change we saw deadlocks occuring.

You can reproduce this by opening two sessions. In session 1

begin transaction
Update Product
   set ProductName ='Test'
 where ProductID = 998

Then in session 2

begin transaction
Update Stock
   set Stocklevel = 5
 where ProductID = 999

Update Stock
   set Stocklevel = 5
 where ProductID = 998

Hop back to session 1 and..

Update Product
   set ProductName ='Test'
 where ProductID = 999

Looking at the deadlock graphs we could see the contention was between two processes, one updating stock and the other updating product, but we knew that all the processes do to the tables is update them.  Period.  There are separate processes that handle the update of stock and product and never the twain shall meet, no reason why one should be requiring data from the other. 

Then it struck us,  AH the indexed view.

Naturally, when you make an update to any table involved in a indexed view, the view has to be updated.  When this happens, the data in all the tables have to be read, so that explains our deadlocks.  The data from stock is read when you update product and vice-versa.

The fix, once you understand the problem fully, is pretty simple, the apps did not guarantee the order in which data was updated.  Luckily it was a relatively simple fix to order the updates and deadlocks went away.  Note, that there is still a *slight* risk of a deadlock occurring, if both a stock update and product update occur at *exactly* the same time.

“Query cost (relative to the batch)” <> Query cost relative to batch

OK, so that is quite a contradictory title, but unfortunately it is true.  There is a common misconception that the query with the highest percentage relative to batch is the worst performing.  Simply put, it is a lie, or more accurately we dont understand what these figures mean.

Consider the two below simple queries:

SELECT * FROM Person.BusinessEntity
JOIN Person.BusinessEntityAddress
ON Person.BusinessEntity.BusinessEntityID = Person.BusinessEntityAddress.BusinessEntityID
go
SELECT * FROM Sales.SalesOrderDetail
JOIN Sales.SalesOrderHeader
ON Sales.SalesOrderDetail.SalesOrderID = Sales.SalesOrderHeader.SalesOrderID

After executing these and looking at the plans, I see this :

image

So, a 13% / 87% split ,  but 13% / 87% of WHAT ? CPU ? Duration ? Reads ? Writes ? or some magical weighted algorithm ? 

In a Profiler trace of the two we can find the metrics we are interested in.

image

CPU and duration are well out but what about reads (210 and 1935)? To save you doing the maths, though you are more than welcome to, that’s a 90.2% / 9.8% split.  Close, but no cigar.

Lets try a different tact.  Looking at the execution plan the “Estimated Subtree cost” of query 1 is 0.29449 and query 2 its 1.96596.  Again to save you the maths that works out to 13.03% and 86.97%, round those and thats the figures we are after.  But, what is the worrying word there ? “Estimated”. 

So these are not “actual”  execution costs,  but what’s the problem in comparing the estimated costs to derive a meaning of “Most Costly”.  Well, in the case of simple queries such as the above , probably not a lot.  In more complicated queries , a fair bit.

By modifying the second query to also show the total number of lines on each order

SELECT *,COUNT(*) OVER (PARTITION BY Sales.SalesOrderDetail.SalesOrderID)
 FROM Sales.SalesOrderDetail
JOIN Sales.SalesOrderHeader
ON Sales.SalesOrderDetail.SalesOrderID = Sales.SalesOrderHeader.SalesOrderID
The split in percentages is now 6% / 94% and the profiler metrics are :
image

Even more of a discrepancy.

Estimates can be out with actuals for a whole host of reasons,  scalar UDF’s are a particular bug bear of mine and in-fact the cost of a udf call is entirely hidden inside the execution plan.  It always estimates to 0 (well, a very small number).

Take for instance the following udf

Create Function dbo.udfSumSalesForCustomer(@CustomerId integer)
returns money
as
begin
   Declare @Sum money
   Select @Sum= SUM(SalesOrderHeader.TotalDue)
     from Sales.SalesOrderHeader
    where CustomerID = @CustomerId
   return @Sum
end
If we have two statements , one that fires the udf and another that doesn't:
Select CustomerID
  from Sales.Customer
 order by CustomerID
go
Select CustomerID,dbo.udfSumSalesForCustomer(Customer.CustomerID)
  from Sales.Customer
 order by CustomerID
The costs relative to batch is a 50/50 split, but the has to be an actual cost of firing the udf.  Indeed profiler shows us :
image

No where even remotely near 50/50!!!!

Moving forward to window framing functionality in SQL Server 2012 the optimizer sees ROWS and RANGE ( see here for their functional differences) as the same ‘cost’ too

SELECT SalesOrderDetailID,SalesOrderId,
       SUM(LineTotal) OVER(PARTITION BY salesorderid 
         ORDER BY Salesorderdetailid RANGE unbounded preceding)
from Sales.SalesOrderdetail
go
SELECT SalesOrderDetailID,SalesOrderId,
       SUM(LineTotal) OVER(PARTITION BY salesorderid 
       ORDER BY Salesorderdetailid Rows unbounded preceding)
from Sales.SalesOrderdetail
By now it wont be a great display to show you the Profiler trace reads a *tiny* bit different.
image

So moral of the story, Percentage relative to batch can give a rough ‘finger in the air’ measurement, but dont rely on it as fact.

Offset without OFFSET

A while ago Robert Cary posted an article on SQL Server Central entitled 2005 Paging – The Holy Grail which is, as the title would suggest about paging in SQL Server.  This article provoked some really interesting chat around the subject and is well worth a read.

This is now a lot easier in SQL Server 2012 with the introduction of the OFFSET extension to the ORDER BY clause,  but what is the most optimal method is you are not using 2012 ?

Well, whilst playing around by the OFFSET portion of my “What’s new in SQL Server 2012 – TSQL” presentation, I hit on a different method that I’ve not seen published before.

Now whilst finding which rows are on which page is a problem, it is only part of a much wider problem, that being that cost of the lookups to find other related data.  For example:  You have a list of people which you are paging through in the order of LastName,  but you also wish to display FirstName.  That is not in your index and so a key lookup occurs, OK I could INCLUDE it in the index but im just simplifying the problem.

So, to demonstrate this I need to create an index on Person.Person in AdventureWorks.

Create index idxLastName on Person.Person(LastName)
The query for the “holy grail” method would look something like this :
with ctePaging
as
(
Select LastName,FirstName,
       row_number() over (order by LastName,BusinessEntityID)-1  as RowN
 from  Person.Person
)
Select * from ctePaging
where RowN between 20 and 39 
order by RowN;

The issue here is that SQL Server has initiated an index scan (against a different index than the one we created) and had to process all the rows in the table and then sort them. 

image

We only want 20 rows returned so this is quite a lot of wasted effort on the engine's part.

OFFSET has been introduced in 2012 and running the equivalent query of :

Select LastName,FirstName,BusinessEntityID
 from  Person.Person
 order by LastName,BusinessEntityID
 offset 20 rows fetch next 20 rows only;

Gives us the query plan of :

image

Even this is non-optimal though, as the key lookup has occurred 40 times , even though we only needed the data (in this case FirstName) for 20 rows.

This can be resolved by doing the key lookup yourself.

with cteKeySeek
as
(
Select LastName,BusinessEntityID
 from  Person.Person
 order by LastName,BusinessEntityID
 offset 20 rows fetch next 20 rows only
)
Select cteKeySeek.LastName,
       FirstName,
       cteKeySeek.BusinessEntityID
 from  cteKeySeek
 inner join  Person.Person   
   on  cteKeySeek.BusinessEntityID =   Person.BusinessEntityID
order  by cteKeySeek.LastName,FirstName,cteKeySeek.BusinessEntityID;

Even though its longer, wordier and involves a join , it is more efficient as the join has replaced the key lookup and it is now only occurring on the 20 rows of data that we need

image

Quite neat hey ? When using OFFSET it is important to remember that no magic is happening, SQL Server still has to ‘count’ and scan through the rows that are not to be processed before it can decide which ones it does need. 

A comparable query for previous versions and taking the lead from the holy grail method would be :

with cteKeySeek
as
(
Select BusinessEntityID,LastName,
       row_number() over (order by LastName,BusinessEntityID)-1 
             as RowN
 from  Person.Person

)
Select cteKeySeek.LastName,FirstName,cteKeySeek.BusinessEntityID ,RowN
  from cteKeySeek
  inner loop join  Person.Person   
   on  cteKeySeek.BusinessEntityID =   Person.BusinessEntityID
where RowN >= 20 and rown<=39
order by LastName,BusinessEntityID;

Which does similarly filter the rows before doing the index lookup

image

It does however still involve a scan of 19,972 rows of which 19,932 are irrelevant to our final result set.  You may of noticed in the OFFSET versions that the TOP operator is used to filter the data and ‘stop’ the scan once it has reached the last row that we are interested in.  What if we could do something similar.

What about this ?:

with cteKeySeek
as
(
Select BusinessEntityID,LastName,
       row_number() over (order by LastName,BusinessEntityID)-1 
             as RowN
 from  Person.Person

)
Select top(20) cteKeySeek.LastName,FirstName,cteKeySeek.BusinessEntityID ,RowN
  from cteKeySeek
  inner loop join  Person.Person   
   on  cteKeySeek.BusinessEntityID =   Person.BusinessEntityID
where RowN >= 20 and RowN<=39
order by LastName,BusinessEntityID;

That does have the rather interesting effect of doing exactly that:

image

So, this is looking (at least in-terms of rowcounts) very similar to the OFFSET functionality.  If we look for a page of data further on (rows 200 to 219) and look at an profiler trace we can see how the three type of query compare.

image

So as you can see over a medium size (ish) dataset the fake and real offset are comparable in terms of IO.

Hope this helps someone, who needs to do paging

Parsing T-SQL – The easy way

UPDATE 2012-09-12 : For my latest adventures with TSQL Parsers please see this post

Every once in a while, I hit an issue that would require me to interrogate/parse some T-SQL code.  Normally, I would shy away from this and attempt to solve the problem in some other way.  I have written parsers before in the the past using LEX and YACC, and as much fun and awesomeness that path is,  I couldnt justify the time it would take.

However, this week I have been faced with just such an issue and at the back of my mind I can remember reading through the SQLServer 2012 feature pack and seeing something called “Microsoft SQL Server 2012 Transact-SQL Language Service “.  This is described there as :

“The SQL Server Transact-SQL Language Service is a component based on the .NET Framework which provides parsing validation and IntelliSense services for Transact-SQL for SQL Server 2012, SQL Server 2008 R2, and SQL Server 2008. “

Sounds just what I was after.  Documentation is very scant on this so dont take what follows as best practice or best use, just a practice and a use.

Knowing what I was sort of looking for something, I found the relevant assembly in the gac which is the simply named ,’Microsoft.SqlServer.Management.SqlParser’.

Even knowing that you wont find much in terms of documentation if you do a web-search, but you will find the MSDN documentation that list the members and methods etc…

The “scanner”  class sounded the most appropriate for my needs as that is described as “Scans Transact-SQL searching for individual units of code or tokens.”.

After a bit of poking, around the code i ended up with was something like

[System.Reflection.Assembly]::LoadWithPartialName("Microsoft.SqlServer.Management.SqlParser") | Out-Null
$ParseOptions = New-Object Microsoft.SqlServer.Management.SqlParser.Parser.ParseOptions
$ParseOptions.BatchSeparator = 'GO'

$Parser = new-object Microsoft.SqlServer.Management.SqlParser.Parser.Scanner($ParseOptions)
$Sql = "Create Procedure MyProc as Select top(10) * from dbo.Table"
$Parser.SetSource($Sql,0)
$Token=[Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::TOKEN_SET
$Start =0
$End = 0
$State =0 
$IsEndOfBatch = $false
$IsMatched = $false
$IsExecAutoParamHelp = $false
while(($Token = $Parser.GetNext([ref]$State ,[ref]$Start, [ref]$End, [ref]$IsMatched, [ref]$IsExecAutoParamHelp ))-ne [Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]::EOF) {
    try{
        ($TokenPrs =[Microsoft.SqlServer.Management.SqlParser.Parser.Tokens]$Token) | Out-Null
        $TokenPrs
        $Sql.Substring($Start,($end-$Start)+1)
    }catch{
        $TokenPrs = $null
    }    
}

As you can see , the $Sql variable holds the sql to be parsed , that is pushed into the $Parser object using SetSource,  and then we will use GetNext until the EOF token is returned.  GetNext will also return the Start and End character positions within the source string of the parsed text.

This script’s output is :

TOKEN_CREATE
Create
TOKEN_PROCEDURE
Procedure
TOKEN_ID
MyProc
TOKEN_AS
as
TOKEN_SELECT
Select
TOKEN_TOP
top
TOKEN_INTEGER
10
TOKEN_FROM
from
TOKEN_ID
dbo
TOKEN_TABLE
Table

note that the ‘(‘, ‘)’  and ‘*’ characters have returned a token type that is not present in the Microsoft.SqlServer.Management.SqlParser.Parser.Tokens Enum that has caused an error which has been caught in the catch block. 

Fun, Fun ,Fun , Simple T-SQL Parsing.  Hope this helps someone in the same position,  let me know how you get on.

SQLMidlands & SQLLunch

Many thanks to all those that turned out to see my presentation on Thursday (16th of Feb) of “Cursors are Evil” at SQLMidlands.  The scripts i used are here :

https://skydrive.live.com/?cid=4004b6a3bc887e2c&id=4004B6A3BC887E2C%21216

You will need the AdventureWorks2008r2 release to run these, feel free to mail me (dave.ballantyne@live.co.uk) with any questions.  They are based upon a series of articles I wrote for SQLServerCentral which can be found here and here.

Also I am starting ,or at least having an attempt at, a new user group in London.  This is SQLLunch, meeting downstairs at The Golden Fleece , EC4N 1SP which is 2 minutes from Bank Tube , we will have a twice monthly meeting (2nd and 4th Tuesdays) for an ‘All Stuff, No Fluff’ event.  Put plainly, a quick hello followed by a 45 minute presentation , which will ,optimistically, have you there and back to your desk within a lunch hour.

Registrations for the first series of dates are at sqlserverfaq.com

If you would like to speak, then please get in touch.

Hope to see you there. 

[BUG] Inserts to tables with an index view can fail

Unfortunately some of the more troubling bugs can be very hard to reproduce succinctly.  Here is one that has been troubling me for a little while :

The issue is using indexed views with a calculated column. Indexed views, despite their restrictions, are a very handy addition to SQL Server and materializing views to be hard data can certainly improve performance.  So to demonstrate my issue we will need to build a table and create a view on it. 

create table myTable
(
Id integer not null,
InView char(1) not null,
SomeData varchar(255) not null
)
go
Create view vwIxView
with schemabinding
as
Select ID,Somedata,left(SomeData,CHARINDEX('x',SomeData)-1) as leftfromx
from dbo.myTable
Where InView ='Y'
 
As you can see , the view is filtering the data for where InView =’Y’ and is adding a calculated column to do some manipulation of the column ‘SomeData’. This column ,leftfromx, is taking the characters up to and including the first ‘x’ from the ‘SomeData’ column.

If we insert some data into the view with

insert into myTable(Id,InView,SomeData)
select 1,'N','a'

unsurprisingly, if we look to the view then there will be no data in it.

Now lets add an index to the view

create unique clustered index pkvwIxView on vwIxView(Id)

The data is now persisted.

Lets now add some more data ,the same data, in a ever so slightly different way.

declare @id integer,
@inview char(1),
@Somedata char(50)
select @id = 1, @inview = 'N',@Somedata = 'a'

insert into myTable(Id,InView,SomeData)
select @id,@inview,@Somedata

What is the result ?

image

Huh , well its kind of obvious which “LEFT or SUBSTRING function” has errored, but as inview = ‘N’ why should that piece of code even been executed ?  Looking at the estimated plan we can more easily see the flow of events.

image

The ‘compute scalar’ operation is where the LEFT is being executed. ,That is happening before the filter and as there is no ‘x’ in the ‘SomeData’ column , it is unsurprising that the function is erroring.  I have tested this on both 2008r2 and 2012 rc0.

I have raised a connect item here, if you want to upvote it.

Book review - SQL Server Secret Diary (Know the unknown secrets of SQL Server)

Like a lot of people within the SQL community, I can never read enough on the subject.  Books, whitepapers, academic research and blogs can all be valuable source of information, so whilst browsing Amazon I found this book on a free kindle download.  The preface makes some bold claims indeed :

“This book is for developers who already know SQL Server and want to gain more knowledge in SQL Server.  This book is not for starter who want to start from the beginning.

The problem-solution approach will help you to understand and solve the real-time problems easily.

This Book will teach you (their emphasis)

  • How to solve common real-time problems
  • How to improve performance
  • How to protect your data and code
  • How to reduce your code
  • How to use SQL Server efficiently
  • Advanced topics with simple examples
  • Tips and tricks with sample queries
  • And also teach how to do in the better way.

The last bullet point, sets the tone of the quite appalling use of grammar (yes, yes , people in glass houses and all that.. ) contained throughout the entire book,  I get that the authors may use english as a second (or third) language,  but where are the proof readers ?  That i can live with though,  its the technical content i really have a problem with.  Here is just a small selection:

Q 2) How to use GO statement in SQL Server ?

IMO,  the most important concept to understand about GO is that it is not a SQL Statement.  It is processed on the client (SSMS, ISQL etc) and splits the workload into separate batches.  This is not mentioned here,  though to be fair in Q3 (How to repeat the statements without using loops ?) the author notes “GO is a client command and not a T-SQL command". So GO or GO <N> can only be used with Microsoft SQL Server client tools.”. 

Q 5) How to use ORDER BY clause in view ?

Here the author spends a great deal of time and effort working around “As per RDBMS rule ORDER BY clause is not allowed in view”.  This section should be thrown away entirely,  if you are depending on the view ordering ( which is a contradiction in terms)  for your result set ordering you deserve all the law suits that are thrown at you. 

Q 10) How to do case sensitive searches in SQL Server ?

The authors solution here is to cast a column as varbinary. OK, fair enough it works. Personally, i would have used COLLATE but lets not split hairs.  The biggest issue i have here is sargability is not mentioned,  we are introducing the possibility of a scan.

Q 12) How to solve server time out problem ?

The scenario presented here is that session #1 has updated some data that session #2 needs to read.  The author presents 2 solutions NOLOCK and READPAST and ,to be fair, does make an attempt at highlighting the dirty reads.  My issue here is that, once again, locking is seen as the enemy that must be worked around.  We should embrace locks, understand why they are happening and how they are protecting us.  The point is not raised that the fault here lies with the UPDATE’ing transaction not completing in a timely fashion, not that the reader cannot complete due to that. The consequences of reading and processing dirty data are not explored thoroughly enough and once again, NOLOCK is used as a “go faster” button.

Q 33 ) How to improve the performance of stored procedure ?

Here we have been given 11 bullet points by the authors , which I have copied verbatim below. My thoughts about each point are inlined in red:

  • Use SET NOCOUNT ON to avoid sending row count information for every statement. So, this can help, but will only have a measurable effect if you have many many statements,  but in that case you are coding sql wrong anyway.
  • Always use the owner name or schema name before the object name to prevent recompilation of stored procedure.  Does this mean that by not referencing the owner or schema (which one is it ?? ) objects will always cause a recompile of the entire stored procedure ? No.  The statement not necessarily the stored procedure, will recompile if the user has a different default schema to the existing compiled statement.
  • Avoid using DISTINCT Just distinct ? Any thing else ? Unnecessary ORDER BY ?
  • Minimize the number of columns in SELECT clause So, Select Col1,Col2,Col3 is bad but Select Col1 +’ ‘+ Col2 +’ ‘+Col3 is ok ? Better wording here would be “Return only the data that is required by the application, nothing more, nothing less.”
  • Use table variables instead temporary tables. Seriously ! What ! Come again.  As a sweeping general statement wrong wrong wrong.
  • Use the CTE ( Common Table BLOCKED EXPRESSION instead of derived tables and table variables as much as possible. Again, massive over generalisation.  Horses for courses.  Also, didn't you just say that i should use table variables.
  • Avoid using cursors Why ? and what should i do instead ?  I have to get the data out some how , what alternatives are there ?
  • Don’t use duplicate codes, reuse the code by Views and UDF’s  This section is about performance , right ? I would like to see one single instance where using a view ( presumably unindexed ) or a UDF (cough , splutter) improves performance.
  • Begin and commit transactions immediately Better wording would be “Keep transactions as short as possible, never leave a transaction open while waiting to user input.”
  • Avoid exclusive locks Confusing,  in what context ?
  • Use table hints BwaaHaa,  this is really a pandora’s box best left by the audience of this book.

And so it continues.  I’m trying really hard to not be to scathing or nit-picky about this book, there is some good advice here, but SQL Server is full of caveats , confusing and contradictory best practices and ultimately 90% of the time you can state that “It depends”. 

Questions are presented with solutions that can work but are given as 100% solutions not with any degree of warning that that may not always be the case.  Even as a free download, it is way too expensive, and, remembering the target audience, could ultimately do more harm than good.

More Posts « Previous page - Next page »