May 2008 - Posts

I cannot emphasise enought the importance of understanding the absolute basic security principles when developing applications that connect to and run SQL against any database product.

If you are doing application development that requires database access and don't fully understand the term "SQL Injection" then really - stop what you are doing; you are doing in code the equivalent of going out and buying a car then driving it away when you've had no driving lessons and don't understand the basics - accelerator and braking. SQL Injection needs only 15 minutes of your reading time to understand and prevent.

This is not a web server problem, it's not a database problem - it's a human coder problem and as such it can only be fixed by you.

Extract from Buck Woody's blog: http://blogs.msdn.com/buckwoody/archive/2008/05/30/sql-injection-attacks.aspx

"You might have read recently that there have been ongoing SQL injection attacks against vulnerable web applications occurring over the last few months.  These attacks have received recurring attention in the press as they pop up in various geographies around the world. These attacks do not leverage any SQL Server vulnerabilities or any un-patched vulnerabilities in any Microsoft product – the attack vector is vulnerable custom applications. In fact, SQL Injection is a coding issue that can attack any database system, so it's a good idea to learn how to defend against them.

 

In order to help you respond to and defend yourself from these attacks, Microsoft has an authoritative blog including talking points and guidance.  You can find this at http://blogs.technet.com/swi/archive/2008/05/29/sql-injection-attack.aspx. "

 

The SQL Server engine does not work in the same way as the Oracle one; there are lots of differences and also lot's of different "theory" you need to understand that is quite different from Oracle.

A good starting point in terms of a like for like discussion is in this paper from Microsoft:

Solution Guide for Migrating Oracle on UNIX to SQL Server on Windows

Chapter 5 - Developing: Databases - Migrating the Database Architecture 

http://www.microsoft.com/technet/solutionaccelerators/cits/interopmigration/unix/oracleunixtosql/or2sql05.mspx

Even if you aren't migrating it's a good read.

But the most important reads are probably the Inside SQL Server books (http://www.insidesqlserver.com/) and Ken Henderson's "The Guru's Guid to SQL Server Architecture and Internals".

 

It's going to be a biggy - we are aiming our sights at 450 attendees for this one, a big Auditorium that can hold 450, couple of rooms holding 240 and one for 120 so space is not going to be a problem. We are looking to raise sponsorship to pay for the event so we can make it free to attend again.

Two things, first of all - have a think - does your company fancy sponsoring a highly successful and well attended event, if so - ping me an email to tonyrogerson@torver.net or give me a call on my mobile 0796 816 0362, I'm just finalising the Sponsorship pack so more on that later next week.

Secondly, I was at the Hertfordshire Show yesterday - I really enjoyed the entertainment put on by a group from Norfolk I think they said - they had a collection of different breeds of sheep; well, got this picture and I think it's ideal for a caption competition - so, send your best to me and there is a SQL Server book in it for the best ones - nothing too rude please :)

 

 

Come and join me and many thousands of others at this great yearly event held at Rothamsted Park in Harpenden.

Pipe Bands and Solo Piping, Highland Dancing, Sheepdog demonstrations, Falconry, Childrens Circus, Funfair, Classic Cars, Craft Stalls, French Market, Charity Stalls, Commerical Stalls, Cabor Toss and tug of war.

Beneficiaries are the local charities:

Harpenden Lions Life Skills Programme
St. Francis' Hospice, Berkamsted
Help 4 Heroes
Local youth charities

Where:

Date

Sunday 13th July 2008

Time

10:00 am to 5:45 pm

Location

Rothamsted Park
Harpenden
AL5 2HU

http://www.harplions.com/

Cost, £3.50 in advance or £6 on the gate.

Using the ROW_NUMBER() function in a non recursive CTE gives a very big performance degradation because the Sequence Generation is executed not once as you'd expect but for once for every row in the anchor starting results, example - if the anchor query returns 588 rows to the CTE then the Sequence Generator will be executed for every reference of the CTE multiplied by 588!

Check this query:

with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id )

as (

    select row_no = row_number() over( order by account_id, tran_date, id ), id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id

    from Transactions

    where account_id between 1000 and 1020

    )

select t1.*,

       running_amount = coalesce(

                         (  select sum( amount )

                            from Trans t2

                            where t2.row_no < t1.row_no

                              and t2.account_id = t1.account_id ), 0 )

                        + amount

from Trans t1

order by account_id, row_no

go

 

 

The results of SET STATISTICS PROFILE ON is as follows:

Rows

Execs

588

1

with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id )  as (      select row_no = row_number() over( order by account_id, tran_date, id ), id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id      from Transactions      where account_id between 1000 and 1020      )  select t1.*,          running_amount = coalesce(                            (  select sum( amount )                              from Trans t2                              where t2.row_no < t1.row_no                                and t2.account_id = t1.account_id ), 0 )                          + amount  from Trans t1  order by account_id, row_no

0

0

  |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1008] IS NOT NULL THEN [Expr1014] ELSE (0.00) END+[SQLBits20080301].[dbo].[transactions].[amount]))

588

1

       |--Nested Loops(Inner Join, PASSTHRU:(IsFalseOrNull [Expr1008] IS NOT NULL), OUTER REFERENCES:([SQLBits20080301].[dbo].[transactions].[account_id], [Expr1003]))

588

1

            |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo].[transactions].[account_id], [Expr1003]))

588

1

            |    |--Sequence Project(DEFINE:([Expr1003]=row_number))

0

0

            |    |    |--Compute Scalar(DEFINE:([Expr1033]=(1)))

588

1

            |    |         |--Segment

588

1

            |    |              |--Sort(ORDER BY:([SQLBits20080301].[dbo].[transactions].[account_id] ASC, [SQLBits20080301].[dbo].[transactions].[tran_date] ASC, [SQLBits20080301].[dbo].[transactions].[id] ASC))

588

1

            |    |                   |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo].[transactions].[id], [Expr1031]) WITH UNORDERED PREFETCH)

588

1

            |    |                        |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[ncidx_account_id]), SEEK:([SQLBits20080301].[dbo].[transactions].[account_id] >= (1000) AND [SQLBits20080301].[dbo].[transactions].[account_id] <= (1020)) ORDERED FORWARD)

588

588

            |    |                        |--Clustered Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[pk_transactions]), SEEK:([SQLBits20080301].[dbo].[transactions].[id]=[SQLBits20080301].[dbo].[transactions].[id]) LOOKUP ORDERED FORWARD)

588

588

            |    |--Index Spool(SEEK:([SQLBits20080301].[dbo].[transactions].[account_id]=[SQLBits20080301].[dbo].[transactions].[account_id] AND [Expr1003]=[Expr1003]))

0

0

            |         |--Compute Scalar(DEFINE:([Expr1008]=CASE WHEN [Expr1036]=(0) THEN NULL ELSE [Expr1037] END))

588

588

            |              |--Stream Aggregate(DEFINE:([Expr1036]=Count(*), [Expr1037]=SUM([SQLBits20080301].[dbo].[transactions].[amount])))

8132

588

            |                   |--Filter(WHERE:([Expr1007]<[Expr1003] AND [SQLBits20080301].[dbo].[transactions].[account_id]=[SQLBits20080301].[dbo].[transactions].[account_id]))

173166

588

            |                        |--Top(TOP EXPRESSION:(CASE WHEN [Expr1003] IS NULL OR [Expr1003]<(0) THEN (0) ELSE [Expr1003] END))

173166

588

            |                             |--Sequence Project(DEFINE:([Expr1007]=row_number))

0

0

            |                                  |--Compute Scalar(DEFINE:([Expr1035]=(1)))

173166

588

            |                                       |--Segment

173166

588

            |                                            |--Sort(ORDER BY:([SQLBits20080301].[dbo].[transactions].[account_id] ASC, [SQLBits20080301].[dbo].[transactions].[tran_date] ASC, [SQLBits20080301].[dbo].[transactions].[id] ASC))

345744

588

            |                                                 |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[ncidx]), SEEK:([SQLBits20080301].[dbo].[transactions].[account_id] >= (1000) AND [SQLBits20080301].[dbo].[transactions].[account_id] <= (1020)) ORDERED FORWARD)

567

567

            |--Index Spool(SEEK:([SQLBits20080301].[dbo].[transactions].[account_id]=[SQLBits20080301].[dbo].[transactions].[account_id] AND [Expr1003]=[Expr1003]))

0

0

                 |--Compute Scalar(DEFINE:([Expr1014]=CASE WHEN [Expr1040]=(0) THEN NULL ELSE [Expr1041] END))

567

567

                      |--Stream Aggregate(DEFINE:([Expr1040]=Count(*), [Expr1041]=SUM([SQLBits20080301].[dbo].[transactions].[amount])))

8132

567

                           |--Filter(WHERE:([Expr1013]<[Expr1003] AND [SQLBits20080301].[dbo].[transactions].[account_id]=[SQLBits20080301].[dbo].[transactions].[account_id]))

167250

567

                                |--Top(TOP EXPRESSION:(CASE WHEN [Expr1003] IS NULL OR [Expr1003]<(0) THEN (0) ELSE [Expr1003] END))

167250

567

                                     |--Sequence Project(DEFINE:([Expr1013]=row_number))

0

0

                                          |--Compute Scalar(DEFINE:([Expr1039]=(1)))

167250

567

                                               |--Segment

167250

567

                                                    |--Sort(ORDER BY:([SQLBits20080301].[dbo].[transactions].[account_id] ASC, [SQLBits20080301].[dbo].[transactions].[tran_date] ASC, [SQLBits20080301].[dbo].[transactions].[id] ASC))

333396

567

                                                         |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[ncidx]), SEEK:([SQLBits20080301].[dbo].[transactions].[account_id] >= (1000) AND [SQLBits20080301].[dbo].[transactions].[account_id] <= (1020)) ORDERED FORWARD)

 

 

The CTE should return just 588 rows because that’s what where account_id between 1000 and 1020 yields.

However, look at the number of rows processed in the above STATISTICS PROFILE output; 333,396 rows – ouch! If you divide 333,396 by 567 and 345744 by 588 you will get the result 588 which is our target result; so – on every execution the ROW_NUMBER is being recalculated on the 588 rows instead of the Sequence Generator just being produced once and reused, this is also confirmed if you look at the “Sequence Project(DEFINE:” step above which is executed 588 times.

When it boils down to it the actual problem is the fact that the CTE anchor is just expanded inline into the main query which is bad as I’ve demonstrated in my previous entry on performance of CTE’s.

The problem does not occur when using a Recursive Member.

Recommendation: Using the # table route as I suggested before.

 

Unless you are using recursion then the Common Table Expression sucks and you shouldn’t use it; the big problem is that the SQL in the anchor is repeated rather than spooled into a work table.

In the last article I went through the basics of the Common Table Expression and CTE Recursion and hinted that there may be performance problems, in this article I delve a bit deeper – specifically we see how self-join’s on a non-recursive CTE are very bad for performance.

Looking at our example query below we are using the Common Table Expression with a self-join and additionally a sub-query on the select; don’t look too much at what I’m doing in the query because it pretty much doesn’t make any sense – the important aspects to concentrate on are the self-join and sub-query.

Example CTE:

with ctePerf ( objname, colid, colname, max_length )

as (

 

      select objname = o.name,

               colid   = c.column_id,

               colname = c.name,

               c.max_length

      from #sysobjs o

            inner join #syscols c on c.object_id=o.object_id

      where o.type='U'

        and o.name= 'accounts'

  )

select p.objname, p.colname, p.colid,

         size_sofar = SUM( p2.max_length ),

         size_left  = ( SELECT SUM( p3.max_length )

                        FROM ctePerf p3

                        WHERE p3.objname=p.objname

                          and p3.colid > p.colid

                                    )

from ctePerf p

      left outer join ctePerf p2 on p2.objname=p.objname

                                            and p2.colid <=p.colid

group by p.objname, p.colname, p.colid

order by objname, colid

Let’s look at the query plan for the query we use as the CTE anchor:

CTE “anchor” query and plan:

select objname = o.name,

       colid   = c.column_id,

       colname = c.name,

       c.max_length

from #sysobjs o

      inner join #syscols c on c.object_id=o.object_id

where o.type='U'

  and o.name= 'accounts'

  |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

       |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].[dbo].[#sysobjs].[type] as [o ].[type]='U' AND [tempdb].[dbo].[#sysobjs].[name] as [o ].[name]=N'accounts'))

       |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

Check the query plan after incorporating the query into the CTE (below), before looking at the query plan note we are referencing the CTE anchor 3 times (shown in bold – aliases p, p2 and p3):

Query and Execution Plan:

with ctePerf ( objname, colid, colname, max_length )

as (

 

      select objname = o.name,

               colid   = c.column_id,

               colname = c.name,

               c.max_length

      from #sysobjs o

            inner join #syscols c on c.object_id=o.object_id

      where o.type='U'

        and o.name= 'accounts'

  )

select p.objname, p.colname, p.colid,

         size_sofar = SUM( p2.max_length ),

         size_left  = ( SELECT SUM( p3.max_length )

                        FROM ctePerf p3

                        WHERE p3.objname=p.objname

                          and p3.colid > p.colid

                                    )

from ctePerf p

      left outer join ctePerf p2 on p2.objname=p.objname

                                            and p2.colid <=p.colid

group by p.objname, p.colname, p.colid

order by objname, colid

  |--Compute Scalar(DEFINE:([Expr1022]=[Expr1022]))

       |--Nested Loops(Inner Join, OUTER REFERENCES:([o ].[name], [c ].[column_id]))

            |--Compute Scalar(DEFINE:([Expr1015]=CASE WHEN [Expr1035]=(0) THEN NULL ELSE [Expr1036] END))

            |    |--Stream Aggregate(GROUP BY:([c ].[column_id], [c ].[name]) DEFINE:([Expr1035]=COUNT_BIG([tempdb].[dbo].[#syscols].[max_length] as [c ].[max_length]), [Expr1036]=SUM([tempdb].[dbo].[#syscols].[max_length] as [c ].[max_length]), [o ].[name]=ANY

            |         |--Nested Loops(Left Outer Join, WHERE:([tempdb].[dbo].[#syscols].[column_id] as [c ].[column_id]<=[tempdb].[dbo].[#syscols].[column_id] as [c ].[column_id]))

            |              |--Sort(ORDER BY:([c ].[column_id] ASC, [c ].[name] ASC))

            |              |    |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

            |              |         |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].[dbo].[#sysobjs].[type] as [o ].[type]='U' AND [tempdb].[dbo].[#sysobjs].[name] as [o ].[name]=N'accounts'))

            |              |         |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

            |              |--Table Spool

            |                   |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

            |                        |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].[dbo].[#sysobjs].[type] as [o ].[type]='U' AND [tempdb].[dbo].[#sysobjs].[name] as [o ].[name]=N'accounts'))

            |                        |--Compute Scalar(DEFINE:([c ].[max_length]=[tempdb].[dbo].[#syscols].[max_length] as [c ].[max_length]))

            |                             |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

            |--Index Spool(SEEK:([o ].[name]=[tempdb].[dbo].[#sysobjs].[name] as [o ].[name] AND [c ].[column_id]=[tempdb].[dbo].[#syscols].[column_id] as [c ].[column_id]))

                 |--Compute Scalar(DEFINE:([Expr1022]=CASE WHEN [Expr1037]=(0) THEN NULL ELSE [Expr1038] END))

                      |--Stream Aggregate(DEFINE:([Expr1037]=Count(*), [Expr1038]=SUM([tempdb].[dbo].[#syscols].[max_length] as [c ].[max_length])))

                           |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

                                |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].[dbo].[#sysobjs].[name] as [o ].[name]=[tempdb].[dbo].[#sysobjs].[name] as [o ].[name] AND [tempdb].[dbo].[#sysobjs].[type] as [o ].[type]='U' AND [tempdb

                                |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]), WHERE:([tempdb].[dbo].[#syscols].[column_id] as [c ].[column_id]>[tempdb].[dbo].[#syscols].[column_id] as [c ].[column_id]))


If you compare the above query plan to the first, the specific plan for just the anchor you can see the same ‘anchor’ query has pretty much been expanded inline into three parts of the query plan – take a look back at the query and note there are 3 references to the anchor!

The output below is a subset of the information available when you use SET STATISTICS PROFILE ON, I’ve not bothered posting the full STATS PROFILE instead I’ve truncated the output for readability – I’m more concerned with the Rows and Executes columns rather than the query tree.

Table 'Worktable'. Scan count 11, logical reads 56, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#syscols'. Scan count 7, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#sysobjs'. Scan count 7, logical reads 21, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Rows  Executions

5      1      with ctePerf ( objname, colid, olid

0      0        |--Compute Scalar(DEFINE:([Expr1022]=[Expr1022]))

5      1             |--Nested Loops(Inner Join, OUTER REFERENCES:([o ].[name], [c ].[column_id]))

0      0                  |--Compute Scalar(DEFINE:([Expr1015]=CASE WHEN [Expr1035]=(0) THEN NUL

5      1                  |    |--Stream Aggregate(GROUP BY:([c ].[column_id], [c ].[name]) DEFINE

15     1                  |         |--Nested Loops(Left Outer Join, WHERE:([tempdb].[dbo].[#sys

5      1                  |              |--Sort(ORDER BY:([c ].[column_id] ASC, [c ].[name] ASC))

5      1                  |              |    |--Hash Match(Inner Join, HASH:([o ].[object_id])=(

1      1                  |              |         |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ])

453    1                  |              |         |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

25     5                  |              |--Table Spool

5      1                  |                   |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

1      1                  |                        |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]),

0      0                  |                        |--Compute Scalar(DEFINE:([c ].[max_length]=[tempdb].[dbo].

453    1                  |                             |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

5      5                  |--Index Spool(SEEK:([o ].[name]=[tempdb].[dbo].[#sysobjs].[name] as [o ].[name] AND

0      0                       |--Compute Scalar(DEFINE:([Expr1022]=CASE WHEN [Expr1037]=(0) THEN NULL ELSE [Expr1038] END))

5      5                            |--Stream Aggregate(DEFINE:([Expr1037]=Count(*), [Expr1038]=SUM([tempdb].[dbo].[#syscols]

10     5                                 |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

5      5                                      |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].[dbo].[#sysobjs]

1496   5                                      |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]), WHERE:([tempdb].[dbo].[#syscols]

Notice the number of rows returned for each of the highlighted steps – it’s running the complete query every time; the 1496 is higher because of what we are doing in the sub-query.

Let’s use a temporary table for the anchor instead; first pump the query into a temporary table then use that in the CTE...

select objname = o.name,

         colid   = c.column_id,

         colname = c.name,

         c.max_length

into #anchor

from #sysobjs o

      inner join #syscols c on c.object_id=o.object_id

where o.type='U'

  and o.name= 'accounts'

 

;with ctePerf ( objname, colid, colname, max_length )

as (

      select objname, colid, colname, max_length

      from #anchor

  )

select p.objname, p.colname, p.colid,

         size_sofar = SUM( p2.max_length ),

         size_left  = ( SELECT SUM( p3.max_length )

                                FROM ctePerf p3

                                WHERE p3.objname=p.objname

                                    and p3.colid > p.colid

                                    )

from ctePerf p

      left outer join ctePerf p2 on p2.objname=p.objname

                                            and p2.colid <=p.colid

group by p.objname, p.colname, p.colid

order by objname, colid

go

 

 

drop table #anchor

 

Execution plans...

select objname = o.name,

         colid   = c.column_id,

         colname = c.name,

         c.max_length

into #anchor

from #sysobjs o

      inner join #syscols c on c.object_id=o.object_id

where o.type='U'

  and o.name= 'accounts'

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#syscols'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#sysobjs'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

Rows  Executions

5      1      select objname = o.name,       colid   = c.column_id,      colname = c.name,

5      1        |--Table Insert(OBJECT:([#anchor]), SET:([#anchor].[objname] = [tempdb].[d

5      1             |--Top(ROWCOUNT est 0)

5      1                  |--Hash Match(Inner Join, HASH:([o ].[object_id])=([c ].[object_id]))

1      1                       |--Table Scan(OBJECT:([tempdb].[dbo].[#sysobjs] AS [o ]), WHERE:([tempdb].

453    1                       |--Table Scan(OBJECT:([tempdb].[dbo].[#syscols] AS [c ]))

 

;with ctePerf ( objname, colid, colname, max_length )

as (

      select objname, colid, colname, max_length

      from #anchor

  )

select p.objname, p.colname, p.colid,

         size_sofar = SUM( p2.max_length ),

         size_left  = ( SELECT SUM( p3.max_length )

                                FROM ctePerf p3

                                WHERE p3.objname=p.objname

                                    and p3.colid > p.colid

                                    )

from ctePerf p

      left outer join ctePerf p2 on p2.objname=p.objname

                                            and p2.colid <=p.colid

group by p.objname, p.colname, p.colid

order by objname, colid

 

Table '#anchor'. Scan count 7, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Rows  Executions

5     1       with ctePerf ( objname, colid, colname, max_length

0     0         |--Compute Scalar(DEFINE:([Expr1016]=[Expr1016]))

5     1              |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[#a

0     0                   |--Compute Scalar(DEFINE:([Expr1011]=CASE WHEN [Expr1029]=(

5     1                   |    |--Stream Aggregate(GROUP BY:([tempdb].[dbo].[#anchor]

15    1                   |         |--Nested Loops(Left Outer Join, OUTER REFERENCES

5     1                   |              |--Sort(ORDER BY:([tempdb].[dbo].[#anchor].[

5     1                   |              |    |--Table Scan(OBJECT:([tempdb].[dbo].[#

0     0                   |              |--Compute Scalar(DEFINE:([tempdb].[dbo].[#a

15    5                   |                   |--Table Scan(OBJECT:([tempdb].[dbo].[#

0     0                   |--Compute Scalar(DEFINE:([Expr1016]=CASE WHEN [Expr1031]=(

5     5                        |--Stream Aggregate(DEFINE:([Expr1031]=Count(*), [Expr

10    5                             |--Table Scan(OBJECT:([tempdb].[dbo].[#anchor]),    

Comparing the results from the above you can clearly see that the IO is a lot less (19 logical reads using the # table against 112 just using the CTE) – that is also shown by the Rows column in the STATISTICS PROFILE output.

But looking at an example on more realistic volumes you can clearly see the difference. The example below is on a transaction table and gives running totals for a subset of accounts.

with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id )

as (

    select row_no = id, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id

    from Transactions

    where account_id between 1000 and 1020

    )

select t1.*,

       running_amount = coalesce(

                         (  select sum( amount )

                            from Trans t2

                            where t2.row_no < t1.row_no

                              and t2.account_id = t1.account_id ), 0 )

                        + amount

from Trans t1

order by account_id, row_no


Table 'transactions'. Scan count 1156, logical reads 54084, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

Rows  Executions

588    1      with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_i

0      0        |--Compute Scalar(DEFINE:([Expr1013]=CASE WHEN [Expr1006] IS NOT NULL THEN [Expr1011] ELS

588    1             |--Nested Loops(Inner Join, PASSTHRU:(IsFalseOrNull [Expr1006] IS NOT NULL), OUTER R

588    1                  |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo].[transact

588    1                  |    |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo].[tra

588    1                  |    |    |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[ncidx_a

588    588                |    |    |--Clustered Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions

0      0                  |    |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1027]=(0) THEN NULL EL

588    588                |         |--Stream Aggregate(DEFINE:([Expr1027]=Count(*), [Expr1028]=SUM([SQLB

8132   588                |              |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].

8132   588                |                   |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions

8132   8132               |                   |--Clustered Index Seek(OBJECT:([SQLBits20080301].[dbo].[tr

0      0                  |--Compute Scalar(DEFINE:([Expr1011]=CASE WHEN [Expr1029]=(0) THEN NULL ELSE [E

567    567                     |--Stream Aggregate(DEFINE:([Expr1029]=Count(*), [Expr1030]=SUM([SQLBits20

8132   567                          |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo]

8132   567                               |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[nc

8132   8132                              |--Clustered Index Seek(OBJECT:([SQLBits20080301].[dbo].[transact

Same query just the anchor is sourced from a temporary table...

select row_no = id, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id

into #t

from Transactions

where account_id between 1000 and 1020;

 

create unique clustered index clidx on #t ( account_id, row_no )

Table 'transactions'. Scan count 1, logical reads 1813, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

(588 row(s) affected)

 

Table '#t'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Rows  Executions

588    1      select row_no = id, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id

588    1        |--Table Insert(OBJECT:([#t]), SET:([#t].[row_no] = [SQLBits20080301].[dbo].[transactions].[id],[#t].

588    1             |--Top(ROWCOUNT est 0)

588    1                  |--Nested Loops(Inner Join, OUTER REFERENCES:([SQLBits20080301].[dbo].[transactions].[id], [Exp

588    1                       |--Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[ncidx_account_id]), SEEK:([S

588    588                     |--Clustered Index Seek(OBJECT:([SQLBits20080301].[dbo].[transactions].[pk_transactions]),

;with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_id )

as (

    select *

    from #t

    )

select t1.*,

       running_amount = coalesce(

                         (  select sum( amount )

                            from Trans t2

                            where t2.row_no < t1.row_no

                              and t2.account_id = t1.account_id ), 0 )

                        + amount

from Trans t1

order by account_id, row_no

go


Table '#t'. Scan count 1156, logical reads 2400, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Rows  Executions

588    1      with Trans ( row_no, id, salesperson_id, tran_date, clear_date, amount, transaction_types_id, account_

0      0        |--Compute Scalar(DEFINE:([Expr1013]=CASE WHEN [Expr1006] IS NOT NULL THEN [Expr1011] ELSE (0.00) EN

588    1             |--Nested Loops(Inner Join, PASSTHRU:(IsFalseOrNull [Expr1006] IS NOT NULL), OUTER REFERENCES:(

588    1                  |--Nested Loops(Inner Join, OUTER REFERENCES:([tempdb].[dbo].[#t].[row_no], [tempdb].[dbo]

588    1                  |    |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#t]), ORDERED FORWARD)

0      0                  |    |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END))

588    588                |         |--Stream Aggregate(DEFINE:([Expr1026]=Count(*), [Expr1027]=SUM([tempdb].[dbo].[

8132   588                |              |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[#t]), SEEK:([tempdb].[dbo].

0      0                  |--Compute Scalar(DEFINE:([Expr1011]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))

567    567                     |--Stream Aggregate(DEFINE:([Expr1028]=Count(*), [Expr1029]=SUM([tempdb].[dbo].[#t].[

8132   567                          |--Clustered Index Seek(OBJECT:([tempdb].[dbo].[#t]), SEEK:([tempdb].[dbo].[#t].

As you can see the pure CTE method costs 54,084 logical reads whereas using the # table method costs just 4,218 and look at the differences in the Rows and Executions between the CTE and the # with CTE method, I do actually use the ROW_NUMBER function in my real query example but that highlights yet another performance problem but I’ll save that for another post.

There is another problem here as well that will work against you in the performance stakes – statistics! The CTE is expanding the anchor SQL into the other bits of the query like an inline macro and as such SQL Server doesn’t have the statistics on the subset of rows the anchor produces. By using the temporary you’ve sucked the results out and have the ability to create an index(s) on that # table which the optimiser will then use when it recompiles the CTE statement.

I think my only advice here is – unless your application has a high concurrent number of users and you are suffering COMPILE locks because of recompiles then don’t use Common Table Expressions unless you need recursion; the problem described above does not exist on the CTE when you use recursion because of the way recursion works. If you are simply using the # table in the anchor as I describe above then don’t use the CTE at all – just use the # table in the first place!

 

Recursion in Common Table Expressions (CTE’s), how does it work? How can I use it with the adjacency list model (parent / child columns)? In this blog entry I show how recursion works, how to use it with the adjacency list model and talk about other aspects and uses for CTE’s; I also demonstrate some of the areas where CTE’s aren’t that good and how to help remedy that performance penalty suffered.

This weekend I’ve been at the Scottish Developers Conference in Glasgow – a great day, I gave my making the leap to advanced TSQL session; as ever like the last time I ran out of time even though I’d cut a lot of stuff out. Congrats to Colin MacKay, Martin Bell and John Thomson for a great and well organised event – see http://www.scottishdevelopers.com/ for more details include slides etc.

CTE Basics

declare @howmany int

set @howmany = 5

 

declare @tb table( i int not null )

 

insert @tb values ( 1 )

insert @tb values ( 4 )

insert @tb values ( 9 )

 

; WITH LoopIT( org_i, i, recurse_depth, descr )

AS (

      --    The anchor is the starting point for possible recursion

      SELECT org_i = i,

               i = i,

               recurse_depth = CAST( 0 as tinyint ),

               DESCR = CAST( 'anchor' as varchar(100) )

      FROM @tb

     

)

select org_i, i, recurse_depth, descr

from LoopIT

order by org_i, i

go

The query marked in italics above is our actual query involved in the CTE; the fluff around it – the WITH clause, the end SELECT on LoopIT are all the CTE framework.

The CTE is nothing but a derived table (think of it like a View) that is expanded into the main query, unlike a derived table with a CTE you can do self-joins and recursion.

Look at the query plan for the entire CTE structure above...

  |--Compute Scalar(DEFINE:([Expr1004]=(0), [Expr1005]='anchor'))
       |--Table Scan(OBJECT:(@tb))

Now compare it to just the query that we term the anchor:

SELECT org_i = i,

       i = i,

       recurse_depth = CAST( 0 as tinyint ),

       DESCR = CAST( 'anchor' as varchar(100) )

FROM @tb

 

  |--Compute Scalar(DEFINE:([Expr1004]=(0), [Expr1005]='anchor'))
       |--Table Scan(OBJECT:(@tb))

It’s important to remember that CTE’s offer no direct performance benefits in fact we will see later that they can impact performance.

Recursion Basics

The two main features of CTE’s are that you can self-join and also perform recursion.

Consider the query below which shows recursion and a subtle self-join (the self-join being the bottom query in the UNION ALL and the CTE (notice we “self-join” with LoopIT):

--  Recursion

declare @howmany int

set @howmany = 5

 

declare @tb table( i int not null )

declare @tb2 table( i int not null )

 

insert @tb values ( 1 )

insert @tb values ( 4 )

insert @tb values ( 9 )

 

insert @tb2 values ( 4 )

 

; WITH LoopIT( org_i, i, recurse_depth, descr )

AS (

      --    The anchor is the starting point for possible recursion

      --    The anchor can use UNION [ALL], INTERSECT, EXCEPT

      SELECT org_i = i,

               i = i,

               recurse_depth = CAST( 0 as tinyint ),

               DESCR = CAST( 'anchor' as varchar(100) )

      FROM @tb

 

      EXCEPT

 

      SELECT org_i = i,

               i = i,

               recurse_depth = CAST( 0 as tinyint ),

               DESCR = CAST( 'anchor' as varchar(100) )

      FROM @tb2

     

      --    Recursive Member; any rows returned are inserted into CTE

      --    which are also then available (seen) on the next recursion

      UNION ALL

 

      SELECT org_i = org_i,

               i = i + 1,

               recurse_depth = recurse_depth + cast( 1 as tinyint ),

               descr = CAST( 'recursed' as varchar(100) )

      FROM LoopIT

      WHERE i < @howmany

)

select org_i, i, recurse_depth, descr

from LoopIT

order by org_i, i

go

 

The starting point is the first query – the anchor; the anchor is iterated through row by row (think cursor), the first row is passed into the recursive member query, if there are any results that set is first saved to the final result set but also used as input to the next recursive execution of the recursive member query, this process happens until the recursive member query returns no more rows and at that point it moves onto the next row from the anchor query.

Below is an illustration in TSQL of just how the Common Table Expression recursion works logically:

declare @tb table( i int not null )

declare @tb2 table( i int not null )

 

insert @tb values ( 1 )

insert @tb values ( 4 )

insert @tb values ( 9 )

 

insert @tb2 values ( 4 )

 

 

--    Create the anchor

SELECT org_i = i,

       i = i,

       recurse_depth = CAST( 0 as tinyint ),

       DESCR = CAST( 'anchor' as varchar(100) )

INTO #anchor

FROM @tb

 

EXCEPT

 

SELECT org_i = i,

       i = i,

       recurse_depth = CAST( 0 as tinyint ),

       DESCR = CAST( 'anchor' as varchar(100) )

FROM @tb2

 

--    Create the interim recursed set

SELECT *

INTO #recurse_set

FROM #anchor

WHERE 1 = 0

 

--    Previous recurse set

SELECT *

INTO #recurse_set_previous

FROM #anchor

WHERE 1 = 0

 

--    Final result

SELECT *

INTO #result

FROM #anchor

WHERE 1 = 0

 

TRUNCATE TABLE #recurse_set

TRUNCATE TABLE #recurse_set_previous

TRUNCATE TABLE #result

 

DECLARE @howmany int

SET @howmany = 5

 

DECLARE @org_i int, @i int, @recurse_depth int, @descr varchar(100)

 

DECLARE anchor_members CURSOR FOR

      SELECT *

      FROM #anchor

 

OPEN anchor_members

 

FETCH NEXT FROM anchor_members

      INTO @org_i, @i, @recurse_depth, @descr

     

WHILE @@FETCH_STATUS = 0

BEGIN

      --    Save anchor to result

      INSERT #result ( org_i, i, recurse_depth, descr )

            SELECT @org_i, @i, @recurse_depth, @descr

 

      --    Prime recursion by creating the single row Anchor set.

      TRUNCATE TABLE #recurse_set

 

      INSERT #recurse_set ( org_i, i, recurse_depth, descr )

            SELECT @org_i, @i, @recurse_depth, @descr

 

      WHILE 1 = 1 -- Do forever (in the CTE it stops at 100 by default, changeable using MAXRECURSION)

      BEGIN

            --    Create the previous set

            TRUNCATE TABLE #recurse_set_previous

 

            INSERT #recurse_set_previous ( org_i, i, recurse_depth, descr )

                  SELECT org_i, i, recurse_depth, descr

                  FROM #recurse_set

 

            --    Recurse creating the new set; if no rows then recursion has finished,

            --    stop and do the next anchor row

            TRUNCATE TABLE #recurse_set

                       

            INSERT #recurse_set ( org_i, i, recurse_depth, descr )

                  SELECT org_i = org_i,

                         i = i + 1,

                         recurse_depth = recurse_depth + cast( 1 as tinyint ),

                         descr = CAST( 'recursed' as varchar(100) )

                  FROM #recurse_set_previous --LoopIT

                  WHERE i < @howmany  --  Adds in new rows to the CTE spool table (temporary table) (LoopIT)

 

            IF @@ROWCOUNT = 0

                  BREAK

 

            --    Save recurse result set into final result (UNION ALL) then recurse again

            INSERT #result ( org_i, i, recurse_depth, descr )

                  SELECT org_i, i, recurse_depth, descr

                  FROM #recurse_set

 

      END

 

      --    Get the next Anchor

      FETCH NEXT FROM anchor_members

            INTO @org_i, @i, @recurse_depth, @descr

 

END

DEALLOCATE anchor_members

 

SELECT *

FROM #result

 

DROP TABLE #recurse_set, #recurse_set_previous, #result, #anchor

 

Using Common Table Expression to traverse the adjacency List Model

It’s common practice to use either the adjacency List model (parent/child columns) or the materialised path (root.node.node.node) to store hierarchies.

The problem with the adjacency list approach is that you need to use many self-joins to iterate the hierarchy (a join per level) or use a cursor as above.

You can now use a CTE; see example below...

create table adjacency_list (

      parent_extractedword_id int   not null,

      child_extractedword_id  int not null,

            constraint pk_adjacency_list primary key clustered( parent_extractedword_id, child_extractedword_id )

      )

     

create table extractedword (

      id    int not null constraint pk_extractedword primary key clustered,

      word varchar(50) not null

      )

insert extractedword( id, word ) values( 11, 'demo' )

insert extractedword( id, word ) values( 12, 'on' )

insert extractedword( id, word ) values( 13, 'cte' )

insert extractedword( id, word ) values( 14, 'recursion' )

     

insert extractedword( id, word ) values( 21, 'does' )

insert extractedword( id, word ) values( 22, 'this' )

insert extractedword( id, word ) values( 23, 'rock' )

insert extractedword( id, word ) values( 24, 'ya' )

insert extractedword( id, word ) values( 25, 'boat?' )

 

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 11, 12 )

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 12, 13 )

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 13, 14 )

 

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 21, 22 )

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 22, 23 )

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 23, 24 )

insert adjacency_list ( parent_extractedword_id, child_extractedword_id ) values( 24, 25 )

 

; WITH WordStruct( org_parent_id, child_id, recurse_depth )

AS (

      --    The anchor is the starting point for possible recursion

      SELECT org_parent_id = parent_extractedword_id,

             child_id      = child_extractedword_id,

             recurse_depth = CAST( 0 as tinyint )

      FROM adjacency_list l1

      WHERE NOT EXISTS ( SELECT *         --    The Anchor should only contain root PARENT's

                         FROM adjacency_list l2

                         WHERE l2.child_extractedword_id = l1.parent_extractedword_id )

                                

      --    Recursive Member

      UNION ALL

 

      SELECT org_parent_id = l1.child_id,

             child_id      = l2.child_extractedword_id,

             recurse_depth = recurse_depth + cast( 1 as tinyint )

      FROM WordStruct l1

            INNER JOIN adjacency_list l2 on l2.parent_extractedword_id = l1.child_id

)

select e.id, e.word, ws.*,

       case when org_parent_id=e.id then 'ROOT' else 'CHILD.DEPTH=' + CAST( recurse_depth + 1 as varchar(5) ) end

from WordStruct ws

      INNER JOIN extractedword e on e.id = ws.child_id

                                 or ( ws.recurse_depth = 0                 --      Do this to get the desc for the root parent

                                  and e.id = ws.org_parent_id )

order by ws.org_parent_id, ws.child_id

go

 

First we step through the hierarchy and get all our surrogate keys and then on the final phase of the CTE at the point where all the anchor rows we join to the table that will give us the node descriptions.

Performance Problems

The CTE is nothing but expanded SQL; by that I mean the anchor CTE is not materialised into a work table (temporary table); unfortunately this presents us with a number of problems which will hopefully be resolved in future versions of SQL Server.

The two specific ones I want to talk about are

·         CTE Self-Join in the result

·         Use of ROW_NUMBER in the CTE – Applied before filtering – ouch!

I’ll save those for another article in the coming week and I’ll delve into IO Statistics and the query plans and what and why it’s happening – oh, strategies for dealing with it too.

Update 2008-05-18:

So I got round to doing the performance posts too:

When not using recursion but self-joining with the CTE performance sucks:

http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/05/17/non-recursive-common-table-expressions-performance-sucks-1-cte-self-join-cte-sub-query-inline-expansion.aspx

When not using recursion and using ROW_NUMBER in the CTE anchor it's executed inefficiently causing a massive amount of additional resource:

http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/05/17/non-recursive-common-table-expressions-performance-sucks-2-row-number-is-executed-number-of-cte-references-x-number-of-rows-from-the-anchor.aspx

 

Say you have an external file you want to load into your SQL Server database; you need to load – verify – cleanse and insert and at the same time keep a cross reference between the original source file records and the inserted database rows.

The OUTPUT clause is an ideal candidate to help you with this cause; most people use the IDENTITY property for surrogate keys; the OUTPUT clause allows you to get at the value returned by the IDENTITY for each row you’ve just inserted; unlike SCOPE_IDENTITY() which only returns the last inserted IDENTITY value the OUTPUT clause gives you all the rows inserted!

Example

Take an input text file (testinp.txt) that simply contains a number of records separated by carriage return line feed eg...

My
Test
File
Is
Here

We can use the BULK operator of the OPENROWSET structure to load this file using a format file defined below:

9.0
1
1 SQLCHAR 0 50 "\r\n" 1 somedata Latin1_General_CI_AS

Note, it is important to name the destination column and give it a collation otherwise you get this really (that’s sarcasm) descriptive error:

Msg 4862, Level 16, State 1, Line 1

Cannot bulk load because the file "c:\temp\testinp.fmt" could not be read. Operating system error code (null).


We need two tables, one is the base table we are inserting into and the second is the talbe that will capture what the IDENTITY value was and which row from the raw file it was set against.

 

create table stage_table_base (

    id  int not null identity,

    myinputdata varchar(100) not null

)

create table stage_table_captured (

    id  int not null,

    myinputdata varchar(100) not null

)

 

We now insert into the base table combining the OPENROWSET and the OUTPUT clause to capture the IDENTITY value set...

 

insert stage_table_base( myinputdata )

    output inserted.id, inserted.myinputdata into stage_table_captured( id, myinputdata )

    select somedata

    from OPENROWSET(BULK N'C:\temp\testinp.txt', FORMATFILE = 'c:\temp\testinp.fmt' ) AS a


The above statement is literally doing two INSERT’s for us, the first INSERT is into the stage_table_base and the second is into the stage_table_captured table.

 

The OUTPUT clause gives us access to only the columns on either the INSERTED or DELETED internally generated system tables (think of triggers).

 

One important thing to note is that BULK INSERT and the BULK operator are not guarenteed to read or return the file in order, until now (SQL 2005 SP2 hotifx 3159) it has and frankly probably will into SQL 2008 but it’s a behaviour not documented and when pressed Microsoft have told me not to rely on the behaviour going forward.

 

If you want an order then you have to use ORDER BY on the select which then returns the data as a row by row cursor rather than a set so the insert will indeed be in the order you specify.

 

Realistically in a migration type scenario you’d have a third table that holds the raw loaded data and a number of columns specifying if the row had been rejected because of data validation problems, the file name source for the data etc... but that is a whole new topic and I need to research some other stuff for my presentation next weekend in Glasgow at the Scottish Developers . May be I’ll take a look on the 4 hour train journey!