November 2008 - Posts
You might have read my previous post about coming to the
Hyperbac booth at PASS to see the next big thing.
So what was it that Hyperbac was announcing.
Imagine this secnario. You have your production database, if you've folowed
best practice the data files will be sized for say the next 6 months which means
there will be a lot of empty space in the mdf/ndf files. In addition your
transaction log will be sized to be big enough to reindex your largest table and
so again will generally contain a large amount of empty space. So of the 100Gb
space used by the data and log files you only have 50GB of used space.
If you want to restore this database onto a dev or test system you can't just
restore the data. You have to find 100Gb of space.
Hyperbac Online is the new product from Hyperbac which allows you to run your
database, in read/write mode, on compressed database files. This means that you
are able to restore your 100Gb database into a space that is a fraction of that
space. The compression of the data will be ~70-80% (similar to what you achieve
with a backup). So the 50Gb of data will compress to 15Gb, and as free space
compresses to nothing you will only need 15Gb of space on your dev/test system
to restore and run the database.
The initial response people have to this is what about peformance. The answer
is the same that you get when talking about compression within SQL Server 2008.
Even though you are compressing the data, you will be reading and writing less
data and so any CPU overhead that you incurr due to the compression is
outwieghed by the savings you make due to doing less IO. Hyperbac has done
initial tests with a TPC-H benchmark and found that performance is better using
compressed files because the IO is less with compressed files.
You might be aware that you can't run SQL on a compressed NTFS volume, in
which case how is Hyperbac Online any different. Well the issue with
NTFS compression is that its buffered IO and is not write ordered. Hyperbac
Online isn't buffered and is write ordered and so use of Hyperbac Online ensures
durability of data. There is a KB article on the requirements of an IO sub
system to support SQL Server Databases http://support.microsoft.com/default.aspx/kb/917047 Hyperbac
Online meets all these requirements.
Whilst this in itself is exciting whats just as exciting is that because
Hyperbac is an OS level service this compression can be applied to any file,
including virtual machine vhd files. There is a demo that has a 12GB VHD
compressed to 2GB.
I'm sure you can imagine the savings you can ow make across your storage. Pop
over to the Hyperbac stand today and see it in action. We've also got some SQL
2008 system table maps to giveaway.
One of the
arguments often put forward against the use of any (Object Relational Mapping)
ORM, whether LINQ 2 SQL or the others available, is the proliferation of Adhoc
SQL within your application code. (http://sqlblogcasts.com/blogs/simons/archive/2008/11/16/Whats-the-score-with-LINQ-to-SQL-.aspx)
heartedly agree, however the use of an ORM solution does not mean you have to
have SQL spread all over your code.
One of the
standard best practices is the use of a data layer. This layer provides a
defined set of interfaces to your data.
is that your application data layer should call stored procedures in your
Thats ok in
that your application has a data layer that has defined interfaces which it
satisfies by getting the data by calling one or more stored procedures. But via
do you need to use stored procedures to achieve this.
procedures don’t provide a guaranteed interface, they have a defined input but
not a defined output, i.e. the contract is undefined. Its very easy to have some
underlying table changed and your SP is now returning something completely
in writing SPs, until 2008, you have no intellisense for developing them and you
have to be doing them either in management studio of in a Database Professional
If you use an
ORM solution you will have a representation of the database schema defined in
your application. You will have a set of objects which will be mapped to the
database against which you can make queries.
access layer can perform these queries just as it would have done using SPs
directly against the database. The benefit is that if you want to change the
structure of your database you can and this will propagate through your code, or
not. The difference with the ORM approach is that if you change a data type,
because you are working with strongly typed interfaces your application won’t
compile until fix your application to use the new types.
think this is a floor. It’s not. I’ve seen many applications where they realise
that a field wasn’t big enough and so increase the field size, however they
don’t propagate the change through the code, i.e.
Parameters in stored
Related columns in other
Data types in ALL the code paths in the
coupling as in the case of stored procedures you don’t find out until a specific
code path in your application runs at run time. With strong typing you will find
out when you compile your application. (Erland has a suggestion for strict
checks that gets round some of this http://www.sommarskog.se/strict_checks.html , https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=260762 )
you can put (note use of the word “put”, stored procedures aren’t compiled until
they are executed) a stored procedure in a database and it can be based on
objects that don’t exist. You won’t find out until the stored procedure is
executed at which time the compilation will fail. With the lack of intellisense
(prior to SQL 2008) that can be a real obstacle to fast development.
So the next
point made is one of compiled query plans. Most ORM solutions use parameterised
SQL that benefits from plan caching in just the same way that stored procedures
do. For this reason the only benefit that stored procedures have over adhoc
paramerised SQL is that the text of an SP name is likely to be shorter than a
use of stored procedures is advocated because they are easy to change. You can
easily pop into your database and make a change. This is very true. However if
you are a developer then you are likely to be more comfortable in visual studio
writing C# or VB.Net in which case writing ALL your data access in your
preferred language in the same solution in visual studio, with intellisense and
compile debugging is likely to result in a much quicker development time, than
if you have to switch between development environments with different
Oh and one
more thing about LINQ is that a LINQ query isn’t executed when you create it.
What that means is that you can add to and manipulate a query after you have
first created it by working with IQueryable<T>. This means that you can
return IQueryable<T> from a function and pass that to another function and
manipulate it by adding additional filters, order bys etc and then when you
finally need to you can enumerate over the query or call .ToList(). The benefits
of this are very similar to views but can be taken much much further because you
can define your query structures based on runtime values and not fixed at design
So in summary
if you encapsulate your data access in a data layer using LINQ to SQL it will
provide just the same benefits as using stored procedures but with the benefit
of strong typing, intellisense (for whatever version of SQL) and a single
7 years ago some clever guys developed the first
backup compression product which at the time revolutionised the SQL Server
The same guys that came up with Litespeed have come up with the next thing
that will blow your mind away.
Come along to the Hyperbac stand (422 & 521) and be prepared to be
If you want to make the next step in saving money in storage costs then you
will kick yourself if you miss it.
If you are going to DDD by train then we (www.sqlbits.com) are providing coaches
again to get you from the station to the Microsoft HQ on Thames Valley Park.
We hope you find them useful.
LINQ to SQL (and any ORM for that matter) has the marmite factor. You either
love it or hate it. Whats more many of those that hate it haven't tried it. In
many cases the hate it crowd are making a decision based on previous experience
with systems using ORM solutions.
I've been looking at and trying to use LINQ to SQL for a while and thought I
would post my findings thoughts. Before starting this series of posts I thought
I would see what others have said. I had heard something about an announcement
about LINQ to SQL and LINQ to Entities and found the post hear. http://blogs.msdn.com/adonet/archive/2008/10/31/clarifying-the-message-on-l2s-futures.aspx
I made a conscious decision to look at L2S rather that L2E for these
It allows separation of code and database mapping easily. This is
commonly referred to as POCO
It supports more SQL features
However it seems that L2E will be enhanced to include the L2S features that
Thats a real shame and the biggest reason for me is the number of levels of
When you right a LINQ query with L2S your LINQ query is translated to an
expression tree and then translated to SQL.
With L2E you have an extra step because the entity framework is a framework
that sits on stop of any provider (that has been written). To enable that it has
its own data access language Entity SQL. So to get to SQL, your LINQ query is
translated from LINQ to ESQL to SQL.
Anyone thats worked or seen translated text will know that often what you put
in is often not what you get out. There is a common games on radio in the UK
where they translate an album/track name to a language and then back to English.
The quiz is then to guess what the original album/track name was given the final
translated text. (A bit like Chinese whispers).
We all know how complex SQL is, so you can understand why generally the SQL
you get from L2E is more complex than with L2S.
I personally like to keep things simple and so I’m disappointed by the
decision to concentrate. Although all is not lost.
If you use standard practice of implementing a data layer then you could move
to L2E in the future. Whats more your SELECT LINQ queries against a set of L2E
objects should be similar to those for L2S. The main difference is when you want
to do your insert, update and delete operations, the two technologies use
different mechanisms. Further more you will likely have to change you validation
code to use the different technology, although the core rules code shouldn’t
have to change.
I would still stick with L2S until the next version of EF. I'll discuss some
of my findings in future posts.
Go to http://www.sqlbits.com/ to see the
videos from SQLBits III.
Most of the presentations are also up.
We are also planing the next SQLBits, so if you're interested in speaking
then start thinking of a session, if you want to help drop us a line firstname.lastname@example.org and if you want to
sponsor the next event also drop us an email at email@example.com
Jon has an interesting view on why you should do it
If you can afford the downtime I don't see an issue. Allowing for downtime
means that if you HAVE to reboot its easily managed and not a
If you script objects in management studio 2008 be
careful its a bit forgetful. If you are using a case insensitive collation and
you have objects that refer to other objects but use different case for the
object names then when you try and script objects some might not be included. I
found this when trying to script a function and two tables. The function
referred to one of the tables but with a different case. Oddly this table didn't
This has been fixed and will be available in an upcoming cumulative
The next gotcha is using the debugging in SSMS 2008. You have to be careful
if you cancel your debug session because the database you are connected to might
If you cancel a query that is in debug mode, the query window will be
connected to the database selected when the query window was opened. i.e. if you
change the database before you start debugging the database will not be the same
before and after cancelling the query,
Open a new window this is likely to start in master
change the database
from the drop down say tempdb
type wait for delay '00:00:10'
you will now be connected to master and not
working with a number of clients recently who all have suffered at the hands of
TSQL Scalar functions. Scalar functions were introduced in SQL 2000 as a means
to wrap logic so we benefit from code reuse and simplify our queries. Who would
be daft enough not to think this was a good idea. I for one jumped on this
initially thinking it was a great thing to do.
you might have gathered from the title scalar functions aren’t the nice friend
you may think they are.
If you are
running queries across large tables then this may explain why you are getting
In this post
we will look at a simple padding function, we will be creating large volumes to
emphasize the issue with scalar udfs.
function PadLeft(@val varchar(100), @len int, @char char(1))
right(replicate(@char,@len) + @val, @len)
functions are interpreted code that means EVERY call to the function results in
your code being interpreted. That means overhead for processing your function is
proportional to the number of rows.
code you will see that the native system calls take considerable less time than
the UDF calls. On my machine it takes 2614 ms for the system calls and 38758ms
for the UDF. Thats a 19x increase.
statistics time on
max(right(replicate('0',100) + o.name + c.name, 100))
join msdb.sys.columns c
max(dbo.PadLeft(o.name + c.name, 100,'0'))
If you run
the last one again but with half the rows i.e. as follows the time is halved.
This highlights how linear the performance is in relation to the number of
max(dbo.PadLeft(o.name + c.name, 100,'0')), count(1)
top 50 percent
functions are calculated on a single thread. This means that if you move to a
multi core machine your performance will not be increased. This is shown by
running against a query that results in parallel operators.
Create a very
large table and populate it.
(col1 bigint, col2 bigint)--varchar(100))
@i = 0
@i < 20
select object_id, object_id
set @i = @i + 1
perform the following queries (You
need to do this on a multi core machine),
statistics time on
max(right(replicate('0',100) + cast(v1.col1+ v2.col2 as varchar(100)), 100))
veryLargeTable v2 on v2.col1 = v1.col2
veryLargeTable v2 on v2.col2 = v1.col1
You will see
in the duration that the system function version takes x amount of time, but it
uses ~2X time in CPU. On my machine I get elapsed time of 3247ms and CPU time of
6094 ms. That highlights that the query was able to work in parallel. Whereas
the scalar UDF results in an elapsed time of 27041ms and an CPU time of 26000ms,
showing no parallelism.
Reduction of CTE and
expressions are largely syntactic constructs that are merged into the main
query, i.e. they perform like a view and not like a physical set of data.
However the use of scalar functions in CTEs can cause undesired behaviour. In
the worst case, a query might only return a few rows and you may believe that
the CTE is only evaluated a few times and thus your function only called a few
times, however due the optimiser might choose a query plan which results in the
CTE being evaluated for every row in the source dataset resulting in your
function being called many more times than you expect. This in conjunction with
the issues above results in very very poor performance.
used scalar functions and tried to perform statement level profiling you will
have hit this last problem. Because each line in a scalar function is considered
a statement then EVERY time the function is evaluated, EACH line in the function
is recorded in profiler.
means you can’t generally find the statements you are really concerned with, but
of more a concern is that the performance hit of capturing ALL these statements
if very very large.
To test this
set up profiler to record SP:StmtCompleted and run the samples above, you will
soon be overwhelmed with data.
There is a
workaround for this which reduces the impact but it only reduces the impact. The
workaround is to filter out statements where the statement is for an object of
type “Scalar function”
So what are
You either need to put your code in
line, but then you lose the benefits of code reuse.
Write a CLR function
Write a table valued function and use a
Limited (single line)
Limited (single line)
Valued Function option is the one that is not commonly known, but is the best
option if you can convert your function into a single line of system function
calls and don’t want to go to CLR. What you do is write a Table Valued function.
So how does
that actually work. You create your table valued function to return 1 row. In
your query you then use a subquery to get the value from the table valued
function PadLeftTVF(@val varchar(100), @len int, @char char(1))
right(replicate(@char,@len) + @val, @len) val)
PadLeftTVF(o.name + c.name, 100,'0')) val
join msdb.sys.columns c) d
val from PadLeftTVF(v1.col1+ v2.col2 , 100,'0'))
veryLargeTable v2 on v2.col1 = v1.col2 )
derived table is used because you can’t use a subquery in an aggregate. But if
you do it via a derived table it works. Why I have no idea and it seems odd that
the optimiser can’t do this for me. Anyway the Max is so we only return 1 row
from the query and so our timing isn’t affected by the time it takes to return
and render the data.
looks very complicated this benefits from the fact that the optimiser can
effectively consume the system calls in line, because it can figure out there is
only going to be own row and column returned. You can see this by looking at the
properties of the streaming aggregate. The table valued function will have an
output of something like where the function has been totally reduced to system
Where as the
scalar one has the function call to PadLeft.
Use of Table
Valued functions means that using this method you can benefit from parallelism
because the function has been reduced to system function calls and so your query
will scale better when you go to a multi core machine. This is highlighted if
you run the last query above on a multi core machine.
If you aren’t
into writing C# code and you want ultimate performance and code reuse then write
your scalar udfs as table
valued functions and use a subquery
As I mentioned I've been at Kalens seminar today and as
always picked up some things I never new about and also was reminded of others
things I had forgotten. The latter is far too common with SQL being such a large
PathName can take a parameter
The first is the new PathName function for getting the virtual path for
filestream has an optional parameter. This parameter confirgures the format of
the server name being returned. The following is copied direct from books
I foresee that many people in a will need to use option 2 to ensure correct
resolution of the server name.
The text in row can be configured
I knew that you can configure text data to be stored in row but didn't
realise that there was a configuration option to say at which point you want
data to be stored off the row. Normally with sp_tableoption you only specify 0
or 1 to turn a feature on or off. However with the "text in row" option you can
specify a value between 24 and 7000. Any text data under that will be considered
to be stored "in row" if there is space. For more details have a look at http://msdn.microsoft.com/en-us/library/ms173530.aspx
Rows can overflow
I completely forgot that normal variable length data can overflow in SQL
2005, even if you are not using the new max types you can still store more than
8060 bytes in a row (as long as the extra is of variable length type).
This means that you can have a table with 100 varchar (100) columns that are
fully populated, i.e. a row size ~ 10k. This is clearly more than 8060
bytes and so would fail in SQL 2000. However in SQL 2005 and later the data
is stored by putting the extra data on OVERFLOW pages, with pointers from
the main row of data to the relevant overflow page(s).
There is downside in that doing this can result in poor performance if you
query is having to go off to lots of overflow pages to get data.
So thats some of the tips I learnt today.