November 2008 - Posts

You might have read my previous post about coming to the Hyperbac booth at PASS to see the next big thing.

So what was it that Hyperbac was announcing.

Imagine this secnario. You have your production database, if you've folowed best practice the data files will be sized for say the next 6 months which means there will be a lot of empty space in the mdf/ndf files. In addition your transaction log will be sized to be big enough to reindex your largest table and so again will generally contain a large amount of empty space. So of the 100Gb space used by the data and log files you only have 50GB of used space.

If you want to restore this database onto a dev or test system you can't just restore the data. You have to find 100Gb of space.

Hyperbac Online is the new product from Hyperbac which allows you to run your database, in read/write mode, on compressed database files. This means that you are able to restore your 100Gb database into a space that is a fraction of that space. The compression of the data will be ~70-80% (similar to what you achieve with a backup). So the 50Gb of data will compress to 15Gb, and as free space compresses to nothing you will only need 15Gb of space on your dev/test system to restore and run the database.

The initial response people have to this is what about peformance. The answer is the same that you get when talking about compression within SQL Server 2008. Even though you are compressing the data, you will be reading and writing less data and so any CPU overhead that you incurr due to the compression is outwieghed by the savings you make due to doing less IO. Hyperbac has done initial tests with a TPC-H benchmark and found that performance is better using compressed files because the IO is less with compressed files.

You might be aware that you can't run SQL on a compressed NTFS volume, in which case how is Hyperbac Online any different. Well the issue with NTFS compression is that its buffered IO and is not write ordered. Hyperbac Online isn't buffered and is write ordered and so use of Hyperbac Online ensures durability of data. There is a KB article on the requirements of an IO sub system to support SQL Server Databases http://support.microsoft.com/default.aspx/kb/917047 Hyperbac Online meets all these requirements.

Whilst this in itself is exciting whats just as exciting is that because Hyperbac is an OS level service this compression can be applied to any file, including virtual machine vhd files. There is a demo that has a 12GB VHD compressed to 2GB.

I'm sure you can imagine the savings you can ow make across your storage. Pop over to the Hyperbac stand today and see it in action. We've also got some SQL 2008 system table maps to giveaway.



-
Posted by simonsabin | with no comments

One of the arguments often put forward against the use of any (Object Relational Mapping) ORM, whether LINQ 2 SQL or the others available, is the proliferation of Adhoc SQL within your application code. (http://sqlblogcasts.com/blogs/simons/archive/2008/11/16/Whats-the-score-with-LINQ-to-SQL-.aspx)

I whole heartedly agree, however the use of an ORM solution does not mean you have to have SQL spread all over your code.

One of the standard best practices is the use of a data layer. This layer provides a defined set of interfaces to your data.

The argument is that your application data layer should call stored procedures in your database.

Thats ok in that your application has a data layer that has defined interfaces which it satisfies by getting the data by calling one or more stored procedures. But via do you need to use stored procedures to achieve this.

Stored procedures don’t provide a guaranteed interface, they have a defined input but not a defined output, i.e. the contract is undefined. Its very easy to have some underlying table changed and your SP is now returning something completely different.

Further more in writing SPs, until 2008, you have no intellisense for developing them and you have to be doing them either in management studio of in a Database Professional project.

If you use an ORM solution you will have a representation of the database schema defined in your application. You will have a set of objects which will be mapped to the database against which you can make queries.

Your data access layer can perform these queries just as it would have done using SPs directly against the database. The benefit is that if you want to change the structure of your database you can and this will propagate through your code, or not. The difference with the ORM approach is that if you change a data type, because you are working with strongly typed interfaces your application won’t compile until fix your application to use the new types.

You might think this is a floor. It’s not. I’ve seen many applications where they realise that a field wasn’t big enough and so increase the field size, however they don’t propagate the change through the code, i.e.

1.       Parameters in stored procedures

2.       Related columns in other tables

3.       Data types in ALL the code paths in the application

With loose coupling as in the case of stored procedures you don’t find out until a specific code path in your application runs at run time. With strong typing you will find out when you compile your application. (Erland has a suggestion for strict checks that gets round some of this http://www.sommarskog.se/strict_checks.html , https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=260762 )

Whats more you can put (note use of the word “put”, stored procedures aren’t compiled until they are executed) a stored procedure in a database and it can be based on objects that don’t exist. You won’t find out until the stored procedure is executed at which time the compilation will fail. With the lack of intellisense (prior to SQL 2008) that can be a real obstacle to fast development.

So the next point made is one of compiled query plans. Most ORM solutions use parameterised SQL that benefits from plan caching in just the same way that stored procedures do. For this reason the only benefit that stored procedures have over adhoc paramerised SQL is that the text of an SP name is likely to be shorter than a SQL statement.

Finally the use of stored procedures is advocated because they are easy to change. You can easily pop into your database and make a change. This is very true. However if you are a developer then you are likely to be more comfortable in visual studio writing C# or VB.Net in which case writing ALL your data access in your preferred language in the same solution in visual studio, with intellisense and compile debugging is likely to result in a much quicker development time, than if you have to switch between development environments with different compilation semantics.

Oh and one more thing about LINQ is that a LINQ query isn’t executed when you create it. What that means is that you can add to and manipulate a query after you have first created it by working with IQueryable<T>. This means that you can return IQueryable<T> from a function and pass that to another function and manipulate it by adding additional filters, order bys etc and then when you finally need to you can enumerate over the query or call .ToList(). The benefits of this are very similar to views but can be taken much much further because you can define your query structures based on runtime values and not fixed at design time.

So in summary if you encapsulate your data access in a data layer using LINQ to SQL it will provide just the same benefits as using stored procedures but with the benefit of strong typing, intellisense (for whatever version of SQL) and a single development environment

 



-
Posted by simonsabin | 10 comment(s)
Filed under:

7 years ago some clever guys developed the first backup compression product which at the time revolutionised the SQL Server backup space.

The same guys that came up with Litespeed have come up with the next thing that will blow your mind away.

Come along to the Hyperbac stand (422 & 521) and be prepared to be astounded.

If you want to make the next step in saving money in storage costs then you will kick yourself if you miss it.



-
Posted by simonsabin | with no comments

If you are going to DDD by train then we (www.sqlbits.com) are providing coaches again to get you from the station to the Microsoft HQ on Thames Valley Park.

We hope you find them useful.



-
Posted by simonsabin | 2 comment(s)

LINQ to SQL (and any ORM for that matter) has the marmite factor. You either love it or hate it. Whats more many of those that hate it haven't tried it. In many cases the hate it crowd are making a decision based on previous experience with systems using ORM solutions.

I've been looking at and trying to use LINQ to SQL for a while and thought I would post my findings thoughts. Before starting this series of posts I thought I would see what others have said. I had heard something about an announcement about LINQ to SQL and LINQ to Entities and found the post hear. http://blogs.msdn.com/adonet/archive/2008/10/31/clarifying-the-message-on-l2s-futures.aspx

I made a conscious decision to look at L2S rather that L2E for these reasons,

1.      Its simpler

2.      It allows separation of code and database mapping easily. This is commonly referred to as POCO

3.      It supports more SQL features

However it seems that L2E will be enhanced to include the L2S features that are missing.

Thats a real shame and the biggest reason for me is the number of levels of interpretation.

When you right a LINQ query with L2S your LINQ query is translated to an expression tree and then translated to SQL.

With L2E you have an extra step because the entity framework is a framework that sits on stop of any provider (that has been written). To enable that it has its own data access language Entity SQL. So to get to SQL, your LINQ query is translated from LINQ to ESQL to SQL.

Anyone thats worked or seen translated text will know that often what you put in is often not what you get out. There is a common games on radio in the UK where they translate an album/track name to a language and then back to English. The quiz is then to guess what the original album/track name was given the final translated text. (A bit like Chinese whispers).

We all know how complex SQL is, so you can understand why generally the SQL you get from L2E is more complex than with L2S.

I personally like to keep things simple and so I’m disappointed by the decision to concentrate. Although all is not lost.

If you use standard practice of implementing a data layer then you could move to L2E in the future. Whats more your SELECT LINQ queries against a set of L2E objects should be similar to those for L2S. The main difference is when you want to do your insert, update and delete operations, the two technologies use different mechanisms. Further more you will likely have to change you validation code to use the different technology, although the core rules code shouldn’t have to change.

I would still stick with L2S until the next version of EF. I'll discuss some of my findings in future posts.

 



-

Go to http://www.sqlbits.com/ to see the videos from SQLBits III.

Most of the presentations are also up.

We are also planing the next SQLBits, so if you're interested in speaking then start thinking of a session, if you want to help drop us a line contactus@sqlbits.com and if you want to sponsor the next event also drop us an email at contactus@sqlbits.com



-
Posted by simonsabin | 1 comment(s)

Jon has an interesting view on why you should do it regularly. http://sqlblogcasts.com/blogs/jonsayce/archive/2008/11/03/restart-regularly-to-avoid-last-known-good-configuration-problems.aspx

If you can afford the downtime I don't see an issue. Allowing for downtime means that if you HAVE to reboot its easily managed and not a pain.



-
Posted by simonsabin | 3 comment(s)

If you script objects in management studio 2008 be careful its a bit forgetful. If you are using a case insensitive collation and you have objects that refer to other objects but use different case for the object names then when you try and script objects some might not be included. I found this when trying to script a function and two tables. The function referred to one of the tables but with a different case. Oddly this table didn't get scripted.

This has been fixed and will be available in an upcoming cumulative update.

The next gotcha is using the debugging in SSMS 2008. You have to be careful if you cancel your debug session because the database you are connected to might change

If you cancel a query that is in debug mode, the query window will be connected to the database selected when the query window was opened. i.e. if you change the database before you start debugging the database will not be the same before and after cancelling the query,

To repro.

Open a new window this is likely to start in master
change the database from the drop down say tempdb
type wait for delay '00:00:10'
start debugging
cancel debugging
you will now be connected to master and not tempdb.

Be warned.



-
Posted by simonsabin | 1 comment(s)
Filed under: ,

I’ve been working with a number of clients recently who all have suffered at the hands of TSQL Scalar functions. Scalar functions were introduced in SQL 2000 as a means to wrap logic so we benefit from code reuse and simplify our queries. Who would be daft enough not to think this was a good idea. I for one jumped on this initially thinking it was a great thing to do.

However as you might have gathered from the title scalar functions aren’t the nice friend you may think they are.

If you are running queries across large tables then this may explain why you are getting poor performance.

In this post we will look at a simple padding function, we will be creating large volumes to emphasize the issue with scalar udfs.

create function PadLeft(@val varchar(100), @len int, @char char(1))

returns varchar(100)

as

begin

  return right(replicate(@char,@len) + @val, @len)

end

go

Interpreted

Scalar functions are interpreted code that means EVERY call to the function results in your code being interpreted. That means overhead for processing your function is proportional to the number of rows.

Running this code you will see that the native system calls take considerable less time than the UDF calls. On my machine it takes 2614 ms for the system calls and 38758ms for the UDF. Thats a 19x increase.

set statistics time on

go

select max(right(replicate('0',100) + o.name + c.name, 100))

from msdb.sys.columns o

cross join msdb.sys.columns c

 

select max(dbo.PadLeft(o.name + c.name, 100,'0'))

from msdb.sys.columns o

 cross join msdb.sys.columns c

If you run the last one again but with half the rows i.e. as follows the time is halved. This highlights how linear the performance is in relation to the number of rows

select max(dbo.PadLeft(o.name + c.name, 100,'0')), count(1)

from (select top 50 percent * from msdb.sys.columns )o

 cross join msdb.sys.columns c

Parallel

Scalar functions are calculated on a single thread. This means that if you move to a multi core machine your performance will not be increased. This is shown by running against a query that results in parallel operators.

Create a very large table and populate it.

create table veryLargeTable (col1 bigint,  col2 bigint)--varchar(100))

go

declare @i int

set @i = 0

while @i < 20

    begin

    insert into veryLargeTable

    select object_id, object_id

    from sys.columns

    set @i = @i + 1

    end

 

Then they perform the following queries  (You need to do this on a multi core machine),

set statistics time on

go

select max(right(replicate('0',100) + cast(v1.col1+ v2.col2 as varchar(100)), 100))

from veryLargeTable v1

join veryLargeTable v2 on v2.col1 = v1.col2

go

select max(dbo.PadLeft(v1.col1+v2.col2, 100,'0'))

from veryLargeTable v1

join veryLargeTable v2 on v2.col2 = v1.col1

 

You will see in the duration that the system function version takes x amount of time, but it uses ~2X time in CPU. On my machine I get elapsed time of 3247ms and CPU time of 6094 ms. That highlights that the query was able to work in parallel. Whereas the scalar UDF results in an elapsed time of 27041ms and an CPU time of 26000ms, showing no parallelism.

Reduction of CTE and views

Common table expressions are largely syntactic constructs that are merged into the main query, i.e. they perform like a view and not like a physical set of data. However the use of scalar functions in CTEs can cause undesired behaviour. In the worst case, a query might only return a few rows and you may believe that the CTE is only evaluated a few times and thus your function only called a few times, however due the optimiser might choose a query plan which results in the CTE being evaluated for every row in the source dataset resulting in your function being called many more times than you expect. This in conjunction with the issues above results in very very poor performance.

Profiler

If you’ve used scalar functions and tried to perform statement level profiling you will have hit this last problem. Because each line in a scalar function is considered a statement then EVERY time the function is evaluated, EACH line in the function is recorded in profiler.

This firstly means you can’t generally find the statements you are really concerned with, but of more a concern is that the performance hit of capturing ALL these statements if very very large.

To test this set up profiler to record SP:StmtCompleted and run the samples above, you will soon be overwhelmed with data.

There is a workaround for this which reduces the impact but it only reduces the impact. The workaround is to filter out statements where the statement is for an object of type “Scalar function”

ObjectType <> 20038

Options

So what are your options,

1.       You either need to put your code in line, but then you lose the benefits of code reuse.

2.       Write a CLR function

3.       Write a table valued function and use a subquery.

 

Option

Code Reuse

Performance

Extensible

Parallelism

System functions

No

Very High

Limited (single line)

Yes

CLR function

Yes

High

Yes

Yes

Table Valued function

Yes

High

Limited (single line)

Yes

Scalar function

Yes

Low

Yes

No

 

The Table Valued Function option is the one that is not commonly known, but is the best option if you can convert your function into a single line of system function calls and don’t want to go to CLR. What you do is write a Table Valued function.

So how does that actually work. You create your table valued function to return 1 row. In your query you then use a subquery to get the value from the table valued function.

create function PadLeftTVF(@val varchar(100), @len int, @char char(1))

returns table

as

  return (select right(replicate(@char,@len) + @val, @len) val)

go

select max(val)

from (select  (select * from PadLeftTVF(o.name + c.name, 100,'0')) val

        from msdb.sys.columns o

  cross join msdb.sys.columns c) d

select max(val)

from (select (select val from PadLeftTVF(v1.col1+ v2.col2 , 100,'0')) val

from veryLargeTable v1

join veryLargeTable v2 on v2.col1 = v1.col2 ) d

 

Note: The derived table is used because you can’t use a subquery in an aggregate. But if you do it via a derived table it works. Why I have no idea and it seems odd that the optimiser can’t do this for me. Anyway the Max is so we only return 1 row from the query and so our timing isn’t affected by the time it takes to return and render the data.

Whilst this looks very complicated this benefits from the fact that the optimiser can effectively consume the system calls in line, because it can figure out there is only going to be own row and column returned. You can see this by looking at the properties of the streaming aggregate. The table valued function will have an output of something like where the function has been totally reduced to system calls.

MAX(right('0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'+CONVERT_IMPLICIT(varchar(100),[msdb].[sys].[syscolpars].[name]+[msdb].[sys].[syscolpars].[name],0),(100)))

Where as the scalar one has the function call to PadLeft.

MAX([tempdb].[dbo].[PadLeft](CONVERT_IMPLICIT(varchar(100),[msdb].[sys].[syscolpars].[name]+[msdb].[sys].[syscolpars].[name],0),(100),'0')))

Use of Table Valued functions means that using this method you can benefit from parallelism because the function has been reduced to system function calls and so your query will scale better when you go to a multi core machine. This is highlighted if you run the last query above on a multi core machine.

Summary

If you aren’t into writing C# code and you want ultimate performance and code reuse then write your scalar udfs as table valued functions and use a subquery

Other References

http://sqlblog.com/blogs/adam_machanic/archive/2006/08/04/scalar-functions-inlining-and-performance-an-entertaining-title-for-a-boring-post.aspx

 



-

As I mentioned I've been at Kalens seminar today and as always picked up some things I never new about and also was reminded of others things I had forgotten. The latter is far too common with SQL being such a large product.

PathName can take a parameter

The first is the new PathName function for getting the virtual path for filestream has an optional parameter. This parameter confirgures the format of the server name being returned. The following is copied direct from books online. (http://msdn.microsoft.com/en-us/library/bb895239.aspx)

Value

Description

0

Returns the server name converted to BIOS format, for example: \\SERVERNAME\MSSQLSERVER\v1\Archive\dbo\Records\Chart\A73F19F7-38EA-4AB0-BB89-E6C545DBD3F9

1

Returns the server name without conversion, for example: \\ServerName\MSSQLSERVER\v1\Archive\dbo\Records\Chart\A73F1

2

Returns the complete server path, for example: \\ServerName.MyDomain.com\MSSQLSERVER\v1\Archive\dbo\Records\Chart\A73F19F7-38EA-4AB0-BB89-E6C545DBD3F9

I foresee that many people in a will need to use option 2 to ensure correct resolution of the server name.

The text in row can be configured

I knew that you can configure text data to be stored in row but didn't realise that there was a configuration option to say at which point you want data to be stored off the row. Normally with sp_tableoption you only specify 0 or 1 to turn a feature on or off. However with the "text in row" option you can specify a value between 24 and 7000. Any text data under that will be considered to be stored "in row" if there is space. For more details have a look at http://msdn.microsoft.com/en-us/library/ms173530.aspx

Rows can overflow

I completely forgot that normal variable length data can overflow in SQL 2005, even if you are not using the new max types you can still store more than 8060 bytes in a row (as long as the extra is of variable length type).

This means that you can have a table with 100 varchar (100) columns that are fully populated, i.e. a row size ~ 10k. This is clearly more than 8060 bytes and so would fail in SQL 2000. However in SQL 2005 and later the data is stored by putting the extra data on OVERFLOW pages, with pointers from the main row of data to the relevant overflow page(s).

There is downside in that doing this can result in poor performance if you query is having to go off to lots of overflow pages to get data.

So thats some of the tips I learnt today.

Cheers Kalen



-
Posted by simonsabin | with no comments