relative query cost not accurate when using set rowcount

When tuning performance of a query, it is quite common to compare different variants of the query in the same batch and compare the execution plans to see if changes that you made actually work for better or not. Often you would compare just relative cost of execution plan of a query within the batch. I lived for a long time with innocent assumption that the relative cost is accurate and reliable. As Gail Shaw showed in her blog, this assumption doesn't hold if you use scalar functions within a query. I just found out that this is true also for certain type of queries that you would use in specific scenarios.


Let's assume that you have a huge table that is heavily used by OLTP system. Yet you have to delete significant amount of data from it with as small impact for the user as possible. If you just execute delete from table with some key value, it may happen that the amount of affected rows will trigger lock escalation to table level. This will obviously prevent users from inserting or even reading data from the table.
The remedy for such scenarion is to delete rows in batches.
Let's see how we can do it. Here is the table we are going to use.

create table t1(a int identity(1, 1) primary key clustered, b char(10) default('a'))


I will populate it with 100 000 rows:

set rowcount 100000

insert t1 (b)

select 'a'

from master.sys.objects a, master.sys.objects b, master.sys.objects c

set rowcount 0

select count(*) [before batch 1] from t1



Fist approach to delete rows in batches from this table can be as follows:

--batch 1

declare @batchsize int

set @batchsize = 100

set rowcount @batchsize

while 1=1

begin

delete t1 where a < 50000

if @@rowcount = 0

break;

end

set rowcount 0



As you see, we can modify the size of the batch here to adjust it to the size of the table and required degree of concurrency.
The other option is to use rowcount setting, similar to the approach we used to populate the table:

declare @batchsize int

set @batchsize = 100

while @batchsize <> 0

begin

delete top (@batchsize) from t1 where a < 50000

set @batchsize = @@rowcount

end



The query and the loop looks a bit simpler. Which approach is better?  

To be honest both techniques give similar results, at least for the sample data. I didn't have time to run tests for bigger amounts of data, but if you find this interesting, I attach the script so you can check it out for yourself.

More interesting are the execution plans. I removed while loops from above queries and executed them with "Include Actual Execution Plan" option on.

declare @batchsize int

set @batchsize = 100


--query 1

set rowcount @batchsize

delete t1 where a < 50000

set rowcount 0

--query 2

delete top (@batchsize) from t1 where a < 50000


The set rowcount query shows plainly delete from the clustered index. The select top() query shows plan that is more complicated, involving clustered index seek and Top operator. Yet SQL Server shows that the plan with select top() is much more efficient!


 

Why? I checked IO statistics and both queries show the same number of reads:

Table 't1'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 't1'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

If you look though at the estimated and actual numbers of rows that optimizer processed, you will see immediate difference

 

 

 


The optimizer doesn't know that set rowcount @batchsize is in effect and estimates that all rows will have to be processed. The select top() query is parsed by optimizer and it knows that we want only 100 rows, so it estimates that the cost of IO and memory will be much less than in case of set rowcount query. Yet actual numbers of rows are the same, which may explain why my tests show no particular difference between both approaches.

This leads me to conclusion that the relative difference between both plans, as you see on the first picture, is mostly caused by inaccurate estimation of amount of data the engine will have to read and modify. And if this is the case, I think this is worth to keep in mind that it may be that stale statistics may affect your plan comparisons too. You can recognize stale statistics amongst the others by significant differences between estimated and actual numbers of rows in execution plans.

I attach the scripts that I used for tests here: testbatchdelete.zip

like in the old times

My little brother got recently an old 50mm Pentax-M 1:2 lens. While this lens is not known for the best image quality amongst Pentax prime 50mm lenses, it gave me first opportunity to take pictures 'the old way' on my K10D. This is quite nostalgic feeling to have to focus and set aperture manually like in the old '80s with famous soviet camera Zenith. My brother has also some old lenses with M42 mount. I bought adapter allowing me to connect them to my camera and indeed, it works. I find though that 50mm is the most interesting.

This is one of pictures i took with it. It is quite hard to get focus properly, at least for me, but when I succeed, the satisfaction is much more rewarding than when I use 'intelligent' lens.

By the way, winter in Poland looks beautiful, provided you have heating (it can be -20 C here or less) :).

 

Posted by Piotr Rodak | with no comments
Filed under:

updating column with a random value

I have been working recently on load testing of our ETL. One of the factors that determines amount of transfered data is a lookup table that is joined to the source. I populated this table with aproppriate values, and then realized that I also need some values in second column, and they were null. Let's say the table looks like that:

create table t101

(

a int,

b char(10)

)

go

insert t101(a, b)

select 1, ''

union all select 2, ''

union all select 3, ''

union all select 4, ''

union all select 5, ''

union all select 6, ''

union all select 7, ''

union all select 8, ''

go

Luckily I had another table that contained the values I needed. The problem was only, there were much less rows in the second table than in the first one. There is also no relation between both tables.

 

create table t102

(

someval char(10)

)

 

go

insert t102(someval)

select 'val 01'

union all select 'val 02'

union all select 'val 03'

union all select 'val 04'

union all select 'val 05'

union all select 'val 06'

union all select 'val 07'

union all select 'val 08'

union all select 'val 09'

go

Happily I wrote a simple update:

update t101 set b = (select top 1 someval from t102 order by newid())

I ran the query and then selected from the t101:

select * from t101

And the results are..

a    b
1    val 06   
2    val 06   
3    val 06   
4    val 06   
5    val 06   
6    val 06   
7    val 06   
8    val 06     

I was surprised, to say the least. Since I had something other to do, I left this as it was, letting my 'background threads' do the thinking. Of course, I figured out what was going on when I was walking home, wading in snow slush and struggling with wind.

The problem with above subquery is that it is not correlated. This means, its execution doesn't depend on the row the query produces output for. So the optimizer chooses to execute the query only once and apply the result to every row of the outer query. It is a random result, but only one for the whole set.

Ok, so what to do then? No, don't use cursor.

declare c cursor for select a from t101
declare @a int
open c
fetch next from c into @a
while @@fetch_status = 0
begin
        update t101 set b = (select top 1 someval from t102 order by newid())
        where a = @a
        fetch next from c into @a
end
close c
deallocate c

Seriously, don't use cursor.

You have to find a way to correlate the query even if there is no relationship between the data. I thought about two ways, but I am sure there may be more.

First option is to return value which would be a result of concatenation of value from table 2 and value from current row of table 1.:

update t101 set b = (select top 1 someval + case when b is null then '' else '' end from t102 order by newid())

As you see, I don't add effectively anything to someval. This query works, but is kind of too elaborate and complicated to write. So there is another option for you - add a filter that will always return true.

update t101 set b = (select top 1 someval from t102 where a = a order by newid())

 

As you see, as long as a value is not null, the subquery will return a result. And it will be a new result for each processed row.

I attach the code so you can play with this idea a bit more, maybe you will find simpler ways of generating random data than these above - if you do, please let me know.

 Testing update random value.zip

 

 

 

 

updlock, holdlock and deadlocks

We have a process that manages sliding window over a number of tables in our reporting database. This process is triggered by first call to our reporting ETL. There is table that contains two dates that are important for the partitioning and business requirements. The table is build as the following example:

if object_id('dbo.tDateTest') is not null

drop table dbo.tDateTest

go

create table dbo.tDateTest

(

FromDate datetime not null,

ToDate datetime not null

)


The table contains only one row:

insert dbo.tDateTest(FromDate, ToDate) values('20090102', '20090101')


 
It is important that the code is called only once and other processes do not overlap, as this would lead to incorrect partition range and missing data in the interface views.

Just a few days we had a production issue, and then another one, next day - deadlocks. That surprised me to say the least, because I thought I had secured myself against this problem. The code looked like this:

declare @newFrom datetime, @newTo datetime, @oldFrom datetime

set @newFrom = '20090115'

set @newTo = '20090114'


begin tran

--######### original approach

select @oldFrom = FromDate from dbo.tDateTest with(updlock, holdlock)


while @oldFrom < @newFrom

begin

set @oldFrom = dateadd(day, 1, @oldFrom)

update dbo.tDateTest set FromDate = @newFrom, ToDate = @newTo

--some other processing here..

end

commit

select * from dbo.tDateTest


Query 1, Original approach.

As you see I added the updlock and holdlock hints to the select statement. This technique effectively changes isolation level of the transaction to serializable, by acquiring update lock at the time select statement is executed and holding it to the end of the transaction.
Well, it turned out that when multiple processes execute the above code, there may be a situation when a deadlock occurs.
At the beginning, I had trouble to actually reproduce the behavior. Fiddling with waitfor statement and multiple windows proved to be rather cumbersome. Then I remembered that Adam Machanic's little utility SQLQueryStress can do exactly what I want - call some code many times from different threads. And indeed, when I set up the utility and pasted above code (picture below), I encountered deadlocks exactly as in the production.



I turned on trace flag 1222 which writes detailed deadlock information to the error log to see what actually is going on.
dbcc traceon(1222, -1)
The -1 parameter indicates that you want to set the status flag globally for all connections to the server. If it is not provided, the flag applies only to the current connection.

After I had enabled the flag, I pasted query 1 to the SQLQueryStress util. I configured the tool to run the code 40 times in two threads. During the run errors were indicated and when I clicked the ellipsis (...) button I could verify the errors were indeed deadlocks. So I opened the SQL Server error log and got the following output from it:

deadlock-list
deadlock victim=process1cdf68c58
process-list
process id=process79a72e8 taskpriority=0 logused=0 waitresource=OBJECT: 15:1622453004:0  waittime=1812 ownerId=4097956687 transactionname=user_transaction lasttranstarted=2009-12-21T13:56:52.703 XDES=0x8031d0c0 lockMode=X schedulerid=1 kpid=1856 status=suspended spid=117 sbid=0 ecid=0 priority=0 transcount=1 lastbatchstarted=2009-12-21T13:56:52.703 lastbatchcompleted=2009-12-21T13:56:52.563 clientapp=.Net SqlClient Data Provider hostname=MYHOST hostpid=16244 loginname=testlogin isolationlevel=read committed (2) xactid=4097956687 currentdb=15 lockTimeout=4294967295 clientoption1=671096864 clientoption2=128056
executionStack
frame procname=adhoc line=7 stmtstart=336 stmtend=480 sqlhandle=0x020000000144d3314c9019b950c94aa9630febde7d049222
select @oldFrom = FromDate from dbo.tDateTest with(updlock<c/> holdlock)
inputbuf
declare @newFrom datetime<c/> @newTo datetime<c/> @oldFrom datetime
set @newFrom = '20090115'
set @newTo = '20090114'
begin tran
--#########        original approach
select @oldFrom = FromDate from dbo.tDateTest with(updlock<c/> holdlock)
while @oldFrom < @newFrom
begin
set @oldFrom = dateadd(day<c/> 1<c/> @oldFrom)
update dbo.tDateTest set FromDate = @newFrom<c/> ToDate = @newTo
#NAME?
end
commit
select * from dbo.tDateTest
process id=process1cdf68c58 taskpriority=0 logused=0 waitresource=OBJECT: 15:1622453004:0  waittime=1828 ownerId=4097956685 transactionname=user_transaction lasttranstarted=2009-12-21T13:56:52.703 XDES=0x35527f6d0 lockMode=X schedulerid=3 kpid=3324 status=suspended spid=118 sbid=0 ecid=0 priority=0 transcount=1 lastbatchstarted=2009-12-21T13:56:52.703 lastbatchcompleted=2009-12-21T13:56:52.533 clientapp=.Net SqlClient Data Provider hostname=MYHOST hostpid=16244 loginname=testlogin isolationlevel=read committed (2) xactid=4097956685 currentdb=15 lockTimeout=4294967295 clientoption1=671096864 clientoption2=128056
executionStack
frame procname=adhoc line=7 stmtstart=336 stmtend=480 sqlhandle=0x020000000144d3314c9019b950c94aa9630febde7d049222
select @oldFrom = FromDate from dbo.tDateTest with(updlock<c/> holdlock)
inputbuf
declare @newFrom datetime<c/> @newTo datetime<c/> @oldFrom datetime
set @newFrom = '20090115'
set @newTo = '20090114'
begin tran
--#########        original approach
select @oldFrom = FromDate from dbo.tDateTest with(updlock<c/> holdlock)
while @oldFrom < @newFrom
begin
set @oldFrom = dateadd(day<c/> 1<c/> @oldFrom)
update dbo.tDateTest set FromDate = @newFrom<c/> ToDate = @newTo
#NAME?
end
commit
select * from dbo.tDateTest
resource-list
objectlock lockPartition=0 objid=1622453004 subresource=FULL dbid=15 objectname=Database_Perf.dbo.tDateTest id=lockea708480 mode=IX associatedObjectId=1622453004
owner-list
owner id=process79a72e8 mode=IX
owner id=process1cdf68c58 mode=IX
waiter-list
waiter id=process1cdf68c58 mode=X requestType=convert
waiter id=process79a72e8 mode=X requestType=convert



There a couple of facts to be noticed here. As you see in the second line od the output, process process1cdf68c58 has been chosen as victim of the deadlock. The lines beginning with >process id< contain also information about what object was deadlocked. This is the waitresource part: =OBJECT: 15:1622453004:0. The identifier tells you that the deadlocked resource is an object (table), in database with db_id() 15, then there is id of the object within this database and eventually id of the index on the table. The index id 0 indicates that this is heap - either there there is no clustered index defined or it was not chosen to run the query.
By the end of the output in the resource-list section you can also see the name of the table and the types of locks that were in place before the deadlock occured. As you see, both processes (owner id lines) had intent exclusive (IX) lock on the table and both of them decided to convert the lockx to exclusive (X)) and since these locks (IX and X) are not compatible, deadlock occured and one of the processes was killed. There is a MSDN article here with overview of the locks compatibility.

To say the least, I was pretty surprised. I never thought this would occur, although when I think of this, the deadlocking is pretty reasonable. I just made (uneducated) assumption, that when there is only one row in a table, the enginge will get straight to this row and (updlock, holdlock) will work as planned. I was wrong as it turns out.

Anyway, I needed to fix the problem.

1. The first idea was to replace the updlock with tablockx and this resolved the deadlocking issue.
Note that the TABLOCKX hint causes the engine to take an exclusive lock on the whole table. Use it with caution, as if you have really concurrent environment, this hint can affect performance seriously. In my case, this is not an issue, so I am OK with this approach. Here is the query that I used to resolve the problem:

declare @newFrom datetime, @newTo datetime, @oldFrom datetime

set @newFrom = '20090115'

set @newTo = '20090114'


begin tran

--############ tablockx approach

select @oldFrom = FromDate from dbo.tDateTest with(tablockx, holdlock)


while @oldFrom < @newFrom

begin

set @oldFrom = dateadd(day, 1, @oldFrom)

update dbo.tDateTest set FromDate = @newFrom, ToDate = @newTo

--some other processing here..

end

commit

select * from dbo.tDateTest



This would indicate that updlock has different path of execution than tablockx and there is transition state from intent lock to exclusive lock that is prone to deadlock issues if exclusive table lock is not explicitly requested. Please note that while I accepted this approach as the solution, it may not be the best idea if you have heavily accessed table with many rows.


Since I was at it, I decided to try several various approaches and see if they work. Here are some approaches I tried:
2. I thought that maybe deadlocking is caused by query without where clause narrowing the set to one row (yeah, there is one row there anyway, but maybe optimizer chooses different optimization technique?). I changed the select statement to the following:

select @oldFrom = FromDate from dbo.tDateTest with(updlock, holdlock)

where FromDate = '20090102'



Obviously, because there is no index on the table, the select statement produces table scan in execution plan, so there is not a big change to the plan produced by query without where clause. The deadlocks are still there, too.

3. Next idea I had was to create a nonclustered index on the table and use it to select rows from it.

--add nonclustered index

if object_id('dbo.tDateTest') is not null

drop table dbo.tDateTest

go

create table dbo.tDateTest

(

FromDate datetime not null,

ToDate datetime not null

)

go

create index ix1 on dbo.tDateTest(FromDate)

go

insert dbo.tDateTest(FromDate, ToDate) values('20090102', '20090101')

go



Now, this query

select @oldFrom = FromDate from dbo.tDateTest with(updlock, holdlock)



uses index scan because optimizer chooses to use index instead of table to retrieve all required data. What is interesting - there are no deadlocks here. So it looks like index scan is somewhat safer in this case than table scan, in terms of deadlocking.

4. I added new variable to the script and changed the query so it has to use in some way a table.
The query looks now like this:

select @oldFrom = FromDate, @oldTo = ToDate from dbo.tDateTest with(updlock, holdlock)


Because there is no where clause, the optimizer chose to use table scan and bypass the index altogether. This takes us to the square one and the deadlocking issue.

5. I wondered what would happen if I enforced use of the index in the query:

select @oldFrom = FromDate, @oldTo = ToDate from dbo.tDateTest with(updlock, holdlock, index=ix1)



This query produces plan that contains index scan and RID lookup on the table. And there are no deadlocks interestingly.

6. So, next approach is to use index and search for the only row in the table using WHERE clause.

select @oldFrom = FromDate, @oldTo = ToDate from dbo.tDateTest with(updlock, holdlock, index=ix1) where FromDate = '20090102'



Since the optimizer goes directly to the searched row, the locking seems to be less prone to deadlocking.

7. The last change I checked was to modify the table and add surrogate clustered primary key:

--change definition of the table - add clustered key

if object_id('dbo.tDateTest') is not null

drop table dbo.tDateTest

go

create table dbo.tDateTest

(

PK int identity(1, 1) primary key clustered,

FromDate datetime not null,

ToDate datetime not null

)

go

insert dbo.tDateTest(FromDate, ToDate) values('20090102', '20090101')


go



Now, the query was modified to seek for the row with PK = 1,

select @oldFrom = FromDate from dbo.tDateTest with(updlock, holdlock)

where PK = 1



Since this is a Clustered Index seek in execution plan, there is no deadlocking as well.


In summary, this is another example that it is better to have a clustered index than not, even on small single-row tables. It is surprising that table-scan is more susceptible to deadlocks than index scan as idea number 5 showed.
I suppose these findings may be inaccurate if there are more rows in the table or in the index. I haven't tested such scenario yet, but definitely SQLStressQuery utility is coming to my toolbelt.

It is worth to mention, that there is another way of serializing execution of your code. There is a, somewhat not well known procedure sp_getapplock which, together with sp_releaseapplock allows for creating critical section around your code withouth applying locks to the table. I checked this approach as well and obviously it prevents deadlocks from occur. This is pretty interesting option in certain situations in my opinion.


The code I used for testing is available here: testing updlock, holdlock hint.zip, and the SQLStressQuery utility is here, please have your go if you find it interesting and have some time to spend.

Happy New Year

Well, this is my last post this year - I know, there weren't too many of them anyway, but I plan to improve my frequency, promiseSmile .

I have a few days off and all of a sudden I got a flu or cold, felt pretty rubbish two days ago. I spent all day in bed yesterday, sleeping, playing with Rubik's Cube and reading a book - The Storage Engine Big Smile. The effect of day in bed was that I couldn't sleep at night and my brain was plotting grand plans how to become rich and retire before end of 2010. Well, I didn't come up with anything you wouldn't know already so regarding this it was a wasted night. However, I came across idea of using regular expressions to parse some SQL queries to pick table names or columns in select statemets. I downloaded very cool regex designer and was playing with the idea for the whole day today. Call me a geek, but I think regular expressions are cool. It's a pity they are not natively implemented in SQL Server, but on the other hand it would encourage developers to do really heavy lifting on character data in the database - and DBAs would have yet another trouble to look after.

Anyway, happy New Year folks, let it be even better than the old one.

 

Default value or value provided for the report parameter .. is not a valid value

I've been working on a report for last few days. It worked fine on my visual Studio and when it was ready, I decided to deploy it to our application server which runs Report Server 2005. To my disappointment, the report didn't want to work after deployment. It threw an error on me as on the below picture:

 

Default value or value provided for the report parameter 'RunDate' is not a valid value

I looked for solution far and near. One of the first articles I stumbled upon was the fix for apparently the same error in Microsoft KB base: http://support.microsoft.com/kb/970058

I installed service pack 3 on the Report Server, bounced the application server box, then downloaded and installed CU4 which is supposed to contain fix for my error. Then I restarted the box once more and launched the reports manager. Unfortunately, the error didn't go away. It was the moment I begun to think rather than look for ready solutions. My report has several parameters passed to it from the feeding query. The values of these parameters are displayed on the header of each page of the report. It dawned on me, that the Report Manager doesn't complain about string parameters, it complains about this particular, DateTime type parameter. This parameter is read from the query and a formatting string is applied to it so it is displayed nicely in the header.

 

Ok, so this might be something related to the representation of the date on the app server. I checked regional settings and indeed, there was difference - my local machine had different location set - Ireland instead of United States. Yet all other values for date formats were the same.

I modified the data type of the parameter to be String and removed the formatting. I have got default date representation as specified in regional settings in Control Panel. This was not what I wanted and I wanted to have the thing done. I ended up with modifying query to convert all three datetime columns for parameters to strings and formatted them using the most important SQL Server function - CONVERT:

 convert(varchar(20), a.RunDate, 13) RunDate

The report works now nicely, producing expected output and I don't have to count on regional settings or whatever to properly format my date parameters.It looks like for Reporting Services it is better sometimes to format data for them than to rely on them to do it because they tend to behave differently on different machines, depending on the external environment settings.

DATA ACCESS setting on local server

When you look at the sys.servers catalog view, you will see list of linked servers defined on the server plus one additional row for the server itself. You can distinguish the local server from linked servers by column server_id, which in case of local server has value of 0. 

select server_id, name, is_data_access_enabled from sys.servers

We have an application, that stores query to execute in a configuration table, along with name of the linked server. The query is executed using OPENQUERY statement. This setup allows for changing environment practically at runtime, all you have to do is to change linked server name in the configuration. Last week we decided to make a 'proxy' database that would contain a subset of data of original linked server database on local server. This way we can have consistent development and testing environment.

So, the query would look something similar to this:

select * from openquery(Amilo, 'select * from testdb.dbo.tLinked')

Unfortunately, it didn't work as we expected:

Msg 7411, Level 16, State 1, Line 1
Server 'Amilo' is not configured for DATA ACCESS.

Now, it turns out that you have two ways of solving this issue.

The first option is that you can create loopback linked server to your local server.

To create a loopback server you can use sp_addlinkedserver procedure, as it is described in books on line:

Untitled

sp_addlinkedserver @server = N'AMILOLinked',
@srvproduct = N' ',
@provider = N'SQLNCLI',
@datasrc = N'AMILO',
@catalog = N'master'

After running this command you will see that there is new entry in sys.servers view and its is_linked column value will be set to 1.

Now you can use this server for example like this.

Untitled

select * from openquery(AmiloLinked, 'select * from testdb.dbo.tLinked')

The limitation of a loopback linked server is that it cannot take part in an explicit or implicit transaction because this would require loopback distributed transaction, something that MSDTC apparently does not support.In such scenario, you will get error 3910:

Msg 3910, Level 16, State 2, Line 1

Transaction context in use by another session.

Interestingly, autocommit transactions will work, but they are constrained to single statement, so it is not always what you maight be looking for.
If MSDTC is not enabled, both explicit and implicit transactions will throw a different error:

Msg 8501, Level 16, State 3, Line 2

MSDTC on server 'AMILO' is unavailable.

The second option you have is to enable DATA ACCESS on your server, the one listed with server_id = 0.

To enable this option hou have to use sp_serveroption procedure:

EXEC sp_serveroption 'AMILO', 'DATA ACCESS', TRUE

If you query sys.servers now, you will see that is_data_access_enabled column for server_id is now set to 1. Also implicit and explicit transactions work, unless you request for distributed transaction and MSDTC is not running:

begin distributed tran

insert Amilo.testdb.dbo.TLinked(a, b) values(1, 1)

commit

Msg 8501, Level 16, State 3, Line 2

MSDTC on server 'AMILO' is unavailable.

 

The DATA ACCESS server option is not very well documented in my opinion - the Books On Line say it is a property of linked servers. It doesn't mention at all that you actually can have it enabled on your local server to enable OPENQUERY calls. I noticed that when you disable DATA ACCESS on a linked server, you can't query any table located on it (I tested it on my loopback server) neither using OPENQUERY nor four-part naming convention. You can still call procedures (with four-part naming) that return rowsets. Well, the interesting question is why it is disabled by default on local server - I suppose to discourage users from using OPENQUERY against it.

 

 

 

Makesafe script - set up security for developers

I worked yesterday on a script that is supposed to run on DEV database after it has been restored from production. This script adjusts configuration data and security in the database so developers can work safely on production data without compromising security policies.

The original script has a long history, over 7 or 8 years. I think it's origins reach out as far as to SQL Server 7.0. Parts responsible for creating user accounts for developers were written with old, deprecated now syntax, like sp_grantdbaccess for example. I decided to modify this part to use CREATE USER and other, more contemporary language constructs.

I like to use queries that generate script for me. I wrote about such utility to generate create database snapshot script some time ago. Since the list of developers and potentially their role assignments in the make safe script may change over time, I thought that this will be pretty reasonable approach, the only difference be that instead of returning generated script to output window, saving it in a variable and later executing by sp_executesql.

The script has pretty simple structure. The first part of it contains declarations and initialization of structures that define logins, users and rights that have to be implemented. An example of the script is displayed below:

--This table contains list of logins to be mapped to databases.
declare @DevLogins table
(
LoginName sysname, --name of login on the server
UserName sysname --name of user mapped to the login in databases
)

--PR: add or remove WINDOWS logins from here as aproppriate. Remember to modify @PermissionsToAdd table below.
--LoginName is the domain name of the account, UserName is name of the user mapped to the login in all databases
insert into @DevLogins(LoginName, UserName)
select 'DOMAIN\Developer1', 'Developer1' union all --dev account
select 'DOMAIN\Developer2', 'Developer2' union all --dev account
select 'DOMAIN\s-Developer3', 's-Developer3' --application account

--This table contains SQL logins and users used in databases
declare @SQLLogins table
(
LoginName sysname,
LoginPwd varchar(50),
UserName sysname
)

insert into @SQLLogins(LoginName, LoginPwd, UserName)
select 'devLogin1', 'pwd1', 'devUser1' union all --application account
select 'devLogin2', 'pwd2', 'devUser2' --application account

I decided to split Windows logins and SQL Logins as they will usually have different rights assigned. SQL logins can be used for example by applications during development stage, for some bulk load operations and so on.

I also added a variable that will control if the generated script is executed or only printed:

 --used to decide if actions should be executed or only printed.
declare @DEBUG char(1)
set @DEBUG = 'N'

 You can see how it works in this piece of script for example:

    raiserror('---## Dropping DEV logins..', 10, 1 ) with nowait;
select @RunSQL = convert(varchar(max),
(select 'if exists(select 1 from sys.server_principals where name = '''+
LoginName+''')' + char(10) +
' drop login ['+LoginName+'];' +
char(10) + char(10)
from @DevLogins for xml path('')))


raiserror(@RunSql, 10, 1) with nowait;

if @DEBUG = 'N'
exec sp_executesql @RunSQL
raiserror('---## DEV logins dropped.', 10, 1 ) with nowait;

As you see, the above script generates drop login statements. It prints out the generated code and executes it only if the value of @DEBUG variable is set to 'N'. The script generates statements for the values defined earlier in the @DevLogins table.

 In my case, there are several databases that have to have security set up. Since the required rights are same in all databases (in development environment), I decided to put the database names into a variable as well:

--This table contains list of databases belonging to the system
declare @APPDatabases table (DBName sysname)

insert into @APPDatabases(DBName)
select 'FEEDS' union all
select 'REPORTING' union all
select 'PROCESSING' union all
select 'ARCHIVE'

 Require drole membership is covered by another table variable:

	-- Verify if all users defined in @DevLogins are refrenced in this table
-- Add one row for each user-role pair. If user belongs to more than one role, two rows have to be added
INSERT INTO @PermissionsToAdd(UserID, DBRole)
select 'Developer1', 'DeveloperRole' union all
select 'Developer2', 'DeveloperRole' union all
select 's-Developer3', 'DeveloperRole'

 Note that if you want particular user to be a member of more than role, each assignment has to be added to the table.

Some developers should have to have special rights assigned, like ability to run profiler or investigate DMVs. These rights can be implemented in the following way:

	--this table contains names of custom grant statements required for all LOGINS.
--DON'T add here execute rights to particular objects in databases
declare @CustomRights table(GrantRight varchar(100))
insert @CustomRights(GrantRight)
select 'GRANT VIEW SERVER STATE ' union all --use dmvs
select 'GRANT ALTER TRACE ' --run profiler

 The rights are assigned in the following script. Note there is also a way to skip certain logins from the assignment. You don't necessarily want to have application logins to have all rights required by developers.


raiserror('---## Adjust DEV login rights..', 10, 1 ) with nowait;

--exclude application account from custom rights
declare @ExcludeAccounts table(LoginName sysname)
insert @ExcludeAccounts(LoginName) values('
DOMAIN\s-Developer3')



select @RunSQL = convert(nvarchar(max), 'use master;' + char(10) + char(10) +
convert(nvarchar(max),
(select GrantRight + ' TO ' + quotename(LoginName) +';' + char(10) from @DevLogins cross join @CustomRights
where LoginName not in(select LoginName from @ExcludeAccounts) for xml path('')
)
) + char(10))

raiserror(@RunSql, 10, 1) with nowait;

if @DEBUG = 'N'
exec sp_executesql @RunSQL


raiserror('---## DEV logins rights adjusted.', 10, 1 ) with nowait;

And that's about it. You can modify the script to fit your needs for various environments. Even if you have to put some work initially in creating the script and setting up proper security for users, later you will have definitely less problems during the configuration of the environments. In fact, this script can be used to configure QA and UAT users, not only developers. I attach the full script, if you find the idea useful, please download it and adjust it to your needs.

 

 

 

 

 

 


 

 

 

 

 

 

 

 

DISABLE TRIGGER not captured, SSIS

This post was to be about behavior of SSIS when they insert rows to tables with triggers on them. We had an issue with one of our packs that began to fail 'all of a sudden'. But when I began writing this post and building test SSIS project and database objects, I came across something even more interesting.

Here we go:

The pack was working properly for several months and all of a sudden one of the execution paths began to fail. I related this failure to change that I had made a few days before. I added trigger to one of the tables that was populated by the pack. The pack uuses bulk method to insert rows to the table. There are numerous resources about the issue: by default triggers are disabled during BULK INSERT. MSDN has a document about different bulk operation techniques and behavior of triggers. I found it interesting by the way, that OPENROWSET method treats triggers in opposite way to BULK INSERT and bcp. SSIS behaves like bcp when it comes to bulk operations. If you want to use triggers during bulk insert, you must add FIRE_TRIGGERS hint in the Advanced Editor of OLEDB destination.

Usually (when FIRE_TRIGGERS is not supplied), SSIS will disable triggers while inserting rows to a table using fast load option. To find out what is happening, I set up DDL trigger that records all DDL operation on test database. The code to create audit table and the trigger is shown below:

--–this table will store some information about user activity
CREATE TABLE DDLAudit
(
dttime datetime,
strservername VARCHAR(25),
strusername VARCHAR(15),
strchange VARCHAR(max)
)

GO

CREATE
TRIGGER [trg_DDLAudit]
ON DATABASE
FOR DDL_DATABASE_LEVEL_EVENTS
AS
SET NOCOUNT ON
DECLARE @xEvent XML
SET @xEvent = eventdata()
INSERT INTO DDLAudit VALUES(
convert(datetime, convert(varchar(30), @xEvent.query('data(/EVENT_INSTANCE/PostTime)'))),
CONVERT(VARCHAR(25), @xEvent.query('data(/EVENT_INSTANCE/ServerName)')),
CONVERT(VARCHAR(15), @xEvent.query('data(/EVENT_INSTANCE/UserName)')),
CONVERT(VARCHAR(MAX), @xEvent.query('data(/EVENT_INSTANCE/TSQLCommand/CommandText)'))
)

GO

Since I have a CSV file with output of sys.dm_db_physical_index_stats DMV, I decided to create test table using code as follows:

select * into testbulk from sys.dm_db_index_physical_stats(db_id(), null, null, default, default)
where 1=0

GO

Then I created the history table and a (very simple) trigger for insert that copies the inserted rows from the work table testbulk to the history table.

select * into testbulkhistory from testbulk where 1=0

go

create
trigger trg_audit1 on testbulk
for insert
as
insert testbulkhistory
select * from inserted

To test that the trigger works, I ran the following queries. It works! :)

insert testbulk
select top 1 * from testbulk

select * from testbulkhistory

Ok, now it's time for the SSIS package. I created pretty simple pack that pulls data from CSV file and stores them in the testdb.dbo.testbulk table.

Data flow

I added Data Conversion and Derived Column components to adjust the data type, otherwise SSIS was complaining about conversion errors.

I set the data access mode to "Table or view - fast load".

Fast load data access mode set

Ok, I cleaned data in both tables and ran the pack.

delete testbulk
delete testbulkhistory

As expected, the DML trigger was disabled by SSIS. There were no data in testbulkhistory table. But I was surprised to see, that DDL trigger was not fired.

I launched Profiler to see what happens, but I didn't find any information. I added following events to the trace to see if I can catch the statement disabling DDL trigger:

Alter object events

I manually ran DROP TRIGGER and CREATE TRIGGER for the DDL trigger to see if I can catch the events.

Profiler trace

As you see, the events were captured correctly. I tried to capture also statements DISABLE TRIGGER and ENABLE TRIGGER, but these events are not covered by the Objects event category. There is a bug reported regarding DISABLE TRIGGER not being an event. I think it is not correct that there are two ways of disabling a DML trigger and only one of them is indirectly logged in ALTER TABLE statement. In case of DDL triggers, you don't have even the option to call 'logged' syntax, DISABLE and ENABLE TRIGGER events are just invisible to profiler.

If you think this is a problem, vote on this bug.

I wonder if there is a way to monitor what happens with the triggers during BCP/SSIS loads.

I attach test SSIS project and sample data, so you can test this yourself.


on update cascade...

When you define a foreign key constraint, you can also decide what SQL Server should do when primary key to which the constraint references to gets deleted or updated. While I always knew and imagined that ON DELETE CASCADE may be useful, I wondered, what scenarios would be suitable for ON UPDATE CASCADE. I still don't have this answer, but I came across some interesting behavior which kept me occupied for quite a bit more time that I had intended to.

A few days ago I decided to implement additional logging to some of our reporting helper tables.These tables have primary key defined on pair of columns and one of these columns is incremented by our ETL to indicate the version that should be returned by views that expose data.

create schema Reports authorization dbo

go
if
object_id('Reports.ClientVersionMap') is not null
drop table Reports.ClientVersionMap
go
create
table Reports.ClientVersionMap
(
ClientID char(5) not null,
Version int not null,
ChangeDate datetime
)

alter table Reports.ClientVersionMap add constraint PK_Reports_ClientVersionMap_ClientId_Version
primary key clustered (ClientId, Version)

go

I took pretty simple approach. For each table (there were three) I created separate corresponding table in History schema. The definitions of the tables were pretty much identical, I just added a datetime column to timestamp the change.


create schema History authorization dbo

go
if
object_id('History.ClientVersionMap') is not null
drop table History.ClientVersionMap
go
create
table History.ClientVersionMap
(
ClientID char(5) not null,
Version int not null,
ChangeDate datetime,
HistoryChange datetime constraint DEF_HistoryChange default(current_timestamp)
)

alter table History.ClientVersionMap add constraint FK_ClientId_Version
foreign key(ClientId, Version) references Reports.ClientVersionMap(ClientId, Version)
on delete cascade on update cascade

go
create
clustered index IX_HistoryChange on History.ClientVersionMap(HistoryChange)
go
create
nonclustered index IX_ClientID_Version on History.ClientVersionMap(ClientId, Version)
go

Then I created a trigger to populate the History data. The trigger is created for insert and update actions, and it always inserts row or rows from inserted to History table.


create trigger Reports.trg_SaveClientVersionMap on Reports.ClientVersionMap
for insert, update
as
begin
insert History.ClientVersionMap(ClientId, Version, ChangeDate, HistoryChange)
select ClientId, Version, ChangeDate, current_timestamp
from inserted
end

Ok, let's see how it works.

insert Reports.ClientVersionMap(ClientId, Version, ChangeDate)
values ('AAAA', 1, current_timestamp)

select * from History.ClientVersionMap
ClientID Version     ChangeDate              HistoryChange
-------- ----------- ----------------------- -----------------------
AAAA 1 2009-09-25 00:25:28.813 2009-09-25 00:25:28.987

Ok, this looks as expected. I added one row to the Reports table and I've got one in the History table.

Now, what happens when I update existing row? I want to have two rows in the History table.

 

update Reports.ClientVersionMap set Version = Version + 1, ChangeDate = current_timestamp where ClientId = 'AAAA'

select * from History.ClientVersionMap

And indeed,  I have two:

ClientID Version     ChangeDate              HistoryChange
-------- ----------- ----------------------- -----------------------
AAAA 2 2009-09-25 00:25:28.813 2009-09-25 00:25:28.987
AAAA 2 2009-09-25 00:32:03.670 2009-09-25 00:32:03.670

But what happened? I was pretty surprised by this result. I should have one row with Version value of 1 and second row with 2. I thought I had clean quick solution, and all of a sudden a problem. I had more code in my script, some procs, view definition changes and so on. It took me a while (a bunch of print statements too) to see the reason of this behavior: on update cascade in the definition of the foreign key constraint of the History.ClientVersioMap table.

So, it works like this. You have primary key on table A and foreign key on table B that references table A with option ON UPDATE CASCADE. Whenever you modify the primary key on table A, the foreign key value is modified accordingly to reflect the change. In this way you don't have to bother about referential integrity in a way you might be used to. At least I am used. In my opinion you need to analyze carefully if putting this option to foreign key is safe (in a way). If you have it, basically nothing can stop you then from updating primary key and having the change propagated through referencing tables, possibly millions of rows. This can be pretty expensive experience. Also, it may not be a good idea to have mutable primary key - change of primary key in some cases may lead to fragmentation of index the key it is defining.

Definitely this option can be useful in some cases, but it has to be used with care, and not unintentionally, as it happened to me.

 

 

 

 

Posted by Piotr Rodak | with no comments

Sql release builder

For a number of weeks, we have been regularly releasing code to our UAT database. The release process of database code differs slightly from releasing binary code because the release has to take into account existing data and schema in the database. With binary code (as in C# application) usually old version is just overwritten with new version. For database releases, change script has to be implemented that contains relevant changes to schema and data in the database. Developers who work on QC items prepare partial scripts that cover only issues discovered during QA.

The fixes and new features may be implemented in many SQL scripts. If there are a few teams working on the system, the number of files can grow significantly. There is always a question of sequence the scripts have to be executed in. Obviously, alter script adding a column to a table has to be executed before the script that alters view to use this column for example. The order of the scripts is not always obvious and during testing of the release it may happen that exceptions are thrown because of invalid schema for example.

Trying to simplify testing and the actual release, we decided to release only one script that contains all partial updates. I thought - wouldn't it be nice to be able to drag files to an app and then arrange them appropriately, then save the resulting script? I looked for a simple utility that will concatenate scripts into one file that would be easier to manage. I found no such utilities (maybe I wasn't looking well enough - if you know about an app like this, let me know!). I considered using a batch file with copy command to stick the scripts together, but this wasn't a 'clean' solution. So I decided to write such tool. 

The principle is very simple: you drag the files to the application, order them the right way and you can save the output. This output file contains all partial scripts concatenated and framed with logging messages. These messages allow you to observe progress of the release and locate scripts that cause issues more easily.

The picture shows the screen of the application:


On the left side there is list of files that are part of the release. On the right side, resulting script is being shown. Clicking on a file in the left listbox navigates to the beginning of this file in the right window. As you can see, the file is preceeded with raiserror message logging the name of the file. Additionally, full file path is commented into the release script. The logging messages have by default "---#" prefix, which allows them to be easily distinguished from other logging that may take place during execution of the release script.

You can save both output of the "compilation" and the project, which contains list of partial scripts in the order they are displayed in the listbox. Having a project has the advantage of being able to regenerate the same output if the partial files change. You can refresh the list of files with command File/Refresh. This command reloads files listed in the listbox.

There is configuration file associated with the application. It contains several simple settings:

<configuration>
    <appSettings>
        <add key="AddFileName" value="true"/>
        <add key="AddFileNameFormat" value="\r\n/******************** {FileName} **********************/\r\n"/>
        <add key="ItemSeparator" value="\r\nGO\r\n/* END END ******************* {FileName} ********************* END END */\r\n\r\n"/>
        <add key="AllowedExtensions" value=".sql;.prc;.tab"/>
        <add key="OutputFileName" value="true"/><!--should filename be output to resulting script-->
       </appSettings>
</configuration> 

 These settings are as follows:

  • AddFileName: If true, AddFileNameFormat setting is processed.
  • AddFileNameFormat: This string is inserted into the release script at the beginning of the partial script. the {FileName} variable is replaced with the partial script name
  • ItemSeparator: This string is inserted into the release script at the end of the partial script. the {FileName} variable is replaced with the partial script name
  • AllowedExtensions: Defines extensions of files that can be dropped on the application.
  • OutputFileName: If set to true, names of executing files are printed to output when the release script is run

Even though this application lacks bells and whistles, it saved us already a ton of time. I hope you will find it usefull too!

For your convenience, I attach two files: one is compressed solution for Visual Studio 2008, the other one, for these of you who are not interested in C# and compiling the code, msi installer that will install the application. You will find it in the Programs menu, in main folder.

ReleaseBuilderSolution.zip

ReleaseBuilderSetup.zip

 

Note: The utility is provided as is and you use it at your responsibility. You may copy and modify the application, but not sell it.

Let me know what you think of this tool.

 

Posted by Piotr Rodak | with no comments

ETL performance test - parallel package execution

I was recently busy with testing performance of our ETL process. Each day we process a number of independent feeds, which will increase in the future. These feeds contain usually similar number of rows. At present we have over 100 hundred feeds that are run daily. We have implemented a C# application that executes SSIS packages programatically, adjusting their various settings and setting some variables at run time. One of tests that we perform is to run all feeds in one go. Obviously, it is not possible (as we learned some time ago) to run 100 packages at the same time, because of memory pressure. We developed a solution that allows for configuration of maximum number of packages that can run at any given moment. This works pretty well, although is not clever enough to react to increasing load of the packages - one day we will implement this :).

We have a test environment which processes about 8000 rows in 117 feeds. Not too much, but this is current characteristics of the data. Some of the feeds are empty, but we still need to process them. All packages in general load data from a SQL Server 2005 database, store them in temporary files, reload the data to separate data flows, perform some transformations and output the data to two destinations. I ran a series of tests to see how the MaxConcurrentPacks configuration option of our process is related to execution time of all feeds configured in the system.

I put together comparison of execution times for different settings, let's have a look:


RunId CountOfFeeds MaxConcurrentPacks NumberOfRowsProcessed DurationOfRun
1 117 1 8226 5456
2 117 2 8226 3103
3 117 5 8226 1999
4 117 7 8226 1754
5 117 10 8226 1765
6 117 15 8226 1702


I created also chart that shows relation between number of packages and overall execution time of the whole lot of feeds.


As you see, there is significant drop of execution time when 5 concurrent packages are run compared to 1 package (basically, serial execution). Adding more packages doesn't improve the performance in such extent.

I also wanted to see, what will happen if I increase amount of data to process. I modified configuration so most of the feeds process quite a bit of data now and altogether, there is 3.3 million rows to be transfered. I started with 2 concurrent packages and the overall execution time exceeded 3 hours. I noticed that memory usage was significantly higher, reaching 6GB (on server with 8GB of RAM). Then I increased number of concurrent packages to 5. This maxed out the memory usage and the whole process crashed - I had to kill it actually, because system became unresponsive and had troubles with launching new applications. The application logged errors that indicate problems with memory pressure  in the log file :

A buffer failed while allocating 72816 bytes.

The system reports 97 percent memory load. There are 8587444224 bytes of physical memory with 236855296 bytes free. There are 8796092891136 bytes of virtual memory with 8782502318080 bytes free. The paging file has 12419862528 bytes with 5967872 bytes free.

The Data Flow task failed to create a required thread and cannot begin running. The usually occurs when there is an out-of-memory state.

and
The Data Flow task engine failed at startup because it cannot create one or more required threads.

For the above setup (3.3M rows), 2 concurrent packs seem to be pretty safe setting.  If you plan to implement concurrent package execution in your solutions, you should run tests and be able to modify easily the number of packages (for example, setting in configuration file) to adjust it in case you experience performance problems.

The testing application server is a quad CPU with 8GB of memory, running 64bit Windows Server 2003 Enterprise Edition.

Of course, your particular design and data conditions may be completely different from our setup, and mileage may vary. I think though it is interesting to see, that a bit of effort put into design of ETL may improve the throughput of the system. It often happens that application servers are either underutilized or run into performance issues because of the rigid design of the ETL.

 

 

Posted by Piotr Rodak | with no comments
Filed under: , , ,

BCP and numeric data field with scientific notation

There is a known issue in SQL Server 2005 with importing data using bcp.exe or BULK INSERT methods from character files that contain numeric values written using scientific notation, like 2.044E10. It was not a problem in versions prior to 2005 because bcp for SQL Server 200 and 7.0 converted such values implicitly. Beginning with SQL Server 2005, BCP follows the same rules when converting data from input files as CONVERT does. Unfortunately, CONVERT doesn't understand scientific notation if it is passed as a string. Check this out:

select convert ( decimal ( 18 , 6 ), 2.6944 E- 01 ) -- this works

select convert ( decimal ( 18 , 6 ), '2.6944E-01' ) --this doesn't

Msg 8114, Level 16, State 5, Line 1

Error converting data type varchar to numeric .

 

This error becomes especially bothersome when you have data feeds in character format, that contain fields written in scientific notation.  Let's create a table that we will populate with data from a file:

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[IndexComponentsTypeChecked](

[RowID] [int] NOT NULL ,

[IndexID] [varchar]( 11 ) NULL ,

[ComponentId] [varchar]( 11 ) NULL ,

[ComponentCurrency] [char]( 3 ) NULL ,

[NumberOfItems] [decimal]( 18 , 6 ) NULL ,

[ComponentOrigprice] [decimal]( 18 , 6 ) NULL ,

[ComponentWeight] [decimal]( 18 , 6 ) NULL ,

[IndexType] [varchar]( 12 ) NULL ,

[IndexDivisor] [decimal]( 18 , 6 ) NULL

) ON [PRIMARY]

 

 Now Books On Line says that if you have columns with scientific notation in your file, you should specify float data type in your format file. I prepared test file along with several format files that define data format for above table. I attached all files used in this post in bcptests.zip file for your convenience.

The file testfeed.txt contains single line of data, containing values for all columns of the table IndexComponentsTypeChecked .

15564 TEST.INDX TSTSTR. 0001 EUR 0.9 0.0 2.6944 E- 01 INDEX 1.0

As you see, column ComponentWeight , defined as decimal(18, 6) is populated with value written using scientific notation.

First, let's try what happens when we don't use any format file.

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeed.txt'

The output is as follows:

Msg 4864, Level 16, State 1, Line 7

Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 7 (ComponentWeight).

  

Following the documentation, you can prepare a format file thatyou will pass to the BULK INSERT statement to overcome the conversion error. Here is line from the format file bcpcfloat.fmt :

7 SQLFLT8 0 41 "\t" 7 ComponentWeight ""

As you can see, the data type for the column has been defined as SQLFLT8, which is mapped to float t-sql data type.

The query loading data from the data file is looking now like this:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeed.txt' with (formatfile = 'c:\Projects\BulkTest\bcpcfloat.fmt' )

If you run above query, it will work, because 2.6944E-01 is converted to float and then to decimal(18,6). Seems that the problem is solved, isn't it? Unfortunately, I discovered that this doesn't work if value is negative. The file testfeednegative.txt contains the same data, but ComponentWeight column is populated with negative value this time:

15564 TEST.INDX TSTSTR. 0001 EUR 0.9 0.0 -2.6944E-01 INDEX 1.0

The following query now returns error:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt' with (formatfile = 'c:\Projects\BulkTest\bcpcfloat.fmt' )

Msg 8115, Level 16, State 6, Line 5

Arithmetic overflow error converting float to data type numeric .

Interestingly enough, CONVERT will work in this case:

select convert ( float , -2.6944E-01 )

There is another workaround (for the scientific notation workaround).You can use XML format files. While they basically contain the same formatting information, they cause BCP to behave in a different way. XML format files are more verbose than legacy format files, but also easier to understand the processing of data by bulk operations using them. Note that you can generate xml (and non-xml) format files using bcp utility.

This is the xml format file bcp.xml:

<?xml version="1.0"?>

<BCPFORMAT xmlns =" http://schemas.microsoft.com/sqlserver/2004/bulkload/format " xmlns:xsi =" http://www.w3.org/2001/XMLSchema-instance ">

<RECORD>

<FIELD ID =" 1 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 12 "/>

<FIELD ID =" 2 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 11 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 3 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 11 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 4 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 3 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 5 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 6 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 7 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 8 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 12 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 9 " xsi:type =" CharTerm " TERMINATOR =" \r\n " MAX_LENGTH =" 41 "/>

</RECORD>

<ROW>

<COLUMN SOURCE =" 1 " NAME =" RowID " xsi:type =" SQLINT "/>

<COLUMN SOURCE =" 2 " NAME =" IndexID " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 3 " NAME =" ComponentId " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 4 " NAME =" ComponentCurrency " xsi:type =" SQLCHAR "/>

<COLUMN SOURCE =" 5 " NAME =" NumberOfItems " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

<COLUMN SOURCE =" 6 " NAME =" ComponentOrigprice " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

<COLUMN SOURCE =" 7 " NAME =" ComponentWeight " xsi:type =" SQLFLT8 "/>

<COLUMN SOURCE =" 8 " NAME =" IndexType " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 9 " NAME =" IndexDivisor " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

</ROW>

</BCPFORMAT>


As you see, other decimal(18, 6) columns are processed correctly unless they contain scientific notation data.

The following query succeeds:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt'

with (formatfile = 'c:\Projects\BulkTest\bcp.xml' )

 

I wondered, what happens when I change the destination type to float? I altered table IndexComponentsTypeChecked using the following query:

alter table [IndexComponentsTypeChecked] alter column ComponentWeight float null

GO

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt' with (formatfile = 'c:\Projects\BulkTest\bcpc1.fmt' )

This time, even negative value was processed correctly!

So, it looks like to properly process numeric columns with potentially negative scientific notation values you have to use XML format files if you have to use character data files. If you can use native data files the problem doesn't exists, as all numbers are stored as binary values in the data file. I created data file datanative.txt and format file bcpnative.fmt so you can investigate their structure.

There are still organizations that have large systems based on SQL Server 2000. Usually such systems process large number of feeds, many of them in character format. The change of behavior of bcp and especially bug with negative values can add a lot of work during migration of the systems. Perhaps Microsoft should think about making CONVERT understand string scientific notation values, this might save major headaches for many developers and project managers.

 

MARS - does anyone use it?

I read recently about MARS - Multiple Active Result Sets, functionality that came with SQL Server 2005. I tried to find some 'real life' example of using MARS. Most of the resources I found showed examples on AdventureWorks  database and they were, to say the least, showing how NOT to access the database. For example this article by Lawrence Moroney, shows two ways of updating inventory on AdventureWorks database. The first way requires opening the connection twice, once to read order details, second time to update inventory. The second way allows to save one connection by interleaving read operation with updates.

string connectionString = "Data Source=MEDIACENTER;" +
"Initial Catalog=AdventureWorks;Integrated Security=SSPI;" +
"MultipleActiveResultSets=True";
string
strSQLGetOrder = "Select * from Sales.SalesOrderDetail" +
"WHERE SalesOrderID = 43659";

string strSQLUpdateInv = "UPDATE Production.ProductInventory " +
"SET Quantity=Quantity-@amt WHERE (ProductID=@pid)";

SqlConnection
marsConnection = new SqlConnection(connectionString);
marsConnection.Open();

SqlCommand
readCommand =
new SqlCommand(strSQLGetOrder, marsConnection);
SqlCommand
writeCommand =
new SqlCommand(strSQLUpdateInv, marsConnection);

writeCommand.Parameters.Add(
"@amt", SqlDbType.Int);
writeCommand.Parameters.Add(
"@pid", SqlDbType.Int);
using (SqlDataReader rdr = readCommand.ExecuteReader())
{
while (rdr.Read())
{
writeCommand.Parameters["@amt"].Value = rdr["OrderQty"];
writeCommand.Parameters["@pid"].Value = rdr["ProductID"];
writeCommand.ExecuteNonQuery();
}
}
marsConnection.Close();

 

Now Lawrence writes that this is cleaner code comparing to 'classic' approach. While there are fewer lines, this is not what I would like developers to do when it comes to updating tables. This is client side RBAR (read Jeff Moden's articles- they are excellent!)

 The cleaner approach, besides using proper stored procedure is this:

Untitled
string connectionString = "Data Source=MEDIACENTER;" +
"Initial Catalog=AdventureWorks;Integrated Security=SSPI;";

string strSQLUpdateInv = "update Production.ProductInventory"+
" set Quantity = Quantity - OrderQty" +
" from Sales.SalesOrderDetail a inner join" +
" Production.ProductInventory b" +
" on a.ProductID = b.ProductId" +
" where a.SalesOrderID = 43659";

SqlConnection
aConnection = new SqlConnection(connectionString);
aConnection.Open();

SqlCommand
writeCommand =
new SqlCommand(strSQLUpdateInv, marsConnection);
writeCommand.ExecuteNonQuery();
}
aConnection.Close();

What happened to the principle that tables should not be 'touched' directly from client code? Ok, I know this is a sample only. But why the sample has to be built around completely wrong concept? Even worse sample can be found in the Training Kit book for exam 70-442 (Desigining and Optimizing Data Access by Using SQL Server 2005). Authors unfortunately didn't bother even to use parameters in sample queries - plain concatenation of strings is being taught there. 

I found more plausible example of using MARS. It's a good article of Thiru Thangarathinam. This is usage I could imagine, although I would rather be unlikely to actually use it. In my view, there is a tier lost somewhere in multi tier architecture in this case - if a website is going to be heavy loaded, one could benefit more from a  proper cache (in business layer) than from MARS connections. 

I  am not against MARS. I believe that there are scenarios which are much simpler to implement using MARS connections than without them or on a database level. I would like to know what are these scenarios. Have you come across such use cases?

 


Posted by Piotr Rodak | with no comments
Filed under: , , ,

Generate Create Database Snapshot script

This post is about yet another way of skinning a cat.

Recently I 'discovered' usefulness of database snapshots. I find them extremely useful for testing environments, where it is important to be able to revert to initial state of environment in case of any issues.

One thing that never stops amusing me though is, that SQL Server Management Studio 2005 doesn't provide simple things as 'Create snapshot' of selected database for example (There are other missing things of course!).

So, after creating a few snapshots manually, I looked for a script that creates snapshot code for me. I found script posted some time ago by Dejan Sunderic, but it wasn't what I was looking for. I don't want to have to create a stored procedure anywhere to run some adhoc queries. I decided to write something simple that I can put into templates in my SSMS and run it on any database I want. This is what I came up with:

---this script generates create snapshot code. run in text mode.

with

preamble(c) as (select 'create database ' + db_name() + '_Snapshot on'),

files(c) as (select '(name=' + name + ', filename=''' + physical_name + '.ss'')' + char(10) from sys.database_files where type = 0),

filescoalesce (c) as (select c + ',' from files for xml path('')),

lastline(c) as (select 'as snapshot of ' + db_name() + char(10) + char(10) + 'GO' + char(10))

select c [--] from preamble

union all

select left(c, len(c) -2) from filescoalesce

union all

select c from lastline

 

 The output for one of my test databases looks like this, ready for execution:

create database Post_Release_Snapshot on

(name=ds, filename='J:\SQLData\TEST02.mdf.ss')

,(name=ds2, filename='J:\SQLData\TEST02_1.ndf.ss')

,(name=ds3, filename='J:\SQLData\TEST02_2.ndf.ss')

,(name=ds4, filename='J:\SQLData\TEST02_4.ndf.ss')

as snapshot of Post_Release

GO

 

 

Posted by Piotr Rodak | 2 comment(s)
Filed under: ,
More Posts Next page »