ETL performance test - parallel package execution

I was recently busy with testing performance of our ETL process. Each day we process a number of independent feeds, which will increase in the future. These feeds contain usually similar number of rows. At present we have over 100 hundred feeds that are run daily. We have implemented a C# application that executes SSIS packages programatically, adjusting their various settings and setting some variables at run time. One of tests that we perform is to run all feeds in one go. Obviously, it is not possible (as we learned some time ago) to run 100 packages at the same time, because of memory pressure. We developed a solution that allows for configuration of maximum number of packages that can run at any given moment. This works pretty well, although is not clever enough to react to increasing load of the packages - one day we will implement this :).

We have a test environment which processes about 8000 rows in 117 feeds. Not too much, but this is current characteristics of the data. Some of the feeds are empty, but we still need to process them. All packages in general load data from a SQL Server 2005 database, store them in temporary files, reload the data to separate data flows, perform some transformations and output the data to two destinations. I ran a series of tests to see how the MaxConcurrentPacks configuration option of our process is related to execution time of all feeds configured in the system.

I put together comparison of execution times for different settings, let's have a look:


RunId CountOfFeeds MaxConcurrentPacks NumberOfRowsProcessed DurationOfRun
1 117 1 8226 5456
2 117 2 8226 3103
3 117 5 8226 1999
4 117 7 8226 1754
5 117 10 8226 1765
6 117 15 8226 1702


I created also chart that shows relation between number of packages and overall execution time of the whole lot of feeds.


As you see, there is significant drop of execution time when 5 concurrent packages are run compared to 1 package (basically, serial execution). Adding more packages doesn't improve the performance in such extent.

I also wanted to see, what will happen if I increase amount of data to process. I modified configuration so most of the feeds process quite a bit of data now and altogether, there is 3.3 million rows to be transfered. I started with 2 concurrent packages and the overall execution time exceeded 3 hours. I noticed that memory usage was significantly higher, reaching 6GB (on server with 8GB of RAM). Then I increased number of concurrent packages to 5. This maxed out the memory usage and the whole process crashed - I had to kill it actually, because system became unresponsive and had troubles with launching new applications. The application logged errors that indicate problems with memory pressure  in the log file :

A buffer failed while allocating 72816 bytes.

The system reports 97 percent memory load. There are 8587444224 bytes of physical memory with 236855296 bytes free. There are 8796092891136 bytes of virtual memory with 8782502318080 bytes free. The paging file has 12419862528 bytes with 5967872 bytes free.

The Data Flow task failed to create a required thread and cannot begin running. The usually occurs when there is an out-of-memory state.

and
The Data Flow task engine failed at startup because it cannot create one or more required threads.

For the above setup (3.3M rows), 2 concurrent packs seem to be pretty safe setting.  If you plan to implement concurrent package execution in your solutions, you should run tests and be able to modify easily the number of packages (for example, setting in configuration file) to adjust it in case you experience performance problems.

The testing application server is a quad CPU with 8GB of memory, running 64bit Windows Server 2003 Enterprise Edition.

Of course, your particular design and data conditions may be completely different from our setup, and mileage may vary. I think though it is interesting to see, that a bit of effort put into design of ETL may improve the throughput of the system. It often happens that application servers are either underutilized or run into performance issues because of the rigid design of the ETL.

 

 

Posted by Piotr Rodak | with no comments
Filed under: , , ,

BCP and numeric data field with scientific notation

There is a known issue in SQL Server 2005 with importing data using bcp.exe or BULK INSERT methods from character files that contain numeric values written using scientific notation, like 2.044E10. It was not a problem in versions prior to 2005 because bcp for SQL Server 200 and 7.0 converted such values implicitly. Beginning with SQL Server 2005, BCP follows the same rules when converting data from input files as CONVERT does. Unfortunately, CONVERT doesn't understand scientific notation if it is passed as a string. Check this out:

select convert ( decimal ( 18 , 6 ), 2.6944 E- 01 ) -- this works

select convert ( decimal ( 18 , 6 ), '2.6944E-01' ) --this doesn't

Msg 8114, Level 16, State 5, Line 1

Error converting data type varchar to numeric .

 

This error becomes especially bothersome when you have data feeds in character format, that contain fields written in scientific notation.  Let's create a table that we will populate with data from a file:

SET ANSI_NULLS ON

GO

SET QUOTED_IDENTIFIER ON

GO

CREATE TABLE [dbo].[IndexComponentsTypeChecked](

[RowID] [int] NOT NULL ,

[IndexID] [varchar]( 11 ) NULL ,

[ComponentId] [varchar]( 11 ) NULL ,

[ComponentCurrency] [char]( 3 ) NULL ,

[NumberOfItems] [decimal]( 18 , 6 ) NULL ,

[ComponentOrigprice] [decimal]( 18 , 6 ) NULL ,

[ComponentWeight] [decimal]( 18 , 6 ) NULL ,

[IndexType] [varchar]( 12 ) NULL ,

[IndexDivisor] [decimal]( 18 , 6 ) NULL

) ON [PRIMARY]

 

 Now Books On Line says that if you have columns with scientific notation in your file, you should specify float data type in your format file. I prepared test file along with several format files that define data format for above table. I attached all files used in this post in bcptests.zip file for your convenience.

The file testfeed.txt contains single line of data, containing values for all columns of the table IndexComponentsTypeChecked .

15564 TEST.INDX TSTSTR. 0001 EUR 0.9 0.0 2.6944 E- 01 INDEX 1.0

As you see, column ComponentWeight , defined as decimal(18, 6) is populated with value written using scientific notation.

First, let's try what happens when we don't use any format file.

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeed.txt'

The output is as follows:

Msg 4864, Level 16, State 1, Line 7

Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 7 (ComponentWeight).

  

Following the documentation, you can prepare a format file thatyou will pass to the BULK INSERT statement to overcome the conversion error. Here is line from the format file bcpcfloat.fmt :

7 SQLFLT8 0 41 "\t" 7 ComponentWeight ""

As you can see, the data type for the column has been defined as SQLFLT8, which is mapped to float t-sql data type.

The query loading data from the data file is looking now like this:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeed.txt' with (formatfile = 'c:\Projects\BulkTest\bcpcfloat.fmt' )

If you run above query, it will work, because 2.6944E-01 is converted to float and then to decimal(18,6). Seems that the problem is solved, isn't it? Unfortunately, I discovered that this doesn't work if value is negative. The file testfeednegative.txt contains the same data, but ComponentWeight column is populated with negative value this time:

15564 TEST.INDX TSTSTR. 0001 EUR 0.9 0.0 -2.6944E-01 INDEX 1.0

The following query now returns error:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt' with (formatfile = 'c:\Projects\BulkTest\bcpcfloat.fmt' )

Msg 8115, Level 16, State 6, Line 5

Arithmetic overflow error converting float to data type numeric .

Interestingly enough, CONVERT will work in this case:

select convert ( float , -2.6944E-01 )

There is another workaround (for the scientific notation workaround).You can use XML format files. While they basically contain the same formatting information, they cause BCP to behave in a different way. XML format files are more verbose than legacy format files, but also easier to understand the processing of data by bulk operations using them. Note that you can generate xml (and non-xml) format files using bcp utility.

This is the xml format file bcp.xml:

<?xml version="1.0"?>

<BCPFORMAT xmlns =" http://schemas.microsoft.com/sqlserver/2004/bulkload/format " xmlns:xsi =" http://www.w3.org/2001/XMLSchema-instance ">

<RECORD>

<FIELD ID =" 1 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 12 "/>

<FIELD ID =" 2 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 11 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 3 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 11 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 4 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 3 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 5 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 6 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 7 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 41 "/>

<FIELD ID =" 8 " xsi:type =" CharTerm " TERMINATOR =" \t " MAX_LENGTH =" 12 " COLLATION =" SQL_Latin1_General_CP1_CI_AS "/>

<FIELD ID =" 9 " xsi:type =" CharTerm " TERMINATOR =" \r\n " MAX_LENGTH =" 41 "/>

</RECORD>

<ROW>

<COLUMN SOURCE =" 1 " NAME =" RowID " xsi:type =" SQLINT "/>

<COLUMN SOURCE =" 2 " NAME =" IndexID " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 3 " NAME =" ComponentId " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 4 " NAME =" ComponentCurrency " xsi:type =" SQLCHAR "/>

<COLUMN SOURCE =" 5 " NAME =" NumberOfItems " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

<COLUMN SOURCE =" 6 " NAME =" ComponentOrigprice " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

<COLUMN SOURCE =" 7 " NAME =" ComponentWeight " xsi:type =" SQLFLT8 "/>

<COLUMN SOURCE =" 8 " NAME =" IndexType " xsi:type =" SQLVARYCHAR "/>

<COLUMN SOURCE =" 9 " NAME =" IndexDivisor " xsi:type =" SQLDECIMAL " PRECISION =" 18 " SCALE =" 6 "/>

</ROW>

</BCPFORMAT>


As you see, other decimal(18, 6) columns are processed correctly unless they contain scientific notation data.

The following query succeeds:

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt'

with (formatfile = 'c:\Projects\BulkTest\bcp.xml' )

 

I wondered, what happens when I change the destination type to float? I altered table IndexComponentsTypeChecked using the following query:

alter table [IndexComponentsTypeChecked] alter column ComponentWeight float null

GO

bulk insert dbo.IndexComponentsTypeChecked from 'c:\Projects\BulkTest\testfeednegative.txt' with (formatfile = 'c:\Projects\BulkTest\bcpc1.fmt' )

This time, even negative value was processed correctly!

So, it looks like to properly process numeric columns with potentially negative scientific notation values you have to use XML format files if you have to use character data files. If you can use native data files the problem doesn't exists, as all numbers are stored as binary values in the data file. I created data file datanative.txt and format file bcpnative.fmt so you can investigate their structure.

There are still organizations that have large systems based on SQL Server 2000. Usually such systems process large number of feeds, many of them in character format. The change of behavior of bcp and especially bug with negative values can add a lot of work during migration of the systems. Perhaps Microsoft should think about making CONVERT understand string scientific notation values, this might save major headaches for many developers and project managers.

 

MARS - does anyone use it?

I read recently about MARS - Multiple Active Result Sets, functionality that came with SQL Server 2005. I tried to find some 'real life' example of using MARS. Most of the resources I found showed examples on AdventureWorks  database and they were, to say the least, showing how NOT to access the database. For example this article by Lawrence Moroney, shows two ways of updating inventory on AdventureWorks database. The first way requires opening the connection twice, once to read order details, second time to update inventory. The second way allows to save one connection by interleaving read operation with updates.

string connectionString = "Data Source=MEDIACENTER;" +
"Initial Catalog=AdventureWorks;Integrated Security=SSPI;" +
"MultipleActiveResultSets=True";
string
strSQLGetOrder = "Select * from Sales.SalesOrderDetail" +
"WHERE SalesOrderID = 43659";

string strSQLUpdateInv = "UPDATE Production.ProductInventory " +
"SET Quantity=Quantity-@amt WHERE (ProductID=@pid)";

SqlConnection
marsConnection = new SqlConnection(connectionString);
marsConnection.Open();

SqlCommand
readCommand =
new SqlCommand(strSQLGetOrder, marsConnection);
SqlCommand
writeCommand =
new SqlCommand(strSQLUpdateInv, marsConnection);

writeCommand.Parameters.Add(
"@amt", SqlDbType.Int);
writeCommand.Parameters.Add(
"@pid", SqlDbType.Int);
using (SqlDataReader rdr = readCommand.ExecuteReader())
{
while (rdr.Read())
{
writeCommand.Parameters["@amt"].Value = rdr["OrderQty"];
writeCommand.Parameters["@pid"].Value = rdr["ProductID"];
writeCommand.ExecuteNonQuery();
}
}
marsConnection.Close();

 

Now Lawrence writes that this is cleaner code comparing to 'classic' approach. While there are fewer lines, this is not what I would like developers to do when it comes to updating tables. This is client side RBAR (read Jeff Moden's articles- they are excellent!)

 The cleaner approach, besides using proper stored procedure is this:

Untitled
string connectionString = "Data Source=MEDIACENTER;" +
"Initial Catalog=AdventureWorks;Integrated Security=SSPI;";

string strSQLUpdateInv = "update Production.ProductInventory"+
" set Quantity = Quantity - OrderQty" +
" from Sales.SalesOrderDetail a inner join" +
" Production.ProductInventory b" +
" on a.ProductID = b.ProductId" +
" where a.SalesOrderID = 43659";

SqlConnection
aConnection = new SqlConnection(connectionString);
aConnection.Open();

SqlCommand
writeCommand =
new SqlCommand(strSQLUpdateInv, marsConnection);
writeCommand.ExecuteNonQuery();
}
aConnection.Close();

What happened to the principle that tables should not be 'touched' directly from client code? Ok, I know this is a sample only. But why the sample has to be built around completely wrong concept? Even worse sample can be found in the Training Kit book for exam 70-442 (Desigining and Optimizing Data Access by Using SQL Server 2005). Authors unfortunately didn't bother even to use parameters in sample queries - plain concatenation of strings is being taught there. 

I found more plausible example of using MARS. It's a good article of Thiru Thangarathinam. This is usage I could imagine, although I would rather be unlikely to actually use it. In my view, there is a tier lost somewhere in multi tier architecture in this case - if a website is going to be heavy loaded, one could benefit more from a  proper cache (in business layer) than from MARS connections. 

I  am not against MARS. I believe that there are scenarios which are much simpler to implement using MARS connections than without them or on a database level. I would like to know what are these scenarios. Have you come across such use cases?

 


Posted by Piotr Rodak | with no comments
Filed under: , , ,

Generate Create Database Snapshot script

This post is about yet another way of skinning a cat.

Recently I 'discovered' usefulness of database snapshots. I find them extremely useful for testing environments, where it is important to be able to revert to initial state of environment in case of any issues.

One thing that never stops amusing me though is, that SQL Server Management Studio 2005 doesn't provide simple things as 'Create snapshot' of selected database for example (There are other missing things of course!).

So, after creating a few snapshots manually, I looked for a script that creates snapshot code for me. I found script posted some time ago by Dejan Sunderic, but it wasn't what I was looking for. I don't want to have to create a stored procedure anywhere to run some adhoc queries. I decided to write something simple that I can put into templates in my SSMS and run it on any database I want. This is what I came up with:

---this script generates create snapshot code. run in text mode.

with

preamble(c) as (select 'create database ' + db_name() + '_Snapshot on'),

files(c) as (select '(name=' + name + ', filename=''' + physical_name + '.ss'')' + char(10) from sys.database_files where type = 0),

filescoalesce (c) as (select c + ',' from files for xml path('')),

lastline(c) as (select 'as snapshot of ' + db_name() + char(10) + char(10) + 'GO' + char(10))

select c [--] from preamble

union all

select left(c, len(c) -2) from filescoalesce

union all

select c from lastline

 

 The output for one of my test databases looks like this, ready for execution:

create database Post_Release_Snapshot on

(name=ds, filename='J:\SQLData\TEST02.mdf.ss')

,(name=ds2, filename='J:\SQLData\TEST02_1.ndf.ss')

,(name=ds3, filename='J:\SQLData\TEST02_2.ndf.ss')

,(name=ds4, filename='J:\SQLData\TEST02_4.ndf.ss')

as snapshot of Post_Release

GO

 

 

Posted by Piotr Rodak | with no comments
Filed under: ,

Numbers table

I haven't been here for a while.. Quite busy time in my work, and also a bit of laziness, let's be honest ;)

There are a few things that I came across druring last few months. I will try to write about them in following posts.

Today I would like to write about Numbers or Tally table. The idea of having a table that contains only numbers and is used in various scenarios is not entirely new. Jeff Moden  wrote an excellent article that contains a few ideas about how to use the tally table. These are really only a few samples and I would like to know about other implementations. I personally used one of the ideas from the article to implement string split functionality for parameters passed as CSV arrays. You may like them or not, but in some cases it is still much better to pass such a list than to do some twisted programming on DAL or business layer.

So, the split string is quite same as in Jeff's article. I wrapped the functionality in UDF that returns table. The code requires that the list of values begins and ends with commas. I have lists of values that do not follow this requirement. Since I wanted to have single statement UDF and also pretty simple and clean way of calling it, I decided to use CTE to modify the parameter value so it is suitable for the split string code:

CREATE function [Admin].[fnListToTable](@list varchar(max), @separator varchar(10))
returns table
as
return
(
        --splits list of strings into a table. Uses Admin.tTally as indexer.
        with ParamCte(GroupIDs) as
        (select @separator + @list + @separator)

                SELECT SUBSTRING(GroupIDs,Number+1,CHARINDEX(@separator,GroupIDs,Number+1)-Number-1) Field
                        FROM Admin.tTally, ParamCte
                        WHERE Number < LEN(@list) 
                        and SUBSTRING(GroupIDs,Number,1) = @separator
                       
)

 The other application for tally table I found very useful is to enable decoding of bit masks. I have a table that stores some simple scheduling information.

 
CREATE TABLE [Control].[tFeedSchedule](
        [ScheduleId] [int] IDENTITY(1,1) NOT NULL,
        [FeedId] [int] NOT NULL,
        [DaysOfWeek] [tinyint] NOT NULL,
        [WindowOpen] [smallint] NOT NULL,
        [WindowClose] [smallint] NOT NULL,
 CONSTRAINT [PK_tFeedSchedule] PRIMARY KEY CLUSTERED
(
        [ScheduleId] ASC
)

)

 The DaysOfWeek column contains days of week when certain event is due to occur and WindowOpen and WindowClose numbers of minutes from midnight. Days of week are encoded as bit flags starting from Monday at 00000001 binary and ending at Sunday at 01000000 binary. There are some functions in my database that return real dates steming from the schedules defined in the above table.

The function below is one of such functions: It returns table with time 'windows' for specific feeds to come in week timeframe.

CREATE function [Control].[fnGetScheduleDaysForFeed](@FeedID int = null)
returns table
as
return
(
        --Returns list of days feed is scheduled to run in with their time windows
        select FeedID, case when @@datefirst = 1 then Number else Number + 1 end as [DayOfWeek], WindowOpen WindowOpenMinutes, WindowClose WindowCloseMinutes,

        convert(varchar, Dateadd(minute, WindowOpen, DATEADD(day, 0, DATEDIFF(day, 0, GETDATE()))), 108) WindowOpenTime,

        convert(varchar, Dateadd(minute, WindowClose, DATEADD(day, 0, DATEDIFF(day, 0, GETDATE()))), 108) WindowCloseTime,

        case when WindowOpen > WindowClose then 1 else 0 end CrossMidnight
 from Control.tFeedSchedule a inner join Admin.tTally b
        on a.DaysOfWeek & power(2, Number-1) <> 0
        where Number < 8 and (@FeedID is null or a.FeedID = @FeedID)
)

 There are many more possible applications of numbers table. I wonder, if it would be useful to create a library of such code snippets or ideas? If you know links to such resources, please put them in comments.

 

SSIS truncating BLOB fields from Sybase

Last week I struggled with issue that had been causing many problems in our work. We have implemented SSIS packages that synchronize Sybase and SQL Server databases. Some of tables contain text (on Sybase) columns that have to be passed to varchar(max) columns on SQL Server. It wasn't apparent on the beginning, that only the first 32k of data are passed though.

I looked for an answer in many places. I tried to see if moving blobs via temporary files will help. Nope. Max size of such file was also 32k. Then I looked at script components, trying to find out if there is possibly an error truncating these data. There is a function GetBlobData that reads data from source. Max length of data returned by this function was 32k also. I thought, maybe it is max size of internal buffer? This function takes three parameters: column index, starting index of data to read and data length. I tried to implement moving window, and call this method in a loop, till all data are read, but no success. It was reading only 32k and not a single byte more. I knew the actual length of data because I read it in source query on Sybase side.

Looking for information, I came across textsize variable. Well, this seemed to be promising. But how to apply change to it in SSIS? If I put SET TEXTSIZE 100000 in front of the feeding select in data source, even though there was no syntax error, data source could no longer retrieve column information and the whole data flow couldn't work. I tried to create Execute SQL task before Data Flow task, but it didn't help. It seemed that what works when you connect using any sort of console, doesn't work if OLE DB is used.  I reached to my mossy memory banks - I remembered that about 8 years ago, when I worked with COM+, I read about various parameters that OLEDB can accept, depending on the driver used. So, I found that indeed you can modify textsize in OLE DB parameters. The last issue was how to apply it to connection manager in SSIS? The edit dialog doesn't contain place to specify this value. Extended properies did not work. If you append TextSize=10000000; to connection string in properties of connection manager, it forgets the password. If you open the Edit window, the connection string is generated from scratch and you loose the setting. Catch 22.

We have a solution that reads connection strings from external windows config file. I added the parameter to the connection string in this file and verified, that this is really the right way - BLOB fields where passed properly. But in design time, though it wasn't that important, I wanted to find a way how to pass this parameter and maintain usable connection manager. I tried to imagine how such edit window of connection manager may work. When you press OK, all properties you set are potentially verified and connection string is created. I hoped that the verification is not too strong, because I decided to attach TextSize parameter to one of properties in the dialog. I chose server name :). Guess what - there is no verification whatsoever of the server name on that dialog. So, my server name is now like

sybase_server,5335;TextSize=100000;

It works. It turned out that the solution is easy (as ususal), just pity that I spent so much time trying to nail it down.

  

changing collation of all columns without dropping them

Just this week I had opportunity to change collation of all objects using it in a database without dropping it. I like to computers to do what computers should do - work that is :). So I created a query that gave me script changing collation of all columns using collation - varchar, char and so on.

I have got 3900 lines altering tables and columns. Nice, but not so easy, this script was not working at first F5 - you can't  change collation of a column that is a part of constraint or index - you have do drop them. So I had to wade through the script and add statements dropping constraints and indexes and recreating them after modifications had been done. To find tables that were causing problems I simply executed script - execution was stopped on column whose collation could not be changed and I had exact information what should be done.

This is the query that generates alter table .. alter column statements:

select 'alter table [' + s.name + '].[' + t.name + '] alter column [' + c.name + '] ' +
                ty.name + case when ty.name not in ('text', 'sysname') then '(' + case when c.max_length > 0 then
                case when ty.name not in ('nchar', 'nvarchar') then convert(varchar, c.max_length) else convert(varchar, c.max_length/2) end else 'max' end +

                ')' else '' end + ' collate SQL_Latin1_General_CP1_CI_AS ' +
                case when c.is_nullable = 0 then 'NOT ' else '' end + 'NULL'
from (sys.columns c inner join sys.types ty on c.system_type_id = ty.system_type_id) inner join
                (sys.objects t inner join sys.schemas s on t.schema_id = s.schema_id) on c.object_id = t.object_id
where t.type='U' and c.collation_name is not null and ty.name <> 'sysname'
order by s.name, t.name, c.column_id

 I saved loads of time thanks to this script.

csv list of elements as parameter for stored procedure

A while ago, Tony Rogerson showed a way how to pass a list of integers (csv) to stored procedure. Approach of creating a script and executing it is OK for smaller amounts of data. I thought, that maybe, as xml is a form of text after all, it would be more appropriate? I crafted a stored proc based on Tony's code that instead of generating a script, generates an xml stream that is subsequently used in query inserting rows. This method is also resistant to sql injection attemtpts so such stored proc is not a 'bobby tables' one :)

Here's the code of the procedure:

create table tStoredInts (afield int)

go

create proc csv_to_table2

@csv varchar(max)

AS

BEGIN

/***

Insert numbers from a CSV to tStoredInts table

***/

declare @handle int --xml document handle

SET @csv = ltrim(rtrim(@csv))

IF RIGHT( @csv, 1 ) = ',' -- If last character is a comma remove it it

SET @csv = left(@csv, len(@csv) -1)

DECLARE @xml varchar(max)

SET @xml = REPLACE( @csv, ',', '</o><o>')

SET @xml = '<numbers><o>' + @xml + '</o></numbers>'

 

exec sp_xml_preparedocument @handle output, @xml

--insert elements from xml

insert tStoredInts (afield)

select field

from openxml(@handle, '/numbers/o', 2) with (field int '.')

exec sp_xml_removedocument @handle

end --proc

Posted by Piotr Rodak | 1 comment(s)
Filed under: ,

concurrency control explained

So, I just came from seminar led by Kalen Delaney, guru of SQL Server (any version), organized by Ireland SQL Technology User Group. The whole event was sponsored and hosted by Microsoft. I must say, I am impressed. Kalen was able to sort out and put into proper drawers all bits of knowledge I had before and stuff a whole lotta more into my head. She talked about concurrency, transaction isolation levels, differences between optimistic and pessimistic locking, blocking and diagnostic tools, and at the end about gotchas related to (nolock) hint, overused so eagerly by many devs (I confess, I used this hint myself ;))

You can find tons of good info about execution plans and isolation levels at Craig Friedman's blog here

Overall, it was a very interesting day!


 

Posted by Piotr Rodak | with no comments
Filed under:

sorting trap

So, you might think that sorting within ASCII range is predictable and order defined by ASCII table is finite? You are on safe side when you do not use unicode nor 'fancy' letters? Well.. you are wrong.

Recently I have been working on comparison of data stored in Sybase and SQL. Database structure is the same, I have script pulling data from Sybase to SQL Server and when this script is finished, data are meant to be identical of course. So I was thrown when we found out that rows can be returned in some cases in different order on both databases. That is, when a varchar field contained two underscore characters in a row, rows sorted by this field could be returned in different order.

After some thinking and looking for information I was able to create script that can reproduce this behavior on SQL Server.

First, create table and fill it with some data:

create table testtable

(

afield varchar(25)

)

go

insert testtable (afield)

select 'A_'

union all

select 'A__'

union all

select 'A_B'

union all

select 'A__B'

 Now select rows ordering by afield:

select afield from testtable order by afield

The results are pretty 'reasonable':

afield

-------------------------

A_

A__

A__B

A_B

 Ok, but Sybase returned different order:

afield

-------------------------

A_

A_B

A__

A__B

The mystery is hidden within collation of the field. Character fields derive default collation from database setting, which in turn gets collation if it is not specified explicitly from server settings.

If you run following query, you will get results identical with the previous result:

alter table testtable alter column afield varchar(25) collate Latin1_General_Bin

This query applies explicit collation fo afield, overriding in this way default collation. You can check collation of afield using this query:

select column_id, name, collation_name from sys.columns where object_id = object_id('testtable')

column_id   name        collation_name

----------- ----------- -------------------

1           afield      Latin1_General_BIN

This was the collation after change, compliant with Sybase collation in my case.

Default collation on my SQL Server is

column_id   name      collation_name

----------- --------- ------------------------------

1           afield    SQL_Latin1_General_CP1_CI_AS

As you see, collation setting may affect sort order of fields which do not contain unicode characters. When I run the same query on both database engines, I get mostly same results with exception of values containing multiple '_' characters. The solution is simple: specify collation of character fields to collation compatible with settings of the other database server.

Note that modifying default collation of database after tables are created does not affect collation of these fields. You can't also change collation in a column if this column is part of an index.

 

Posted by Piotr Rodak | with no comments

distributed transactions and triggers

Last week I came across an interesting issue.

If you have a FOR INSERT trigger on a table and you want to store some information on a linked server, the transaction the trigger is running within is automatically expanded to distributed mode. This creates some problems. First, you have to have Distributed Transaction Coordinator running on your machine. Second, even this does not mean you will succeed, as not all providers support distributed transactions protocols. We will see this further in this post.

First, create a linked server that will point to an Excel workbook on local disk, for simplicity sake.

DECLARE @RC int

DECLARE @server nvarchar(128)

DECLARE @srvproduct nvarchar(128)

DECLARE @provider nvarchar(128)

DECLARE @datasrc nvarchar(4000)

DECLARE @location nvarchar(4000)

DECLARE @provstr nvarchar(4000)

DECLARE @catalog nvarchar(128)

-- Set parameter values

SET @server = 'XLTEST'

SET @srvproduct = 'Excel'

SET @provider = 'Microsoft.Jet.OLEDB.4.0'

SET @datasrc = 'c:\LSBook.xls'

SET @provstr = 'Excel 8.0'

EXEC @RC = [master].[dbo].[sp_addlinkedserver] @server, @srvproduct, @provider,

@datasrc, @location, @provstr, @catalog

Create LSBook.xls file on your drive first. In first row, in first two columns write afield and bfield:

Now, create a test table and trigger:

use testdb

go

 

if object_id('dbo.testtable') is not null

      drop table dbo.testtable

go

create table dbo.testtable

(

      afield int identity(1, 1),

      bfield varchar(20) not null

)

go

create trigger trg_testtrg1 on dbo.testtable

for insert

as

begin

 

insert xltest...sheet1$ (afield, bfield)

select afield, bfield

from inserted

 

print 'inserted ' + convert(varchar, @@rowcount) + ' row(s)'

 

end

Finally, execute some statements to see what happens:

insert into dbo.testtable values('somevealue1')

insert into dbo.testtable values('somevealue2')

go

select * from dbo.testtable

And the result is..

Msg 8501, Level 16, State 3, Procedure trg_testtrg1, Line 6

MSDTC on server 'VAMILOXP' is unavailable.

afield      bfield

----------- --------------------

 

(0 row(s) affected)

Oops, MSDTC is not running on the machine.. Start it. Note that you may have to install it if you are running Windows 2003 Server.

Msg 7390, Level 16, State 2, Procedure trg_testtrg1, Line 6

The requested operation could not be performed because OLE DB provider "Microsoft.Jet.OLEDB.4.0" for linked server "xltest" does not support the required transaction interface.

afield      bfield

----------- --------------------

 

(0 row(s) affected)

 

So, as you see, MSDTC did not help us too much. The provider I used in this case does not support distributed transactions. What we can do to be able to insert information to linked server data store?

There is an article in MSDN that mentions that you can COMMIT transaction in trigger before you call linked server. However this causes a batch terminating error. Let's modify trigger code to commit transaction and see the results:

create trigger trg_testtrg1 on dbo.testtable

for insert

as

begin

--commit transaction

commit

--call linked server

insert xltest...sheet1$ (afield, bfield)

select afield, bfield

from inserted

Will the code above work? No! Commiting transaction makes inserted table empty, so we will not have anything to store! Here's modification what overcomes this little issue:

create trigger trg_testtrg1 on dbo.testtable

for insert

as

begin

--declare temp variable and store rows to insert to linked server table

declare @tmp table (a int, b varchar(20))

insert @tmp (a, b) select afield, bfield from inserted

 

--commit current tran to avoid escalation

commit

 

--this will work, but the batch will be broken.

insert xltest...sheet1$ (afield, bfield)

select a, b

from @tmp

end

Let's see results of this code:

insert into dbo.testtable values('somevealue1')

insert into dbo.testtable values('somevealue2')

go

--there is only one record in the table

select * from dbo.testtable

 

 

(1 row(s) affected)

 

(1 row(s) affected)

Msg 3609, Level 16, State 1, Line 1

The transaction ended in the trigger. The batch has been aborted.

afield      bfield

----------- --------------------

1           somevealue1

 

(1 row(s) affected)

As you see there is only one record in the table - the batch was broken before second insert in the code. Obviously, this is not what we want.

I don't actually understand, why Microsoft decided to throw batch breaking error when a commit in trigger is called. Perhaps to prevent poor coding practices? Anyway, this error is not thrown when you add begin tran statement AFTER your linked server query:

create trigger trg_testtrg1 on dbo.testtable

for insert

as

begin

--declare temp variable and store rows to insert to linked server table

declare @tmp table (a int, b varchar(20))

insert @tmp (a, b) select afield, bfield from inserted

 

--commit current tran to avoid escalation

commit

 

--this will work, but the batch will be broken.

insert xltest...sheet1$ (afield, bfield)

select a, b

from @tmp

 

begin tran

end

So far, so good. In some cases this will work.

afield      bfield

----------- --------------------

1           somevealue1

2           somevealue2

But how about this code? What if you decide to roll back some of the work done due to an error?

begin tran

insert into dbo.testtable values('somevealue1')

insert into dbo.testtable values('somevealue2')

rollback

Well.. we are in trouble now:

select * from dbo.testtable

afield      bfield

----------- --------------------

1           somevealue1

2           somevealue2

 

(2 row(s) affected)

 

This transaction is not rolled back, as it has been already committed! In some cases, such behavior is acceptable, for example when an application calls single insert statement. But if insert query is a part of bigger batch, this approach is definitely not an option.

The easiest way I found is to use SQL Agent to do remote server job out of transaction boundaries. First, create a queue table that will store temporarily rows to be inserted to linked server storage:

 

use testdb

go

--this table stores records to be inserted into linked server table

create table dbo.linkedsrvqueue

(

      afield int not null,

      bfield varchar(20) not null

)

go

 

Then, create a simple job that will execute command moving rows to linked server:

USE [msdb]

GO

/****** Object:  Job [SaveToLinkedServer]    Script Date: 10/29/2007 23:01:19 ******/

BEGIN TRANSACTION

DECLARE @ReturnCode INT

SELECT @ReturnCode = 0

/****** Object:  JobCategory [[Uncategorized (Local)]]]    Script Date: 10/29/2007 23:01:20 ******/

IF NOT EXISTS (SELECT name FROM msdb.dbo.syscategories WHERE name=N'[Uncategorized (Local)]' AND category_class=1)

BEGIN

EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[Uncategorized (Local)]'

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

 

END

 

DECLARE @jobId BINARY(16)

EXEC @ReturnCode =  msdb.dbo.sp_add_job @job_name=N'SaveToLinkedServer',

            @enabled=1,

            @notify_level_eventlog=0,

            @notify_level_email=0,

            @notify_level_netsend=0,

            @notify_level_page=0,

            @delete_level=0,

            @description=N'No description available.',

            @category_name=N'[Uncategorized (Local)]',

            @owner_login_name=N'VAMILOXP\rogas', @job_id = @jobId OUTPUT

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

/****** Object:  Step [StoreToLinkedServerStep]    Script Date: 10/29/2007 23:01:21 ******/

EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, @step_name=N'StoreToLinkedServerStep',

            @step_id=1,

            @cmdexec_success_code=0,

            @on_success_action=1,

            @on_success_step_id=0,

            @on_fail_action=2,

            @on_fail_step_id=0,

            @retry_attempts=0,

            @retry_interval=0,

            @os_run_priority=0, @subsystem=N'TSQL',

            @command=N'insert xltest...sheet1$ (afield, bfield)

select afield, bfield

from dbo.linkedsrvqueue

if @@error = 0

      delete from dbo.linkedsrvqueue',

            @database_name=N'testdb',

            @flags=0

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'

IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback

COMMIT TRANSACTION

GOTO EndSave

QuitWithRollback:

      IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION

EndSave:

 

Finally, modify the trigger to call store rows in queue table and call the job:

create trigger trg_testtrg1 on dbo.testtable

for insert

as

begin

--insert to queue table

insert dbo.linkedsrvqueue(afield, bfield)

select afield, bfield

from inserted

 

print 'inserted ' + convert(varchar, @@rowcount) + ' row(s)'

exec msdb..sp_start_job @job_name='SaveToLinkedServer'

 

end

 

This way the linked server functionality is called outside of transaction boundaries, which does not upset DTC and the main transaction flow remains consistent. The issues with this approach are: you have to have proper credentials to call sp_start_job and if the insert statement is executed too often, you may get following error:

Msg 22022, Level 16, State 1, Line 0

SQLServerAgent Error: Request to run job SaveToLinkedServer (from User VAMILOXP\rogas) refused because the job already has a pending request from User VAMILOXP\rogas.

 

You may get rid of this error by checking job status before calling sp_start_job, you may also configure this job to run periodically and remove the sp_start_job call from the trigger at all.

 

References:

http://msdn2.microsoft.com/en-us/library/ms187844.aspx

 

So.. this is it

My first blog post. As someone said somewhere, sharing information forces you to look at it from every side, ensure that it is really correct. I will do my best to make so.

This blog is going to be about... SQL Server :). I am database developer and from time to time I come across quite interesting problems, up with (brilliant Wink) ideas, have some observations that I would like to share. Some of them will not be revolutionary but I hope that some of them will allow you to look at questions you have from different angle, which is always helpful, I noticed.

My main areas of interest are performance, programming, some SSIS, some c# will perhaps find their way here as well. We'll see.