Help to Improve SQL Batch Operations and ETL

How many times have you copied data into a table and got one of the following less than helpful errors.

 

String or binary data would be truncated.

 

Arithmetic overflow error converting numeric to data type numeric.

 

Arithmetic overflow error for data type tinyint, value = 256.

 

Violation of UNIQUE KEY constraint '??????'. Cannot insert duplicate key in object '??????'. The duplicate key value is (3).

 

Cannot insert the value NULL into column 'parentId', table 'tempdb.dbo.child'; column does not allow nulls. INSERT fails.

 

The INSERT statement conflicted with the FOREIGN KEY constraint "?????". The conflict occurred in database "????", table "?????", column '?????'.

So they improved some of these messages to include the offending value. However that doesn’t help much when you’ve got a multi row insert. The error only occurs for the first offending value. For instance in the unique violation above you can resolve the problem by excluding the rows with a value of 3 but you might have others. In the example below we have others and so you have to put some generic code in to verify the data before you insert it.

What this means is that if you have a source of data to insert into a table you have to go through the following process.

1. Perform insert

2. Get failure message

3. Identify offending rows, correct or exclude them

4. Proceed to step 1 until you get no errors.

5. Once insert works, implement a process to identify bad rows prior to the insert to be corrected or post the insert to be verified and reprocessed.

 

I’ve worked through a very simple example below involving two tables “parent” and “child”. Each step resolves an error with the insert due to a data type domain constraint or explicit constraint. You can see that its a painful process.

The end result is a hugely complex query, with many sub queries and use of try_convert. Note that this code prior to SQL 2012 would have been amazingly complex to verify the data type values. The uniqueness check is the most complex due to having to check the combination of values from both the source and target tables.

The plan goes from a simple insert

image

to a much much more complex plan involving 2 additional scans of the source and 1 of the parent. Whilst those could be indexed to improve the plan it highlights additional work and indexes that aren’t required in the simple insert.

image

What’s immensely frustrating is that all these verification predicates are being done by the engine anyway, that’s how it generates the errors.

KISS

image

This is far from keeping it simple. On a recent project we’ve had to implement months of work to put this level of validation in. Whats worse is that you have to put all this code in up front. If you don’t then your application will jet fail and fall apart when you get a problem. What worse is the resolving of errors is hugely complex and time consuming.

I’d just accepted this as something you just have to live with. That was until now.

I was recently notified of a discussion on the TSQL forum and associated connect items.

Forum thread - http://social.technet.microsoft.com/Forums/en-US/transactsql/thread/3e17f8dc-9685-412b-8e76-94ad41536d5d 

Connect item https://connect.microsoft.com/SQLServer/feedback/details/774754/new-virtual-table-errors-it-would-analogous-to-the-deleted-and-inserted-tables#details

The idea here is that the errors get dumped into a table just like the output clause does. You can then examine the contents of the errors table and do something with it. This is very similar to the process seen in most ETL frameworks where rows that fail a lookup or a data type validation are put in an errors table  for later processing.

What a fantastic idea

 

Implementation

I would like the statement to not fail by default (maybe an option clause to configure this behaviour).

The biggest question is what error information is included. One column could have multiple validation errors, not null, check, foreign key etc. Ideally I would like to know all validation errors for each column but I can’t see that as workable, unless the result is a pivotted table i.e. one row per column and validation rule, but that would require a unique set of columns on the source to be defined somehow and would make it complex.

You would also have the situation that and error that fails one check is likely to fail lots more and so knowing that a column value has at least one error is good enough. That would need to have an error code column in the errors table per column.

The simplest would be to have a single error per row, thats what you get with the SSIS implementation and would probbaly be simplest to implement, as soon as a row fails its is marked as bad and not passed to the other parts of the plan but directed to the errors output.

I would settle for the 1 error per row. At least you know the row has failed, have caught the row, can reprocess the data easily.

Side note: What I found amusing during the process of building the script blow was that I was getting errors excluded but couldn’t find which source rows where causing the problem due to the vagueness of the failings. Exactly the problem that would be solved if we had an errors table.

 

VOTE

image

I think this idea is fantastic and urge you to vote on it if you agree.

https://connect.microsoft.com/SQLServer/feedback/details/774754/new-virtual-table-errors-it-would-analogous-to-the-deleted-and-inserted-tables#details

 

 

 

Sample script showing the ugly process of validating source data.

use tempdb

go

drop table child

drop table parent

drop table SourceTable

go

create table parent (parentId int not null primary key)

create table child (childid int identity(1,1), parentId int not null,strCol varchar(10),unqCol int unique, intCol int, numCol numeric(2,1),tinyintCol tinyint,  )

alter table child add constraint fk_child foreign key (parentId) references parent(parentId)

 

go

 

insert into parent (parentId)

values(1)

go

declare @bigint as bigint = pOWER(cast(2 as bigint),32)

select *

 into SourceTable

 from (

values

 (1,replicate('A',11),1,100,10.22,256)

,(1,replicate('A',10),1,100,10.22,256)

,(1,replicate('A',10),2,100,10.22,256)

,(1,replicate('A',10),3,100,1.2,256)

,(1,replicate('A',10),3,100,1.2,256)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),3,@bigint,1.2,255)

,(1,replicate('A',10),4,100,1.2,255)

,(1,replicate('A',10),4,100,1.2,255)

,(null,replicate('A',10),9,100,1.2,255)

,(2,replicate('A',10),10,@bigint,1.2,255)

,(2,replicate('A',10),5,100,1.2,255)

,(2,replicate('A',10),6,100,1.2,255)

,(1,replicate('A',10),11,@bigint,1.2,255)

,(1,replicate('A',10),7,100,1.2,255)

,(1,replicate('A',10),8,100,1.2,255)

)  v(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

 

go

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

go

print ' '

print 'Exclude varchar values that will be truncated'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

 

print ' '

print ' Exclude numeric values that are too big'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

 

print ' '

print 'Exclude values too big for tinyint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

 

print ' '

print 'Exclude values too big for int'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

 

print ' '

print 'Exclude specific value that breaks unique constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and unqcol <> 3

 

print ' '

print 'Exclude values that break unique constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

go

print ' '

print 'Exclude values that break foreign key constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

and (parentId is null or

  exists (select 1 from parent p where p.parentId = SourceTable.parentId))

go

print ' '

print 'Exclude values that breaks not null constraint'

insert into child(parentId, strcol,unqcol, intCol, numCol,tinyintCol)

select parentId, strcol,unqcol, intCol, numCol,tinyintCol

from SourceTable

where len(strcol) <= 10

and (numCol is null or try_convert(numeric(2,1),numCol) is not null)

and (tinyintCol is null or try_convert(tinyint,tinyintCol) is not null)

and (intCol is null or try_convert(int,intCol ) is not null)

and not exists (select 1 from child  c where c.unqCol = SourceTable.unqcol)

and (select COUNT(1) from sourcetable s where s.unqCol = SourceTable.unqcol) <=1

and (parentId is not  null and

  exists (select 1 from parent p where p.parentId = SourceTable.parentId))

Published 31 December 2012 17:15 by simonsabin

Comments

31 December 2012 18:09 by SimonS Blog on SQL Server Stuff

# Help to Improve SQL Batch Operations and ETL

How many times have you copied data into a table and got one of the following less than helpful errors

# DBA Notes &laquo; 36 Chambers &#8211; The Legendary Journeys: Execution to the max!

Pingback from  DBA Notes &laquo; 36 Chambers &#8211; The Legendary Journeys:  Execution to the max!

# DBA Notes &laquo; 36 Chambers &#8211; The Legendary Journeys: Execution to the max!

Pingback from  DBA Notes &laquo; 36 Chambers &#8211; The Legendary Journeys:  Execution to the max!