SQL Server 2008 - Batch inserts and identity columns using MERGE
Performing bulk inserts into a table when the table has
an identity column has long bee a difficult challenge for client
programming.
The difficulty has been getting the identity values that have been generated
for the set of data you are saving. It becomes even more difficult when you want
to save parent child relationships.
In SQL 2005 the OUTPUT clause was introduced which allowed you to return the
identity values generated by a an insert of multiple rows (you can also generate
any other data generated by the server i.e. defaults contrary to what BOL says).
However there was one draw back. You couldn't relate the identity values to the
original data set unless you had a natural key in your table or you stored a
second surrogate key. This means you had to fully reload your client application
data set. For large data sets thats not ideal.
This is because the OUTPUT clause for an INSERT statement can only reference
the inserted pseudo table. This only contains the columns of the underlying
table so you don't have access to any other data.
With SQL Server 2008 you get MERGE. MERGE is a combined INSERT, UPDATE and
DELETE statement that supports the OUTPUT clause. I thought I would have a look
and see if the same limitations apply to the OUTPUT claue as they do for the
INSERT statement. Well they don't.
This means that you can return any value from your source data even if its
not being inserted into the table. For example
CREATE
TABLE Person (id int identity(1,1), name varchar(200))
MERGE
INTO Person
USING
(Values ('Simon',1),('Mark',2),('Stuart',3))
People(name,clientSideId)
ON
1=2
WHEN
NOT MATCHED THEN
INSERT
(name) values
(People.Name)
OUTPUT
People.clientSideId, Inserted.Id;
This allows you to update your client code with the idenity values generated
by the server. Whilst this logically shold work, in CTP5 there is a bug that
causes this to fail. It appears that the issue is to do with the ON clause. When
the expression is always false something is causing a severe error. There are a
number of workarounds, sometimes using a WHEN MATCHED clause will help and in
others changing he USING to be a table reference and not a derived table
helps.
For an example of saving parent child relationships have a look at the code
below. This copies the objects and columns system views into new tables, whilst
maintaining correct referential integrity.
The use of the @objectTab and @columnTab is to avoid the error mentioned
above, once fixed this code can be greatly simplfied.
You might be thinking that this is goig to such from a performance stance. It
doesn't. Whilst the execution plan costs is much higher than the straight insert
the actual cost of cpu and IO can be less than the direct INSERT approach. I
beleieve this is due to some sorting and assertions that are imposed in the
MERGE statement which simplify the storage and upating of the underlying
indexes.
Here is the code mentioned above. It returns two data sets that can be used
to update your client data sets. You could combine into 1 data set if
required.
drop
table NewObject
go
drop
table NewColumn
go
create
table NewObject (id int identity(1,1) check
(id > 0 ), name sysname)
go
create
table NewColumn (id int identity(1,1), ObjectId int, name
sysname)
go
set
nocount on
declare
@objects xml = (
select
top 1000 object_id
[@object_id]
,name [@name]
,(select column_id [@id], name [@name]
from sys.columns where sys.columns.object_id = sys.objects.object_id
for xml
path('column'), Type )
from
sys.objects
for
xml path ('object'))
declare
@NewObject table (new_id int, old_id int, unique ([old_id]))
declare
@id int
declare
@NewColumn table (new_id int, old_id int,
new_col_id int, old_col_id int)
declare
@objectTab table (name sysname, object_id
int)
insert
into @objectTab (name,
object_id)
select obj.node.value('@name','sysname')
name
, obj.node.value('@object_id','int') object_id
from @objects.nodes('/object') obj(node)
merge
into NewObject
using
( select name, object_id from @objectTab
)src(name,id)
on
NewObject.id = 0
when
matched then
delete
when
not matched then
insert
(name) values(src.name)
output src.id, inserted.id into
@NewObject (old_id, new_id);
select
* from @NewObject
declare
@columnTab table (name sysname,id int, object_id
int)
insert
into @columnTab (name, id,
object_id)
select col.node.value('@name','sysname')
name
, col.node.value('@id','int') id
, obj.node.value('@object_id','int') object_id
from @objects.nodes('object') obj(node)
cross apply obj.node.nodes('column') col(node)
merge
into NewColumn
using
(select col.name, col.object_id , col.id ,
obj.new_id new_object_id
from @columnTab
col
join @NewObject obj on col.object_id = obj.old_id)
col(name,old_object_id,old_col_id, new_object_id)
on
1=2
when
not matched then
insert
(name, ObjectId) values(col.name,
col.new_object_id)
output col.old_object_id, col.old_col_id, col.new_object_id, inserted.id
into @NewColumn (old_id, old_col_id, new_id,
new_col_id);
select
* from @NewColumn
-