11 March 2007 15:24
tonyrogerson
BULK INSERT - FIRSTROW and BATCHSIZE cause very poor performance and significant Writes to LDF and MDF.
Quick Tip about BULK INSERT and using FIRSTROW – the BATCHSIZE parameter makes performance suck and you’ll never guess why!
What do you think FIRSTROW does? According to Books Online it “Specifies the number of the first row to load.”, so baring that in mind what does BATCHSIZE do? According to BOL it “Specifies the number of rows in a batch; each batch is copied to the server as one transaction.”
Given a data file with 100,000 rows in it and you just want the last two rows you’d expect SQL Server to read data up-to and including row 99,998 and then start writing for just the last two rows you want (99,999 and 100,000) – well it doesn’t.
In a nutshell specifying a BATCHSIZE of 1 and FIRSTROW of 99999 will cause SQL Server to write 1.2GBytes of data on the LDF and MDF files only to write just a few Kbytes of actual data! The repro example is below.
Note, this is on SQL Server 2005 (could be 2000 as well but I've not tested it).
CREATE DATABASE BUlkInsert
go
USE BulkInsert
go
CREATE TABLE Test (
somedata char(4096) not null
)
go
SET NOCOUNT ON
DECLARE @i int
SET @i = 1
WHILE @i <= 100000
BEGIN
INSERT Test values( cast( @i as char(10) ) )
SET @i = @i + 1
END
GO
sp_spaceused Test
GO
Now create the data file we will import in a moment..
bcp bulkinsert..test out testbulk.dat -n -T –S
TRUNCATE TABLE Test
CHECKPOINT
GO
SET NOCOUNT ON
SELECT * FROM sys.dm_io_virtual_file_stats( db_id(), null )
BULK INSERT Test
FROM 'C:\users\tonyrogerson\testbulk.dat'
WITH (
BATCHSIZE = 1,
FIRSTROW = 99999,
CODEPAGE = 'raw',
DATAFILETYPE = 'native',
TABLOCK )
SELECT * FROM sys.dm_io_virtual_file_stats( db_id(), null )
This will give you two result sets
database_id file_id sample_ms num_of_reads num_of_bytes_read io_stall_read_ms num_of_writes num_of_bytes_written io_stall_write_ms io_stall size_on_disk_bytes file_handle
----------- ------- ----------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ------------------
26 1 336123593 3937 867188736 1976212 84840 1486749696 2379713 4355925 4520607744 0x00000A60
26 2 336123593 96 815104 700 192953 935902208 712195 712895 40894464 0x00000928
database_id file_id sample_ms num_of_reads num_of_bytes_read io_stall_read_ms num_of_writes num_of_bytes_written io_stall_write_ms io_stall size_on_disk_bytes file_handle
----------- ------- ----------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ------------------
26 1 336967453 3968 867442688 1976278 185633 2312601600 2488036 4464314 6562185216 0x00000A60
26 2 336967453 96 815104 700 301099 1458697216 843031 843731 40894464 0x00000928
If we look at the num_of_bytes_written columns we can see the following IO...
MDF (file id 1) 787Mbytes
LDF (file id 2) 498Mbytes
Ouch! To read just two rows of data has caused 1.2GBytes of write activity.
Remove BATCHSIZE=1 and the problem goes away, instead of taking 13minutes it now takes 48 seconds and has the following IO characteristics:
MDF (file id 1) 16KBytes
LDF (file id 2) 77KBytes
Somewhat of a difference in performance eh! The only probably with removing BATCHSIZE is that the rows we are inserting are done as one complete transaction which if there are many then you can end up with a large transaction log.
Note, setting recovery model to Simple has no effect.
Filed under: SQL Server