11 March 2007 15:24 tonyrogerson

BULK INSERT - FIRSTROW and BATCHSIZE cause very poor performance and significant Writes to LDF and MDF.

Quick Tip about BULK INSERT and using FIRSTROW – the BATCHSIZE parameter makes performance suck and you’ll never guess why!

What do you think FIRSTROW does? According to Books Online it “Specifies the number of the first row to load.”, so baring that in mind what does BATCHSIZE do? According to BOL it “Specifies the number of rows in a batch; each batch is copied to the server as one transaction.”

Given a data file with 100,000 rows in it and you just want the last two rows you’d expect SQL Server to read data up-to and including row 99,998 and then start writing for just the last two rows you want (99,999 and 100,000) – well it doesn’t.

In a nutshell specifying a BATCHSIZE of 1 and FIRSTROW of 99999 will cause SQL Server to write 1.2GBytes of data on the LDF and MDF files only to write just a few Kbytes of actual data! The repro example is below.

Note, this is on SQL Server 2005 (could be 2000 as well but I've not tested it).

CREATE DATABASE BUlkInsert

go

USE BulkInsert

go

CREATE TABLE Test (

    somedata    char(4096)  not null

)

go

SET NOCOUNT ON

DECLARE @i int

SET @i = 1

WHILE @i <= 100000

BEGIN

    INSERT Test values( cast( @i as char(10) ) )

 

    SET @i = @i + 1

 

END

GO

 

sp_spaceused Test

GO

Now create the data file we will import in a moment..

bcp bulkinsert..test out testbulk.dat -n -T –S

TRUNCATE TABLE Test

CHECKPOINT

GO

SET NOCOUNT ON

 

SELECT * FROM sys.dm_io_virtual_file_stats( db_id(), null )

 

BULK INSERT Test

    FROM 'C:\users\tonyrogerson\testbulk.dat'

    WITH (

        BATCHSIZE = 1,

        FIRSTROW = 99999,

        CODEPAGE = 'raw',

        DATAFILETYPE = 'native',

        TABLOCK    )

 

SELECT * FROM sys.dm_io_virtual_file_stats( db_id(), null )

This will give you two result sets

database_id file_id sample_ms   num_of_reads         num_of_bytes_read    io_stall_read_ms     num_of_writes        num_of_bytes_written io_stall_write_ms    io_stall             size_on_disk_bytes   file_handle
----------- ------- ----------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ------------------
26          1       336123593   3937                 867188736            1976212              84840                1486749696           2379713              4355925              4520607744           0x00000A60
26          2       336123593   96                   815104               700                  192953               935902208            712195               712895               40894464             0x00000928

 

database_id file_id sample_ms   num_of_reads         num_of_bytes_read    io_stall_read_ms     num_of_writes        num_of_bytes_written io_stall_write_ms    io_stall             size_on_disk_bytes   file_handle
----------- ------- ----------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- -------------------- ------------------
26          1       336967453   3968                 867442688            1976278              185633               2312601600           2488036              4464314              6562185216           0x00000A60
26          2       336967453   96                   815104               700                  301099               1458697216           843031               843731               40894464             0x00000928

If we look at the num_of_bytes_written columns we can see the following IO...

MDF (file id 1)    787Mbytes

LDF (file id 2)      498Mbytes

Ouch! To read just two rows of data has caused 1.2GBytes of write activity.

Remove BATCHSIZE=1 and the problem goes away, instead of taking 13minutes it now takes 48 seconds and has the following IO characteristics:

MDF (file id 1)    16KBytes

LDF (file id 2)      77KBytes

Somewhat of a difference in performance eh! The only probably with removing BATCHSIZE is that the rows we are inserting are done as one complete transaction which if there are many then you can end up with a large transaction log.

Note, setting recovery model to Simple has no effect.

 

Filed under:

Comments

# MAXDOP Recommendation | keyongtech

22 January 2009 01:56 by MAXDOP Recommendation | keyongtech

Pingback from  MAXDOP Recommendation | keyongtech