I was recently busy
with testing performance of our ETL process. Each day we process a
number of independent feeds, which will increase in the future. These feeds contain usually similar number of rows. At
present we have over 100 hundred feeds that are run daily. We have
implemented a C# application that executes SSIS packages programatically,
adjusting their various settings and setting some variables at run
time. One of tests that we perform is to run all feeds in one go.
Obviously, it is not possible (as we learned some time ago) to run 100
packages at the same time, because of memory pressure. We developed a
solution that allows for configuration of maximum number of packages
that can run at any given moment. This works pretty well, although is
not clever enough to react to increasing load of the packages - one day
we will implement this :).
We have a test
environment which processes about 8000 rows in 117 feeds. Not too much, but this is current characteristics of the data. Some of the
feeds are empty, but we still need to process them. All packages in general load data from a SQL Server 2005 database, store them in temporary files, reload the data to separate data flows, perform some transformations and output the data to two destinations. I ran a series of
tests to see how the MaxConcurrentPacks configuration option of our process is related
to execution time of all feeds configured in the system.
I put together comparison of execution times for different settings, let's have a look:
I created also chart that shows relation between number of packages and overall execution time of the whole lot of feeds.
As you see, there is
significant drop of execution time when 5 concurrent packages are run
compared to 1 package (basically, serial execution). Adding more
packages doesn't improve the performance in such extent.
I also wanted to see,
what will happen if I increase amount of data to process. I modified
configuration so most of the feeds process quite a bit of data now and
altogether, there is 3.3 million rows to be transfered. I started with
2 concurrent packages and the overall execution time exceeded 3 hours.
I noticed that memory usage was significantly higher, reaching 6GB (on
server with 8GB of RAM). Then I increased number of concurrent packages
to 5. This maxed out the memory usage and the whole process crashed - I
had to kill it actually, because system became unresponsive and had
troubles with launching new applications. The application logged errors
that indicate problems with memory pressure in the log file :
A buffer failed while allocating 72816 bytes.
The system reports
97 percent memory load. There are 8587444224 bytes of physical memory
with 236855296 bytes free. There are 8796092891136 bytes of virtual
memory with 8782502318080 bytes free. The paging file has 12419862528
bytes with 5967872 bytes free.
The Data Flow task
failed to create a required thread and cannot begin running. The
usually occurs when there is an out-of-memory state.
The Data Flow task engine failed at startup because it cannot create one or more required threads.
For the above setup
(3.3M rows), 2 concurrent packs seem to be pretty safe setting. If you
plan to implement concurrent package execution in your solutions, you
should run tests and be able to modify easily the number of packages
(for example, setting in configuration file) to adjust it in case you
experience performance problems.
The testing application server is a quad CPU with 8GB of memory, running 64bit Windows Server 2003 Enterprise Edition.
Of course, your particular design and data conditions may be completely different from our setup, and mileage may vary. I think though it is interesting to see, that a bit of effort put into design of ETL may improve the throughput of the system. It often happens that application servers are either underutilized or run into performance issues because of the rigid design of the ETL.