This blog will no longer be updated.

New content is available on my new blog

ETL performance test - parallel package execution - Piotr Rodak

ETL performance test - parallel package execution

I was recently busy with testing performance of our ETL process. Each day we process a number of independent feeds, which will increase in the future. These feeds contain usually similar number of rows. At present we have over 100 hundred feeds that are run daily. We have implemented a C# application that executes SSIS packages programatically, adjusting their various settings and setting some variables at run time. One of tests that we perform is to run all feeds in one go. Obviously, it is not possible (as we learned some time ago) to run 100 packages at the same time, because of memory pressure. We developed a solution that allows for configuration of maximum number of packages that can run at any given moment. This works pretty well, although is not clever enough to react to increasing load of the packages - one day we will implement this :).

We have a test environment which processes about 8000 rows in 117 feeds. Not too much, but this is current characteristics of the data. Some of the feeds are empty, but we still need to process them. All packages in general load data from a SQL Server 2005 database, store them in temporary files, reload the data to separate data flows, perform some transformations and output the data to two destinations. I ran a series of tests to see how the MaxConcurrentPacks configuration option of our process is related to execution time of all feeds configured in the system.

I put together comparison of execution times for different settings, let's have a look:

RunId CountOfFeeds MaxConcurrentPacks NumberOfRowsProcessed DurationOfRun
1 117 1 8226 5456
2 117 2 8226 3103
3 117 5 8226 1999
4 117 7 8226 1754
5 117 10 8226 1765
6 117 15 8226 1702

I created also chart that shows relation between number of packages and overall execution time of the whole lot of feeds.

As you see, there is significant drop of execution time when 5 concurrent packages are run compared to 1 package (basically, serial execution). Adding more packages doesn't improve the performance in such extent.

I also wanted to see, what will happen if I increase amount of data to process. I modified configuration so most of the feeds process quite a bit of data now and altogether, there is 3.3 million rows to be transfered. I started with 2 concurrent packages and the overall execution time exceeded 3 hours. I noticed that memory usage was significantly higher, reaching 6GB (on server with 8GB of RAM). Then I increased number of concurrent packages to 5. This maxed out the memory usage and the whole process crashed - I had to kill it actually, because system became unresponsive and had troubles with launching new applications. The application logged errors that indicate problems with memory pressure  in the log file :

A buffer failed while allocating 72816 bytes.

The system reports 97 percent memory load. There are 8587444224 bytes of physical memory with 236855296 bytes free. There are 8796092891136 bytes of virtual memory with 8782502318080 bytes free. The paging file has 12419862528 bytes with 5967872 bytes free.

The Data Flow task failed to create a required thread and cannot begin running. The usually occurs when there is an out-of-memory state.

The Data Flow task engine failed at startup because it cannot create one or more required threads.

For the above setup (3.3M rows), 2 concurrent packs seem to be pretty safe setting.  If you plan to implement concurrent package execution in your solutions, you should run tests and be able to modify easily the number of packages (for example, setting in configuration file) to adjust it in case you experience performance problems.

The testing application server is a quad CPU with 8GB of memory, running 64bit Windows Server 2003 Enterprise Edition.

Of course, your particular design and data conditions may be completely different from our setup, and mileage may vary. I think though it is interesting to see, that a bit of effort put into design of ETL may improve the throughput of the system. It often happens that application servers are either underutilized or run into performance issues because of the rigid design of the ETL.



Published Saturday, June 13, 2009 12:38 AM by Piotr Rodak
Filed under: , , ,


No Comments