Range lookups in SSIS
Recently someone posted to the SSIS Forums about performance of DTS and SSIS. They had found the performance of SSIS to be well below that of DTS and of other ETL tools. On investigation we found that the package was a fact load and that one of the tasks was a range lookup, i.e. given a set of ranges which one does a particular value fit. To achieve this the lookup component had been changed to perform a range lookup in the SQL. This is where the problem lies. In doing this only a partial cache is possible, meaning that only values that are retrieved cached, so a subsequent lookup with the same values results in a cache hit. However when the data is not very similar you end up with not many cache hits at all. As in the case where the values were monetary values.
The question followed was why does SSIS not support range lookups and the answer, which I have many times, is that there was a finite time and so the concentration was on building a framework that could have building blocks put on top to provide the required functionality, and for those building blocks to be easily created, some provided out of the box.
The is highlighted by the fact that within 10 minutes I had a range lookup working using a script component. I leveraged the power of lists in .Net and the ability to do a binary search on them. This is a very powerful feature and can be extended to perform custom matching of values i.e. to find which subnet and ipaddress belongs. I used the following as a starting point http://aspnet.4guysfromrolla.com/articles/110602-1.aspx.
So the script component,
- Generated the range data in the preexecute method (this would normally probably come from a file, database etc)
- Performed the lookup in the processInputRow method, and updated the buffer with the min value for the range
Doing this I was able to perform a search on 1,000,000 rows in 6 seconds
The code is below and is very simple. I used List(of X) as it doesn't require sorting, this is unlikely to be the fastest way but with the speed achieved not sure if you need any more.
' Microsoft SQL Server Integration Services user script component
' This is your new script component in Microsoft Visual Basic .NET
' ScriptMain is the entrypoint class for script components
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Collections.Generic
Public Class ScriptMain
Inherits UserComponent
Dim lookup As List(Of Decimal)
'Dim lookup() As Decimal
Dim lookupKey As List(Of Decimal)
Public Overrides Sub PreExecute()
MyBase.PreExecute()
lookup = New List(Of Decimal)(100)
'ReDim lookup(100)
Dim i As Integer
For i = 0 To 100
lookup.Add(100 * i)
'lookup(i) = 100 * i
Next
End Sub
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
'
' Add your code here
'
'Array.Sort(lookup)
'Dim itemIndex As Integer = Array.BinarySearch(lookup, Row.Value)
Dim itemIndex As Integer = lookup.BinarySearch(Row.Value)
Dim itemValue As Integer = 0
Try
itemValue = CInt(lookup.Item(-2 - itemIndex))
'itemValue = CInt(lookup(-2 - itemIndex))
Catch ex As Exception
End Try
'Range found
Row.Lookup = itemIndex '(CInt(lookup.Item(Abs(itemIndex)))
Row.range = itemValue
End Sub
End Class
End Sub
End Class
-