Range lookups in SSIS

Recently someone posted to the SSIS Forums about performance of DTS and SSIS. They had found the performance of SSIS to be well below that of DTS and of other ETL tools. On investigation we found that the package was a fact load and that one of the tasks was a range lookup, i.e. given a set of ranges which one does a particular value fit. To achieve this the lookup component had been changed to perform a range lookup in the SQL. This is where the problem lies. In doing this only a partial cache is possible, meaning that only values that are retrieved cached, so a subsequent lookup with the same values results in a cache hit. However when the data is not very similar you end up with not many cache hits at all. As in the case where the values were monetary values.

The question followed was why does SSIS not support range lookups and the answer, which I have many times, is that there was a finite time and so the concentration was on building a framework that could have building blocks put on top to provide the required functionality, and for those building blocks to be easily created, some provided out of the box.

The is highlighted by the fact that within 10 minutes I had a range lookup working using a script component. I leveraged the power of lists in .Net and the ability to do a binary search on them. This is a very powerful feature and can be extended to perform custom matching of values i.e. to find which subnet and ipaddress belongs. I used the following as a starting point http://aspnet.4guysfromrolla.com/articles/110602-1.aspx.

So the script component,

  1. Generated the range data in the preexecute method (this would normally probably come from a file, database etc)
  2. Performed the lookup in the processInputRow method, and updated the buffer with the min value for the range

Doing this I was able to perform a search on 1,000,000 rows in 6 seconds

The code is below and is very simple. I used List(of X)  as it doesn't require sorting, this is unlikely to be the fastest way but with the speed achieved not sure if you need any more.

' Microsoft SQL Server Integration Services user script component
' This is your new script component in Microsoft Visual Basic .NET
' ScriptMain is the entrypoint class for script components
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Imports System.Collections.Generic
Public Class ScriptMain
    Inherits UserComponent
    Dim lookup As List(Of Decimal)
    'Dim lookup() As Decimal
    Dim lookupKey As List(Of Decimal)
    Public Overrides Sub PreExecute()
        MyBase.PreExecute()
        lookup = New List(Of Decimal)(100)
        'ReDim lookup(100)
        Dim i As Integer
        For i = 0 To 100
            lookup.Add(100 * i)
            'lookup(i) = 100 * i
        Next
    End Sub
    Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
        '
        ' Add your code here
        '
        'Array.Sort(lookup)
        'Dim itemIndex As Integer = Array.BinarySearch(lookup, Row.Value)
        Dim itemIndex As Integer = lookup.BinarySearch(Row.Value)
        Dim itemValue As Integer = 0
        Try
            itemValue = CInt(lookup.Item(-2 - itemIndex))
            'itemValue = CInt(lookup(-2 - itemIndex))
        Catch ex As Exception
        End Try
        'Range found
        Row.Lookup = itemIndex '(CInt(lookup.Item(Abs(itemIndex)))
        Row.range = itemValue
    End Sub
End Class
End Sub
End Class

 

-
Published 04 October 2005 18:45 by simonsabin

Comments

27 January 2010 20:17 by CozyRoc

# re: Range lookups in SSIS

There is solution based on the third-party commercial CozyRoc SSIS+ library. CozyRoc has implemented data flow destination script, which creates memory-efficient range dictionary object. The dictionary object can then be used in CozyRoc Lookup Plus component. For more information and demonstration how to use the script, check here:

www.cozyroc.com/.../range-dictionary-destination