SSIS: Getting data from Excel files using Aspose.Cells

Excel spreadsheets with complicated layouts really do not lend themselves to processing with the standard Microsoft OLE DB drivers. Indeed, extracting one or two cells from different worksheets can be positively painful! This is where the power and flexibility of Aspose.Cells comes into its own.

Aspose.Cells provides an API that looks very similar to the OLE automation API exposed by Excel in the Office InterOp assemblies. However, you do not get all the headaches normally associated with the InterOp assemblies (i.e. modal dialogs, memory leaks, multiple Excel instances etc. etc.). With Aspose.Cells you do not need Excel to be installed on your server. Instead you only need a single assembly (Aspose.Cells.dll ) and the accompanying license file (Aspose.Cells.lic). Simple!

Working with Aspose.Cells in SSIS

To work with Aspose.Cells, you need to use the Script Task or Script Component. My personal preference is to use the Script Component in its asynchronous transformation mode. This way I can feed in a list of Workbooks that I want the component to process on its input buffer and the component can write the data read from the Workbooks onto multiple output buffers (one buffer for each type of data).

As SSIS uses Visual Studio for Applications (VSA), you do not have the flexible "Add Reference" dialog found in Visual Studio. In order to get around this issue while you are developing, you need to copy the Aspose.Cells assembly and associated license file into your .NET folder, which is usually:

C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727

Restart Visual Studio and you will see Aspose.Cells listed on VSA's "Add Reference" dialog. Now you can add the appropriate "imports" statement to the top of your module and start coding. Easy!

Later, when you move to production, you will need to register the Aspose.Cells assembly in the GAC by simply dropping the assembly into the following folder:

C:\WINDOWS\assembly

Dealing with the Aspose.Cells License file

Aspose.Cells has a license file which can either be in the same directory as the assembly, or in your own location. The easiest way to handle this in SSIS is to use a File Connection Manager which records the location of the license file. Then in code, you will need to add:

Dim license As Aspose.Cells.License = New Aspose.Cells.License()

license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)

 

Example Code

My sample code reads a list of workbooks from the input buffer and writes the extracted data to the output buffer.  The data flow is shown below along with the code from my asynchronous Script Component. Enjoy!


 

' Aspose.Cells.dll must be registered in the GAC for production

' AND the C:\WINNT\Microsoft.NET\Framework\v2.0.50727 for development

 

' standard libraries

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

' additional libraries

Imports System.IO

Imports System.Collections.Generic

Imports Aspose.Cells

 

Public Class ScriptMain

Inherits UserComponent

 

Public Overrides Sub WorkbookList_ProcessInputRow(ByVal Row As WorkbookListBuffer)

 

' Imports all data from the the Excel workbooks provided on the input

 

' Note that each workbook contains several worksheets.

' For efficiency, we open the workbook once and then

' process all the worksheets in the file

 

Try

' The previous step in the package downloaded the workbook.

Dim fInfo As FileInfo = New FileInfo(Path.Combine(Me.Variables.ImportFolder, Row.URL))

If fInfo.Exists Then

 

' Set Aspose.Cells license

Dim license As Aspose.Cells.License = New Aspose.Cells.License()

license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)

 

' Open the workbook

Dim wrkBook As Workbook = New Workbook()

wrkBook.Open(fInfo.FullName)

 

' process each worksheet

For Each wrkSheet As Worksheet In wrkBook.Worksheets

Try

' Check we have a valid worksheet by testing

' for text prompts in specific locations

If ValidWorksheetLayout(wrkSheet) Then

' process each worksheet found in the workbook

' that has the correct format

ImportWorksheetData(Row.WorkbookID, wrkSheet, Row)

End If

Catch exi As Exception

Me.ComponentMetaData.FireInformation(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Invalid Worksheet Layout: " & wrkSheet.Name, String.Empty, 0, True)

End Try

Next

Else

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Missing Excel File: " & fInfo.FullName, String.Empty, 0, True)

End If

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: " & ex.Message, String.Empty, 0, True)

End Try

End Sub

 

Private Function ValidWorksheetLayout(ByVal wrkSheet As Worksheet) As Boolean

' checks if data is in expected locations. If not, then return false.

If wrkSheet.Cells("B3").Value.ToString = "Profit and loss projections" And wrkSheet.Cells("B14").Value.ToString = "Revenue" Then

Return True

End If

Return False

End Function

 

Private Sub ImportWorksheetData(ByVal WorkbookID As Integer, ByVal wrkSheet As Worksheet, ByVal Row As WorkbookListBuffer)

' imports all data from the worksheet

Dim iRow As Integer

Dim iCol As Integer

Dim iCount As Integer

Dim dataRange As DataTable

Dim dr As DataTableReader

Dim HeaderRow As List(Of Integer)

Try

' heading row is in seperate location to data, so import heading first

' note that Aspose.Cells uses zero-based row and column numbers in ExportDataTable call

' so Cell=4,Row=12 is actually cell E13

Try

' read single header row into generic list

HeaderRow = New List(Of Integer)

dataRange = wrkSheet.Cells.ExportDataTable(12, 4, 1, 5)

dr = dataRange.CreateDataReader()

If dr.Read Then

For iCount = 0 To 4

HeaderRow.Add(GetInteger(dr, iCount))

Next

End If

 

Catch ex0 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex0: " & ex0.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(14, 2, 2, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

 

Catch ex1 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex1: " & ex1.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(19, 2, 2, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

Catch ex2 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex2: " & ex2.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(23, 2, 9, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

Catch ex3 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex3: " & ex3.Message, String.Empty, 0, True)

End Try

 

Catch exOuter As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData: outer " & exOuter.Message, String.Empty, 0, True)

End Try

End Sub

 

Public Sub ProcessDataTable(ByVal WorkbookID As Integer, ByVal dr As DataTableReader, ByVal iStartDataCol As Integer, ByVal iDataColCount As Integer, ByVal HeaderRow As List(Of Integer))

Dim iCol As Integer

Dim DataValue As Decimal

Dim bAddRow As Boolean

Dim TimePeriodID As Integer

Dim OutputCount As Integer = 0

 

Try

For iCol = iStartDataCol To iDataColCount

bAddRow = False

Try

If Not dr.IsDBNull(iCol) Then

DataValue = GetDecimal(dr, iCol)

If DataValue <> 0 Then bAddRow = True

If bAddRow Then

With Me.WorksheetDataBuffer

.AddRow()

' get header from the row we read into a list earlier

.FinancialYear = HeaderRow(iCol - 2)

.WorkbookID = WorkbookID

.RowName = dr.GetString(0)

.DataValue = DataValue

End With

End If

End If

Catch ex1 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Inner " & ex1.Message, String.Empty, 0, True)

End Try

Next

 

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Outer " & ex.Message, String.Empty, 0, True)

End Try

End Sub

 

Function GetDecimal(ByVal dr As DataTableReader, ByVal iCol As Integer) As Decimal

Dim DataValue As Decimal

Try

Select Case dr.GetDataTypeName(iCol)

Case "String"

If IsNumeric(dr.GetString(iCol)) Then

DataValue = CDec(dr.GetString(iCol))

End If

Case "Integer"

DataValue = CDec(dr.GetInt32(iCol))

Case "Double"

DataValue = CDec(dr.GetDouble(iCol))

Case "Decimal"

DataValue = dr.GetDecimal(iCol)

End Select

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetDecimal " & ex.Message, String.Empty, 0, True)

End Try

Return DataValue

End Function

 

Function GetInteger(ByVal dr As DataTableReader, ByVal iCol As Integer) As Integer

Dim DataValue As Integer

Try

Select Case dr.GetDataTypeName(iCol)

Case "String"

If IsNumeric(dr.GetString(iCol)) Then

DataValue = CInt(dr.GetString(iCol))

End If

Case "Integer"

DataValue = dr.GetInt32(iCol)

Case "Double"

DataValue = CInt(dr.GetDouble(iCol))

Case "Decimal"

DataValue = CInt(dr.GetDecimal(iCol))

End Select

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetInteger " & ex.Message, String.Empty, 0, True)

End Try

Return DataValue

End Function

End Class

Published 11 November 2007 15:16 by DrJohn
Filed under: ,

Comments

No Comments