Excel spreadsheets with complicated layouts really do not lend themselves to processing with the standard Microsoft OLE DB drivers. Indeed, extracting one or two cells from different worksheets can be positively painful! This is where the power and flexibility of Aspose.Cells comes into its own.
Aspose.Cells provides an API that looks very similar to the OLE automation API exposed by Excel in the Office InterOp assemblies. However, you do not get all the headaches normally associated with the InterOp assemblies (i.e. modal dialogs, memory leaks, multiple Excel instances etc. etc.). With Aspose.Cells you do not need Excel to be installed on your server. Instead you only need a single assembly (Aspose.Cells.dll ) and the accompanying license file (Aspose.Cells.lic). Simple!
Working with Aspose.Cells in SSIS
To work with Aspose.Cells, you need to use the Script Task or Script Component. My personal preference is to use the Script Component in its asynchronous transformation mode. This way I can feed in a list of Workbooks that I want the component to process on its input buffer and the component can write the data read from the Workbooks onto multiple output buffers (one buffer for each type of data).
As SSIS uses Visual Studio for Applications (VSA), you do not have the flexible "Add Reference" dialog found in Visual Studio. In order to get around this issue while you are developing, you need to copy the Aspose.Cells assembly and associated license file into your .NET folder, which is usually:
C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727
Restart Visual Studio and you will see Aspose.Cells listed on VSA's "Add Reference" dialog. Now you can add the appropriate "imports" statement to the top of your module and start coding. Easy!
Later, when you move to production, you will need to register the Aspose.Cells assembly in the GAC by simply dropping the assembly into the following folder:
C:\WINDOWS\assembly
Dealing with the Aspose.Cells License file
Aspose.Cells has a license file which can either be in the same directory as the assembly, or in your own location. The easiest way to handle this in SSIS is to use a File Connection Manager which records the location of the license file. Then in code, you will need to add:
Dim license As Aspose.Cells.License = New Aspose.Cells.License()
license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)
Example Code
My sample code reads a list of workbooks from the input buffer and writes the extracted data to the output buffer. The data flow is shown below along with the code from my asynchronous Script Component. Enjoy!
' Aspose.Cells.dll must be registered in the GAC for production
' AND the C:\WINNT\Microsoft.NET\Framework\v2.0.50727 for development
' standard libraries
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
' additional libraries
Imports System.IO
Imports System.Collections.Generic
Imports Aspose.Cells
Public
Class ScriptMain
Inherits UserComponent
Public
Overrides
Sub WorkbookList_ProcessInputRow(ByVal Row As WorkbookListBuffer)
' Imports all data from the the Excel workbooks provided on the input
' Note that each workbook contains several worksheets.
' For efficiency, we open the workbook once and then
' process all the worksheets in the file
Try
' The previous step in the package downloaded the workbook.
Dim fInfo As FileInfo = New FileInfo(Path.Combine(Me.Variables.ImportFolder, Row.URL))
If fInfo.Exists Then
' Set Aspose.Cells license
Dim license As Aspose.Cells.License = New Aspose.Cells.License()
license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)
' Open the workbook
Dim wrkBook As Workbook = New Workbook()
wrkBook.Open(fInfo.FullName)
' process each worksheet
For
Each wrkSheet As Worksheet In wrkBook.Worksheets
Try
' Check we have a valid worksheet by testing
' for text prompts in specific locations
If ValidWorksheetLayout(wrkSheet) Then
' process each worksheet found in the workbook
' that has the correct format
ImportWorksheetData(Row.WorkbookID, wrkSheet, Row)
End
If
Catch exi As Exception
Me.ComponentMetaData.FireInformation(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Invalid Worksheet Layout: " & wrkSheet.Name, String.Empty, 0, True)
End
Try
Next
Else
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Missing Excel File: " & fInfo.FullName, String.Empty, 0, True)
End
If
Catch ex As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: " & ex.Message, String.Empty, 0, True)
End
Try
End
Sub
Private
Function ValidWorksheetLayout(ByVal wrkSheet As Worksheet) As
Boolean
' checks if data is in expected locations. If not, then return false.
If wrkSheet.Cells("B3").Value.ToString = "Profit and loss projections"
And wrkSheet.Cells("B14").Value.ToString = "Revenue"
Then
Return
True
End
If
Return
False
End
Function
Private
Sub ImportWorksheetData(ByVal WorkbookID As
Integer, ByVal wrkSheet As Worksheet, ByVal Row As WorkbookListBuffer)
' imports all data from the worksheet
Dim iRow As
Integer
Dim iCol As
Integer
Dim iCount As
Integer
Dim dataRange As DataTable
Dim dr As DataTableReader
Dim HeaderRow As List(Of
Integer)
Try
' heading row is in seperate location to data, so import heading first
' note that Aspose.Cells uses zero-based row and column numbers in ExportDataTable call
' so Cell=4,Row=12 is actually cell E13
Try
' read single header row into generic list
HeaderRow = New List(Of
Integer)
dataRange = wrkSheet.Cells.ExportDataTable(12, 4, 1, 5)
dr = dataRange.CreateDataReader()
If dr.Read Then
For iCount = 0 To 4
HeaderRow.Add(GetInteger(dr, iCount))
Next
End
If
Catch ex0 As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex0: " & ex0.Message, String.Empty, 0, True)
End
Try
Try
dataRange = wrkSheet.Cells.ExportDataTable(14, 2, 2, 7)
dr = dataRange.CreateDataReader()
While dr.Read
ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)
End
While
Catch ex1 As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex1: " & ex1.Message, String.Empty, 0, True)
End
Try
Try
dataRange = wrkSheet.Cells.ExportDataTable(19, 2, 2, 7)
dr = dataRange.CreateDataReader()
While dr.Read
ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)
End
While
Catch ex2 As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex2: " & ex2.Message, String.Empty, 0, True)
End
Try
Try
dataRange = wrkSheet.Cells.ExportDataTable(23, 2, 9, 7)
dr = dataRange.CreateDataReader()
While dr.Read
ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)
End
While
Catch ex3 As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex3: " & ex3.Message, String.Empty, 0, True)
End
Try
Catch exOuter As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData: outer " & exOuter.Message, String.Empty, 0, True)
End
Try
End
Sub
Public
Sub ProcessDataTable(ByVal WorkbookID As
Integer, ByVal dr As DataTableReader, ByVal iStartDataCol As
Integer, ByVal iDataColCount As
Integer, ByVal HeaderRow As List(Of
Integer))
Dim iCol As
Integer
Dim DataValue As
Decimal
Dim bAddRow As
Boolean
Dim TimePeriodID As
Integer
Dim OutputCount As
Integer = 0
Try
For iCol = iStartDataCol To iDataColCount
bAddRow = False
Try
If
Not dr.IsDBNull(iCol) Then
DataValue = GetDecimal(dr, iCol)
If DataValue <> 0 Then bAddRow = True
If bAddRow Then
With
Me.WorksheetDataBuffer
.AddRow()
' get header from the row we read into a list earlier
.FinancialYear = HeaderRow(iCol - 2)
.WorkbookID = WorkbookID
.RowName = dr.GetString(0)
.DataValue = DataValue
End
With
End
If
End
If
Catch ex1 As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Inner " & ex1.Message, String.Empty, 0, True)
End
Try
Next
Catch ex As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Outer " & ex.Message, String.Empty, 0, True)
End
Try
End
Sub
Function GetDecimal(ByVal dr As DataTableReader, ByVal iCol As
Integer) As
Decimal
Dim DataValue As
Decimal
Try
Select
Case dr.GetDataTypeName(iCol)
Case
"String"
If IsNumeric(dr.GetString(iCol)) Then
DataValue = CDec(dr.GetString(iCol))
End
If
Case
"Integer"
DataValue = CDec(dr.GetInt32(iCol))
Case
"Double"
DataValue = CDec(dr.GetDouble(iCol))
Case
"Decimal"
DataValue = dr.GetDecimal(iCol)
End
Select
Catch ex As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetDecimal " & ex.Message, String.Empty, 0, True)
End
Try
Return DataValue
End
Function
Function GetInteger(ByVal dr As DataTableReader, ByVal iCol As
Integer) As
Integer
Dim DataValue As
Integer
Try
Select
Case dr.GetDataTypeName(iCol)
Case
"String"
If IsNumeric(dr.GetString(iCol)) Then
DataValue = CInt(dr.GetString(iCol))
End
If
Case
"Integer"
DataValue = dr.GetInt32(iCol)
Case
"Double"
DataValue = CInt(dr.GetDouble(iCol))
Case
"Decimal"
DataValue = CInt(dr.GetDecimal(iCol))
End
Select
Catch ex As Exception
Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetInteger " & ex.Message, String.Empty, 0, True)
End
Try
Return DataValue
End
Function
End
Class
In my previous post, I outlined how to obtain a recursive list of all Excel files held in a SharePoint document library. In SSIS, I use this code to upsert records into a table which identifies which files are new or modified. Clearly I could process the file in-situ, but that would over complicate my code with data streams which SSIS would not understand – especially if you are attempting to use the standard Microsoft.Jet.OLEDB.4.0 engine to process the file! So downloading the file to a local folder is the easiest option. Fortunately, I spotted Greg Enslow's post which pointed me in the direction of using the WebClient library.
In my control flow I obtain a list all the files I need to process and store it in an ADO.NET recordset. I then iterate over this list using a ForEach task, which sets various variables containing the URL, file extension and the document's name and it's GUID in SharePoint. My script task then downloads each file so it can be processed by subsequent tasks in the control flow.
I have expanded on Greg's original code to support file system locations as well as SharePoint URLs. Enjoy!
Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Runtime
Imports System.Net
Imports System.IO
Public
Class ScriptMain
Public
Sub Main()
' downloads the file from SharePoint or a file system location to a local folder
Dim taskResult As
Integer = Dts.Results.Success
Try
' obtain location of local folder from variable
Dim dir As DirectoryInfo = New DirectoryInfo(Dts.Variables("ImportFolder").Value.ToString())
If dir.Exists Then
' Create the filename for local storage using
' the GUID from SharePoint as this will be unique.
Dim file As FileInfo = New FileInfo(dir.FullName & "\" & Dts.Variables("WorkbookGUID").Value.ToString() & Dts.Variables("Extension").Value.ToString())
If
Not file.Exists Then
' get the path of the file we need to download
Dim fileUrl As
String = Dts.Variables("EncodedAbsUrl").Value.ToString()
If fileUrl.Length <> 0 Then
' download the file from SharePoint or Archive file system to local folder
Dim client As
New WebClient()
If Left(fileUrl, 4).ToLower() = "http"
Then
'download the file from SharePoint
client.Credentials = System.Net.CredentialCache.DefaultCredentials
client.DownloadFile(fileUrl, file.FullName)
Else
' copy file from remote file system
System.IO.File.Copy(fileUrl, file.FullName)
End
If
Else
Throw
New ApplicationException("EncodedAbsUrl variable does not contain a value!")
End
If
End
If
Else
Throw
New ApplicationException("ImportFolder does not exist!")
End
If
Catch ex As Exception
Dts.Events.FireError(0, String.Empty, ex.Message, String.Empty, 0)
taskResult = Dts.Results.Failure
End
Try
Dts.TaskResult = taskResult
End
Sub
End
Class