November 2007 - Posts - Sparks from the Anvil

November 2007 - Posts

Cleaning Address Data with SSIS using a Web Service - Explained

Jamie Thompson gave me some very helpful feedback on my recently published article over at SQLCentral.com called Cleaning Address Data with SSIS Using a Web Service and my previous blog entry on Calling SharePoint web services from the data flow. Jamie pointed out that I had assumed my audience would be fully familiar with the .NET framework and terms such as WSDL and Proxy Class need further explanation. So here is my attempt to clarify some issues he pointed out.

  • WSDL
    WSDL is short for Web Service Description Language and is the definition of what the web service provides and how it can be invoked. A WSDL document is an XML file which lists all the methods provided by the web service and their parameters. The WSDL document is returned by a web service when "?wsdl" is appended to the web service's URL (e.g. http://services.postcodeanywhere.co.uk/uk/lookup.asmx?wsdl).
  • Proxy Class
    A proxy class simplifies the interaction between your application and a web service. The .NET Framework provides a command-line utility called wsdl.exe which generates the code for a proxy class automatically. In Visual Studio, the Add Web Reference command automatically executes wsdl.exe and generated the proxy class for you. However, with SSIS you need to create a proxy class manually using a command-line such as this:
    e.g.
    wsdl /language:VB http://services.postcodeanywhere.co.uk/uk/lookup.asmx?wsdl /out:c:\PostcodeAnywhere.vb
  • GAC
    The Global Assembly Cache (GAC) is the central repository for sharing assemblies (DLLs). In order to register an assembly in the GAC, it must be strong named (i.e. must have a unique hash value, public key, locale and version number).

If you want to know more about the .NET Framework, then read O'Reilly's .NET Framework Essentials, 3rd Edition. 

 

Posted by DrJohn with 1 comment(s)
Filed under: , ,

SSIS: Getting data from Excel files using Aspose.Cells

Excel spreadsheets with complicated layouts really do not lend themselves to processing with the standard Microsoft OLE DB drivers. Indeed, extracting one or two cells from different worksheets can be positively painful! This is where the power and flexibility of Aspose.Cells comes into its own.

Aspose.Cells provides an API that looks very similar to the OLE automation API exposed by Excel in the Office InterOp assemblies. However, you do not get all the headaches normally associated with the InterOp assemblies (i.e. modal dialogs, memory leaks, multiple Excel instances etc. etc.). With Aspose.Cells you do not need Excel to be installed on your server. Instead you only need a single assembly (Aspose.Cells.dll ) and the accompanying license file (Aspose.Cells.lic). Simple!

Working with Aspose.Cells in SSIS

To work with Aspose.Cells, you need to use the Script Task or Script Component. My personal preference is to use the Script Component in its asynchronous transformation mode. This way I can feed in a list of Workbooks that I want the component to process on its input buffer and the component can write the data read from the Workbooks onto multiple output buffers (one buffer for each type of data).

As SSIS uses Visual Studio for Applications (VSA), you do not have the flexible "Add Reference" dialog found in Visual Studio. In order to get around this issue while you are developing, you need to copy the Aspose.Cells assembly and associated license file into your .NET folder, which is usually:

C:\WINDOWS\Microsoft.NET\Framework\v2.0.50727

Restart Visual Studio and you will see Aspose.Cells listed on VSA's "Add Reference" dialog. Now you can add the appropriate "imports" statement to the top of your module and start coding. Easy!

Later, when you move to production, you will need to register the Aspose.Cells assembly in the GAC by simply dropping the assembly into the following folder:

C:\WINDOWS\assembly

Dealing with the Aspose.Cells License file

Aspose.Cells has a license file which can either be in the same directory as the assembly, or in your own location. The easiest way to handle this in SSIS is to use a File Connection Manager which records the location of the license file. Then in code, you will need to add:

Dim license As Aspose.Cells.License = New Aspose.Cells.License()

license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)

 

Example Code

My sample code reads a list of workbooks from the input buffer and writes the extracted data to the output buffer.  The data flow is shown below along with the code from my asynchronous Script Component. Enjoy!


 

' Aspose.Cells.dll must be registered in the GAC for production

' AND the C:\WINNT\Microsoft.NET\Framework\v2.0.50727 for development

 

' standard libraries

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

' additional libraries

Imports System.IO

Imports System.Collections.Generic

Imports Aspose.Cells

 

Public Class ScriptMain

Inherits UserComponent

 

Public Overrides Sub WorkbookList_ProcessInputRow(ByVal Row As WorkbookListBuffer)

 

' Imports all data from the the Excel workbooks provided on the input

 

' Note that each workbook contains several worksheets.

' For efficiency, we open the workbook once and then

' process all the worksheets in the file

 

Try

' The previous step in the package downloaded the workbook.

Dim fInfo As FileInfo = New FileInfo(Path.Combine(Me.Variables.ImportFolder, Row.URL))

If fInfo.Exists Then

 

' Set Aspose.Cells license

Dim license As Aspose.Cells.License = New Aspose.Cells.License()

license.SetLicense(Me.Connections.AsposeCellsLicense.ConnectionString)

 

' Open the workbook

Dim wrkBook As Workbook = New Workbook()

wrkBook.Open(fInfo.FullName)

 

' process each worksheet

For Each wrkSheet As Worksheet In wrkBook.Worksheets

Try

' Check we have a valid worksheet by testing

' for text prompts in specific locations

If ValidWorksheetLayout(wrkSheet) Then

' process each worksheet found in the workbook

' that has the correct format

ImportWorksheetData(Row.WorkbookID, wrkSheet, Row)

End If

Catch exi As Exception

Me.ComponentMetaData.FireInformation(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Invalid Worksheet Layout: " & wrkSheet.Name, String.Empty, 0, True)

End Try

Next

Else

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: Missing Excel File: " & fInfo.FullName, String.Empty, 0, True)

End If

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "WorkbookList_ProcessInputRow: " & ex.Message, String.Empty, 0, True)

End Try

End Sub

 

Private Function ValidWorksheetLayout(ByVal wrkSheet As Worksheet) As Boolean

' checks if data is in expected locations. If not, then return false.

If wrkSheet.Cells("B3").Value.ToString = "Profit and loss projections" And wrkSheet.Cells("B14").Value.ToString = "Revenue" Then

Return True

End If

Return False

End Function

 

Private Sub ImportWorksheetData(ByVal WorkbookID As Integer, ByVal wrkSheet As Worksheet, ByVal Row As WorkbookListBuffer)

' imports all data from the worksheet

Dim iRow As Integer

Dim iCol As Integer

Dim iCount As Integer

Dim dataRange As DataTable

Dim dr As DataTableReader

Dim HeaderRow As List(Of Integer)

Try

' heading row is in seperate location to data, so import heading first

' note that Aspose.Cells uses zero-based row and column numbers in ExportDataTable call

' so Cell=4,Row=12 is actually cell E13

Try

' read single header row into generic list

HeaderRow = New List(Of Integer)

dataRange = wrkSheet.Cells.ExportDataTable(12, 4, 1, 5)

dr = dataRange.CreateDataReader()

If dr.Read Then

For iCount = 0 To 4

HeaderRow.Add(GetInteger(dr, iCount))

Next

End If

 

Catch ex0 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex0: " & ex0.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(14, 2, 2, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

 

Catch ex1 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex1: " & ex1.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(19, 2, 2, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

Catch ex2 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex2: " & ex2.Message, String.Empty, 0, True)

End Try

 

Try

dataRange = wrkSheet.Cells.ExportDataTable(23, 2, 9, 7)

dr = dataRange.CreateDataReader()

While dr.Read

ProcessDataTable(WorkbookID, dr, 2, 6, HeaderRow)

End While

Catch ex3 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData ex3: " & ex3.Message, String.Empty, 0, True)

End Try

 

Catch exOuter As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ImportWorksheetData: outer " & exOuter.Message, String.Empty, 0, True)

End Try

End Sub

 

Public Sub ProcessDataTable(ByVal WorkbookID As Integer, ByVal dr As DataTableReader, ByVal iStartDataCol As Integer, ByVal iDataColCount As Integer, ByVal HeaderRow As List(Of Integer))

Dim iCol As Integer

Dim DataValue As Decimal

Dim bAddRow As Boolean

Dim TimePeriodID As Integer

Dim OutputCount As Integer = 0

 

Try

For iCol = iStartDataCol To iDataColCount

bAddRow = False

Try

If Not dr.IsDBNull(iCol) Then

DataValue = GetDecimal(dr, iCol)

If DataValue <> 0 Then bAddRow = True

If bAddRow Then

With Me.WorksheetDataBuffer

.AddRow()

' get header from the row we read into a list earlier

.FinancialYear = HeaderRow(iCol - 2)

.WorkbookID = WorkbookID

.RowName = dr.GetString(0)

.DataValue = DataValue

End With

End If

End If

Catch ex1 As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Inner " & ex1.Message, String.Empty, 0, True)

End Try

Next

 

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "ProcessDataTable: Outer " & ex.Message, String.Empty, 0, True)

End Try

End Sub

 

Function GetDecimal(ByVal dr As DataTableReader, ByVal iCol As Integer) As Decimal

Dim DataValue As Decimal

Try

Select Case dr.GetDataTypeName(iCol)

Case "String"

If IsNumeric(dr.GetString(iCol)) Then

DataValue = CDec(dr.GetString(iCol))

End If

Case "Integer"

DataValue = CDec(dr.GetInt32(iCol))

Case "Double"

DataValue = CDec(dr.GetDouble(iCol))

Case "Decimal"

DataValue = dr.GetDecimal(iCol)

End Select

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetDecimal " & ex.Message, String.Empty, 0, True)

End Try

Return DataValue

End Function

 

Function GetInteger(ByVal dr As DataTableReader, ByVal iCol As Integer) As Integer

Dim DataValue As Integer

Try

Select Case dr.GetDataTypeName(iCol)

Case "String"

If IsNumeric(dr.GetString(iCol)) Then

DataValue = CInt(dr.GetString(iCol))

End If

Case "Integer"

DataValue = dr.GetInt32(iCol)

Case "Double"

DataValue = CInt(dr.GetDouble(iCol))

Case "Decimal"

DataValue = CInt(dr.GetDecimal(iCol))

End Select

Catch ex As Exception

Me.ComponentMetaData.FireError(0, Me.ComponentMetaData.Name, "GetInteger " & ex.Message, String.Empty, 0, True)

End Try

Return DataValue

End Function

End Class

Posted by DrJohn with no comments
Filed under: ,

Downloading Excel files from SharePoint using SSIS

In my previous post, I outlined how to obtain a recursive list of all Excel files held in a SharePoint document library. In SSIS, I use this code to upsert records into a table which identifies which files are new or modified. Clearly I could process the file in-situ, but that would over complicate my code with data streams which SSIS would not understand – especially if you are attempting to use the standard Microsoft.Jet.OLEDB.4.0 engine to process the file! So downloading the file to a local folder is the easiest option. Fortunately, I spotted Greg Enslow's post which pointed me in the direction of using the WebClient library.

In my control flow I obtain a list all the files I need to process and store it in an ADO.NET recordset. I then iterate over this list using a ForEach task, which sets various variables containing the URL, file extension and the document's name and it's GUID in SharePoint. My script task then downloads each file so it can be processed by subsequent tasks in the control flow.


 

I have expanded on Greg's original code to support file system locations as well as SharePoint URLs. Enjoy!

Imports System

Imports System.Data

Imports System.Math

Imports Microsoft.SqlServer.Dts.Runtime

Imports System.Net

Imports System.IO

 

Public Class ScriptMain

 

Public Sub Main()

' downloads the file from SharePoint or a file system location to a local folder

 

Dim taskResult As Integer = Dts.Results.Success

 

Try

' obtain location of local folder from variable

Dim dir As DirectoryInfo = New DirectoryInfo(Dts.Variables("ImportFolder").Value.ToString())

If dir.Exists Then

 

' Create the filename for local storage using

' the GUID from SharePoint as this will be unique.

Dim file As FileInfo = New FileInfo(dir.FullName & "\" & Dts.Variables("WorkbookGUID").Value.ToString() & Dts.Variables("Extension").Value.ToString())

If Not file.Exists Then

 

' get the path of the file we need to download

Dim fileUrl As String = Dts.Variables("EncodedAbsUrl").Value.ToString()

If fileUrl.Length <> 0 Then

' download the file from SharePoint or Archive file system to local folder

Dim client As New WebClient()

 

If Left(fileUrl, 4).ToLower() = "http" Then

'download the file from SharePoint

client.Credentials = System.Net.CredentialCache.DefaultCredentials

client.DownloadFile(fileUrl, file.FullName)

Else

' copy file from remote file system

System.IO.File.Copy(fileUrl, file.FullName)

End If

Else

Throw New ApplicationException("EncodedAbsUrl variable does not contain a value!")

End If

End If

Else

Throw New ApplicationException("ImportFolder does not exist!")

End If

Catch ex As Exception

 

Dts.Events.FireError(0, String.Empty, ex.Message, String.Empty, 0)

 

taskResult = Dts.Results.Failure

 

End Try

Dts.TaskResult = taskResult

End Sub

 

End Class

Posted by DrJohn with no comments
Filed under: ,

SSIS: Calling SharePoint web services from the data flow

In my previous post, I outlined how to call a SharePoint web service to obtain a list of all Excel files held in a document library.  However, that code was developed in Visual Studio where adding a reference to a web service is simplicity itself.  Doing the same in SSIS is more of a challenge as Visual Studio for Applications (VSA) does not have the Add Web Reference command familiar to Visual Studio developers.

To keep my script component simple, I decided to place it in the data flow so that I could pass a list of SharePoint sites on the input buffer and the component could output a list of Excel files on the output buffer.  If you are familiar with the Script Component you will know that it can be used in three forms: as a Source, a Destination and as a Transformation.  In its default form, the Transformation is synchronous (i.e. one line of output is written for every line of input).  As I wanted the script component to generate more rows on its output than it received, the first thing I needed to do was to switch it to asynchronous mode by changing the SynchronousInputID of the output buffer to zero.

To call the SharePoint web service from within my SSIS script component I needed to create a Visual Basic proxy class using wsdl.exe using the following command-line:

wsdl /language:VB http://yourhost/_vti_bin/lists.asmx?wsdl /out:SharePointList.vb

This proxy class was then imported into the Script Component by selecting "Add Existing Item..." from the Project menu.   I needed to delete the first few "garbage" characters and add references to the System.Xml and System.Web.Services .NET assemblies before the proxy class would compile.

Next I converted the C# code I developed in my console application (see previous post) to VB.NET using the excellent conversion routine from DeveloperFusion and pasted it into my script component.  After modifying the code to work with the input and output buffers the whole package worked fine.  Below is my final data flow.

Beware!  If you need to call two different SharePoint web services from within the same script component, you will need to edit the code generated by wsdl.exe and add a Namespace in order to avoid conflicts.


 

Posted by DrJohn with 3 comment(s)
Filed under: ,

Converting C# to VB.NET for use in SSIS

If you much prefer C# over VB.NET but are forced to use VB.NET as that is the only language supported by SSIS, you will love the tool I discovered.  Some cool guys over at DeveloperFusion have created an on-line C# to VB.NET convertor which makes the transition simplicity itself.  It also works the other way around. So if you want to convert ugly VB.NET into cool C#, this is your answer.  Easy!

Posted by DrJohn with 2 comment(s)
Filed under: ,

Getting a list of files from a MOSS document library using a SharePoint web service

My challenge was simple.  I needed to develop an SSIS package that would download and extract data from every Excel file held in document libraries across several SharePoint sites.  SSIS was the natural choice as the data needed to be cleaned and validated before being imported into a database. However, SSIS is not great with web services – especially in the data flow. As I not worked with the SharePoint web services much, I started with a good old Console application.
 
MOSS, or more accurately, WSS provides a whole host of web services to obtain information about SharePoint sites.  However, figuring out which method to invoke and what parameters to pass is more problematic.  Especially as many of the parameters are chunks of Collaborative Application Mark-up Language (CAML) – a dialect of XML developed by Microsoft specifically for use with SharePoint.
 
A False Start
My first console app simply obtained the GUID of the document library using the GetListCollection() method of the Lists web service.  The GUID was then passed to the GetListItems() method which duly provided all documents and folders at the top level of the document library.  It then seemed logical to me to recursively call the GetListItems() method using the GUID of each sub-folder.  On no, how wrong could I be!  The GetListItems() method simply chokes on these folder GUIDs.
 
On searching the internet I found many other incorrect forum posts and blog entries about the same topic – but no working solutions.  I also made an extensive search of my eBook collection – but again no solutions – which overall motivated me to write this blog entry.
 
The solution - RTFM
Well, if I had read the whole page in the manual, I would have got to the solution earlierCrying.  The key to my puzzle was the QueryOptions XML fragment which has both a Folder element and the all important <ViewAttributes Scope="Recursive" /> element. Using these elements together, it is possible to obtain a list of all documents in all subfolders in the list. Indeed, it does not even bother returning the subfolder details!
 
So here is the code for my working C# sample.
 

using System;

using System.Collections.Generic;

using System.Text;

using System.Xml;

using System.Web.Services;

using System.Web;

using System.Net;

 

namespace ConsoleApplication1

{
    class Program
    {

        static void Main(string[] args)
        {

string siteUrl = @"http://yourserver/sites/yoursite";

string documentLibraryName = @"Shared Documents";

SharePointList.Lists wsList = new SharePointList.Lists();

wsList.Credentials = System.Net.CredentialCache.DefaultCredentials;

WebProxy proxyObj = new WebProxy("yourproxy", 80);

wsList.Proxy = proxyObj;

wsList.Url = siteUrl + @"/_vti_bin/lists.asmx";

 

// get a list of all top level lists

XmlNode allLists = wsList.GetListCollection();

// load into an XML document so we can use XPath to query content

XmlDocument allListsDoc = new XmlDocument();

allListsDoc.LoadXml(allLists.OuterXml);

// allListsDoc.Save(@"c:\allListsDoc.xml"); // for debug

XmlNamespaceManager ns = new XmlNamespaceManager(allListsDoc.NameTable);

ns.AddNamespace("d", allLists.NamespaceURI);

 

// now get the GUID of the document library we are looking for

XmlNode dlNode = allListsDoc.SelectSingleNode("/d:Lists/d:List[@Title='" + documentLibraryName + "']", ns);

if (dlNode == null)

{

Console.WriteLine("Document Library '{0}' not found!", documentLibraryName);

}

else

{

// obtain the GUID for the document library and the webID

string documentLibraryGUID = dlNode.Attributes["ID"].Value;

string webId = dlNode.Attributes["WebId"].Value;

 

Console.WriteLine("Opening folder '{0}' GUID={1}", documentLibraryName, documentLibraryGUID);

 

// create ViewFields CAML

XmlDocument viewFieldsDoc = new XmlDocument();

XmlNode ViewFields = AddXmlElement(viewFieldsDoc, "ViewFields", "");

AddFieldRef(ViewFields, "GUID");

AddFieldRef(ViewFields, "ContentType");

AddFieldRef(ViewFields, "BaseName");

AddFieldRef(ViewFields, "Modified");

AddFieldRef(ViewFields, "EncodedAbsUrl");

//viewFieldsDoc.Save(@"c:\viewFields.xml"); // for debug

 

// create QueryOptions CAML

XmlDocument queryOptionsDoc = new XmlDocument();

XmlNode QueryOptions = AddXmlElement(queryOptionsDoc, "QueryOptions", "");

AddXmlElement(QueryOptions, "Folder", documentLibraryName);

AddXmlElement(QueryOptions, "IncludeMandatoryColumns", "FALSE");

// this element is the key to getting the full recusive list

XmlNode node = AddXmlElement(QueryOptions, "ViewAttributes", "");

AddXmlAttribute(node, "Scope", "Recursive");

queryOptionsDoc.Save(@"c:\queryOptions.xml"); // for debug

 

// obtain the list of items in the document library

XmlNode listContent = wsList.GetListItems(documentLibraryGUID, null, null, ViewFields, null, QueryOptions, webId);

 

XmlDocument xmlResultsDoc = new XmlDocument();

xmlResultsDoc.LoadXml(listContent.OuterXml);

ns = new XmlNamespaceManager(xmlResultsDoc.NameTable);

ns.AddNamespace("z", "#RowsetSchema");

// xmlResultsDoc.Save(@"c:\listContent.xml"); // for debug

 

XmlNodeList rows = xmlResultsDoc.SelectNodes("//z:row", ns);

if (rows.Count == 0)

{

Console.WriteLine("No content found");

}

foreach (XmlNode row in rows)

{

Console.WriteLine(row.Attributes["ows_ContentType"].Value + " " + row.Attributes["ows_GUID"].Value + " :: " + row.Attributes["ows_BaseName"].Value);

}

}

Console.WriteLine("Done");

Console.Read();

}

 

public static XmlNode AddXmlElement(XmlNode parent, string elementName, string elementValue)

{

XmlNode element = parent.AppendChild(parent.OwnerDocument.CreateNode(XmlNodeType.Element, elementName, ""));

if (elementValue != "")

element.InnerText = elementValue;

return (element);

}

 

public static XmlNode AddXmlElement(XmlDocument parent, string elementName, string elementValue)

{

XmlNode element = parent.AppendChild(parent.CreateNode(XmlNodeType.Element, elementName, ""));

if (elementValue != "")

element.InnerText = elementValue;

return (element);

}

 

public static XmlNode AddXmlAttribute(XmlNode element, string attrName, string attrValue)

{

XmlNode attr = element.Attributes.Append((XmlAttribute)element.OwnerDocument.CreateNode(XmlNodeType.Attribute, attrName, ""));

if (attrValue != "")

attr.Value = attrValue;

return (attr);

}

 

public static void AddFieldRef(XmlNode viewFields, string fieldName)

{

    XmlNode fieldRef = AddXmlElement(viewFields, "FieldRef", "");

    AddXmlAttribute(fieldRef, "Name", fieldName);

}

}

}


Posted by DrJohn with 8 comment(s)
Filed under: ,