SSIS - XML Source Script

The XML Source in SSIS is great if you have a 1 to 1 mapping between entity and table. You can do more complex mapping but it becomes very messy and won't perform. What other options do you have?

The challenge with XML processing is to not need a huge amount of memory. I remember using the early versions of Biztalk with loaded the whole document into memory to map from one document type to another. This was fine for small documents but was an absolute killer for large documents.

You therefore need a streaming approach.

For flexibility however you want to be able to generate your rows easily, and if you've ever used the XmlReader you will know its ugly code to write.

That brings me on to LINQ. The is an implementation of LINQ over XML which is really nice. You can write nice LINQ queries instead of the XMLReader stuff. The downside is that by default LINQ to XML requires a whole XML document to work with. No streaming.

Your code would look like this. We create an XDocument and then enumerate over a set of annoymous types we generate from our LINQ statement

XDocument x = XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");

 

foreach (var xdata in (from customer in x.Elements("OrderInterface").Elements("Customer")

                       from order in customer.Elements("Orders").Elements("Order")

                       select new { Account = customer.Attribute("AccountNumber").Value

                                  , OrderDate = order.Attribute("OrderDate").Value }

                       ))

{

    Output0Buffer.AddRow();

    Output0Buffer.AccountNumber = xdata.Account;

    Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

}

As I said the downside to this is that you are loading the whole document into memory.

I did some googling and came across some helpful videos from a nice UK DPE Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx. Which show you how you can combine LINQ and the XmlReader to get a semi streaming approach. I took what he did and implemented it in SSIS. What I found odd was that when I ran it I got different numbers between using the streamed and non streamed versions. I found the cause was a little bug in Mikes code that causes the pointer in the XmlReader to progress past the start of the element and thus

foreach (var xdata in (from customer in StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")

                               from order in customer.Elements("Orders").Elements("Order")

                               select new { Account = customer.Attribute("AccountNumber").Value

                                          , OrderDate = order.Attribute("OrderDate").Value }

                               ))

        {

            Output0Buffer.AddRow();

            Output0Buffer.AccountNumber = xdata.Account;

            Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

        }

These look very similiar and they are the key element is the method we are calling, StreamReader. This method is what gives us streaming, what it does is return a IEnumerable list of elements, because of the way that LINQ works this results in the data being streamed in, it returns the elements one at a tiem rather than building a collection of them. The key is the use of the IEnumerable and the "yield return"

static IEnumerable<XElement> StreamReader(String filename, string elementName)

{

    using (XmlReader xr = XmlReader.Create(filename))

    {

        xr.MoveToContent();

        while (xr.Read()) //Reads the first element

        {

            while (xr.NodeType == XmlNodeType.Element && xr.Name == elementName)

            {

                XElement node = (XElement)XElement.ReadFrom(xr);

 

                yield return node;

            }

        }

        xr.Close();

    }

}

This code is specifically designed to return a list of the elements with a specific name. The first Read reads the root element and then the inner while loop checks to see if the current element is the type we want. If not we do the xr.Read() again until we find the element type we want. We then use the neat function XElement.ReadFrom to read an element and all its sub elements into an XElement. This is what is returned and can be consumed by the LINQ statement. Essentially once one element has been read we need to check if we are still on the same element type and name (the inner loop) This was Mikes mistake, if we called .Read again we would advance the XmlReader beyond the start of the Element and so the ReadFrom method wouldn't work.

To get this working,

1. Put a script component in your data flow as a source component.

2. Then add the columns you wish to the output.

3. Add the StreamReader function to your script code

4. Put the first foreach in the CreateNewOutputRows method

4. Change the LINQ query etc to match what you want

With the code above you can use what ever LINQ statement you like to flatten your XML into the rowsets you want. You could even have multiple outputs and generate your own surrogate keys.


Published Thursday, March 11, 2010 8:20 PM by simonsabin

Comments

# Help with blogger/blogspot template? it says I need to make changes to the html? | Hot Trending Topics

Pingback from  Help with blogger/blogspot template? it says I need to make changes to the html? | Hot Trending Topics

Wednesday, October 26, 2011 9:02 PM by unclebiguns

# re: SSIS - XML Source Script

Simon,

Trying to implement this in a Script Component as t Transformation Type as I have an XML Column that I need to shred and I'm getting this error:

"The body of 'ScriptMain.StreamReader(string, string)' cannot be an iterator block because 'IEnumerable<System.Xml.Linq.XElement>' is not an iterator interface type"

My Binoogle skills have failed me so I'm wondering if you have any ideas?