Simon Sabin UK SQL Consultant's Blog

I did a presentation at DDD8 on the entity framework and how to stop your DBA from having a heart attack. You can find my demos and slide deck here http://sqlblogcasts.com/blogs/simons/archive/2010/01/30/Entity-Framework-how-to-stop-your-DBA-having-a-heart-attack.aspx

Whilst at DDD Mike Ormond interviewed me about my view on ORMs and the battel between DBAs and Devs. To see what I said go tohttp://bit.ly/bnf1By


Posted by simonsabin | with no comments

The XML Source in SSIS is great if you have a 1 to 1 mapping between entity and table. You can do more complex mapping but it becomes very messy and won't perform. What other options do you have?

The challenge with XML processing is to not need a huge amount of memory. I remember using the early versions of Biztalk with loaded the whole document into memory to map from one document type to another. This was fine for small documents but was an absolute killer for large documents.

You therefore need a streaming approach.

For flexibility however you want to be able to generate your rows easily, and if you've ever used the XmlReader you will know its ugly code to write.

That brings me on to LINQ. The is an implementation of LINQ over XML which is really nice. You can write nice LINQ queries instead of the XMLReader stuff. The downside is that by default LINQ to XML requires a whole XML document to work with. No streaming.

Your code would look like this. We create an XDocument and then enumerate over a set of annoymous types we generate from our LINQ statement

XDocument x = XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");

 

foreach (var xdata in (from customer in x.Elements("OrderInterface").Elements("Customer")

                       from order in customer.Elements("Orders").Elements("Order")

                       select new { Account = customer.Attribute("AccountNumber").Value

                                  , OrderDate = order.Attribute("OrderDate").Value }

                       ))

{

    Output0Buffer.AddRow();

    Output0Buffer.AccountNumber = xdata.Account;

    Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

}

As I said the downside to this is that you are loading the whole document into memory.

I did some googling and came across some helpful videos from a nice UK DPE Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx. Which show you how you can combine LINQ and the XmlReader to get a semi streaming approach. I took what he did and implemented it in SSIS. What I found odd was that when I ran it I got different numbers between using the streamed and non streamed versions. I found the cause was a little bug in Mikes code that causes the pointer in the XmlReader to progress past the start of the element and thus

foreach (var xdata in (from customer in StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")

                               from order in customer.Elements("Orders").Elements("Order")

                               select new { Account = customer.Attribute("AccountNumber").Value

                                          , OrderDate = order.Attribute("OrderDate").Value }

                               ))

        {

            Output0Buffer.AddRow();

            Output0Buffer.AccountNumber = xdata.Account;

            Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);

        }

These look very similiar and they are the key element is the method we are calling, StreamReader. This method is what gives us streaming, what it does is return a IEnumerable list of elements, because of the way that LINQ works this results in the data being streamed in, it returns the elements one at a tiem rather than building a collection of them. The key is the use of the IEnumerable and the "yield return"

static IEnumerable<XElement> StreamReader(String filename, string elementName)

{

    using (XmlReader xr = XmlReader.Create(filename))

    {

        xr.MoveToContent();

        while (xr.Read()) //Reads the first element

        {

            while (xr.NodeType == XmlNodeType.Element && xr.Name == elementName)

            {

                XElement node = (XElement)XElement.ReadFrom(xr);

 

                yield return node;

            }

        }

        xr.Close();

    }

}

This code is specifically designed to return a list of the elements with a specific name. The first Read reads the root element and then the inner while loop checks to see if the current element is the type we want. If not we do the xr.Read() again until we find the element type we want. We then use the neat function XElement.ReadFrom to read an element and all its sub elements into an XElement. This is what is returned and can be consumed by the LINQ statement. Essentially once one element has been read we need to check if we are still on the same element type and name (the inner loop) This was Mikes mistake, if we called .Read again we would advance the XmlReader beyond the start of the Element and so the ReadFrom method wouldn't work.

To get this working,

1. Put a script component in your data flow as a source component.

2. Then add the columns you wish to the output.

3. Add the StreamReader function to your script code

4. Put the first foreach in the CreateNewOutputRows method

4. Change the LINQ query etc to match what you want

With the code above you can use what ever LINQ statement you like to flatten your XML into the rowsets you want. You could even have multiple outputs and generate your own surrogate keys.


Don't forget to register for the SQL Server 2008 R2 Launch event in London on the 15th April.

http://www.microsoft.com/uk/techdays/itprodaythursday.aspx

Why not make a long weekend of it and do the launch and SQLBits (www.sqlbits.com ) followed by a few days sightseeing in London.

We will be opening registration for SQLBits next week but in the mean time get registering for the launch day, from what I understand there are only a few places left.

http://www.microsoft.com/uk/techdays/itprodaythursday.aspx


Posted by simonsabin | 1 comment(s)

I've had a chance to look at the results directly and it is clear that there is a tough choice.

On the one hand people are saying that they prefer to have PASS put their money into chapters and things like 24hrs of PASS rather than an event on the east coast. Whilst at the same time almost 50% more people said they would be more likely to attend an East Coast event than a Seattle event, and 60% more said they would be more likely to attend a US Central region event. What’s more 60% said that the summit should be outside of Seattle every other year with only 19% saying it should always stay in Seattle.

So clearly there is a huge desire for a non Seattle event.

Looking at the other reasons for keeping in Seattle and the big one being that people want Microsoft speakers. More people think it’s somewhat important of very important that the conference is in walking distance of the hotels and restaurants. Essentially the Q6 questions show an even balance for normal conference, highlighting that they are prepared to travel, not with the family and they want a well laid out conference.

What’s very annoying is that the questions, as people have commented, were biased towards certain answers. For instance there was no option about whether people feel it’s important to have industry leading speakers, MVPs etc at the conference. Only questions about Microsoft speakers. I know survey writing is very difficult to avoid biasing the answers one way or another. There was also no choice to show peoples preference, would people prefer Microsoft speakers or the summit to be held on the East Coast/Central US. I also find it amazing that people prefer hundreds of developers rather than the SQLCAT and CSS teams, surely that indicates another issue about a lack of understanding of what the these teams do.

All in all it is clear that people showed they want an event outside of Seattle and don't want PASS to be putting money into that instead of into other community activities. I find it surprising that there appears to have been a huge weighting against certain questions which have prioritised them over the huge desire for a PASS summit outside of Seattle.

Let’s see where we will be in 2013 or maybe they will rethink 2012 who knows.


Posted by simonsabin | with no comments

Due to the overwhelming number of session submissions we have increased the number. So we now have 24 sessions plus 4 sessions from sponsors planned. To do this we are shortening each session to 50 minutes.

I hope this goes down well, I guess we'll see when we get the feedback in.


Posted by simonsabin | with no comments

I've just received the PASS connector newsletter with details of the decision of where to locate the next summits.

To me it shows you how you can spin statistics they way you want to.

I am very surprised by the numbers, 81% said that they would like an east coast event. The spin is that "When we look at responses from only 2008 and 2009 Summit attendees (our most successful ones by far), the number who want a future Summit outside of Seattle drops to 69%." Wow the number drops that’s bad. We must stay in Seattle then.

My take however is that 69% of those willing to travel to Seattle want an event NOT in Seattle. Doesn't that suggest that people would like an event not to be in Seattle?

I appreciate the comment about the launch years however 2012 isn't going to be a launch year because they need to get SQL out of the door before then to meet the 3 year software assurance cycle. So sticking 2012 in Seattle is IMHO a bad decision.

It might keep the cost down for the PASS org, Microsoft and the attendees that live near Seattle, but does it keep the cost down for the rest of the SQL Community?

They do mention a possibility of an event on the East Coast and I do hope that it’s not just lip service. Flights for me to the East coast look ~30% cheaper than to Seattle and I've never been to the East Coast. Come on PASS show you are listening to the community.


Posted by simonsabin | 1 comment(s)

David's posted a great post on shrinking the transaction log and log shipping. Log shipping and shrinking transaction logs

Unlike shrinking the data file shrinking the transaction log isn't a bad thing, IF you don't need the log to be that size.

I've seen many systems that shrink the log because it has grown only for it to grow the next day to the same size becuase of an overnight operation.

To reduce the growth of the transaction log you need to do one or more of the following,

1.Back it up more frequently
2.Change to simple recovery model
3.Use minimally logged operations
4.Keep transactions short and small
5.Break large transactions into smaller transactions
6.If using replication ensure that your backup of the replication topology is frequent enough


Posted by simonsabin | with no comments

I just got this from fello SQL MVP Chris Testa O'Neil

 

"I am pleased to announce the release of the Author Model eCourseCollection 6233 AE: Implementing and Maintaining Business Intelligence in Microsoft® SQL Server® 2008: Integration Services, Reporting Services and Analysis Services

This 24-hour collection provides you with the skills and knowledge required for implementing and maintaining business intelligence solutions on SQL Server 2008. You will learn about the SQL Server technologies, such as Integration Services, Analysis Services, and Reporting Services.

This collection also helps students to prepare for Exam 70-448 and can be accessed from: http://www.microsoft.com/learning/elearning/course/6233.mspx 

 


Posted by simonsabin | with no comments

It appears the video for my session on when query plans go wrong was not working.

This has now been fixed.

You can view the video here http://sqlbits.com/Agenda/event5/When_a_query_plan_goes_wrong/default.aspx


Posted by simonsabin | with no comments

Something mildly amusing for a monday morning.

The Deprecation Event Class uses the ntext data type which is a deprecated feature.

To have a look yourself go to http://msdn.microsoft.com/en-us/library/ms178053.aspx

Yeh I know its profiler that is using ntext, still made me smile.


Posted by simonsabin | 1 comment(s)

We will be deciding on the sessions tomorrow (Tuesday 8th March) so make sure you get your session in for SQLBits quick.

Don't forget we are focussing on performance an scalability so make sure your session covers one or both of these.

To submit your session

  • Step 1 - Complete your Speaker Profile
  • Step 2 - Submit My Sessions

  • Posted by simonsabin | 1 comment(s)

    Often you have the need to archive data from a table.

    This leads to a number of challenges

    1. How can you do it without impacting users

    2. How can I make it transactionally consistent, i.e. the data I put in the archive is the data I remove from the main table

    3. How can I get it to perform well

    Points 1 is very much tied to point 3. If it doesn't perform well then the delete of data is going to cause lots of locks and thus potentially blocking.

    For points 1 and 3 refer to my previous posts DELETE-TOP-x-rows-avoiding-a-table-scan and UPDATE-and-DELETE-TOP-and-ORDER-BY---Part2. In essence you need to be removing small chunks of data from your table and you want to do that avoiding a table scan.

    So that deals with the delete approach but archiving is about inserting that data somewhere else.

    Well in SQL 2008 they introduced a new feature INSERT over DML (Data Manipulation Language, i.e. SQL statements that change data), or composable DML. The ability to nest DML statements within themselves, so you can past the results of an insert to an update to a merge. I've mentioned this before here SQL-Server-2008---MERGE-and-optimistic-concurrency. This feature is currently limited to being able to consume the results of a DML statement in an INSERT statement. There are many restrictions which you can find here http://msdn.microsoft.com/en-us/library/ms177564.aspx look for the section "Inserting Data Returned From an OUTPUT Clause Into a Table"

    Even with the restrictions what we can do is consume the OUTPUT from a DELETE and INSERT the results into a table in another database. Note that in BOL it refers to not being able to use a remote table, remote means a table on another SQL instance.

    To show this working use this SQL to setup two databases foo and fooArchive

    create database foo

    go

    --create the source table fred in database foo

    select * into foo..fred from sys.objects

    go

    create database fooArchive

    go

    if object_id('fredarchive',DB_ID('fooArchive')) is null

    begin

        select getdate() ArchiveDate,* into fooArchive..FredArchive from sys.objects where 1=2

     

        end

    go

    And then we can use this simple statement to archive the data

    insert into fooArchive..FredArchive

    select getdate(),d.*

    from (delete top (1)

            from foo..Fred

            output deleted.*) d

            go

    In this statement the delete can be any delete statement you wish so if you are deleting by ids or a range of values then you can do that. Refer to the DELETE-TOP-x-rows-avoiding-a-table-scan post to ensure that your delete is going to perform. The last thing you want to do is to perform 100 deletes each with 5000 records for each of those deletes to do a table scan.

    For a solution that works for SQL2005 or if you want to archive to a different server then you can use linked servers or SSIS. This example shows how to do it with linked servers. [ONARC-LAP03] is the source server.

    begin transaction

    insert into fooArchive..FredArchive

    select getdate(),d.*

    from openquery ([ONARC-LAP03],'delete top (1)

                        from foo..Fred

                        output deleted.*') d

    commit transaction

    and to prove the transactions work try, you should get the same number of records before and after.

    select (select count(1) from foo..Fred) fred

           ,(select COUNT(1) from fooArchive..FredArchive ) fredarchive

     

    begin transaction

    insert into fooArchive..FredArchive

    select getdate(),d.*

    from openquery ([ONARC-LAP03],'delete top (1)

                        from foo..Fred

                        output deleted.*') d

    rollback transaction

     

    select (select count(1) from foo..Fred) fred

           ,(select COUNT(1) from fooArchive..FredArchive ) fredarchive

    The transactions are very important with this solution. Look what happens when you don't have transactions and an error occurs

     

    select (select count(1) from foo..Fred) fred

           ,(select COUNT(1) from fooArchive..FredArchive ) fredarchive

     

    insert into fooArchive..FredArchive

    select getdate(),d.*

    from openquery ([ONARC-LAP03],'delete top (1)

                        from foo..Fred

                        output deleted.*

                        raiserror (''Oh doo doo'',15,15)') d

                       

    select (select count(1) from foo..Fred) fred

           ,(select COUNT(1) from fooArchive..FredArchive ) fredarchive

    Before running this think what the result would be. I got it wrong.

    What seems to happen is that the remote query is executed as a transaction, the error causes that to rollback. However the results have already been sent to the client and so get inserted into the


    Posted by simonsabin | 1 comment(s)

    Whenever I teach about SQL Server performance tuning I try can get across the message that there is no such thing as a table. Does that sound odd, well it isn't, trust me. Rather than tables you need to consider structures. You have

    1. Heaps

    2. Indexes (b-trees)

    Some people split indexes in two, clustered and non-clustered, this I feel confuses the situation as people associate clustered indexes with sorting, but don't associate non clustered indexes with sorting, this is wrong. Clustered and non-clustered indexes are the same b-tree structure(and even more so with SQL 2005) with the leaf pages sorted in a linked list according to the keys of the index.. The difference is that non clustered indexes include in their structure either, the clustered key(s), or the row identifier for the row in the table (see http://sqlblog.com/blogs/kalen_delaney/archive/2008/03/16/nonclustered-index-keys.aspx for more details). Beyond that they are the same, they have key columns which are stored on the root and intermediary pages, and included columns which are on the leaf level.

    The reason this is important is that this is how the optimiser sees the world, this means it can use any of these structures to resolve your query. Even if your query only accesses one table, the optimiser can access multiple structures to get your results. One commonly sees this with a non-clustered index scan and then a key lookup (clustered index seek), but importantly it's not restricted to just using one non-clustered index and the clustered index or heap, and that's the challenge for the weekend.

    So the challenge for the weekend is to produce the most complex single table query.

    For those clever bods amongst you that are thinking, great I will just use lots of xquery functions, sorry these are the rules.

    1. You have to use a table from AdventureWorks (2005 or 2008)

    2. You can add whatever indexes you like, but you must document these

    3. You cannot use XQuery, Spatial, HierarchyId, Full Text or any open rowset function.

    4. You can only reference your table once, i..e a FROM clause with ONE table and no JOINs

    5. No Sub queries.

    The aim of this is to show how the optimiser can use multiple structures to build the results of a query and to also highlight why the optimiser is doing that. How many structures can you get the optimiser to use?

    As an example create these two indexes on AdventureWorks2008

    create index IX_Person_Person on Person.Person (lastName, FirstName,NameStyle,PersonType)

    create index IX_Person_Person on Person.Person(BusinessentityId,ModifiedDate)with drop_existing

      

    select lastName, ModifiedDate

      from Person.Person

     where LastName = 'Smith'

    You will see that the optimiser has decided to not access the underlying clustered index of the table but to use two indexes above to resolve the query. This highlights how the optimiser considers all storage structures, clustered indexes, non clustered indexes and heaps when trying to resolve a query.

    Multi index query plan

    So are you up to the challenge for the weekend to produce the most complex single table query?

    The prize is a pdf version of a popular SQL Server book, or a physical book if you live in the UK.

     


    If you get the following error when trying to write an expression there is an easy solution

    Attempt to parse the expression "@[User::FilePath] + "\" + @[User::FileName] + ".raw"" failed.  The token "." at line number "1", character number "<some position>" was not recognized. The expression cannot be parsed because it contains invalid elements at the location specified.

    The SSIS expression language is a C based language and the \ is a token, this means you have to escape it with another one. i.e "\" becomes "\\", unlike C# you can't prefix the string with a @, you have to use the escaping route.

    In summary when ever you want to use \ you need to use two \\


    We've got over 50 people registered for the SQLSocial event on 16th March with Itzik Ben-Gan, Greg Low, Davide Mauri and Bill Vaughn

    I need to finalise numbers on early next week so if you want to come along please register asap, otherwise I can't promise that we'll have space for you.

    To register use he form on herehttp://sqlsocial.com/events.aspx.

    I look forward to hearing from you.


    Posted by simonsabin | 1 comment(s)
    More Posts Next page »