I did a presentation at DDD8 on the entity framework and how to stop your DBA
from having a heart attack. You can find my demos and slide deck here http://sqlblogcasts.com/blogs/simons/archive/2010/01/30/Entity-Framework-how-to-stop-your-DBA-having-a-heart-attack.aspx
Whilst at DDD Mike Ormond
interviewed me about my view on ORMs and the battel between DBAs and Devs. To
see what I said go tohttp://bit.ly/bnf1By
The XML Source in SSIS is great if you have a 1 to 1
mapping between entity and table. You can do more complex mapping but it becomes
very messy and won't perform. What other options do you have?
The challenge with XML processing is to not need a huge amount of memory. I
remember using the early versions of Biztalk with loaded the whole document into
memory to map from one document type to another. This was fine for small
documents but was an absolute killer for large documents.
You therefore need a streaming approach.
For flexibility however you want to be able to generate your rows easily, and
if you've ever used the XmlReader you will know its ugly code to write.
That brings me on to LINQ. The is an implementation of LINQ over XML which is
really nice. You can write nice LINQ queries instead of the XMLReader stuff. The
downside is that by default LINQ to XML requires a whole XML document to work
with. No streaming.
Your code would look like this. We create an XDocument and then enumerate
over a set of annoymous types we generate from our LINQ statement
XDocument x =
XDocument.Load("C:\\TEMP\\CustomerOrders-Attribute.xml");
foreach (var xdata in (from customer in
x.Elements("OrderInterface").Elements("Customer")
from order in customer.Elements("Orders").Elements("Order")
select new { Account = customer.Attribute("AccountNumber").Value
,
OrderDate = order.Attribute("OrderDate").Value }
))
{
Output0Buffer.AddRow();
Output0Buffer.AccountNumber =
xdata.Account;
Output0Buffer.OrderDate =
Convert.ToDateTime(xdata.OrderDate);
}
As I said the downside to this is that you are loading the whole document
into memory.
I did some googling and came across some helpful videos from a nice UK DPE
Mike Taulty http://www.microsoft.com/uk/msdn/screencasts/screencast/289/LINQ-to-XML-Streaming-In-Large-Documents.aspx.
Which show you how you can combine LINQ and the XmlReader to get a semi
streaming approach. I took what he did and implemented it in SSIS. What I found
odd was that when I ran it I got different numbers between using the
streamed and non streamed versions. I found the cause was a little bug in Mikes
code that causes the pointer in the XmlReader to progress past the start of the
element and thus
foreach (var xdata in (from customer in
StreamReader("C:\\TEMP\\CustomerOrders-Attribute.xml","Customer")
from order in customer.Elements("Orders").Elements("Order")
select new { Account = customer.Attribute("AccountNumber").Value
, OrderDate = order.Attribute("OrderDate").Value }
))
{
Output0Buffer.AddRow();
Output0Buffer.AccountNumber = xdata.Account;
Output0Buffer.OrderDate = Convert.ToDateTime(xdata.OrderDate);
}
These look very similiar and they are the key element is
the method we are calling, StreamReader. This method is what gives us streaming,
what it does is return a IEnumerable list of elements, because of the way that LINQ works this
results in the data being streamed in, it returns the elements one at
a tiem rather than building a collection of them. The key is the use
of the IEnumerable and the "yield return"
static IEnumerable<XElement> StreamReader(String filename, string elementName)
{
using (XmlReader
xr = XmlReader.Create(filename))
{
xr.MoveToContent();
while (xr.Read()) //Reads the first
element
{
while (xr.NodeType == XmlNodeType.Element && xr.Name ==
elementName)
{
XElement node = (XElement)XElement.ReadFrom(xr);
yield return node;
}
}
xr.Close();
}
}
This code is specifically designed to return a list of the elements with a
specific name. The first Read reads the root element and then the inner while
loop checks to see if the current element is the type we want. If not we do the
xr.Read() again until we find the element type we want. We then use the neat
function XElement.ReadFrom to read an element and all its sub elements into an
XElement. This is what is returned and can be consumed by the LINQ statement.
Essentially once one element has been read we need to check if we are still on
the same element type and name (the inner loop) This was Mikes mistake, if we
called .Read again we would advance the XmlReader beyond the start of the
Element and so the ReadFrom method wouldn't work.
To get this working,
1. Put a script component in your data flow as a source component.
2. Then add the columns you wish to the output.
3. Add the StreamReader function
to your script code
4. Put the first foreach in the CreateNewOutputRows method
4. Change the LINQ query etc to match what you want
With the code above you can use what ever LINQ statement you like to flatten
your XML into the rowsets you want. You could even have multiple outputs and
generate your own surrogate keys.
Don't forget to register for the SQL Server 2008 R2 Launch
event in London on the 15th April.
http://www.microsoft.com/uk/techdays/itprodaythursday.aspx
Why not make a long weekend of it and do the launch and SQLBits (www.sqlbits.com ) followed by a few days
sightseeing in London.
We will be opening registration for SQLBits next week but in the mean time
get registering for the launch day, from what I understand there are only a few
places left.
http://www.microsoft.com/uk/techdays/itprodaythursday.aspx
I've had a chance to look at the results directly and it is clear that there
is a tough choice.
On the one hand people are saying that they prefer to have PASS put their
money into chapters and things like 24hrs of PASS rather than an event on the
east coast. Whilst at the same time almost 50% more people said they would be
more likely to attend an East Coast event than a Seattle event, and 60% more
said they would be more likely to attend a US Central region event. What’s more
60% said that the summit should be outside of Seattle every other year with only
19% saying it should always stay in Seattle.
So clearly there is a huge desire for a non Seattle event.
Looking at the other reasons for keeping in Seattle and the big one being
that people want Microsoft speakers. More people think it’s somewhat important
of very important that the conference is in walking distance of the hotels and
restaurants. Essentially the Q6 questions show an even balance for normal
conference, highlighting that they are prepared to travel, not with the family
and they want a well laid out conference.
What’s very annoying is that the questions, as people have commented, were
biased towards certain answers. For instance there was no option about whether
people feel it’s important to have industry leading speakers, MVPs etc at the
conference. Only questions about Microsoft speakers. I know survey writing
is very difficult to avoid biasing the answers one way or another. There
was also no choice to show peoples preference, would people
prefer Microsoft speakers or the summit to be held on the East
Coast/Central US. I also find it amazing that people prefer hundreds of
developers rather than the SQLCAT and CSS teams, surely that indicates another
issue about a lack of understanding of what the these teams do.
All in all it is clear that people showed they want an event outside of
Seattle and don't want PASS to be putting money into that instead of into other
community activities. I find it surprising that there appears to have been a
huge weighting against certain questions which have prioritised them over the
huge desire for a PASS summit outside of Seattle.
Let’s see where we will be in 2013 or maybe they will rethink 2012 who
knows.
Due to the overwhelming number of session submissions we
have increased the number. So we now have 24 sessions plus 4 sessions from
sponsors planned. To do this we are shortening each session to 50 minutes.
I hope this goes down well, I guess we'll see when we get the feedback
in.
I've just received the PASS connector newsletter with details of the decision
of where to locate the next summits.
To me it shows you how you can spin statistics they way you want to.
I am very surprised by the numbers, 81% said that they would like an east
coast event. The spin is that "When we look at responses from only 2008 and 2009
Summit attendees (our most successful ones by far), the number who want a future
Summit outside of Seattle drops to 69%." Wow the number drops that’s bad. We
must stay in Seattle then.
My take however is that 69% of those willing to travel to Seattle want an
event NOT in Seattle. Doesn't that suggest that people would like an event not
to be in Seattle?
I appreciate the comment about the launch years however 2012 isn't going to
be a launch year because they need to get SQL out of the door before then to
meet the 3 year software assurance cycle. So sticking 2012 in Seattle is IMHO a
bad decision.
It might keep the cost down for the PASS org, Microsoft and the attendees
that live near Seattle, but does it keep the cost down for the rest of the SQL
Community?
They do mention a possibility of an event on the East Coast and I do hope
that it’s not just lip service. Flights for me to the East coast look ~30%
cheaper than to Seattle and I've never been to the East Coast. Come on PASS show
you are listening to the community.
David's posted a great post on shrinking the transaction
log and log shipping. Log shipping and shrinking transaction logs
Unlike shrinking the data file shrinking the transaction log isn't a bad
thing, IF you don't need the log to be that size.
I've seen many systems that shrink the log because it has grown only for it
to grow the next day to the same size becuase of an overnight operation.
To reduce the growth of the transaction log you need to do one or more of the
following,
1.Back it up more frequently
2.Change to simple recovery model
3.Use
minimally logged operations
4.Keep transactions short and small
5.Break
large transactions into smaller transactions
6.If using replication ensure
that your backup of the replication topology is frequent
enough
I just got
this from fello SQL MVP Chris Testa O'Neil
"I
am pleased to announce the release of the Author Model
eCourseCollection 6233 AE: Implementing and Maintaining Business Intelligence in
Microsoft® SQL Server® 2008: Integration Services, Reporting Services and
Analysis Services
This 24-hour
collection provides you with the skills and knowledge required for implementing
and maintaining business intelligence solutions on SQL Server 2008. You will
learn about the SQL Server technologies, such as Integration Services, Analysis
Services, and Reporting Services.
This
collection also helps students to prepare for Exam 70-448 and can be accessed
from: http://www.microsoft.com/learning/elearning/course/6233.mspx
It appears the video for my session on when query plans
go wrong was not working.
This has now been fixed.
You can view the video here http://sqlbits.com/Agenda/event5/When_a_query_plan_goes_wrong/default.aspx
Something mildly amusing for a monday morning.
The Deprecation Event Class uses the ntext data type which is a deprecated
feature.
To have a look yourself go to http://msdn.microsoft.com/en-us/library/ms178053.aspx
Yeh I know its profiler that is using ntext, still made me
smile.
We will be deciding on the sessions tomorrow (Tuesday
8th March) so make sure you get your session in for SQLBits quick.
Don't forget we are focussing on performance an scalability so make sure your
session covers one or both of these.
To submit your session
Step 1 - Complete your Speaker Profile
Step 2 - Submit My Sessions
Often you have the need to archive data from a
table.
This leads to a number of challenges
1. How can you do it without impacting users
2. How can I make it transactionally consistent, i.e. the data I put in the
archive is the data I remove from the main table
3. How can I get it to perform well
Points 1 is very much tied to point 3. If it doesn't perform well then
the delete of data is going to cause lots of locks and thus potentially
blocking.
For points 1 and 3 refer to my previous posts DELETE-TOP-x-rows-avoiding-a-table-scan and
UPDATE-and-DELETE-TOP-and-ORDER-BY---Part2.
In essence you need to be removing small chunks of data from your table and you
want to do that avoiding a table scan.
So that deals with the delete approach but archiving is about inserting that
data somewhere else.
Well in SQL 2008 they introduced a new feature INSERT over DML (Data
Manipulation Language, i.e. SQL statements that change data), or composable DML.
The ability to nest DML statements within themselves, so you can past the
results of an insert to an update to a merge. I've mentioned this before here SQL-Server-2008---MERGE-and-optimistic-concurrency.
This feature is currently limited to being able to consume the results of a DML
statement in an INSERT statement. There are many restrictions which you can find
here http://msdn.microsoft.com/en-us/library/ms177564.aspx look
for the section "Inserting Data Returned From an OUTPUT Clause Into a Table"
Even with the restrictions what we can do is consume the OUTPUT from a DELETE
and INSERT the results into a table in another database. Note that in BOL it
refers to not being able to use a remote table, remote means a table on another
SQL instance.
To show this working use this SQL to setup two databases foo and
fooArchive
create database foo
go
--create
the source table fred in database foo
select * into foo..fred from sys.objects
go
create database fooArchive
go
if object_id('fredarchive',DB_ID('fooArchive')) is null
begin
select getdate()
ArchiveDate,* into fooArchive..FredArchive from
sys.objects where
1=2
end
go
And then we can use this simple statement to archive the data
insert into fooArchive..FredArchive
select getdate(),d.*
from
(delete top (1)
from foo..Fred
output deleted.*) d
go
In this statement the delete can be any delete statement you wish so if you
are deleting by ids or a range of values then you can do that. Refer to the DELETE-TOP-x-rows-avoiding-a-table-scan post
to ensure that your delete is going to perform. The last thing you want to do is
to perform 100 deletes each with 5000 records for each of those deletes to do a
table scan.
For a solution that works for SQL2005 or if you want to archive to a
different server then you can use linked servers or SSIS. This example
shows how to do it with linked servers. [ONARC-LAP03] is the source server.
begin transaction
insert into fooArchive..FredArchive
select getdate(),d.*
from openquery ([ONARC-LAP03],'delete top (1)
from
foo..Fred
output deleted.*')
d
commit
transaction
and to prove the
transactions work try, you should get the same number of records before and
after.
select
(select count(1) from foo..Fred)
fred
,(select COUNT(1) from
fooArchive..FredArchive ) fredarchive
begin
transaction
insert into fooArchive..FredArchive
select getdate(),d.*
from openquery ([ONARC-LAP03],'delete top (1)
from foo..Fred
output deleted.*')
d
rollback transaction
select
(select count(1) from foo..Fred)
fred
,(select COUNT(1) from
fooArchive..FredArchive ) fredarchive
The transactions are very important with this solution. Look what happens
when you don't have transactions and an error occurs
select
(select count(1) from foo..Fred)
fred
,(select COUNT(1) from
fooArchive..FredArchive ) fredarchive
insert into fooArchive..FredArchive
select getdate(),d.*
from openquery ([ONARC-LAP03],'delete top (1)
from foo..Fred
output deleted.*
raiserror (''Oh doo doo'',15,15)')
d
select
(select count(1) from foo..Fred)
fred
,(select COUNT(1) from
fooArchive..FredArchive ) fredarchive
Before running this think what the result would be. I got it wrong.
What seems to happen is that the remote query is executed as a transaction,
the error causes that to rollback. However the results have already been sent to
the client and so get inserted into the
Whenever I teach about SQL Server performance tuning I
try can get across the message that there is no such thing as a table. Does that
sound odd, well it isn't, trust me. Rather than tables you need to consider
structures. You have
1. Heaps
2. Indexes (b-trees)
Some people split indexes in
two, clustered and non-clustered, this I feel confuses the situation as
people associate clustered indexes with sorting, but don't associate non clustered indexes with sorting, this
is wrong. Clustered and non-clustered indexes are the same b-tree structure(and
even more so with SQL 2005) with the leaf pages sorted
in a linked list according to the keys of the index.. The difference is
that non clustered indexes include in their structure either, the clustered key(s), or the
row identifier for the row in the table (see http://sqlblog.com/blogs/kalen_delaney/archive/2008/03/16/nonclustered-index-keys.aspx for
more details). Beyond that they are the same, they have key columns which
are stored on the root and intermediary
pages, and included columns which are on
the leaf level.
The reason this is important is that this is how the
optimiser sees the world, this means it can use any of these structures to
resolve your query. Even if your query only accesses one table, the optimiser can access
multiple structures to get your results. One commonly sees this with a non-clustered
index scan and then a key lookup (clustered index seek), but
importantly it's not restricted to just using one non-clustered index
and the clustered index or heap, and that's the
challenge for the weekend.
So the challenge for the weekend is to produce the most complex single table
query.
For those clever bods amongst you that are thinking, great I will just use
lots of xquery functions, sorry these are the rules.
1. You have to use a table from AdventureWorks (2005 or 2008)
2. You can add whatever indexes you like, but you must document these
3. You cannot use XQuery, Spatial, HierarchyId, Full Text or any open rowset
function.
4. You can only reference your table once, i..e a FROM clause with ONE table
and no JOINs
5. No Sub queries.
The aim of this is to show how the optimiser can use multiple structures to build the results of a query and to also highlight
why the optimiser is doing that. How many structures can you get the optimiser to
use?
As an example create these two indexes on AdventureWorks2008
create
index IX_Person_Person on Person.Person (lastName,
FirstName,NameStyle,PersonType)
create
index IX_Person_Person on Person.Person(BusinessentityId,ModifiedDate)with drop_existing
select lastName,
ModifiedDate
from Person.Person
where LastName = 'Smith'
You will see that the optimiser has decided to not access the
underlying clustered index of the table but to use two indexes above to resolve the
query. This highlights how the optimiser considers all storage structures,
clustered indexes, non clustered indexes and heaps when trying to resolve a
query.

So are you up to the challenge for the weekend to produce the most complex
single table query?
The prize is a pdf version of a popular SQL Server
book, or a physical book if you live in the UK.
If you get the following error when trying to write an
expression there is an easy solution
Attempt to parse the expression "@[User::FilePath] + "\" + @[User::FileName]
+ ".raw"" failed. The token "." at line number "1", character number
"<some position>" was not recognized. The expression cannot be parsed
because it contains invalid elements at the location specified.
The SSIS expression language is a C based language and the \ is a token, this
means you have to escape it with another one. i.e "\" becomes "\\", unlike C#
you can't prefix the string with a @, you have to use the escaping route.
In summary when ever you want to use \ you need to use two
\\
We've got over 50 people registered for the
SQLSocial event on 16th March with Itzik Ben-Gan, Greg Low, Davide Mauri and Bill Vaughn
I need to finalise numbers on early next week so if you want to come along please register asap, otherwise
I can't promise that we'll have space for you.
To register use he form on herehttp://sqlsocial.com/events.aspx.
I look forward to hearing from you.
More Posts
Next page »