June 2007 - Posts
I've just published my RSS feed in feedburner. The link
Currently there are this many subscribers.
I have finally got round to looking into the error
reports and found a few bugs in this component.
I have uploaded a new install for this component including the fixes. The component now validates correctly that the correct inputs are attached to the component.
The key to using this component is to attach two outputs from the upstream component, the error output and the normal output. The normal output should be connected to the passthrough input of the enhanced error component and the error output connected to the error input.
The enhanced error component uses the normal output during the pre execution phase to map the column LineageIds to column names, after that the normal output is not touched it just passes through the component, thus the name of the input.
I hope explains how the component should be used.
I've decided its time to move on from Totaljobs Group to work
for myself. That leaves an opportunity for someone to join Totaljobs Group and
be responsible for a really challenging database architecture.
Totaljobs Group is the largest internet job board company in the UK and has
been growing rapidly for the past 6 years. We are responsible to for 9 job sites
in the UK and internationally with more planned in the future.
SQL Server is at the heart of the job sites providing many functions
including the core job and candidate searching services. You will be
Developing the database architecture
- Mentoring the SQL team
- Liasing with the business to develop technical
- Reccommending and championing the use of SQL Server to enhance the services
Totaljobs Group is an early adopter of new technology where it benefits the
business, SQL 2005 was in production shortly after RTM and there is already a
commitment to using SQL Server 2008 (Katmai).
If you like to develop innovative solutions in SQL Server and want to see how
you can get the most out of SQL Server, working in a fast moving, friendly,
relaxed environment then this is the job for you.
If you're interested then you can apply for the job here
I've got a 50% off voucher for SQL 2005 certification
exams, only downside is that it has to be used by the end of the week.
If you want it let me know and I'll send you the code.
First come first served.
If you are lucky enough to be running enterprise edition
of SQL you may have looked or want to look into online index rebuilds.
Unfortunately having blobs in your index blocks you from being able to do online
rebuild of that index. Cruically what that means is that if your table is
clustered then you cannot rebuild the clustered index on that table
You can still reorganise you index with blobs.
For this reason my reccomendation is to consider seperating out your blob
data into another table. For instance if you stored articles in your database,
you would have an article table. In this table you will store the
contents of the article along with other information about the article.
Many queries are likely to query the information and not the article contents.
If you stored the article contents on the same table as the other information
you wouldn't be able to re-index the table if it had a clustered index.
So in this scenario you would be best off with an Article and ArticleContents
This mechansim is also of benefit if using a full text index. Having a
seperate table allows you to isolate full text to that table. In addition it
allows you to drop, recreate and even repopulate your data without impacting
your main table with the core information in it.
Tony's blogged about the use of heaps http://sqlblogcasts.com/blogs/tonyrogerson/archive/2007/06/24/row-fragmentation-hopscotch-heap-v-clustered-and-io-cost.aspx.
Whilst this highlights data on a page being out of sequence and thus yoyoing
when you scan the data, it doesn't mean you should always have a clustered
The stats in Tony's blog are only because the query is a full table scan
in the first (heap) query. With a clustered index you are not doing a table scan
your are only doing a partial scan and so the reads are different. If you remove
the where clause thus forcing a table scan then the heap is qucker because it is
more compact, it has to read less pages.
You can also address these issues with a covering index. An non clustered
index is effectively the its own special table with a clustered index with the
leaf pages also containing the Row id or Primary key column.
Also the issue here is highlighted by a scan. If you are performing an
operation that does a seek heaps aren't bad. Also clustered indexes result in
the whole row being read, thus if you only need a few rows they can also be
worse than a heap with a covering index.
Finally having a clustered index is beneficial if you key size is < 8
bytes because the key is put in the leaf of non-clustered indexes. Whereas a
heap the row id is and the row id is 8 bytes
Finally with clustered indexes you will end up with page splits, this can
result in page fragmentation across the database file, the worst case would be
that in order to read 8 pages (8k), 8 extents(64k) would have to be read
because each page is on a different extent. The ALTER INDEX REORGANISE and
REBUILD address this fragmentation by the former moving pages to be in the
same order on disk as logically and also compacting pages in and the latter
rebuilding the index entirely. REORGANISE is an online operation.
Following this post Tony and I have had a very long discussion. The bottom
line is that if there is one thing you should learn its the internals of
indexes. This would be my starter, heaps and b trees, covering
indexes, bookmark lookups, included columns, page fragmentation, page splitting
We are currently implementing a data warehouse and the
design of the time dimension was recently being done.
Firstly it was agreed that day and time should be split out. Combining the
two into one dimension is nuts your dimension would be huge and unworkable.
So considering the day dimension what should the key be.
There are a number of options
1. No surrogate key just use the date (with no time component as we don't
have a date datatype yet)
2. Use a generated surrogate key i.e. an identity
3. Convert the datetime to an integer. As the decimal
portion of a datetime represents the time, if you remove it then you
have a number that represents the day.
4. Use an integer representation
of the date as a string i.e. 1 july 2007 becomes 20070601
I am sure there are others we just considered the above.
The following are the pros and cons of each.
Is simple because you don't have to translate the date in your ETL (if
you don't have time in your datetimes). However a datetime is 8 bytes compared
to a 4 byte integer.
This requires you to lookup the relevant key in your ETL and has no
meaning when used in queries so to get the date you always have to join to the
time dimension. This can use a smallint, but do you realy need it
This is simple because you can obtain the integer and remove the time
component easily by doing floor(cast(datecol as float)). However you are basing
your key on the internal representation of a datetime in SQL Server, an example
of the problems you have with this is using this number in Excel. Excel starts
at 31 Dec 1899 not 1 Jan 1900 and also has a different leap year at some point,
which means your dates can be 2 days out. In addition to get the date
information you need you need to join to the time dimension or convert the
number to a datetime.
This is also simple to calculate by doing, cast(convert(char(8),@d
,112) as int). Ok so the overhead is greater than 3, however the calculation
still takes .00059ms compared to 0.00017ms for 3.
The big benefit is that the
key is a meaningful value that makes it much easier to query.
So option 4 is my preferred route and having spoke to a number of fellow MVPs
they all agreed that whilst meaningful surrogate keys are something to be
avoided, this is one case where the rule should be broken.
The code and slides from my presentation from today
(Service Broker - Asynchronous processing in SQL
)are now online.
The code is on codplex http://www.codeplex.com/MultiThreadedSQL,
The slides are on http://sqlblogcasts.com/files/folders/servicebrokerpresentations/entry1885.aspx
and a video of the service broker restaring after I stopped SQL Server is
I have just finished my presentation on asynchronous
process/parallel processing using Service Broker at the Microsoft Technology Conference in Ireland. I was a bit tentative after deciding last night,
after a few guinesses and whisky (thanks Barry), to add a new feature to my presentation,
Yep I thought it would be great if I did a demo of how service broker handled
failure and performed reliably. How can I simulate failure I thought, well the
best I could come up with at short notice was to shutdown Sql Server with no
wait whilst service broker was processing.
Anyone that has done presentations will realise that effectively killing
SQL Server mid demo is probably not the wisest idea., any way the presentation
was going well and so I went for it.
In my demo Service Broker is configure with activation to run an SP process
the messages and upload the file.
So I kicked of my process that results in ~180
messages(files) sitting in the queue to be uploaded. I then showed that Service Broker was processing
some messages and then shutdown SQL.
Once shutdown, my app dies because it can't access the DB.
I then restart SQL and pray.
Eventually the databases recover and service brokder starts processing the
messages again without any intervention.
Wow it worked. When I tried it last night I was quite shocked that it
A video of this will is available here http://sqlblogcasts.com/files/folders/servicebrokerdemos/entry1884.aspx
More Posts Next page »