Getting your joins wrong can really impact performance

I've just posted about filtering with joins is difficult to optimize (http://sqlblogcasts.com/blogs/simons/archive/2010/02/16/Filtering-by-joining-is-difficult-to-optimise.aspx). This post is along the same lines and is about the opposite, when joins don't filter as expected but rather they expand the result set.

As with the filtering post, this join situation I am going to describe is also difficult to identify and the database tuning advisor won't be able to help you either.

Someone asked me the other day if I had ever come across the situaiton of someone defining a join by using a cross join and then putting criteria in the where clause. I hadn't but it make sense if the person has come form the old way of writing queries where you specify all the tables in the from clause and then define the join criteria in the where clause.

select

from tableA, tableB, tableC

where tableA.col1 = tableB.col1

and tableB.col2 = tableC.col2

select

from tableA

cross join tableB

cross join tableC

where tableA.col1 = tableB.col1

and tableB.col2 = tableC.col2

select

from tableA

join tableB on tableA.col1 = tableB.col1

join tableC on tableB.col2 = tableC.col2

All these are the same.

The danger of using cross join is that you might lose a join clause and that can be really bad, and thats what I'm going to discuss.

If you are doing a distinct or a group by in your query then you will never know if you have inadvertantly added a cartesian join into your query and thus generated more rows than you expect.

Look at this query to find the the max row size for a table

select SUM(c.max_length)

from msdb.sys.tables t

join msdb.sys.columns c on t.object_id = t.object_id

where t.name = 'sysjobhistory'

This will give a very large number, which makes you think. But why?

So the join is wrong, I've used the wrong column, this is effectively a cross join. You might think well thats just a bug but what about

select t.name , i.name, sum(s.user_lookups)

from msdb.sys.tables t

join msdb.sys.indexes i on t.object_id = i.object_id

join msdb.sys.dm_db_index_usage_stats s on s.object_id = t.object_id

                                       and s.index_id = i.index_id

group by t.name , i.name

Whats wrong with that?

Well the issue is that msdb.sys.dm_db_index_usage_stats Isn't just for msdb. The key columns on it are database_id, object_id and index_id. Because we haven't added database_id is means we've got erroneous data coming being summed. The danger is that you might start off not grouping and then see duplicates and think you have to group. So you do and it masks the issue.

select DB_NAME(database_id), * from msdb.sys.dm_db_index_usage_stats

where object_id in (select object_id

                      from msdb.sys.dm_db_index_usage_stats

                  group by object_id

                    having COUNT (distinct database_id)> 1)

This query will show you any iobjects with the same id in multiple databases.

What this highlights is the need to make sure we get our joins correct. As a starting point on one side of a query all the join columns should make up all the columns of a unique constraint(or primary key). If that is not the case then you will end up with a partial catesian join. This results in the number of rows being processed internally within the query being very very large. If you run a query and find the estimate and actuals wildly off then consider your joins.

How do you do that? Well what I do is draw my tables and joins and then write down the primary key and unique key columns, above the tables. I can then see clearly when I'm not joining correctly.

Imagine you have 3 tables, a Team table, a Player table and a Goal table. The player table has a surrogate key of Team and the Players registration code. This allows a player to move teams. The Goal table holds the Team and the Playercode and the datetime of the Goal. If we join these together without thinking we can end up with the joins as we have below.

This picture shows how you can use my technique to resolve the problem. By putting the primary key columns above the tables we can see that the join between the Player and Goal tables isn't matching the primary key on either side. In this situation what is missing is including the TeamId on the join.

Now this is a very simple solution. Normally this occurs when you have lots of tables and generally when you have composite keys. So if you find yourself grappling with why your query seems to be getting its estimates completely wrong then consider that you have your joins wrong and try this technique.


Published 17 February 2010 20:38 by simonsabin

Comments

18 February 2010 03:25 by SqlServerKudos

# Getting your joins wrong can really impact performance

Kudos for a great Sql Server article - Trackback from SqlServerKudos