Getting your joins wrong can really impact performance
I've just posted about filtering with joins is difficult to optimize (http://sqlblogcasts.com/blogs/simons/archive/2010/02/16/Filtering-by-joining-is-difficult-to-optimise.aspx).
This post is along the same lines and is about the opposite, when joins don't
filter as expected but rather they expand the result set.
As with the filtering post, this join situation I am going to describe is
also difficult to identify and the database tuning advisor won't be able to help
you either.
Someone asked me the other day if I had ever come across the situaiton of
someone defining a join by using a cross join and then putting criteria in the
where clause. I hadn't but it make sense if the person has come form the old way
of writing queries where you specify all the tables in the from clause and then
define the join criteria in the where clause.
select
from
tableA, tableB, tableC
where
tableA.col1 =
tableB.col1
and
tableB.col2 =
tableC.col2
select
from
tableA
cross
join tableB
cross
join tableC
where
tableA.col1 =
tableB.col1
and
tableB.col2 =
tableC.col2
select
from
tableA
join
tableB on tableA.col1 = tableB.col1
join
tableC on tableB.col2 = tableC.col2
All these are the same.
The danger of using cross join is that you might lose a join clause and that
can be really bad, and thats what I'm going to discuss.
If you are doing a distinct or a group by in your query then you will never
know if you have inadvertantly added a cartesian join into your query and thus
generated more rows than you expect.
Look at this query to find the the max row size for a table
select
SUM(c.max_length)
from
msdb.sys.tables
t
join
msdb.sys.columns c on t.object_id =
t.object_id
where
t.name = 'sysjobhistory'
This will give a very large number, which makes you think. But why?
So the join is wrong, I've used the wrong column, this is effectively a cross
join. You might think well thats just a bug but what about
select
t.name ,
i.name, sum(s.user_lookups)
from
msdb.sys.tables
t
join
msdb.sys.indexes i on t.object_id =
i.object_id
join
msdb.sys.dm_db_index_usage_stats s on s.object_id =
t.object_id
and s.index_id = i.index_id
group
by t.name
, i.name
Whats wrong with that?
Well the issue is that msdb.sys.dm_db_index_usage_stats Isn't just for msdb.
The key columns on it are database_id, object_id and index_id. Because we
haven't added database_id is means we've got erroneous data coming being summed.
The danger is that you might start off not grouping and then see duplicates and
think you have to group. So you do and it masks the issue.
select
DB_NAME(database_id), * from msdb.sys.dm_db_index_usage_stats
where
object_id in (select object_id
from msdb.sys.dm_db_index_usage_stats
group by object_id
having COUNT (distinct
database_id)> 1)
This query will show you any iobjects with the same id in multiple
databases.
What this highlights is the need to make sure we get our joins correct. As a
starting point on one side of a query all the join columns should make up all
the columns of a unique constraint(or primary key). If that is not the case then
you will end up with a partial catesian join. This results in the number of rows
being processed internally within the query being very very large. If you run a
query and find the estimate and actuals wildly off then consider your joins.
How do you do that? Well what I do is draw my tables and joins and then write
down the primary key and unique key columns, above the tables. I can then see
clearly when I'm not joining correctly.
Imagine you have 3 tables, a Team table, a Player table and a Goal table. The
player table has a surrogate key of Team and the Players registration code. This
allows a player to move teams. The Goal table holds the Team and the Playercode
and the datetime of the Goal. If we join these together without thinking we can
end up with the joins as we have below.

This picture shows how you can use my technique to resolve the problem. By
putting the primary key columns above the tables we can see that the join
between the Player and Goal tables isn't matching the primary key on either
side. In this situation what is missing is including the TeamId on the join.
Now this is a very simple solution. Normally this occurs when you have lots
of tables and generally when you have composite keys. So if you find yourself
grappling with why your query seems to be getting its estimates completely wrong
then consider that you have your joins wrong and try this
technique.