SQL and the like

Dave Ballantyne's blog. Freelance SQL Server database designer and developer at Clear Sky SQL

Joining on NULLs

A problem I see on a fairly regular basis is that of dealing with NULL values.  Specifically here, where we are joining two tables on two columns, one of which is ‘optional’ ie is nullable.  So something like this:

image

i.e. Lookup where all the columns are equal, even when NULL.   NULL’s are a tricky thing to initially wrap your mind around.  Statements like “NULL is not equal to NULL and neither is it not not equal to NULL, it’s NULL” can cause a serious brain freeze and leave you a gibbering wreck and needing your mummy.

Before we plod on, time to setup some data to demo against.

Create table #SourceTable
(
Id         integer  not null,
SubId      integer null,
AnotherCol char(255) not null
)
go
create unique clustered index idxSourceTable on #SourceTable(id,subID)
go
with cteNums
as
(
select top(1000) number
from  master..spt_values
where  type ='P'
)
insert into #SourceTable
select Num1.number,nullif(Num2.number,0),'SomeJunk'
from  cteNums num1
cross join  cteNums num2

go
Create table #LookupTable
(
Id    integer  not null,
SubID integer null
)
go
insert into #LookupTable
Select top(100) id,subid
from #SourceTable 
where subid is not null
order by newid()
go
insert into #LookupTable
Select top(3) id,subid 
from #SourceTable 
where subid is null
order by newid()

If that has run correctly, you will have 1 million rows in #SourceTable and 103 rows in #LookupTable.  We now want to join one to the other.

First attempt – Lets just join

select * 
 from #SourceTable join #LookupTable
   on #LookupTable.id = #SourceTable.id
  and #LookupTable.SubID = #SourceTable.SubID

OK, that’s a fail.  We had 100 rows back,  we didn’t correctly account for the 3 rows that have null values.  Remember NULL <> NULL and the join clause specifies SUBID=SUBID, which for those rows is not true.

Second attempt – Lets deal with those pesky NULLS

select * 
  from #SourceTable join #LookupTable
    on #LookupTable.id = #SourceTable.id
   and isnull(#LookupTable.SubID,0) = isnull(#SourceTable.SubID,0)

OK, that’s the right result, well done Smile and 99.9% of the time that is where its left. It is a relatively trivial CPU overhead to wrap ISNULL around both columns and compare that result, so no problems.  But, although that’s true, this a relational database we are using here, not a procedural language.  SQL is a declarative language, we are making a request to the engine to get the results we want.  How we ask for them can make a ton of difference.

Lets look at the plan for our second attempt, specifically the clustered index seek on the #SourceTable

image

 

There are 2 predicates. The ‘seek predicate’ and ‘predicate’.  The ‘seek predicate’ describes how SQLServer has been able to use an Index.  Here, it has been able to navigate the index to resolve where ID=ID.  So far so good, but what about the ‘predicate’ (aka residual probe) ? This is a row-by-row operation.  For each row found in the index matching the Seek Predicate, the leaf level nodes have been scanned and tested using this logical condition.  In this example [Expr1007] is the result of the IsNull operation on #LookupTable and that is tested for equality with the IsNull operation on #SourceTable.  This residual probe is quite a high overhead, if we can express our statement slightly differently to take full advantage of the index and make the test part of the ‘Seek Predicate’.

Third attempt – X is null and Y is null

So, lets state the query in a slightly manner:

select * 
 from #SourceTable join #LookupTable
   on #LookupTable.id = #SourceTable.id
  and ( #LookupTable.SubID = #SourceTable.SubID or
        (#LookupTable.SubID is null     and #SourceTable.SubId is null)
      )

So its slightly wordier and may not be as clear in its intent to the human reader, that is what comments are for, but the key point is that it is now clearer to the query optimizer what our intention is.

Let look at the plan for that query, again specifically the index seek operation on #SourceTable

image

No ‘predicate’, just a ‘Seek Predicate’ against the index to resolve both ID and SubID.  A subtle difference that can be easily overlooked.  But has it made a difference to the performance ? Well, yes , a perhaps surprisingly high one.

image

Clever query optimizer well done.

If you are using a scalar function on a column, you a pretty much guaranteeing that a residual probe will be used.  By re-wording the query you may well be able to avoid this and use the index completely to resolve lookups. In-terms of performance and scalability your system will be in a much better position if you can.

Comments

FatherJack said:

It's worth noting that the query plans are also quite different, the estimation is a lot better with the final attempt and the cost is much lower.

# July 26, 2012 10:52 AM

gsej said:

Surely there's something wrong with your data model if you need to join on nulls? The idea that two nulls are not equal (or not not equal) isn't something to be worked around. If those nulls in #SourceTable and #LookupTable are supposed to represent the same thing (and thus be equal to each other), they shouldn't be nulls!

# July 27, 2012 11:04 AM

Dave Ballantyne said:

Hi gsej,

I wouldnt disagree with that assesment but :

A) this is just a simplified version for the sake of brevity.

B) Sometimes changing the datamodel is not possible and you have to work with what you are given

# July 27, 2012 11:48 AM

MartinSmith said:

You can also use

SELECT *

FROM   #SourceTable

      JOIN #LookupTable

        ON #LookupTable.id = #SourceTable.id

           AND EXISTS (SELECT #LookupTable.SubID

                       INTERSECT

                       SELECT #SourceTable.SubID)

Gives the same plan as the last version.

# July 27, 2012 1:23 PM

MartinSmith said:

And in response to gsej's comment where I would usually use something like this would be in comparing two versions of the same table.

I stole the idea from this blog post sqlblog.com/.../undocumented-query-plans-equality-comparisons.aspx

# July 27, 2012 1:31 PM