SQL and the like

Dave Ballantyne's blog. Freelance SQL Server database designer and developer at Clear Sky SQL

Rows or Range, What’s the difference ?

With the release of Denali CTP3 came an extension of the over clause to allow for a sliding window of data.  This allows us to , amongst other things, to efficiently and neatly calculate rolling balances.  This is a very common requirement for a database system and one which crops up time and time again on forums etc.

The basic syntax is documented in BOL here and Wayne Sheffield(blog|twitter) has written a great introduction here.

To help demonstrate, I need to generate a temporary table using some data from AdventureWorks.

use AdventureWorks2008R2
go
drop table #orders
go

Select SalesPersonID,
min(OrderDate) as OrderMonth,
sum(TotalDue ) as TotalDue
into #Orders
from Sales.SalesOrderHeader SOH
where SOH.SalesPersonID is not null
group by SalesPersonID,datediff(mm,0,OrderDate);

go
create clustered index idxOrder on #Orders(SalesPersonId,OrderMonth);
go

So, in Denali CTP3 to perform a rolling balance of the TotalDue accumulating for each OrderMonth we can perform:

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;

and we will see a result set like this
 
image

Neat, a rolling balance, and as we as partitioning by SalesPersonID it will be re-initialised for each SalesPerson.  Problem solved.

But what about the ‘BY RANGE’ and ‘BY ROWS’ option of the clause , what’s their significance ?

Let us duplicate the data in the table

insert into #Orders
select * from #Orders

and retry the RollingBalance query (  remember that the default is RANGE so this is exactly the same query as before )

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth
Range UNBOUNDED PRECEDING)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;

image

Notice how the RollingBalance is now summing together all the totaldue values for each SalesPersonId / ordermonth.

Where-as by using ROWS

select *,
sum(TotalDue) over(partition by SalesPersonId
order by OrderMonth
Rows UNBOUNDED PRECEDING)
as RollingBalance
from #Orders
order by SalesPersonID,OrderMonth;

image

We now have the value incremented over each row.

Other than the output , ( which i grant you is quite important Smile ) there is another fundamental difference to be looked at.  How much work has SQLServer done to service both of these queries ?

Lets create more data..

insert into #Orders
Select SalesPersonID,OrderMonth,TotalDue
from #Orders
go 8
and re-execute the queries
 
image
 
The columns are (CPU,Reads,Writes and TotalDuration)

Wow , quite a stark difference.  But, a very big but, if range is what you require then that is what you have to do.

Now consider a different set of data.

drop table #orders
go

Select SalesPersonID,
min(OrderDate) as OrderMonth,
sum(TotalDue ) as TotalDue
into #Orders
from Sales.SalesOrderHeader SOH
where SOH.SalesPersonID is not null
group by SalesPersonID,datediff(mm,0,OrderDate);

go
create unique clustered index idxOrder on #Orders(SalesPersonId,OrderMonth);
go


declare @MaxSalesPersonId integer
Select @MaxSalesPersonId = max(SalesPersonId)
from #Orders

insert into #Orders
select SalesPersonID+@MaxSalesPersonId,OrderMonth,TotalDue
from #Orders
go 8

The same amount of rows overall but because there are now many more combinations of SalesPersonID and OrderMonth the difference in timings is now even more stark.

image

Also,  the outputs are now exactly the same.  In-fact notice how I have a unique index on SalesPersonID and OrderMonth so by definition the outputs are guaranteed to be the same, every ‘range’ will only ever have one row. 

From experience I would say that ROWS will be used in 99% of scenario’s and TBH i think that

A) BY ROWS should of been the default

B) The optimizer could be enhanced to ensure that if unique , then BY ROWS is used. 

Comments

mcflyamorim said:

Hi Dave I only found your blog right now, congrats for the posts, I loved… lot of things to read :-).

There is another difference I would like to mention,

The execution plan window spool operator (used to process the window) as two ways for storing the frame data, on-disk or on-memory. Unfortunately the default frame (RANGE…) always use the on-disk window spool… the ROWS frame use an on-memory worktable if the number of rows is less than 10000 rows… If the number of rows is greater than 10k it will also use the on-disk spool.

I saw in your video at SQLBits you said the FIRS_VALUE function was not good on performance… In fact the problem was that you were using the default frame (RANGE…) and it was using the on-disk window spool, if you change the frame to ROWS, it will be better than the old solution (CROSS APPLY) because it will use the on-memory window spool.

You could use xEvents to capture the warning (window_spool_ondisk_warning), you can also see the difference between windows spooling on disks looking at the results of statistics io, the worktable will show lots of page reads.

I think there are some scenarios where MS could change this behavior, for instance if the ORDER BY clause is unique (as in your sample) even if we use the RANGE frame we know we could use the ROWS and consequently use the in-memory worktable…

Also, this 10000 number is fixed, I think it should use the same model on spill to disk if the memory grant has been exceeded, use an rowsize base to know how much memory to grant…

I think this is something that could be improved on SQL2012.

What do you think?

Cheers

Fabiano N. Amorim - SQL Server MVP

Twitter: @mcflyamorim │Blog: http://blogfabiano.com

# November 22, 2011 8:50 PM

SQL and the like said:

OK, so that is quite a contradictory title, but unfortunately it is true that a common misconception

# May 10, 2012 11:19 AM