INSERT and UPDATE loading practice - The impact of forwarding pointers. - Simons SQL Blog

INSERT and UPDATE loading practice - The impact of forwarding pointers.

Ever since I’ve been working with databases I have come across the following pattern of programming. The pattern is an INSERT and LOTS of UPDATEs. The reason for this pattern is reasonable, and is generally due to the desired data set being complex to generate from the source database.

For this reason a base set of data is generated and then updated by a number of updates.

I’ve seen this is two main scenarios, integration and warehousing. The former is due to a target system wanting something that the soruce system had in a different structure or a differnt level and the latter generally wants calculated data like number of children, or name of last dentist visited.

You might think these should easily be solved by one query, and if there was only on of these bits of infomation required that would be easy. However more often than not there are many, and even worse they are depenent, i.e. given the last dentist someone visted, give me the number of people that have sued said dentist.

So logically people break this down into manageable chunks. And do something like,

 

INSERT INTO CustomerDetails

SELECT somecolums

FROM Customers

 

 

ALTER sometable ADD NumberOfOrders int, NumberOfChildren int, NumberOfPeopleSued int, LastDentistVisited varchar(100) ...

 

UPDATE CustomerDetails

   SET NumberOfChildren = (SELECT COUNT(1) FROM ....)

 

UPDATE CustomerDetails

   SET LastDentistVisited = (SELECT COUNT(1) FROM ....)

 

...

 

UPDATE CustomerDetails

   SET NumberOfPeopleSued= (SELECT TOP 1 Surname FROM ....)

 

The reason for this post is that this approach is right for 1 reason and wrong for many more.

It is right because it is good practice to break small problems down into manageable understandable chunks as this aids understanding and means code is easier to support in the future.

It is wrong for many reasons of performance and one functional reason. I will cover the latter in a later post but will mention the performance reasons now.

1. The addition of the columns means each row has to be adjusted to fit the new columns. With fixed width columns this can be very expensive as columns have to be moved about. If you have a need for this the columns should be added in the insert statement. To get the correct type, use cast, i.e. CAST(NULL as int) or CAST(NULL as varchar(24))

2. The second reason is that for each UPDATE each row has to be updated. If you have N updates that means each page is touched at least N times. Which is not good for performance. If you have to do this the solution is to try and batch up dates to do it in one pass, by using derived tables and CASE statements. i.e.

UPDATE <table>

   SET Col1 = CASE WHEN <some criteria>

                   THEN <a new value> ELSE Col1 END

     , Col2 = CASE WHEN <Some other criteria>
                   THEN <some other new value> ELSE Col2 END

3. The final reason is one of forward pointers. Even if you add your additional columns by adding them in the INSERT statement, if any of them are variable length then they will result in FORWARDED RECORDS when you start doing an update. Whats a forwarded record, well it occurs when a row has to move because it no longer fits on the page it was put to start with. Its just like when you move house and redirect your mail. SQL Server maintains that your record is still located in its original location but in that location is a pointer to another location, in another page in your database. Why is this bad. Because it means SQL server has to perform two IOs to be able to read the data, and also it has to do more work to update the data. More work equals worse performance.

So lets have a look at forward pointers.

In this example we are going to create a table, and put 1 row in it. We will then put another row in the table that fills the page.

use tempdb

go

drop table heap

go

create table heap (id int not null, v1 varchar(5000))

go

insert into heap(id,v1)

select 1, ''

go

select object_id, partition_number, alloc_unit_type_desc, record_count, avg_record_size_in_bytes, forwarded_record_count

from sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

insert into heap(id,v1)

select 2, REPLICATE('#',4000)

go

select object_id, partition_number, alloc_unit_type_desc, record_count, avg_record_size_in_bytes, forwarded_record_count

from sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

select * from heap

cross apply sys.fn_PhysLocCracker(%%physloc%%)

 

What you should see is that after the first insert the DMV dm_db_index_physical_stats  is reporting 1 row and no forwarded records. After the second insert we get 2 tows and no forwarded records. Not we are also using a SQL 2008 undocumented feature %%physloc%% and fn_PhysLocCracker.  Both of these are undocumented, but are very handy as they show the physical location of a row, the File, Page and Slot.  There is also a scalar function as well. Thanks to Paul Randal for the pointer http://www.sqlskills.com/blogs/paul/post/SQL-Server-2008-New-(undocumented)-physical-row-locator-function.aspx  . What we will see is that this returns the original location and not any forwarded location.

Now if we update the first row so its too big to fit on the page what do you think happens.

update heap

set v1= REPLICATE('@',5000)

where id = 1

go

select * from heap

cross apply sys.fn_PhysLocCracker(%%physloc%%)

go

select object_id, partition_number, alloc_unit_type_desc, record_count, avg_record_size_in_bytes, forwarded_record_count

from sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

 

Well you will see that we now have a forwarded record, but %%PhysLoc%% is still saying its in the same place. If we use DBCC IND we can see that the table now has 3 pages, an allocation page and two data pages.

dbcc ind(tempdb, heap,1)

go

dbcc traceon (3604)
go
dbcc page (tempdb,1,???,3)
dbcc page (tempdb,1,???,3)

Using DBCC PAGE we can look at the original page and you will see something like this. (note the ??? change these to the page numbers returned from DBCC IND where PageType = 1)

Slot 0 Offset 0x101a Length 9

 

Record Type = FORWARDING_STUB        Record Attributes =                  Record Size = 9

 

Memory Dump @0x0000000011A1B01A

 

0000000000000000:   04000900 00010000 00†††††††††††††††††..     ......       

Forwarding to  =  file 1 page 2304 slot 0

This highlights that the row has moved to page 2304 (your page will probably be different)

If we look at that page (which will be the second DBCC PAGE above) you will find your row.

So what happens if we move house again, what does SQL Server do. To check it out we need to add two rows, 1 fills up the first page and one fills up the second page. If we then update the first row so that its too big for its new page, it should move.

 

insert into heap(id,v1) select 3, REPLICATE('#',4000)

insert into heap(id,v1) select 4, REPLICATE('#',1000)

go

select * from heap

cross apply sys.fn_PhysLocCracker(%%physloc%%)

go

select object_id, partition_number, alloc_unit_type_desc, record_count, avg_record_size_in_bytes, forwarded_record_count

from sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

update heap

set v1= REPLICATE('@',8000)

where id = 1

go

select * from heap

cross apply sys.fn_PhysLocCracker(%%physloc%%)

go

select object_id, partition_number, alloc_unit_type_desc, record_count, avg_record_size_in_bytes, forwarded_record_count

from sys.dm_db_index_physical_stats(db_id('tempdb'),object_id('heap'),0,null,'detailed')

go

dbcc ind(tempdb, heap,1)

Doing the insert we will see that the 4th row now appears on the second page but the other three appear to be on the first page. After the update nothing changes and the forward record count is still 1. So whats happened.

Well if we look at the first page again, using DBCC PAGE you will find that the page number that the forward pointer points to has changed to a new page. If you run DBCC IND you will see you now have a 3rd data page. Interestingly if you have a look at the second page, you will see that the slot 0 when the old forwarded record resided is now not being shown, thats because there is nothing in it. Slot 0 is now free.

So the bottom line is that if you are resorting to use such an approach doing an INSERT and then an UPDATE forwarding records can cause you a problem, not just during reading but also when updating because data has to be moved. If you have to use this approach then you are best to preallocate space in the table this ensures that the row doesn’t move and performance will improve. I’ve seen 50% improvements in systems by allocating defaults to varchar columns using replicate for instance.

So what should you do.

Well whilst you might be able to right 1 query that achieves your goal, it is probably only supportable by the person that wrote it or someone from MENSA. So breaking it up is a good thing, but the use of updates is a bad thing. You will probably find that it is quicker to perform 5 insert statements and then joined them all together rather than 1 INSERT and 4 UPDATES. If you use SELECT INTO it will be a bulk logged operation, you can then add a clustered index on your key which should then allow the final query to be an efficient MERGE  JOIN.

If you can you could use INSERT INTO with the CLUSTERED INDEX already inplace and use trace flag 610 http://blogs.msdn.com/sqlserverstorageengine/archive/2008/03/23/minimal-logging-changes-in-sql-server-2008-part-2.aspx This would normally be the quickest option.

ps. One nice feature in SQL 2008 is that the ALTER TABLE <heap> REBUILD, removes forward pointers.


-
Published 31 August 2009 01:29 by simonsabin

Comments

31 August 2009 16:23 by jamiet

# re: INSERT and UPDATE loading practice - The impact of forwarding pointers.

This was all new to me mate. Great stuff - thanks!

31 August 2009 20:16 by simonsabin

# re: INSERT and UPDATE loading practice - The impact of forwarding pointers.

Pleasure, it is to most people. I think its quite conter intuitive to think inserts is going to be quicker than doing updates.

# How can I tell if a SQL Server system is affected by Forwarded records? | Sankar Reddy, SQL Server Developer/DBA

Pingback from  How can I tell if a SQL Server system is affected by Forwarded records? | Sankar Reddy, SQL Server Developer/DBA