Piotr Rodak

if datepart(dw, getdate()) in (6, 7)
use pubs;

sorting trap

So, you might think that sorting within ASCII range is predictable and order defined by ASCII table is finite? You are on safe side when you do not use unicode nor 'fancy' letters? Well.. you are wrong.

Recently I have been working on comparison of data stored in Sybase and SQL. Database structure is the same, I have script pulling data from Sybase to SQL Server and when this script is finished, data are meant to be identical of course. So I was thrown when we found out that rows can be returned in some cases in different order on both databases. That is, when a varchar field contained two underscore characters in a row, rows sorted by this field could be returned in different order.

After some thinking and looking for information I was able to create script that can reproduce this behavior on SQL Server.

First, create table and fill it with some data:

create table testtable

(

afield varchar(25)

)

go

insert testtable (afield)

select 'A_'

union all

select 'A__'

union all

select 'A_B'

union all

select 'A__B'

 Now select rows ordering by afield:

select afield from testtable order by afield

The results are pretty 'reasonable':

afield

-------------------------

A_

A__

A__B

A_B

 Ok, but Sybase returned different order:

afield

-------------------------

A_

A_B

A__

A__B

The mystery is hidden within collation of the field. Character fields derive default collation from database setting, which in turn gets collation if it is not specified explicitly from server settings.

If you run following query, you will get results identical with the previous result:

alter table testtable alter column afield varchar(25) collate Latin1_General_Bin

This query applies explicit collation fo afield, overriding in this way default collation. You can check collation of afield using this query:

select column_id, name, collation_name from sys.columns where object_id = object_id('testtable')

column_id   name        collation_name

----------- ----------- -------------------

1           afield      Latin1_General_BIN

This was the collation after change, compliant with Sybase collation in my case.

Default collation on my SQL Server is

column_id   name      collation_name

----------- --------- ------------------------------

1           afield    SQL_Latin1_General_CP1_CI_AS

As you see, collation setting may affect sort order of fields which do not contain unicode characters. When I run the same query on both database engines, I get mostly same results with exception of values containing multiple '_' characters. The solution is simple: specify collation of character fields to collation compatible with settings of the other database server.

Note that modifying default collation of database after tables are created does not affect collation of these fields. You can't also change collation in a column if this column is part of an index.

 

Published 03 November 2007 21:03 by Piotr Rodak

Comments

No Comments