TSQL Challenge - Remove duplicates from a string

I have a table where a coloumn contains the ordering of some preferences. Each preference is represented by a letter ,i.e. AYFT represents preferences A, Y, F and T in that order. 

All 26 characters are possible preferences. A preference can only appear once in the string.

Due to a bug the data has become corrupt and we have duplcates appearing i.e. XEDDDEFFE.

There are three of these columns in the table and the challenge is to remove the duplicate occurrences and maintain the correct order (left to right), so the example above goes from XEDDDEFFE to XEDF.

The columns can be up to 100 characters long, CLR can't be used.

create table Preferences

(

PreferenceId int identity(1,1) primary key,

Ordering1 varchar(100),

Ordering2 varchar(100),

Ordering3 varchar(100)

)

The schema can be downloaded here http://sqlblogcasts.com/files/folders/tsql_challenge/entry12784.aspx

and a population script here http://sqlblogcasts.com/files/folders/tsql_challenge/entry12783.aspx

Post proposals as comments or email me.

[SS 05/01/2010 13:29 You can assume there is a numbers table called num with a column n with values 1to 10,000 if that helps]
[SS 05/01/2010 23:17 Added a bit more explanation about uniqueness and the characters being used.]


Published 05 January 2010 09:48 by simonsabin

Comments

05 January 2010 12:05 by Uri Dimant

# re: TSQL Challenge - Remove duplicates from a string

Hi Simon

There are many techigues to find out the duplicates (Peter Larsson has nice scripts) so I modified his script and added my logic further, see if that helps you

DECLARE @Orderring1 varchar(100),@PreferenceId int

DECLARE @FS smallint

DECLARE curp INSENSITIVE CURSOR

FOR SELECT  PreferenceId,COALESCE(Ordering1,'')+COALESCE(Ordering2,'')

                      +COALESCE(Ordering3,'') AS Ordering

FROM Preferences

FOR READ ONLY

open curp

FETCH Next FROM curp into @PreferenceId,@Orderring1

SELECT @FS =  @@FETCH_STATUS

WHILE (@FS<>-1)

BEGIN

WITH cte

AS

(

SELECT SUBSTRING(@Orderring1, 1 + Number, 1)AS Clearstr,Number,

                          PATINDEX('%[^' + SUBSTRING(@Orderring1, 1 + Number, 1) + ']%',

                    SUBSTRING(@Orderring1, 2 + Number, 8000)) AS rn1,

                    ROW_NUMBER() OVER (PARTITION BY  SUBSTRING(@Orderring1, 1 + Number, 1)ORDER BY Number )rn

                    FROM   master..spt_values

                    WHERE Number < DATALENGTH(@Orderring1)

                          AND type = 'P'

) UPDATE Preferences SET Ordering1_Clear=(SELECT REPLACE(STUFF(

 (SELECT ',' + [Clearstr] as [text()] FROM cte WHERE rn=1

   ORDER BY Number FOR XML PATH('')),  1, 1, ''),',','') AS D) WHERE PreferenceId=@PreferenceId;

FETCH Next FROM curp into @PreferenceId,@Orderring1

SELECT @FS =  @@FETCH_STATUS

END

CLOSE curp

DEALLOCATE curp

# Twitter Trackbacks for TSQL Challenge - Remove duplicates from a string - Simon Sabin UK SQL Consultant's Blog [sqlblogcasts.com] on Topsy.com

Pingback from  Twitter Trackbacks for                 TSQL Challenge - Remove duplicates from a string - Simon Sabin UK SQL Consultant's Blog         [sqlblogcasts.com]        on Topsy.com

05 January 2010 15:20 by SqlServerKudos

# TSQL Challenge - Remove duplicates from a string

Kudos for a great Sql Server article - Trackback from SqlServerKudos

05 January 2010 18:24 by guercheLE

# re: TSQL Challenge - Remove duplicates from a string

/*

* I would do it as following:

*/

CREATE FUNCTION RemoveDuplicates

(

@InputValue AS varchar(max)

)

RETURNS varchar(max)

AS

BEGIN

   DECLARE @CharacterIndex int

   SET @CharacterIndex = LEN(@InputValue)

   WHILE @CharacterIndex > 1

   BEGIN

       WHILE CHARINDEX(SUBSTRING(@InputValue, @CharacterIndex, 1), SUBSTRING(@InputValue, 1, @CharacterIndex - 1)) > 0

       BEGIN

           SET @InputValue = STUFF(@InputValue, @CharacterIndex, 1, '')

       END

       SET @CharacterIndex = @CharacterIndex - 1

   END

   RETURN @InputValue

END

GO

SELECT PreferenceId,

      Ordering1,

      dbo.RemoveDuplicates(Ordering1),

      Ordering2,

      dbo.RemoveDuplicates(Ordering2),

      Ordering3,

      dbo.RemoveDuplicates(Ordering3)

FROM preferences

GO

UPDATE preferences

SET Ordering1 = dbo.RemoveDuplicates(Ordering1),

   Ordering2 = dbo.RemoveDuplicates(Ordering2),

   Ordering3 = dbo.RemoveDuplicates(Ordering3)

GO

05 January 2010 18:58 by guercheLE

# re: TSQL Challenge - Remove duplicates from a string

/*

* Just noted there is no need for an inner

* while, but just an if

*/

CREATE FUNCTION RemoveDuplicates

(

@InputValue AS varchar(max)

)

RETURNS varchar(max)

AS

BEGIN

  DECLARE @CharacterIndex int

  SET @CharacterIndex = LEN(@InputValue)

  WHILE @CharacterIndex > 1

  BEGIN

      IF CHARINDEX(SUBSTRING(@InputValue, @CharacterIndex, 1), SUBSTRING(@InputValue, 1, @CharacterIndex - 1)) > 0

      BEGIN

          SET @InputValue = STUFF(@InputValue, @CharacterIndex, 1, '')

      END

      SET @CharacterIndex = @CharacterIndex - 1

  END

  RETURN @InputValue

END

GO

05 January 2010 21:07 by EricssonA

# re: TSQL Challenge - Remove duplicates from a string

create function RemoveDuplicates

(

@input as varchar(max)

)

returns varchar(max)

as

begin

declare @result as varchar(max);

with strings(s, r)

as

(

select substring(@input,1, 1) s ,1 r

union all

select substring(@input,r,1) s ,r+1 r

from strings

where len(@input) >= r

)

,dist as

(

select s,min(r) r

from strings

group by s

)

select @result = ( select s +''

from dist

order by r

for xml path('') )

return @result

end

05 January 2010 22:02 by SQL_Menace

# re: TSQL Challenge - Remove duplicates from a string

Here is my solution

-----------------------------

declare @loop int,@rowcount int

select @rowcount = 1

select @loop = 65

while @loop <=90

begin

while @rowcount > 0

begin

update preferences

set Ordering1  = replace(Ordering1,char(@loop) + char(@loop),''),

Ordering2  = replace(Ordering2,char(@loop) + char(@loop),''),

Ordering3  = replace(Ordering3,char(@loop) + char(@loop),'')

where Ordering1 like '%' +char(@loop) + char(@loop) + '%'

or Ordering2 like '%' +char(@loop) + char(@loop) + '%'

or Ordering3 like '%' +char(@loop) + char(@loop) + '%'

set @rowcount = @@rowcount

end

set @loop= @loop + 1

set @rowcount = 1

end

go

select * from preferences

05 January 2010 22:22 by simonsabin

# re: TSQL Challenge - Remove duplicates from a string

SQL_Menace

That only removes consecutive characters. I need to remove all but the first instance of a character.

06 January 2010 02:30 by Ennor

# re: TSQL Challenge - Remove duplicates from a string

select p.PreferenceId,

replace(cast((

select char(ascii('A') + number) as [data()]

from master.dbo.spt_values sv

where sv.type = 'P' and sv.number < 26

and charindex(char(ascii('A') + number), p.ordering1) > 0

order by charindex(char(ascii('A') + number), p.ordering1)

for xml path('')

) as varchar(200)), ' ', '') as [Filtered1],

replace(cast((

select char(ascii('A') + number) as [data()]

from master.dbo.spt_values sv

where sv.type = 'P' and sv.number < 26

and charindex(char(ascii('A') + number), p.ordering2) > 0

order by charindex(char(ascii('A') + number), p.ordering2)

for xml path('')

) as varchar(200)), ' ', '') as [Filtered2],

replace(cast((

select char(ascii('A') + number) as [data()]

from master.dbo.spt_values sv

where sv.type = 'P' and sv.number < 26

and charindex(char(ascii('A') + number), p.ordering3) > 0

order by charindex(char(ascii('A') + number), p.ordering3)

for xml path('')

) as varchar(200)), ' ', '') as [Filtered3]

from dbo.Preferences p;

First, we always have numbering table - it resides in master database; and it has some useful indexes, oh yes it has.

Second, the main trick here is an exploit of CHARINDEX() function' feature: it returns first position of the character in the string, ignoring all others, and this is exactly what we need. This way, we do not have to cross-join each character in every string by all 26 possible values; it is enough to join string with letters' rowset on condition of CHARINDEX() function having non-zero result (letter is present).

Correct strings then being concatenated from resulted letters by the usual FOR XML PATH('') method. It works well enough on short values, say, less than 10K, and will probably beats STUFF() down.

In the end, it should be wee bit faster than all those loops and numbering...

06 January 2010 09:17 by Dave Ballantyne

# re: TSQL Challenge - Remove duplicates from a string

Heres my attempt

Drop function StripDupChars

go

create function StripDupChars(@Order varchar(100))

returns table

as

return

(

with CharList (c)

as

(

Select 'A' union all Select 'B' union all Select 'C' union all Select 'D' union all Select 'E' union all Select 'F' union all Select 'G' union all Select 'H' union all

Select 'I' union all Select 'J' union all Select 'K' union all Select 'L' union all Select 'M' union all Select 'N' union all Select 'O' union all Select 'P' union all

Select 'Q' union all Select 'R' union all Select 'S' union all Select 'T' union all Select 'U' union all Select 'V' union all Select 'W' union all Select 'X' union all

Select 'Y' union all Select 'Z'

),

cteCharPos(c,Pos)

as

(

select CharList.c,CharIndex(CharList.c,@Order)

from  CharList

)

select * from(

select cteCharPos.c as [text()]

 from cteCharPos

where cteCharPos.Pos>0

order by  cteCharPos.Pos

for xml path('')) as ReturnChars(ReturnChars)

)

go

 select PreferenceId ,Ordering1.ReturnChars,Ordering2.ReturnChars,Ordering3.ReturnChars

from  Preferences

 outer apply dbo.StripDupChars(Ordering1) as Ordering1

 outer apply dbo.StripDupChars(Ordering2) as Ordering2

 outer apply dbo.StripDupChars(Ordering3) as Ordering3