Monday, July 25, 2005

I was answering a Google group post about removing duplicates from tables and wrote some sample scripts of ways to remove duplicates… IMHO the second example is the best, but I think it makes for an extremely large transaction on large tables… the first method (using a cursor) would keep the transaction small – I think.

Both samples assume the following table with the following data:

CREATE TABLE tblTest
(
   [ID]  smallint
      IDENTITY(1,1)
      PRIMARY KEY CLUSTERED,
   FirstName    varchar(50)     NOT NULL,
   LastName varchar(50) NOT NULL,
   Email varchar(50) NOT NULL
)

DATA:

insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal3','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('Hal3','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('NonDup','Lesesne','testemail@yahoo.com')
insert into tblTest VALUES ('AnotherNon', 'Dup', ‘whatever@yahoo.com’)

The first method uses a cursor to build the MIN(ID) of the duplicates then iterates through each of those, deleting rows that match the dup fields but don’t have that id.

DECLARE @id int, @FirstName varchar(50), @LastName varchar(50), @Email varchar(50)
DECLARE GOODCUR CURSOR FOR
     select MIN(ID),FirstName,LastName,Email as GoodID from tblTest group by FirstName,LastName,Email having count(1) > 1
OPEN GOODCUR
FETCH NEXT FROM GOODCUR INTO @id, @FirstName,@LastName,@Email
WHILE @@FETCH_STATUS=0
     BEGIN
          DELETE tblTest from tblTest where
               (FirstName = @FirstName AND LastName = @LastName AND Email = @Email) AND NOT (ID = @ID)
          FETCH NEXT FROM GOODCUR INTO @id, @FirstName,@LastName,@Email
     END
CLOSE GOODCUR
DEALLOCATE GOODCUR

The next method using a single delete statement and no cursor, but I think I remember it making for a huge transaction and lots of locking. This joins the table to itself based on the fields that you want to check for duplicates and deletes all but the lowest ID value for each.

DELETE A
 FROM tblTest A
 INNER JOIN tblTest B
  ON A.Email = B.Email
   AND A.FirstName = B.FirstName
   AND A.LastName = B.LastName
   AND A.ID <> B.ID
 WHERE A.ID > B.ID

I would appreciate any thoughts or comments on these statements. 

Name
E-mail
Home page

Comment (HTML not allowed)  

Enter the code shown (prevents robots):