I was answering a Google group post about removing duplicates from tables and wrote some sample scripts of ways to remove duplicates… IMHO the second example is the best, but I think it makes for an extremely large transaction on large tables… the first method (using a cursor) would keep the transaction small – I think.
Both samples assume the following table with the following data:
CREATE TABLE tblTest( [ID] smallint IDENTITY(1,1) PRIMARY KEY CLUSTERED, FirstName varchar(50) NOT NULL, LastName varchar(50) NOT NULL, Email varchar(50) NOT NULL)
DATA:
insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal1','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal2','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal3','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('Hal3','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('NonDup','Lesesne','testemail@yahoo.com')insert into tblTest VALUES ('AnotherNon', 'Dup', ‘whatever@yahoo.com’)
The first method uses a cursor to build the MIN(ID) of the duplicates then iterates through each of those, deleting rows that match the dup fields but don’t have that id.
DECLARE @id int, @FirstName varchar(50), @LastName varchar(50), @Email varchar(50)DECLARE GOODCUR CURSOR FOR select MIN(ID),FirstName,LastName,Email as GoodID from tblTest group by FirstName,LastName,Email having count(1) > 1OPEN GOODCURFETCH NEXT FROM GOODCUR INTO @id, @FirstName,@LastName,@EmailWHILE @@FETCH_STATUS=0 BEGIN DELETE tblTest from tblTest where (FirstName = @FirstName AND LastName = @LastName AND Email = @Email) AND NOT (ID = @ID) FETCH NEXT FROM GOODCUR INTO @id, @FirstName,@LastName,@Email ENDCLOSE GOODCURDEALLOCATE GOODCUR
The next method using a single delete statement and no cursor, but I think I remember it making for a huge transaction and lots of locking. This joins the table to itself based on the fields that you want to check for duplicates and deletes all but the lowest ID value for each.
DELETE A FROM tblTest A INNER JOIN tblTest B ON A.Email = B.Email AND A.FirstName = B.FirstName AND A.LastName = B.LastName AND A.ID <> B.ID WHERE A.ID > B.ID
I would appreciate any thoughts or comments on these statements.
Remember Me
Page rendered at 11/21/2008 7:28:45 AM (Eastern Standard Time, UTC-05:00)
Disclaimer The opinions expressed herein are my own personal opinions and do not represent my employer's view in anyway.