Joe Celko The SQL Apprentice: Good Technique For Removing Dupes?

Thursday, September 28, 2006

Good Technique For Removing Dupes?

SQL Apprentice Question
I have a table full of records, some of which are duplicates
and I need to go through the table and remove the duplicate records.
My technique works, but it is extremely slow, and I was wondering if
anyone can tell me a faster method.

Here is a description of my current technique:

1. Declare a cursor and fetch all fields from first record from
TABLE1.
2. SELECT COUNT(*)
FROM TABLE1
WHERE Field1 = @Value1, Field2 = @Value2
3. If COUNT <= 1 THEN
This is not a duplicate record, go to the next record
ELSE
BEGIN
Must be a duplicate record.
DELETE FROM TABLE 1 where Field3 <> @Value 3
Go to the next record.
END

I am wondering if attempting to delete a record from the table
while it is being looped through and SELECTED from is causing it to be
unusually slow.

Does anyone have a different and possibly faster de-duping
algorithm?

Celko Answers
>> I have a table full of records [sic], some of which are duplicates and I need to go through the table and remove the duplicate records [sic].

1) Rows are not records; tables are not files, nor are columns
anything like fields. You are in trouble now because you did not
bother with the constraints that you needed. As soon as you scrub this
data, add those constraints and quit implementing files in SQL.

2) If you want to use cursors, then look up the "DELETE FROM [table
name] .. WHERE CURRENT OF [cursor] ;" syntax.

3) Your pseudo_code was a bit vague as to what makes a duplicate so I
would guess something like this will work:

DELETE FROM Table1
WHERE field3 <
(SELECT MAX(field3) FROM Table_1 AS T2
WHERE Table_1.field1 = T2.field1
AND Table_1.field2 = T2.field2 ;

This says find groupings based on (field1, field2) and keep the single
rows that have the max value in field3. Pretty common problem when
clenang out a history and fiedl3 is a datetime.

Joe Celko The SQL Apprentice

Thursday, September 28, 2006

Good Technique For Removing Dupes?

No comments:

Blog Archive

Labels