SQL Apprentice Question
Here is a table representing the problem
A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3
Tha duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user
goal: to find all duplicated rows & to delete them all accept one
instance to leave.
Note:
Finding that row 1similar to 2 in A & deleting it will loose data
because we won't know that row 1 is ALSO similar to 3 on C & later on
finding that 3 is similar to 4 & 6 on D & so on
The simple time consuming (about 2 weaks) query to acomplish the task
is:
SELECT count(*),A.B,C,D
FROM tbl
GROUP BY A,B,C,D
HAVING count(*)>1
Celko Answers
A | B | C | D
-----------------
a1 b1 c1 d1
a1 b2 c2 d2
a3 b3 c1 d3
a4 b4 c4 d3
a5 b5 c5 d5
a6 b6 c6 d3
I am going to guess at the DDL and add another column
CREATE TABLE Foobar
(a CHAR(2) NOT NULL,
b CHAR(2) NOT NULL,
c CHAR(2) NOT NULL,
d CHAR(2) NOT NULL,
dups INTEGER DEFAULT 0 NOT NULL
CHECK(dups >= 0),
PRIMARY KEY(a, b, c, d));
INSERT INTO Foobar(a, b, c, d) VALUES ('a1', 'b1', 'c1', 'd1');
INSERT INTO Foobar(a, b, c, d) VALUES ('a1', 'b2', 'c2', 'd2');
INSERT INTO Foobar(a, b, c, d) VALUES ('a3', 'b3', 'c1', 'd3');
INSERT INTO Foobar(a, b, c, d) VALUES ('a4', 'b4', 'c4', 'd3');
INSERT INTO Foobar(a, b, c, d) VALUES ('a5', 'b5', 'c5', 'd5');
INSERT INTO Foobar(a, b, c, d) VALUES ('a6', 'b6', 'c6', 'd3');
Since you seem to want to preserve some of the information about
duplications, you can keep a tally
UPDATE Foobar
SET dups
= dups
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.a = Foobar.a
HAVING COUNT(*) > 1) ,0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.b = Foobar.b
HAVING COUNT(*) > 1),0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.c = Foobar.c
HAVING COUNT(*) > 1) ,0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.d = Foobar.d
HAVING COUNT(*) > 1),0);
The duplications are:
row 1+2 in param A
row 1+3 in param C
row 3+4+6 in param D
only row 5 is unique in all parameters.
conclusion: row 1+2+3+4+6 are the same user
Thursday, June 01, 2006
Subscribe to:
Post Comments (Atom)
2 comments:
I was interested in this problem and your solution, so I pasted the code in to SQL Server Query Analyzer to create the table and populate it.
However, when I came to run the update query i get the error:
Cannot insert the value NULL into column 'dups', table 'Utility.dbo.Foobar'; column does not allow nulls. UPDATE fails.
The statement has been terminated.
I've not dug in to it yet but can you tell me what the problem is?
I have updated the code, you have to use COALESESCE
UPDATE Foobar
SET dups
= dups
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.a = Foobar.a
HAVING COUNT(*) > 1) ,0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.b = Foobar.b
HAVING COUNT(*) > 1),0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.c = Foobar.c
HAVING COUNT(*) > 1) ,0)
+ COALESCE((SELECT 1
FROM Foobar AS F1
WHERE F1.d = Foobar.d
HAVING COUNT(*) > 1),0);
Post a Comment