SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Monday, June 12, 2006

Tough SQL problem, need expert advice!!!

SQL Apprentice Question
I think I am having a very tough problem, I need some expert advice
here.
Please bear with me since it will takes me a while to explain the
situation.

(Using SQL2005) I need to design the generic search utility on the
database server (run as Web Service), client supply the search criteria
(where clause)


I need return only one field:id back to user,


the data source the user need to search is coming from different
relational database tables , there is a main table and a few sub
tables, the main table has a primary key (id) and all the sub-table
have the main tables primary key as foreign key., but there are not
relationship among the sub tables, and client needs to able to search
the all the fields in all the tables.


Also, we have different database, each has its own main table and sub
tables, and the structure of each table is different. for each set, I
have those information stored in my own meta table. For each set I
could create view, dynamically generate sql statement on the fly from
the metadata table I have, but the code that generate the dynamic SQL
has to be the same


I have two ways to solve this problem but neither is satisfactory


1) Create a view or query that joins all the tables, because the
sub-tables only related to each other with the main table's primary
key, if I do a join on the main table and all sub- tables, I would
create a Cartesian products.


Select distinct id from cartesian_product_view where
maintable.field1=field1 and subtable1.field2=fied2 and
subtable3.field3=field3


Here on the main table with 40,000 rows, the cartesian_product_view has
100 million rows!!!!!


2) Substitute the where clause with subquery . for example, if the
where clause is following:


where subtable1.name ='john' or subtable2.product_code ='xyz'
or subtable1.name ='tom


I will replace it with the following : (please trust me I have the code
to do the Substitution but it will be too hard to explain here)


Select distinct maintable.id where
exists (select * from subttable1 where name = 'john' and
maintable.id=subtable1.id)
or
exists (select * from subttable2 where product_code = 'xyz' and
maintable.id=subtable2.id)
or
exists (select * from subttable1 where name = 'tom' and
maintable.id=subtable1.id)
order by ...


the problem with that is the query is getting slower & slower when user
specify more search criteria, it spend more time parsing the dynamic
query than querying the database, when you specify enough criteria
(which translated into subqueries) SQL Server just gave up and throw
the following exception.


The query processor ran out of internal resources and could not produce
a query plan. This is a rare event and only expected for extremely
complex queries or queries that reference a very large number of tables
or partitions. Please simplify the query. If you believe you have
received this message in error, contact Customer Support Services for
more information.


So I am totally stuck.


Please advice.


Celko Answers
I think what you want is the ability to load tables with criteria and
not have to use dynamic SQL:

skill = Java AND (skill = Perl OR skill = PHP)


becomes the disjunctive canonical form:


(Java AND Perl) OR (Java AND PHP)


which we load into this table:


CREATE TABLE Query
(and_grp INTEGER NOT NULL,
skill CHAR(4) NOT NULL,
PRIMARY KEY (and_grp, skill));


INSERT INTO Query VALUES (1, 'Java');
INSERT INTO Query VALUES (1, 'Perl');
INSERT INTO Query VALUES (2, 'Java');
INSERT INTO Query VALUES (2, 'PHP');


Assume we have a table of job candidates:


CREATE TABLE Candidates
(candidate_name CHAR(15) NOT NULL,
skill CHAR(4) NOT NULL,
PRIMARY KEY (candidate_name, skill));


INSERT INTO Candidates VALUES ('John', 'Java'); --winner
INSERT INTO Candidates VALUES ('John', 'Perl');
INSERT INTO Candidates VALUES ('Mary', 'Java'); --winner
INSERT INTO Candidates VALUES ('Mary', 'PHP');
INSERT INTO Candidates VALUES ('Larry', 'Perl'); --winner
INSERT INTO Candidates VALUES ('Larry', 'PHP');
INSERT INTO Candidates VALUES ('Moe', 'Perl'); --winner
INSERT INTO Candidates VALUES ('Moe', 'PHP');
INSERT INTO Candidates VALUES ('Moe', 'Java');
INSERT INTO Candidates VALUES ('Celko', 'Java'); -- loser
INSERT INTO Candidates VALUES ('Celko', 'Algol');
INSERT INTO Candidates VALUES ('Smith', 'APL'); -- loser
INSERT INTO Candidates VALUES ('Smith', 'Algol');


The query is simple now:


SELECT DISTINCT C1.candidate_name
FROM Candidates AS C1, Query AS Q1
WHERE C1.skill = Q1.skill
GROUP BY Q1.and_grp, C1.candidate_name
HAVING COUNT(C1.skill)
= (SELECT COUNT(*)
FROM Query AS Q2
WHERE Q1.and_grp = Q2.and_grp);


You can retain the COUNT() information to rank candidates. For example
Moe meets both qualifications, while other candidates meet only one of
the two. You can Google "canonical disjunctive form" for more details.
This is a form of relational division.

No comments: