Joe Celko The SQL Apprentice: July 2006

Monday, July 31, 2006

CASE WHEN not working

SQL Apprentice Question
I'm having a problem with a SQL query that uses a bunch of "case when"
statements. I am fairly new at this, so bear with me.

We are trying to determine how many people have the different versions
of antivirus software on their machines. In the instance where they
have none of the three versions that have been available on our
network, I am trying to create a field called "noVersions" that will
populate by evaluating a CASE WHEN statement that (I think) says, if
there's no version10, no version9 or no version8, then put a "1"in the
noVersions column.

Here is the entire query:

SELECT
CASE WHEN Version10.version10 = 1 then 1 Else 0 end as version10,
CASE WHEN Version9.version9 = 1 then 1 Else 0 end as version9,
CASE WHEN Version8.version8 = 1 then 1 Else 0 end as version8,
CASE WHEN Version10.version10 = 0 and Version9.version9 = 0 and
Version8.version8 = 0 then 1 Else 0 end as noVersions,
CASE WHEN Version10.version10 = 1 and Version9.version9 = 1 and
Version8.version8 = 1 then 1 Else 0 end as allVersions,
CASE WHEN GIManaged.ResourceID IS NULL THEN 0 Else 1 end as
ManagedByGI,
CASE WHEN GIManaged_LAB.ResourceID IS NULL THEN 0 Else 1 end as
GIManaged_LAB,
CASE WHEN Healthy.ResourceID IS NULL THEN 0 Else 1 end as Healthy,
v_R_System.Netbios_Name0, v_R_System.User_Name0,
v_R_System.Operating_System_Name_and0,
eCoreRev.eCoreRevision,
lastScanDate.lastScanDate, Obsolete0 , Resource_Domain_OR_Workgr0 ,
AD_Site_Name0
FROM v_R_System LEFT OUTER JOIN
(SELECT DISTINCT ResourceID
FROM v_FullCollectionMembership
WHERE (CollectionID = 'MRK00DAA'))
GIManaged_LAB ON v_R_System.ResourceID = GIManaged_LAB.ResourceID LEFT
OUTER JOIN
(SELECT DISTINCT ResourceID
FROM v_FullCollectionMembership
WHERE (CollectionID = 'MRK011FB'))
Healthy ON v_R_System.ResourceID = Healthy.ResourceID LEFT OUTER JOIN
v_GS_LastSoftwareScan lastScanDate ON
v_R_System.ResourceID = lastScanDate.ResourceID LEFT OUTER JOIN
(SELECT DISTINCT ResourceID
FROM v_FullCollectionMembership
WHERE (CollectionID = 'MRK00D4F'))
GIManaged ON v_R_System.ResourceID = GIManaged.ResourceID LEFT OUTER
JOIN
(SELECT ResourceID, Data0 AS
eCoreRevision
FROM v_GS_eCore_XP_Build0
WHERE (Property0 = 'eCoreRevision'))
eCoreRev ON v_R_System.ResourceID = eCoreRev.ResourceID LEFT OUTER JOIN
(SELECT ResourceID, 1 AS version10
FROM v_GS_eCore_XP_Softwa0
WHERE (Application0 LIKE'SAV 10
Client%')) Version10 ON
v_R_System.ResourceID = Version10.ResourceID LEFT
OUTER JOIN
(SELECT ResourceID, 1 AS version9
FROM v_GS_eCore_XP_Softwa0
WHERE (Application0 LIKE'SAV 9
Client%')) Version9 ON
v_R_System.ResourceID = Version9.ResourceID LEFT
OUTER JOIN
(SELECT ResourceID, 1 AS version8
FROM v_GS_eCore_XP_Softwa0
WHERE (Application0 LIKE'SAV 8%'))
Version8 ON
v_R_System.ResourceID = Version8.ResourceID
WHERE (v_R_System.Operating_System_Name_and0 LIKE '%5.1%') And
Obsolete0 NOT LIKE '1'

The problem is in the FOURTH "case when" statement. when I analyze the
results, there are several instances where all three versions are
zeroes but the 'noversions' column does not indicate a one.
Conversely, the fifth line uses an almost identical "case when"
statement and it DOES work correctly.

Am I missing something here?

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

What you did post looks awful. Table and columns with the same name.
The same subquery appearing over and over4. The construction of flags
in convoluted SQL. Etc.

My guess is that the whole thign can be reduced to something like this:

SELECT VR.resource_id,
CASE WHEN G.version_nbr = 10 THEN 1 ELSE 0 END AS ver_10,
CASE WHEN G.version_nbr = 9 THEN 1 ELSE 0 END AS ver_9,
CASE WHEN G.version_nbr = 8 THEN 1 ELSE 0 END AS ver_8,
..
FROM V_R_System AS VR
LEFT OUTER JOIN
(SELECT DISTINCT resource_id
FROM Full_CollectionMembership
WHERE collection_id
IN ('MRK00DAA', 'MRK011FB', 'MRK00D4F')
AND ..) AS G
ON VR.resource_id = G.resource_id;

This is only an outline since we have no DDL, sample data or clear
specs. The other columns and the playing around with COALESCE seem to
be redundant once you have the flags for the three versions.

>> I'm having a problem with a SQL query that uses a bunch of "case when" statements. I am fairly new at this, so bear with me. <<

First thing to know that this is a CASE **expression** and not a
statement. You are thinking about it all wrong.

>> We are trying to determine how many people have the different versions of antivirus software on their machines. In the instance where they have none of the three versions that have been available on our network, I am trying to create a field [sic] called "noVersions" that will populate by evaluating a CASE WHEN statement that (I think) says, if there's no version10, no version9 or no version8, then put a "1"in the noVersions column. <<

Redundant columns (whcih are not fields!) are a bad SQL programming
practice. If you really need to show it, then do it is the front end.
It would also help if you were consistent in formatting your code
(reserved words in UPPERCASE all the time, data element names in
lowercase, no more camelCase, etc.)

>> The problem is in the FOURTH "case when" statement. when I analyze the results, there are several instances where all three versions are zeroes but the 'noversions' column does not indicate a one. <<

Try putting the real version of my query in the FROM clause, then use

SELECT (ver_10 + ver_9 + ver_8) AS total_versions
FROM ( ..)

What data type should I use, Varchar or Decimal/Numeric

SQL Apprentice Question
I have a table which has the following fields:
ID (Int, IDENTITY)
SerialNumber (?)
OrderID (Integer)

The SerialNumber is a 15 digit number, and each Order may contain
10,000+ rows. The table will be populated through ADO.Net, the data
being appended from Excel spreadsheets that the user will upload.

Will I need need to do is produce reports highlighting which Serial
Numbers are duplicates that have appeared in previous orders, and also
allow the user to remove just those serial numbers from the current
OrderID. I think(!) I know how to do this bit in SQL, it's become a bit
rusty...

What I am not sure about is what the data type should be for Serial
Number, in terms of both size and speed. The number will always be 15
digits long, so should I use use. I've put down what the sizes of field
would be, correct me if I am wrong, I've never been good at figuring
these out!

Bigint = 8bytes
Char(15) = 15bytes
Varchar(15) = 30bytes
Numeric = ?
Decimal = ?

I am guessing that Bigint is the way to go, since it uses up the least
amount of space, but will it be the quickest for my comparison needs.
Bear in mind that I am expecting a few million records in this table.

But maybe I am hoping for too much and querying against this amount of
data will be extremely slow?

Your thoughts are greatly appreaciated.

Celko Answers
>> I have a table which has the following fields [sic]:

ID (Int, IDENTITY)
SerialNumber (?)
OrderID (Integer) <<

Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

>From just what you posted, it looks like you need to drop that silly

IDENTITY pseudo-column and that (order_id, serial_nbr) is the natural
key.

That is one of the constraints you can have on a column, which does not
apply to a field.

>> The SerialNumber is a 15 digit number, and each Order may contain 10,000+ rows. <<

Describe this serial number; check digits or validation rules? how do
you verify it? First design the encoding, then pick the data type. As
a practical observation, long identification numbers tend to be built
in sub-fields by a set of verifiable rules, so CHAR(15) might be a good
choice. For example, the 17 character VIN on automobiles. But without
any more specs, we are just guessing.

The one thing we are sure about is that you would never just rattle off
a simple sequence and accept that kind of data error rate.

>> Will I need need to do is produce reports highlighting which SerialNumbers are duplicates that have appeared in previous orders, and also allow the user to remove just those serial numbers from the current OrderID. <<

If your DDL was declared properly, then you would never have redundant
duplicates in the schema.

SQL query as parameter

SQL Apprentice Question
In SQL 2000. Suppose I have a stored procedure where @SQLWhere is
a varchar(100) parameter which contains the WHERE Clause which will be used
with a SELECT Clause(which is in the store procedure).

How can I put all of this together so that the whole query (SELECT+WHERE) will
work correctly?

Celko Answers
>> Suppose I have a stored procedure where @SQLWhere is a VARCHAR(100) parameter which contains the WHERE Clause which will be used with a SELECT Clause(which is in the store procedure). <<

Then, by definition, this string is not a parameter, is it? It is code
that you want to run using dynamic SQL. It is also a very, very bad
programming technique. What you are telling the world is that you have
no idea what this procedure should do, so you have to depend on any
random future user to do your job.

Do you remember coupling and cohesion from your freshman Software
Engineering course?

Did you look up "SQL injection" as a possible problem?

Database with many tables or multiple databases with lesser tables

SQL Apprentice Question
We are building a large application (a financial system) that, after
normalization, would approximately need 1500 or so tables. The question is
should we build this as one database or should we separate it out into
multiple databases.
There is a case for separating out into 5 or so databases based on
functional modules within the financial system (GL, AP, AR etc.) . Problem
though is they are related to one another in some way.
In a single database case, some standard tables (like U.S. States, Notes
etc.) can be a single table. Also it avoids cross database talks so ease of
development is more.
What are the performance advantages/disadvantages of one way or the other?
Does SQL Server performance get affected if the number of tables is large, as
in my case? Any inputs, suggestions?

Celko Answers
>> We are building a large application (a financial system) that, after normalization, would approximately need 1500 or so tables. <<

We have to take your word for it, but I have found that a bit larger
than most Fortune 500 systems. But who cares, if the data model is
right. And if they are part of the same data model, they should be in
one DB -- would put male employees in one schema and female employess
in a second schema?

With various products, you can partition the tables, do some indexing
tricks, etc. but that is implementation and not design.

>> The question is should we build this as one database or should we separate it out into

multiple databases. <<

Unh? Is it one data model or many different universes of discourse?

>> There is a case for separating out into 5 or so databases based on functional modules within the financial system (GL, AP, AR etc.) . <<

"functional modules within the financial system" -- Like a 1950's
COBOL system! Nothing like an RDBMS at all! No concept of a universe
of discourse at all.

>> though is they are related to one another in some way. In a single database case, some standard tables (like U.S. States, Notes etc.) can be a single table. <<

NO, THEY CANNOT!! A table a set of entities or relationships of the
same kind. Were the kid on dope in the back of my database class?
Look up the OTLT or MUCK design flaw.

>> Also it avoids cross database talks so ease of development is more. <<

Ease of developement is not the important thing. Data quality,
maintainabilty and those other things they told you about in the
Software Engineering course you should have taken are ORDERS OF
MAGNITUDE more important (literally -- did you ever read Boehm's
research, etc.?).

SQL Join question - getting most recent value in a history table

SQL Apprentice Question
My question concerns three tables:

- tblHeader - containing general information such as product ID, description
etc
HeaderID
HeaderDescription
HeaderSerialNumber

lstLocations - lookup table containing location IDs
LocID
LocDescription

tblLocationHistory - a transaction table containing the location history of
each header and the date of their installation.
HistoryID
HeaderID
LocID
InstallDate

What I'm after is a select which shows each line in tblHeader and its most
recent location. However my first attempt (as you would expect from SQL)
returns every product and every location they have been in - e.g. Product X
has 7 locations - thus appears 7 times in the results from the below query.
Unfortunately I'm after the most recent location of each Header *only*.

I feel like this is something I should be able to do with my eyes closed but
I can't seem to crack it. Please could someone point me in the right
direction! Should I be looking at using UDF's for example?

*******************

SELECT
dbo.tblHeader.HeaderID,
dbo.tblHeader.Type,
dbo.tblHeader.SerialNumber,
dbo.lstLocations.Location_Name,
dbo.tblLocationHistory.InstallDate
FROM dbo.lstLocations INNER JOIN
dbo.tblLocationHistory ON dbo.lstLocations.Location_ID
= dbo.tblLocationHistory.LocationID INNER JOIN
dbo.tblHeader ON dbo.tblLocationHistory.HeaderID =
dbo.tblHeader.HeaderID

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

You also need to learn ISO-11179 rules so you will stop using those
silly redundant affixes that describe physical implementations in an
RDBMS. It makes you look like a BASIC programmer. Also, tables are
sets which usually have more than one element, so you need to use
collecitve or plural names.

The proper way to model temporal data is with duration, not chronons
(google it). Try this:

CREATE TABLE LocationHistory
(header_id INTEGER NOT NULL
REFERENCES Headers (header_id),
loc_id INTEGER NOT NULL
REFERENCES Locations (loc_id),
start_date DATETIME NOT NULL.
end_date DATETIME, -- null is current location
PRIMARY KEY (header_id, loc_id, start_date),
etc.);

>> What I'm after is a select which shows each line in Headers and its most recent location. <<

Look for "end_date IS NULL" in your query. Proper DDL means much
easier coding.

Orders Per Hour Count

SQL Apprentice Question
Hello,

I have a table with 3 columns

ProductID int, Product varchar(30), OrderDate datetime

I have orders in this table from January to July I want to get a query that
will display how many orders I am getting per day and per hour.

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

You talk about orders, then post your personal pseudo-code about
products. Does that make sense to you?

>> I have orders in this table from January to July I want to get a query that will display how many orders I am getting per day and per hour. <<

Build a table of hourly report ranges. Think about using a spreadsheet
or a simple program to give you a year's worth of ranges -- 365 * 24 =
8760 rows

CREATE TABLE ReportRanges
(period_name CHAR(15) NOT NULL PRIMARY KEY, -- or use start_date
start_date DATETIME NOT NULL,
end_date DATETIME NOT NULL,
CHECK (start_date < end_date),
etc.) ;

Then the query is simply:

SELECT R.period_name, COUNT(*) AS hourly_cnt
FROM ReportRanges AS R, Orders AS O
WHERE O.order_date BETWEEN R.start_date AND R.end_date
GROUP BY R.period_name;

Constraint on an aggregate?

SQL Apprentice Question
Probably not a great idea, but is it possible to create a constraint on
an aggregate of rows? E.g. "sum(amount) = 0 group by foreign_key"

Celko Answers
Try this little trick:

CREATE TABLE Foobar
(..);

CREATE VIEW Foobar2 ( ..)
AS SELECT ( ..)
FROM Foobar
WHERE (NOT EXISTS
(SELECT *
FROM Foobar
GROUP BY foreign_key
HAVING SUM(amount) <> 0)
WITH CHECK OPTION;

Tuesday, July 25, 2006

Assign a "record number" to each row through an UPDATE clause

SQL Apprentice Question
I've a table with an interger column that accept NULL values.
When records are inserted, this column is not specified.
I would like to assign a "record number" to each row through this column
starting with 1 to each record with a single UPDATE command.
For instance, if the table contains 10 rows, atfer the UPDATE command, each
row will have a value in the interger column starting from 1 up to 10.
Is it possible ?

The primary key is a string...

sample data before UPDATE:

primary_key datetime imagedata record_number
olivier some_dt some_data NULL
olivier some_dt some_data NULL

After UPDATE

primary_key datetime imagedata record_number
olivier some_dt some_data 1
olivier some_dt some_data 2

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications.

The personal pseudo-code you did post, with all of the illegal columns
names, is useless. Your "primary key" column has duplicate values. My
guess is that you meant to say:

CREATE TABLE Foobar
(vague_name VARCHAR(35) NOT NULL,
vague_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
PRIMARY KEY (vague_name, vague_date),
etc.);

Your attempt to mimic a 1950's magnetic tape file in SQL is a bad idea.
Let's get back to the basics of an RDBMS. Rows are not records; fields
are not columns; tables are not files; there is no sequential access or
ordering in an RDBMS, so "first", "next" and "last" are totally
meaningless. If you want an ordering, then you need to have a column
that defines that ordering. It is an attribute of the entity in the
data model, not a physical locator put on as an afrter thought.

In your case, you probably should use a VIEW with a "ROW_NUMBER () OVER
(PARTITION BY vague_name ORDER BY vague_date DESC)" function in
SQL-2005 or a self-join. PARTITION gives you a "row number" in each
vague_name grouping; drop it and you will get the whole table in
temporal order.

Find bit fields with no default

SQL Apprentice Question
Does anyone know how to write a select to search all tables in a database
and return the table name and column name of any bit data type that allows
nulls? I inherited a database where some bit fields apparently do not have
a default value of 1 or 0.

Celko Answers
>> Does anyone know how to write a select to search all tables in a database and return the table name and column name of any bit data type that allows nulls? <<

After that, you need to re-write this piece of crap to get rid of
assembly language style programming. The schema is a nightmare.

>> I inherited a database where some bit fields [sic] apparently do not have a default value of 1 or 0. <<

That is not the real problem. It is a screw-up on a baaaaad design.
You should never be using assembly language bits in SQL. I also hope
that you know that columns and fields are absolutely nothing alike.

SQL Apprentice Question
Do you suggest using tinyint instead for a column that is simply true or
false?

Celko Answers
>> Do you suggest using tinyint instead for a column that is simply true or false? <<

You still thinking in terms of old procedural languages. Look at VB
and C# -- they do not agree on 0, 1, or -1 for Boolean values! Bits do
not port evern with proprietary languages inside Microsoft. In fact
this is a real mess in the X3J language standards.

But more importantly, I have found that assembly style flags are
usually redundant and dangerous. Instead of deducing a fact from a
logical predicate, the relational declarative way of programming, the
programmer is set a flag like we did with punch cards.

An actual example was an Inventory that had "is_instock" flag when
"(qty_on_hand > 0)" gave the same information. The bad news was that
the flag and the qty_on_hand had no relation to each other until the
guy did an update. His thought was to add a CHECK() constraint and
make it worse! Then they ran into the (1, -1) problem with the front
end languages.

My advise is to sit down and design a status code. be sure to leave
room for expansion. For example, the inventory status might include
"in stock", "back ordered", "on regular order" and then needs to add
"discontinued", etc. in the future.

I have the basics for such designs in SQL PROGRAMING STYLE. Yes, doing things right is work and I know that lot of XP-Agile-cowboy coders hate
that. Think high level abstract, not low level bit flags.

Design: How to avoid violation of 2NF/splitting of compound primary key?

SQL Apprentice Question
I'm designing a new locations model.

I need to support the standard locations structures (country,
state/province, county/parish, etc.) and several code standards but
also need to support known and to-be-user-defined regions.

Since country, state, county and even city are technically "regions" --
composed of zero or more other regions, I have subtyped a Region table
(which I call Location for political reasons) to support/disambiguate
all these (Location)Types.

The question is whether to include the LocationType in the primary key
of the Location table.

Pro: The member location will be easily identified/selected from the
subtype table
Con: The parent locatioon (e.g. State)'s Location Type will be the
same for the entire table, thus violating 2NF... redundancy. Or am I
wrong?

DDL follows (sorry for the length, but I find it impossible to help
other people when I don't have the model in front of me, and I'm
guessing I'm not the only one).

Also, I'd love to hear constructive criticism. Designing is a fine
balance of clarity and performance, and since location data is not
strictly foreseeably hierarchical but pretty darn static, I'm hoping to
bank on indexed views as the best way for developers to reference the
data in their sprocs.

TIA

CREATE TABLE Catalog
(
Catalog_id integer IDENTITY (1,1) ,
Catalog_nm dm_Name ,
Catalog_desc dm_Description
)
go

ALTER TABLE Catalog
ADD PRIMARY KEY (Catalog_id ASC)
go

CREATE TABLE CityTown
(
StateProvince_Location_id integer NOT NULL ,
Country_Location_id integer NOT NULL ,
CityTown_Location_id integer NOT NULL
)
go

ALTER TABLE CityTown
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC,CityTown_Location_id ASC)
go

CREATE TABLE CityTownMember
(
CityTownMember_seq dm_Sequence ,
StateProvince_Location_id integer NOT NULL ,
Country_Location_id integer NOT NULL ,
CityTown_Location_id integer NOT NULL ,
CityTownMember_Location_id integer NOT NULL
)
go

ALTER TABLE CityTownMember
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC,CityTown_Location_id ASC,CityTownMember_Location_id ASC)
go

CREATE TABLE CodeStandard
(
CodeStandard_id integer IDENTITY (1,1) ,
CodeStandard_nm dm_Name ,
CodeStandard_desc dm_Description ,
Catalog_id integer NULL
)
go

ALTER TABLE CodeStandard
ADD PRIMARY KEY (CodeStandard_id ASC)
go

CREATE TABLE CountryMember
(
CountryMember_seq dm_Sequence ,
Location_id integer NOT NULL ,
CountryMember_id integer NOT NULL
)
go

ALTER TABLE CountryMember
ADD PRIMARY KEY (Location_id ASC,CountryMember_id ASC)
go

CREATE TABLE CountyParish
(
StateProvince_Location_id integer NOT NULL ,
Country_Location_id integer NOT NULL ,
CountyParish_Location_id integer NOT NULL
)
go

ALTER TABLE CountyParish
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC,CountyParish_Location_id ASC)
go

CREATE TABLE CountyParishMember
(
CountyParishMember_seq dm_Sequence ,
StateProvince_Location_id integer NOT NULL ,
Country_Location_id integer NOT NULL ,
CountyParish_Location_id integer NOT NULL ,
CountyParishMember_Location_id integer NOT NULL
)
go

ALTER TABLE CountyParishMember
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC,CountyParish_Location_id ASC,CountyParishMember_Location_id ASC)
go

CREATE TABLE Location
(
Location_nm dm_Name ,
Location_id integer IDENTITY (1,1) ,
LocationType_id integer NULL
)
go

ALTER TABLE Location
ADD PRIMARY KEY (Location_id ASC)
go

CREATE TABLE LocationCode
(
LocationCode_cd char(18) NULL ,
Location_id integer NOT NULL ,
CodeStandard_id integer NULL
)
go

ALTER TABLE LocationCode
ADD PRIMARY KEY (Location_id ASC)
go

CREATE TABLE LocationType
(
LocationType_id integer IDENTITY (1,1) ,
LocationType_nm dm_Name
)
go

ALTER TABLE LocationType
ADD PRIMARY KEY (LocationType_id ASC)
go

CREATE TABLE RegionMember
(
RegionMember_seq dm_Sequence ,
Location_id integer NOT NULL ,
RegionMember_id integer NOT NULL
)
go

ALTER TABLE RegionMember
ADD PRIMARY KEY (Location_id ASC,RegionMember_id ASC)
go

CREATE TABLE StateProvince
(
StateProvince_seq dm_Sequence ,
StateProvince_Location_id integer NOT NULL ,
Country_Location_id integer NOT NULL
)
go

ALTER TABLE StateProvince
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC)
go

CREATE TABLE StateProvinceMember
(
StateProvinceMember_seq dm_Sequence ,
Country_Location_id integer NOT NULL ,
StateProvinceMember_Location_id integer NOT NULL ,
StateProvince_Location_id integer NOT NULL
)
go

ALTER TABLE StateProvinceMember
ADD PRIMARY KEY (StateProvince_Location_id ASC,Country_Location_id
ASC,StateProvinceMember_Location_id ASC)
go

ALTER TABLE CityTown
ADD FOREIGN KEY (CityTown_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CityTown
ADD FOREIGN KEY (StateProvince_Location_id,Country_Location_id)
REFERENCES StateProvince(StateProvince_Location_id,Country_Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CityTownMember
ADD FOREIGN KEY
(StateProvince_Location_id,Country_Location_id,CityTown_Location_id)
REFERENCES
CityTown(StateProvince_Location_id,Country_Location_id,CityTown_Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CityTownMember
ADD FOREIGN KEY (CityTownMember_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CodeStandard
ADD FOREIGN KEY (Catalog_id) REFERENCES Catalog(Catalog_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountryMember
ADD FOREIGN KEY (CountryMember_id) REFERENCES Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountryMember
ADD FOREIGN KEY (Location_id) REFERENCES Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountyParish
ADD FOREIGN KEY (CountyParish_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountyParish
ADD FOREIGN KEY (StateProvince_Location_id,Country_Location_id)
REFERENCES StateProvince(StateProvince_Location_id,Country_Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountyParishMember
ADD FOREIGN KEY
(StateProvince_Location_id,Country_Location_id,CountyParish_Location_id)
REFERENCES
CountyParish(StateProvince_Location_id,Country_Location_id,CountyParish_Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE CountyParishMember
ADD FOREIGN KEY (CountyParishMember_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE Location
ADD FOREIGN KEY (LocationType_id) REFERENCES
LocationType(LocationType_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE LocationCode
ADD FOREIGN KEY (Location_id) REFERENCES Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE LocationCode
ADD FOREIGN KEY (CodeStandard_id) REFERENCES
CodeStandard(CodeStandard_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE RegionMember
ADD FOREIGN KEY (RegionMember_id) REFERENCES Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE RegionMember
ADD FOREIGN KEY (Location_id) REFERENCES Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE StateProvince
ADD FOREIGN KEY (Country_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE StateProvince
ADD FOREIGN KEY (StateProvince_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE StateProvinceMember
ADD FOREIGN KEY (StateProvince_Location_id,Country_Location_id)
REFERENCES StateProvince(StateProvince_Location_id,Country_Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

ALTER TABLE StateProvinceMember
ADD FOREIGN KEY (StateProvinceMember_Location_id) REFERENCES
Location(Location_id)
ON DELETE NO ACTION
ON UPDATE NO ACTION
go

Celko Answers
>> I'm designing a new locations model. I need to support the standard locations structures (country, state/province, county/parish, etc.) and several code standards but also need to support known and to-be-user-defined regions. <<

Thank you for the DDL. The flaws I see are the use of IDENTITY as a
key in violation of the most basic RDBMS definitions and inconsistent
naming rules for data elements. .

I would suggest a nested sets model for this. One table for locations
with a name as a key and (longitude, latitude) or whatever as an
attribute. Then one table for each geographical partitioning of the
set of locations. You would have a political divisions table, a wine
of regions table, etc, that reference the locations. The political
name of a division is an attribute.

You made it complex by having a table for every level in a hierarchy.
Model the hierarchies instead.

Monday, July 24, 2006

Having Clause Question

SQL Apprentice Question
Sorry if this a long winded question with a simple answer

i have the following script that builds 2 tables including data

Create Table tblProducts
(
intProdID int IDENTITY(1,1),
strName varchar(20) UNIQUE
)
go
insert into tblProducts (strName) values('A')
insert into tblProducts (strName) values('B')
insert into tblProducts (strName) values('C')
insert into tblProducts (strName) values('D')

go
Create UNIQUE CLUSTERED index ProdIndex on tblProducts(intProdID)
go
--Prices
drop table tblPrices
go
Create table tblPrices
(
intPriceID int IDENTITY(1,1),
intProdID int,
intPrice int,
dtInputDate dateTime
)
go
insert into tblPrices(intProdID,intPrice,dtInputDate) values (1,1.5,'21
july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(1,12.5,'20 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(1,13.5,'22 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(1,14.5,'23 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(2,15.5,'24 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(3,17.5,'25 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(2,81.5,'26 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(2,16.5,'27 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(3,18.5,'28 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(3,14.5,'29 july 2006')
insert into tblPrices(intProdID,intPrice,dtInputDate) values
(4,12.5,'30 july 2006')
go
create unique clustered index PriceIndex on tblPrices(intPriceID)
go
create nonclustered index PriceProdIndex on tblPrices(intProdID)

Im trying to get the following data
the Product name, Price for that product at the latest date. Trying to
do this without using a sub query

So in my Sybase days i would have done the following

select tblProducts.strName, tblPrices.intPrice, tblPrices.dtInputDate
from tblProducts join tblPrices on tblProducts.intProdID =
tblPrices.intProdID
group by
strName
having dtInputDate = max(dtInputDate)

which my mate confirm works on Sysbase 12.5 now on SQL server i do the
previous and i get the following error

Msg 8121, Level 16, State 1, Line 2
Column 'tblPrices.dtInputDate' is invalid in the HAVING clause because
it is not contained in either an aggregate function or the GROUP BY
clause.

Is there anyway of doing the above with out using a sub query?

Celko Answers
Thanks for the DDL. It shows us immediately that your problem is a
lack of a proper design. Get rid of the silly prefixes that violate
ISO-11179 and the rules of data modeling. Stop using IDENTITY as a
key; I am goignto assume that you have a UPC or SKU or something that
your inventory uses. Stop leaving everything NULL-able. Start using
DRI actions. Learn how to do temporal models in SQL (i.e. read
Snodgrass or the current working drafts at NCITS H2)

CREATE TABLE Products
(sku INTEGER NOT NULL PRIMARY KEY, -- need industry std
product_name VARCHAR(20) NOT NULL);

CREATE TABLE PriceList
(sku INTEGER NOT NULL,
REFERENCES Products (sku)
ON UPDATE CASCADE
ON DELETE CASCADE,
item_price DECIMAL (12,2) NOT NULL,
start_date DATETIME NOT NULL,
end_date DATETIME, - null means current
PRIMARY KEY (sku, start_date));

>> I'm trying to get the following data: the product name, price for that product at the latest date. <<

SELECT P.product_name, L.item_price, L.start_date
FROM Products AS P
LEFT OUTER JOIN
PriceList AS L
ON P.sku = L.sku
WHERE L.end_date IS NULL;

There is no problem at all, if you start with the right design.

The other problem you have is not knowing SQL, and speaking in Sybase
dialect. Here is how a SELECT works in SQL ... at least in theory.
Real products will optimize things, but the code has to produce the
same results.

a) Start in the FROM clause and build a working table from all of the
joins, unions, intersections, and whatever other table constructors are
there. The [table expression] AS [correlation name] option allows you
give a name to this working table which you then have to use for the
rest of the containing query.

b) Go to the WHERE clause and remove rows that do not pass criteria;
that is, that do not test to TRUE (i.e. reject UNKNOWN and FALSE). The
WHERE clause is applied to the working set in the FROM clause.

c) Go to the optional GROUP BY clause, partiton the original table
into groups and reduce each grouping to a *single* row, replacing the
original working table with the new grouped table. The rows of a
grouped table must be only group characteristics: (1) a grouping column
(2) a statistic about the group (i.e. aggregate functions) (3) a
function or constant(4) an expression made up of only those three
items. The original table no longer exists and you cannot reference
anything in it (this was an error in early Sybase products).

d) Go to the optional HAVING clause and apply it against the grouped
working table; if there was no GROUP BY clause, treat the entire table
as one group.

e) Go to the SELECT clause and construct the expressions in the list.
This means that the scalar subqueries, function calls and expressions
in the SELECT are done after all the other clauses are done. The AS
operator can also give names to expressions in the SELECT list. These
new names come into existence all at once, but after the WHERE clause,
GROUP BY clause and HAVING clause have been executed; you cannot use
them in the SELECT list or the WHERE clause for that reason.

If there is a SELECT DISTINCT, then redundant duplicate rows are
removed. For purposes of defining a duplicate row, NULLs are treated
as matching (just like in the GROUP BY).

f) Nested query expressions follow the usual scoping rules you would
expect from a block structured language like C, Pascal, Algol, etc.
Namely, the innermost queries can reference columns and tables in the
queries in which they are contained.

g) The ORDER BY clause is part of a cursor, not a query. The result
set is passed to the cursor, which can only see the names in the SELECT
clause list, and the sorting is done there. The ORDER BY clause cannot
have expression in it, or references to other columns because the
result set has been converted into a sequential file structure and that
is what is being sorted.

As you can see, things happen "all at once" in SQL, not "from left to
right" as they would in a sequential file/procedural language model. In
those languages, these two statements produce different results:
READ (a, b, c) FROM File_X;
READ (c, a, b) FROM File_X;

while these two statements return the same data:

SELECT a, b, c FROM Table_X;
SELECT c, a, b FROM Table_X;

Think about what a confused mess this statement is in the SQL model.

SELECT f(c2) AS c1, f(c1) AS c2 FROM Foobar;

That is why such nonsense is illegal syntax.

incrementing a counter in sql

SQL Apprentice Question
need to create a table with two values - old no and new no
the old no should come from an existing table and the new number should begin with a starting value and increment by 1

how do i do this????

Celko Answers
>> I need to create a table with two values - old number and new number the old number should come from an existing table and the new number should begin with a starting value and increment by 1 <<

Generally, we prefer "nbr" as the affix for "number" so it does not
conflict the English word "No" and matches the more international
Latin-based "numero", "numbero", etc. and common abbreviation. Hey, I
wrote a book on progamming style and spent weeks on nit-picky things
like this to save you from it.

A motre serious question is: What the heck does this mean in your data
model? What is the transformation rule?

a) Order the old numbers and re-number? ASC or DESC?
b) Random physical order, so we know you have no data model !
c) a formula that you did not give in the lack of specs you posted?

Frankly, this sounds like a kludge to clean up something. Can you tell
us what you are actually trying to do? Or tell us that all you want is
the fastest, dirty, stinking proprietary kludge you can find, so we
will not waste time solving the real problems.

Help with a complicated query

SQL Apprentice Question
have a data set like the one below:

Qualifier Date Whole/Partial Partial Count
A 1/12/2000 W NULL
A 1/3/2001 P 1
A 1/6/2001 P 2
A 1/9/2001 P 3
A 1/12/2001 P 4
B 1/4/2001 W NULL
B 1/7/2001 P 1
B 1/10/2001 P 2
B 1/1/2002 P 3
B 1/1/2006 P 4
B 1/4/2002 P 4

There are more than a million rows of data in the database

What would be the most efficient SQL server query to extract for each
qualifier the
the latest 4 Partial dates given a reference date (like say getDate() or a
date)

Celko Answers
>> I have a data set [table?] like the one below: <<

Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

Why do you have a running count in a table? Also, learn the basics of
the trade like ISO-8601 data formats. Did you mean something more like
this, if you knew DDL?

CREATE TABLE Foobar
(foobar_qualifier CHAR(1) N0T NULL,
posting_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
wp_flag CHAR(1) DEFAULT 'W' NOT NULL
CHECK (foobar_qualifier IN ('W', 'P')),
PRIMARY KEY (foobar_qualifier, posting_date)); -- wild guess?

>> What would be the most efficient SQL server query to extract for each qualifier the the latest 4 Partial dates given a reference date (like say CURRENT_TIMESTAMP or a

date) <<

SELECT foobar_qualifier, posting_date,
ROW_NUMBER() OVER (PARTITION BY foobar_qualifier ORDER BY
posting_date)
AS rn
FROM Foobar
WHERE rn <= 4
AND wp_flag = 'P'
AND posting_date >= @my_date;

Untested, SQL 2005.

Case statement question

SQL Apprentice Question
Hi All,
Is it possible to update a specific field based on the case statement?
Something like this, for example:

UPDATE Person SET
CASE
WHEN HB = '1' THEN HasBrother = '1'
WHEN HS = '1' THEN HasSister = '1'
END
FROM
.....

I know one way is to specify both fields and have CASE statement for both of them, something like:
UPDATE Person SET
HasBrother = CASE
WHEN HB = '1' THEN '1'
ELSE HasBrother
END
HasSister = CASE
WHEN HB = '1' THEN '1'
ELSE HasSister
END
FROM
......

But I am curious if first example was possible.

Celko Answers
Better schema design and encodings! SQL is a data language and not a
procedural language.

CREATE TABLE Persons
( ..
sibling_status DEFAULT 0 INTEGER NOT NULL
CHECK (sibling_status IN (0, 1, 2, 3)),
..);

0= no siblings
1= has brothers
2= has sisters
3= has brothers and sisters

>> Of course, other than dynamically building the SQL statement. <<

Writing dynamic SQL is like playhing a video game where you have no
idea what will happen next, so you just start shooting at the problem
at run time :)

merge two date ranges with overlap from two tables

SQL Apprentice Question
I have two tables:

pat01 record patient status1 from date to date.

patientID startdate enddate status1
1 2000-01-01 2000-01-25 Good
1 2000-01-26 2000-02-25 not so Good
1 2000-02-26 2000-03-01 BAD
2 2000-01-01 2000-04-01 BAD

and another table pat02 record patient status2 from date to date

patientID startdate2 enddate2 status2
1 2000-01-01 2000-01-10 CHECKED
1 2000-01-11 2000-02-01 UNCHECKED

Please note that there are overlap between date range in two tables for
the same patient.Now I want to merge those two tables into one:

PATIENTID STARTDATE ENDDATE STATUS1 STATUS2
1 2000-01-01 2000-01-10 Good
CHECKED
1 2000-01-11 2000-01-25 Good
UNCHECKED
1 2000-01-26 2000-02-01 NOT SO GOOD
unchecked
1 2000-02-02 2000-02-25 not so good
null
1 2000-02-26 2000-03-01 BAD
null
2 2000-01-01 2000-04-01 BAD
null

Both table have about 1 million rows, more than 10 years data. I know I
can use number table to create row for each patient each day then
merge them but I am worry about the performance. Does anyone have other
ideas?

Thanks

DDL:

create table pat01
(PatientID int,
StartDate datetime,
EndDate datetime,
Status1 varchar(20)
)

insert into pat01 values (1, '2000-01-01','2000-01-25', 'Good')
insert into pat01 values (1, '2000-01-26','2000-02-25', 'not so Good')
insert into pat01 values (1, '2000-02-26','2000-03-01', 'BAD')
insert into pat01 values (2, '2000-01-01','2000-04-01', 'BAD')

create table pat02
(PatientID int,
StartDate2 datetime,
EndDate2 datetime,
Status2 varchar(20)
)

insert into pat01 values (1, '2000-01-01','2000-01-10', 'CHECKED')
insert into pat01 values (1, '2000-01-11','2000-02-01', 'UNCHECKED')

Celko Answers

>> Both table have about 1 million rows, more than 10 years data. I know I can use number table to create row for each patient each day then merge them but I am worry about the performance. <<

So basically something like this:

SELECT Pat01.patient_id, C1.cal_date, Pat01.status1, Pat02.status2
FROM Pat01, Pat02, Calendar AS C1
WHERE Pat01.patient_id = Pat02.patient_id
AND C1.cal_date BETWEEN Pat01.start_date AND Pat01.end_date
AND C1.cal_date BETWEEN Pat02.start_date AND Pat02.end_date
AND C1.cal_date BETWEEN @my_start_date AND @my_end_date;

so figure we have 3652+ rows per patient_id, assuming they all stay in
the study for 10 years. How many patients? Could be ugly. But I
also assume this is a one-shot data warehousing kind of thing, so this
might not be a bad way to go.

I cannot think of an answer right now, but I have the feeling that you
can use the new OLAP functions to build the ranges..

SQL ... Order of operations with Logical Operators

SQL Apprentice Question
Can anyone advise how operations are treated when AND and OR are
combined?

eg

WHERE: Condition 1 > X AND Condition 2 < Y OR Condition 3 = Z

Would the result of this query be to return results that

a) Satisfied Condition 1 AND Condition 2 or alternatively, Condition 3
regarless of Conditions 1 AND 2

OR

b) Condition 1 AND Condition 2 OR Condition 1 AND Condition 3

Thanks

Celko Answers
The usual way:
parens
NOT
AND
OR

Design Difficulties

SQL Apprentice Question
I've a project that requires a billing system that has different
requirements for each payment recipient and I was wondering how I'm to
design an efficient DB. Each payment requirement is stipulated in the
contracts w/ each payment recipient.

Some are very easy, just a price per item per month.

One recipient requires that the fee charged be based on different time
frames. E.g.:

1/1/2004 - 12/31/2004 $0.5 per item per month
1/1/2005 - 6/30/2005 $.04 per item per month
7/1/2005 - (until end) $.045 per item per month

Then there are just the flat-fee recipeints: e.g., $5,000.00 per year.

There are recipients that want different fees per area where their
service is given:

Latin America: $0.03 per item per month
UK: $0.05 per item per month
US: $0.06 per item per month

and it goes on like that.

There are those that will receive a different fee base on their ranking
in the market (how many items sold in an area).

We get a report for each area, for each month stating how many items
have been processed.

Do you believe I have to set up an SP that will calculate the payment
for each recipient? And, I'm thinking I'll have to have the data
hard-coded into the SP instead of building many tables to accommodate
the various data structures that would be necessary for each
payment/calculation type - including the necessity of building new
tables/stored procedures for any new contract that requries new payment
criteria.

Any thoughts welcome.

Celko Answers
>>I've a project that requires a billing system that has different requirements for each payment recipient .. Each payment requirement is stipulated in the contracts w/ each payment recipient .. Some are very easy, .. flat-fee recipeints.. recipients that want different fees per area where their service is given .. that will receive a different fee base on their ranking in the market (how many items sold in an area). <<

My first approach, without knowing more than this posting, would be to
create a single payments table with rows that give (recipient_id,
payment_date, payment_amt, payment_method, etc). Write a series of
front end procedure for creating the entire set of rows all at once. I
am assuming that "ranking in the market" takes place at the start of
the contract and not at payment time.

The app then goes to today's work and cuts a check to recipients "WHERE
ayment_date = CURRENT_TIMESTAMP" every morning.

SQL is not a computational language; it works much better with data.

>> I'm thinking I'll have to have the data hard-coded into the SP instead of building many tables to accommodate the various data structures that would be necessary for each payment/calculation type - including the necessity of building new tables/stored procedures for any new contract that requries new payment criteria.<<

hard wired code is not a good idea in SQL; the language is designed to
work with tables.
Think data, not process. Think complete sets that model complete facts
(like all the payments in a contract).

find null value in any column

SQL Apprentice Question
Assume I have a table with only 1 row and all column is 'int'. I just
wonder is it possible to use SQL to check if any one of those column
contain a 'null' without knowing the column name?

For example,

create t1 ( i1 int null, i2 int null ....)
insert into t1 (i1, i2, ...) values (1, 1, 1, ... , null, 1, ... null,
..)

and check if any of those column contain null...

Celko Answers
>> Assume I have a table with only 1 row and all column is INTEGER. I just wonder is it possible to use SQL to check if any one of those column contain a 'null' without knowing the column name? <<

SELECT 'yes'
FROM Foobar
WHERE (c1 + c2 + c3+ .. + cn) IS NULL;

eliminating the use of temp tables with derived tables

SQL Apprentice Question
I am trying to rewrite my procedure to use a derived table instead of temp
table to test if this increases performance.

I was using this article to help but I receive a error message where I am
using a fuction to build my parameter list. Could someone review my current
proc to provide a suggestion in using a derived table instead of using a temp
table?

CREATE TABLE #Metrolist (metroid varchar(100))

INSERT #Metrolist
SELECT * FROM db.rcudf_ConvertMetroTabletoList(@metroidlist)

BEGIN TRAN

SELECT

CASE WHEN @groupingtimeframe = 'daily' THEN DATEADD(dd,DATEDIFF(day,
0,r.datecreated) ,0)
WHEN @groupingtimeframe = 'weekly' THEN DATEADD(ww, DATEDIFF(week,0,
r.datecreated),-1)
WHEN @groupingtimeframe = 'monthly' THEN DATEADD(mm,DATEDIFF(month,
0,r.datecreated),0)
WHEN @groupingtimeframe = 'yearly' THEN DATEADD(yy,DATEDIFF(year,
0,r.datecreated), 0)
END as 'DATE',
Count(*) as 'Leads',
m.metroname, m.metroid

FROM rentclicks.rcadreplies as r
JOIN db.rcads as a on a.adid = r.adid
JOIN db.rczipcode AS z ON z.zipcode = a.zip
JOIN db.rcarea AS ar ON ar.areaid = z.primaryareaid
JOIN db.rcmetro AS m ON m.metroid = ar.primarymetroid
JOIN #Metrolist as ml on ml.metroid = m.metroid

WHERE r.datecreated >= @startdate AND r.datecreated < DATEADD(dd, 1,
@enddate)

GROUP BY CASE WHEN @groupingtimeframe = 'daily' THEN
DATEADD(dd,DATEDIFF(day, 0,r.datecreated) ,0)
WHEN @groupingtimeframe = 'weekly' THEN DATEADD(ww, DATEDIFF(week,0,
r.datecreated),-1)
WHEN @groupingtimeframe = 'monthly' THEN DATEADD(mm,DATEDIFF(month,
0,r.datecreated),0)
WHEN @groupingtimeframe = 'yearly' THEN DATEADD(yy,DATEDIFF(year,
0,r.datecreated), 0)
END, m.metroname, m.metroid

ORDER BY CASE WHEN @groupingtimeframe = 'daily' THEN
DATEADD(dd,DATEDIFF(day, 0,r.datecreated) ,0)
WHEN @groupingtimeframe = 'weekly' THEN DATEADD(ww, DATEDIFF(week,0,
r.datecreated),-1)
WHEN @groupingtimeframe = 'monthly' THEN DATEADD(mm,DATEDIFF(month,
0,r.datecreated),0)
WHEN @groupingtimeframe = 'yearly' THEN DATEADD(yy,DATEDIFF(year,
0,r.datecreated), 0)
END ASC

Celko Answers
>> I am trying to rewrite my procedure to use a derived table instead of temp table to test if this increases performance. <<

Worry about the things that REALLY matter in in the software life cycle
of a successful project; maintainability is first (80%+ of the cost of
software is here), portability (15%+ of the cost is here) and
performance shows up as a cost factor only when it really stinks and it
is usually the easiest (i.e. cheapest) one to fix.

Oh, you might want to learn Standard SQL. The syntax is INSERT INTO,
not the abbreviated INSERT, like a hillbilly dropping words in his
sentences. We do not use single around a column alias.

Good SQL programmers do not use Sybase/SQL Server style temp tables but
write with derived tables, VIEWs and CTEs. And would **never** use a
function call if we did use a temp table. What that says is that this
programmer does not understand the basic idea of a declarative
language.

What is the difference berween a zip and a zip_code? Two names means
two TOTALLY DIFFERENT DATA ELEMENTS. But it also means that you did
not build a data dictionary for the enterprise, doesn't it?

Finally, you did a lot of proprietary temporal math to make sure that
your query cannot use indexes. Build a calendar table with the proper
date ranges:

CREATE TABLE ReportRanges
(cal_date DATETIME NOT NULL PRIMARY KEY, --, start date
day_end_date DATETIME NOT NULL,
week_end_date DATETIME NOT NULL,
month_end_date DATETIME NOT NULL,
year_end_date DATETIME NOT NULL);

Now use a predicate like this:

{@my_period BETWEEN cal_date
AND CASE @report_period
WHEN 'dailey' THEN day_end_date
WHEN 'weekly' THEN week_end_date
WHEN 'monthly' THEN month_end_date
WHEN ''annual' THEN year_end_date
ELSE NULL END

SQL is a declarative language that is designed for data. It is not a
procedural language the way you are trying to use it. You are trying
to talk English with Japanse grammar rules :)

stored proc to new table

SQL Apprentice Question
what's the easiest way to get the results of a stored procedure into a new
table?

thanks,

Celko Answers

>> what's the easiest way to get the results of a stored procedure into a new table? <<

Quick answer: put an INSERT INTO statement in the body of the stored
procedure.

Better answer: you should never be creating table on the fly. You can
put the SELECT into a VIEW and call it that way. Based on cleaning up
non-RDBMS designs written in SQL, I will guess that you are mimicking a
scratch tape.

If SQL were a 1950's tape file system language when we did not have
much disk storage and it costs a fortune. One **procedural** step in
the procedure would write a result to a new tape, then the next step
would do more work with it. I will also bet that you have cursors in
your code.

Once you learn to actually think in a declarative language like SQL,
you will not even consider such kludges. You will ask for what you
want in one statement.

Monday, July 17, 2006

Referential integrity in DW necessary??

SQL Apprentice Question
In a data warehousing application, what is the impact of imposing
referential integrity on the database side? Does it help or degrade the
performance considering the complex transformations that take place during
the ETL process. I have read a few articles suggesting not to impose
referential integrity on the Data Warehouse but would like to hear more
views on this. Request you to please share your previous experiences on
this....
Thanks in advance.
Celko Answers
>> In a data warehousing application, what is the impact of imposing referential integrity on the database side? <<

Unlike OLTP, a warehouse is scrubbed BEFORE it is persisted in the
schema and then it is STATIC. There is not (well, should not be) any
need to have RI or a lot of constraints. What you want ina data
warehouse is access methods for aggregation of bulk data. I have a
book on OLAP & ANALYTICS IN SQL due out in a few months which gives an
over view of the issues

Complicated updates

SQL Apprentice Question
I am trying to do a stored procedure to delete & demote records. I have a
table with Client/Matter #'s where sometimes the matter number needs to be
deleted and so all matters above need to be demoted. For instance, my
client number is 113245, and I have matters 1, 2, 3, 4, 5, and so on.
Matter #3 needs to be deleted, so we rename 4 to 3, and 5 to 4, etc., etc.

My problem is because of relationships, it's turning out to not be so easy.

I have 3 tables that are set up with this client matter number (don't get me
started on why the tables are set up this way!). Each Table has a client
field and a matter field. Client/Matter is the primary key on tbl1, with a
relationship on tbl2 & 3 that says both of those fields have to match tbl1.

I cannot delete matter 3 from tbl1, because of the foreign keys on tbls 2 &
3. I cannot rename the matter because of the foreign keys. I cannot rename
the matter in tbls2 & 3 to something that doesn't exist in tbl1. I am not
sure how to do this. Can anyone tell me what steps to take to get around
these relationships? Thank you for your help.

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

The basic answer is simple:

UPDATE Foobar
SET foo_nbr
= (SELECT COUNT (F1.foo_key)
FROM Foobar AS F1
WHERE Foobar.foo_key = F1.fookey
AND F1.item_nbr <= Foobar.item_nbr);

Trying to avoid a stored procedure, is this possible?

SQL Apprentice Question
I have a single table that has two columns pkID & Notes. For each pkID there
can be several rows, what I would like to accomplish and know if possible.
Is to create a column, that I can number each row for a given pkID
sequentially.

pkID Notes NoteNumber
1 1
2 1
2 2
2 3
3 1
3 2

etc....... for each pkID in the table.

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

My guess is that you meant something like this, with better names,
constraints, a keyu, etc -- the basic stuff:

CREATE TABLE DocumentNotes
(document_nbr INTEGER NOT NULL,
note_nbr INTEGER NOT NULL
CHECK ( note_nbr > 0),
PRIMARY KEY (document_nbr, note_nbr),
note_txt VARCHAR(250) NOT NULL);

>> know if possible Is to create a column, that I can number each row for a given pkID

sequentially. <<

You can enforce that constraint with:
1) a TRIGGER in T-SQL
2) a CHECK() constraint in SQL-92 and up
3) a VIEW in SQL-89 and up
4) an OLAP function in SQL-99 and up

What behavior do you want? Should the notes re-number when there is a
gao? Etc.

Question for T-SQL Gurus

SQL Apprentice Question
I have a table that relates emails together. I would like to leverage
this information to create a rollup key that can be used to identify
the entire group. Some groups will have one email some will have many.

Is there a way to approach this problem with a set based approach? I
know I can do it with a cursor but I prefer not to.

Please feel free to offer schema and/or modeling changes.

Thanks for looking!

CREATE TABLE dbo.email_key (
email_id INT identity
,email VARCHAR(100)
,rollup_id INT
)

CREATE TABLE dbo.email_email (
email_email_id INT identity
,email_id INT
,rel_email_id INT
)

CREATE TABLE dbo.rollup_key (
rollup_id INT identity
,primary_email_id INT)

TRUNCATE TABLE dbo.email_key
TRUNCATE TABLE dbo.email_email
TRUNCATE TABLE dbo.rollup_key

--SAMPLE DATA
INSERT INTO dbo.email_key (email) SELECT '...@abc.com'
INSERT INTO dbo.email_key (email) SELECT '...@abc.com'
INSERT INTO dbo.email_key (email) SELECT '...@abc.com'
INSERT INTO dbo.email_key (email) SELECT '...@abc.com'

--SAMPLE DATA
INSERT INTO dbo.email_email (email_id,rel_email_id) SELECT 1,2
INSERT INTO dbo.email_email (email_id,rel_email_id) SELECT 1,3
INSERT INTO dbo.email_email (email_id,rel_email_id) SELECT 2,1
INSERT INTO dbo.email_email (email_id,rel_email_id) SELECT 3,1
INSERT INTO dbo.email_email (email_id,rel_email_id) SELECT 2,3

--ONE POSSIBLE END RESULT
INSERT INTO dbo.rollup_key (primary_email_id) SELECT 1
INSERT INTO dbo.rollup_key (primary_email_id) SELECT 4
--RUN A COUPLE UPDATES TO SET email_key.rollup_id

Celko Answers
>> I have a table that relates emails together. I would like to leverage this information to create a rollup key that can be used to identify the entire group. <<

Roll ups are mathematical operations done on a hierachy. Google ROLLUP
and CUBE for details. Are you talking about building a tree structure
to show a newsgroup thread?

>> Is there a way to approach this problem with a set based approach? <<

Get a copy of TREES & HIERARCHIES IN SQL for several approaches for
this kind of problem.

You mgiht also want to get a basic data modeling book, so you will not
have table names that contain "key" as affix, will stop using IDENTITY
and will create tables with relational keys. Also, think about how bad
"email_email" is as a data element name. The proper syntax is "INSERT
INTO .. VALUES()", and not the proprietary SELECT you used.

Query to find amissing number

SQL Apprentice Question
I need to write a query to find out a set of missing number in a given
sequence.

Eg : a Column in some table has the following data

Col1

1
2
3
4
5
6
8
9
10

Here I need to write a query to find out that number 7 is missing in the
given sequence.

One possible solution is by using any loop. But I am looking out if the same
can be achieved using any query.

Thanks in advance.

Celko Answers

>> I need to write a query to find out a set of missing numbers in a given sequence. <<

Let's assume we have a table of people who bought tickets that are
supposed to be in sequential order and we want to make a list of what
is missing in each buyer's set of tickets.

CREATE TABLE Tickets
(buyer CHAR(5) NOT NULL,
ticket_nbr INTEGER DEFAULT 1 NOT NULL
CHECK (ticket_nbr > 0),
PRIMARY KEY (buyer, ticket_nbr));

INSERT INTO Tickets VALUES ('a', 2);
INSERT INTO Tickets VALUES ('a', 3);
INSERT INTO Tickets VALUES ('a', 4);
INSERT INTO Tickets VALUES ('b', 4);
INSERT INTO Tickets VALUES ('c', 1);
INSERT INTO Tickets VALUES ('c', 2);
INSERT INTO Tickets VALUES ('c', 3);
INSERT INTO Tickets VALUES ('c', 4);
INSERT INTO Tickets VALUES ('c', 5);
INSERT INTO Tickets VALUES ('d', 1);
INSERT INTO Tickets VALUES ('d', 6);
INSERT INTO Tickets VALUES ('d', 7);
INSERT INTO Tickets VALUES ('d', 9);
INSERT INTO Tickets VALUES ('e', 10);

If we can assume that there is a relatively small number of Tickets,
then you could use a table of sequential numbers from 1 to (n) and
write:

SELECT DISTINCT T1.buyer, S1.seq
FROM Tickets AS T1, Sequence AS S1
WHERE seq <= (SELECT MAX(ticket_nbr) -- set the range
FROM Tickets AS T2
WHERE T1.buyer = T2.buyer)
AND seq NOT IN (SELECT ticket_nbr -- get missing numbers
FROM Tickets AS T3
WHERE T1.buyer = T3.buyer);

Another version:

BEGIN
SELECT *
INTO #foobar
FROM Tickets
UNION ALL
SELECT DISTINCT buyer, 0
FROM Tickets;

SELECT T1.buyer,
(T1.ticket_nbr + 1) AS gap_start,
(MIN(T2.ticket_nbr) - 1) AS gap_end
FROM --Tickets AS T1,
#foobar AS T1,
Tickets AS T2
WHERE T1.ticket_nbr < T2.ticket_nbr
AND T1.buyer = T2.buyer
GROUP BY T1.buyer, T1.ticket_nbr
HAVING MIN(T2.ticket_nbr) - T1.ticket_nbr > 1;

END;

The trick here is to add a zero to act as a boundary when 1 is missing
from the sequence.

In Standard SQL-92, you could write the UNION ALL expression directly
in the FROM clause.

weird sql query problem

SQL Apprentice Question
I dont have an DDL for this so I am just going to write it out by hand

CREATE TABLEA
(
AddressID BIGINT,
Address1 nvarchar(100),
Address2 nvarchar(200),
zipcode nvarchar(5)
)
Primary key is AddressID

CREATE Table JunctionA
(
AddressID BIGINT,
PersonID BIGINT,
Description nvarchar(300)
)

and say I have this same data

TableA===========

1 123 test st. '' 12345
2 123 test st '' 12345
3 230 1st st '' 12345
4 345 2nd st '' 90122

Junction==========
1 2 desc1
1 4 desc3
1 3 desc3
2 2 desc2
3 2 desc34
3 1 desc9

as you can see there are 2 addresses in tablea (ID 1 AND 2) that are the
same, and in the junction table, person 2 references both of them! how can I
fix this with a query so every item in tableA is unique while making all the
items in the junction change their references from the ones I want to delete
in tablea to the one that stays?

Celko Answers
This was pretty sloppy even for an example -- DDL is easy to write; do
you really have a need for BIGINT? Usually huge and proprietary are
bad things. Why did you invent variable length zip coes and address
lines that exceed the USPS CHAR(35)?

CREATE TABLE Addresses
(address_id INTEGER NOT NULL PRIMARY KEY,
address1 CHAR(35) NOT NULL,
address2 CHAR(35) NOT NULL,
zip_code CHAR (5) NOT NULL
CHECK (zip_code LIKE '[0-9][0-9][0-9][0-9][0-9]'),
UNIQUE (address1, address2, zip_code) -- prevents your problem!
);

The correct terms are "referencing" and "reference" tables. I have no
idea who invented "Junction" as a term for a table that holds a
relationship.

CREATE TABLE Mailings
(address_id INTEGER NOT NULL
REFERENCES Addresses(address_id)
ON UPDATE CASCADE
ON DELETE CASCADE,
person_id INTEGER NOT NULL
REFERENCES Personnel(person_id)
ON UPDATE CASCADE
ON DELETE CASCADE,
mailing_description VARCHAR(300) NOT NULL, -- 300? guess or research?
PRIMARY KEY (address_id, person_id)
);

Look at how the DRI actions and the constraints are doing most of your
work for you. The UNIQUE() constraint can be expensive and hard to
write because you are dealing with English ("Lane" = "Ln", "Street" =
"St" = "Str"). You probably should look at Melisa Data or SSA software
for cleaning address data -- SSA has a great little booklet on the
problems that will scare you to death.

Variables in SP do not compare as equal when both are NULL

SQL Apprentice Question
DB2 8.2 for Linux, FP 10 (also performs the same on DB2 8.2 for Windoes, FP
11).

Using the SAMPLE database, tables EMP and EMLOYEE.

In the followng stored procedure, 2 NULL columns (COMM) are selected into 2
different SP variables and compared for equal. They are both NULL, but do
not compare as equal. When the Not NULL columns (SALARY) are compared, they
do compare as equal.

Is there a reason for this?

Statement terminator is @.

-- SET THE COMMISION TO NULL ON TWO SAMPLE TABLES FOR EMPNO '000010'

UPDATE EMP SET COMM = NULL WHERE EMPNO = '000010'@
UPDATE EMPLOYEE SET COMM = NULL WHERE EMPNO = '000010'@

--------------------------------------------------
DROP PROCEDURE TEST_SP@

CREATE PROCEDURE TEST_SP
(
OUT EMP_COMM DECIMAL(9,2),
OUT EMP_SALARY DECIMAL(9,2),
OUT EMPLOYEE_COMM DECIMAL(9,2),
OUT EMPLOYEE_SALARY DECIMAL(9,2),
OUT v_ERRMSG_1 VARCHAR(500),
OUT v_ERRMSG_2 VARCHAR(500)
)
SPECIFIC TEST_SP
RESULT SETS 0
MODIFIES SQL DATA
LANGUAGE SQL
BEGIN

---------------------------------------------
-- Declare variables
---------------------------------------------

DECLARE not_found CONDITION FOR SQLSTATE '02000';
DECLARE at_end SMALLINT DEFAULT 0;

DECLARE A_EMPNO CHAR(6);
DECLARE A_EMP_COMM DECIMAL(9,2);
DECLARE A_EMP_SALARY DECIMAL(9,2);

DECLARE B_EMPNO CHAR(6);
DECLARE B_EMPLOYEE_COMM DECIMAL(9,2);
DECLARE B_EMPLOYEE_SALARY DECIMAL(9,2);

---------------------------------------------
-- Declare exit handlers
---------------------------------------------

DECLARE EXIT HANDLER FOR SQLEXCEPTION
GET DIAGNOSTICS EXCEPTION 1 v_ERRMSG_1 = MESSAGE_TEXT;

DECLARE CONTINUE HANDLER FOR not_found
SET at_end = 1;

DECLARE EXIT HANDLER FOR SQLWARNING
GET DIAGNOSTICS EXCEPTION 1 v_ERRMSG_1 = MESSAGE_TEXT;

---------------------------------------------------------------
-- Begin processing logic
---------------------------------------------------------------

SELECT EMPNO, COMM, SALARY INTO A_EMPNO, A_EMP_COMM, A_EMP_SALARY
FROM DB2INST1.EMP
WHERE EMPNO = '000010';

SELECT EMPNO, COMM, SALARY INTO B_EMPNO, B_EMPLOYEE_COMM, B_EMPLOYEE_SALARY
FROM DB2INST1.EMPLOYEE
WHERE EMPNO = A_EMPNO;

IF A_EMP_COMM = B_EMPLOYEE_COMM THEN
SET V_ERRMSG_1 = 'NULLS COMPARE AS EQUAL';
ELSE
SET V_ERRMSG_1 = 'NULLS DO NOT COMPARE AS EQUAL';
END IF;

IF A_EMP_SALARY = B_EMPLOYEE_SALARY THEN
SET V_ERRMSG_2 = 'NON-NULLS COMPARE AS EQUAL';
ELSE
SET V_ERRMSG_2 = 'NON-NULLS DO NOT COMPARE AS EQUAL';
END IF;

SET EMP_COMM = A_EMP_COMM;
SET EMP_SALARY = A_EMP_SALARY;

SET EMPLOYEE_COMM = B_EMPLOYEE_COMM;
SET EMPLOYEE_SALARY = B_EMPLOYEE_SALARY;

END@

CALL TEST_SP (?,?,?,?,?,?)@

Value of output parameters
--------------------------
Parameter Name : EMP_COMM
Parameter Value : -

Parameter Name : EMP_SALARY
Parameter Value : 52750.00

Parameter Name : EMPLOYEE_COMM
Parameter Value : -

Parameter Name : EMPLOYEE_SALARY
Parameter Value : 52750.00

Parameter Name : V_ERRMSG_1
Parameter Value : NULLS DO NOT COMPARE AS EQUAL

Parameter Name : V_ERRMSG_2
Parameter Value : NON-NULLS COMPARE AS EQUAL

Return Status = 0

Celko Answers
>> Yes, it is clear that DB2 follows the ANSI standard, but it is not clear that the ANSI standard is consistent (as noted above) or is the preferred interpretation, and there are several databases that apparently use different interpretations (unless you set them to be ANSI compliant).<<

GROUPING is not the same as testing for equality. Grouping is done
with groups and equality is for scalars. We debated this in the old
ANSI X3H2 committee decades ago when it was still an issue.

If you want to see if some columns are all NULL, use "COALESCE (c1, c2,
,,cn) IS NULL" on your scalars.

Using Decimal as a Key

SQL Apprentice Question
Is there any disadvantages of using Decimal type as a primary key? If
so, are there other data types that can hold large numbers (more than
an int can hold: 13 digits or so) and perform better as a Key field?

Celko Answers
>> Is there any disadvantages of using Decimal type as a primary key? If so, are there other data types that can hold large numbers (more than an int can hold: 13 digits or so) and perform better as a Key field [sic] ? <<

This question is like "What car should I buy?"

Look at the nature of the data from a logical view point. Is it fixed
length? Are there constraints on it? Your insanely vague pseudo-specs
are useless. And you don't know that a column and field re totally
different concepts, which is part of why you don't know how to give a
spec.

For a key, you should be using an industry standard code whenever
possible -- say a VIN, ISBN, UPC, EAN, etc. If this industry standard
is expressed as a decimal (ICD, Dewey Classification, etc.) then use
DECIMAL(s,p) or NUMERIC(s,p).

The worst thing you could is grab an absurd data type like FLOAT or
BIGINT to invite garbage and in the case of BIGINT prevent portable
code. CHAR(n) is a good choice for a hierarchical encoding.

Transact-SQL SUM Help!

SQL Apprentice Question
I am trying to tally up information from two tables, but I am running
in to trouble and I can't seem to figure this out. I have a
CreditsTable and a DebitsTable and I am trying to create a querry that
sums up the info from each per UserID and returns TotalCredits,
TotalDebits, and a Ballance.


CreditsTable 
UserID  Ammount Applied 
+----------+----------+----------+ 
 |  192    |  1       |  True   | 
 |  192    |  2       |  True   | 
 |  207    |  1       |  True   | 
 |  207    |  1       |  True   | 
 |  207    |  2       |  True   | 
 |  212    |  3       |  True   | 


DebitsTable 
UserID  Ammount Applied 
+----------+----------+----------+ 
 |  192    |  1       |  True   | 
 |  207    |  1       |  True   | 
 |  207    |  1       |  True   |

***This is the Function I have tried, but it doesn't return the correct
results

ALTER FUNCTION [dbo].[BallanceTotals]()
RETURNS TABLE
AS
RETURN
(
SELECT DISTINCT
dbo.CreditsTable.UserID, SUM(dbo.CreditsTable.Ammount) AS TotalCredits,
SUM(dbo.DebitsTable.Ammount) AS TotalDebits,
SUM(dbo.CreditsTable.Ammount - dbo.DebitsTable.Ammount) AS Ballance
FROM
dbo.CreditsTable FULL OUTER JOIN
dbo.DebitsTable ON dbo.CreditsTable.UserID = dbo.DebitsTable.UserID
WHERE
(dbo.CreditsTable.Applied = 1) OR (dbo.DebitsTable.Applied = 1)
GROUP BY
dbo.CreditsTable.UserID
)

*** This is what it returns, it is not adding things up correctly (it
looks like it is adding NULL values as 1 instead of 0 or something)


BallanceTotals 
              Total     Total 
UserID    Credits  Debits  Ballance 
+----------+----------+----------+----------+ 
 |  192    |  3       |  2        |  1       | 
 |  207    |  4       |  3        |  1       | 
 |  212    |  3       |            |           | 


*** This is what I want it to return! 


BallanceTotals 
              Total     Total 
UserID    Credits  Debits  Ballance 
+----------+----------+----------+----------+ 
 |  192    |  3       |  1        |  2       | 
 |  207    |  4       |  2        |  2       | 
 |  212    |  3       |  0        |  3       |

I would really appreciate some help in getting this to work correctly!

Celko Answers

>>. I have a CreditsTable and a DebitsTable and I am trying to create a querry that sums up the info from each per UserID and returns TotalCredits, TotalDebits, and a Ballance. <<

This design flaw is called attribute splitting. You need a single
table, say Accounts, that shows events related to each account. Also,
we do not have BOOLEAN data types in SQL. It would be much more
meaningful to show the time that a posting was made to the account.

CREATE TABLE Accounts
(user_id INTEGER NOT NULL,
posting_date DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
posting_amt DECIMAL (12,4) NOT NULL
CHECK (posting_amt <> 0.00));

We have negative numbers and do not have to use credit/debit columns as
they did in Roman times.

Monday, July 10, 2006

Trees, recursion, and grouping

SQL Apprentice Question
In a DB2 UDB LUW table, I have a table with pairs of equivalent ID's.
What I want to do is assign all equivalent IDs to the same group
number, including those that are transitively related, i.e., if A = B
and B = C then A = C, so I'd group all three together.

Although they're not related in a composition fashion per se, it seems
like the way to go conceptually is to consider the ID relationships as
a reporting tree (ID_A would be the manager and ID_B would be the
employee) and assign all IDs that share the same root to the same
group.

For instance, let's say I have the following pairs

ID_A ID_B
---- ----
1800 1804
1800 1808
1806 1809
1808 1810
1808 1812
1809 1815
1810 1811

I'd have two trees (sideways):

1800 1804
1808
1810
1811
1812
and

1806
1809
1815

I'm struggling with the following:

1. How to group based on a shared *root* (I'd hate to have to build a
chain column, e.g., for 1811: "1800-->1808-->1810" and do something
like DENSE_RANK() OVER(ORDER BY SUBSTR(CHAIN,1,4)--that seems
unreliable, and the ID is not always the same length)

2. How to write a recursive CTE that accomodates multiple, independent,
trees.

What I'd like to end up with is this:

ID GRP
---- ---
1800 1
1804 1
1808 1
1810 1
1811 1
1812 1
1806 2
1809 2
1815 2

I feel like I'm close--I've read Serge's "CONNECT BY" article and
Molinaro's chapter "Hierarchical Queries" in his _SQL Cookbook_, but
I'm just not able to stitch it all together.

Would anyone care to lend a hand?

Thanks

Celko Answers
>> I feel like I'm close--I've read Serge's "CONNECT BY" article and Molinaro's chapter "Hierarchical Queries" in his _SQL Cookbook_, but I'm just not able to stitch it all together. <<

You should have been reading Celko's TREES & HIERARCHIES IN SQL instead
:)!

What you have is a forest, not a tree and you can do this with a nested
sets model

CREATE TABLE FoobarForest
(foobar_id INTEGER NOT NULL PRIMARY KEY, --assumption
tree_nbr INTEGER NOT NULL,
lft INTEGER NOT NULL CHECK (lft > 0),
rgt INTEGER NOT NULL,
CHECK (lft < rgt),
UNIQUE (tree_nbr, lft));

Instead of a recursive CTE, pick the roots and use a simple push-down
stack for tree traversal. Here is a SQL/PSM version for one tree:

- Tree holds the adjacency model
CREATE TABLE Tree
(node CHAR(10) NOT NULL,
parent CHAR(10));

-- Stack starts empty, will holds the nested set model
CREATE TABLE Stack
(stack_top INTEGER NOT NULL,
node CHAR(10) NOT NULL,
lft INTEGER,
rgt INTEGER);

CREATE PROCEDURE TreeTraversal ()
LANGUAGE SQL
DETERMINISTIC
BEGIN ATOMIC
DECLARE counter INTEGER;
DECLARE max_counter INTEGER;
DECLARE current_top INTEGER;

SET counter = 2;
SET max_counter = 2 * (SELECT COUNT(*) FROM Tree);
SET current_top = 1;

--clear the stack
DELETE FROM Stack;

-- push the root
INSERT INTO Stack
SELECT 1, node, 1, max_counter
FROM Tree
WHERE parent IS NULL;

-- delete rows from tree as they are used
DELETE FROM Tree WHERE parent IS NULL;

WHILE counter <= max_counter- 1
DO IF EXISTS (SELECT *
FROM Stack AS S1, Tree AS T1
WHERE S1.node = T1.parent
AND S1.stack_top = current_top)
THEN BEGIN -- push when top has subordinates and set lft value
INSERT INTO Stack
SELECT (current_top + 1), MIN(T1.node), counter, NULL
FROM Stack AS S1, Tree AS T1
WHERE S1.node = T1.parent
AND S1.stack_top = current_top;

-- delete rows from tree as they are used
DELETE FROM Tree
WHERE node = (SELECT node
FROM Stack
WHERE stack_top = current_top + 1);
-- housekeeping of stack pointers and counter
SET counter = counter + 1;
SET current_top = current_top + 1;
END;
ELSE
BEGIN -- pop the stack and set rgt value
UPDATE Stack
SET rgt = counter,
stack_top = -stack_top -- pops the stack
WHERE stack_top = current_top;
SET counter = counter + 1;
SET current_top = current_top - 1;
END;
END IF;
END WHILE;
-- SELECT node, lft, rgt FROM Stack;
-- the top column is not needed in the final answer
-- move stack contents to new tree table
END;

Friday, July 07, 2006

Implementing a Business Rule

SQL Apprentice Question
I need to implement a business rule. Here is the original DDL:

create table parent(parentID int not null,
ValidFrom datetime not null, ValidTo datetime not null)
go
create table child(childID int not null, parentID int not null,
ValidFrom datetime not null, ValidTo datetime not null)

Besides the obvious RI constraint, I need to make sure that the parent
row is valid for all the time the child one is valid:

(parent.ValidFrom <= child.ValidFrom) and
(child.ValidFrom < child.ValidTo) and
(child.ValidTo <= parent.ValidTo)

Here is my approach. It works, but at a rather steep price of adding
two 8 byte datetime columns (smalldatetime does not have enough
precision) on the child table and two composite 3-column indexes. I was
wondering if there is a cheaper approach, and I am in a very big hurry.
I am sure I'm not the first person to encounter such a problem, and I d
not want to reinvent the wheel, but I ahve a very limited time for
research. Thanks in advance for the suggestions!

drop table child
drop table parent
go
create table parent(parentID int not null,
ValidFrom datetime not null, ValidTo datetime not null)
go
alter table parent add constraint parentPK primary key(parentID)
go
alter table parent add constraint parentUK1 unique(parentID, ValidFrom,
ValidTo)
go
--alter table parent add constraint parentU2 unique(parentID, ValidTo)
go
alter table parent add constraint parentC1 check(ValidFrom < ValidTo)
go
set nocount on
insert parent values(1, '20060101','21000101')
go
create table child(childID int not null, parentID int not null,
ValidFrom datetime not null, ValidTo datetime not null,
parentValidFrom datetime not null, parentValidTo datetime not null)
go
alter table child add constraint childPK primary key(childID)
go
alter table child add constraint childFK1 foreign key(parentID,
parentValidFrom, parentValidTo)
references parent(parentID, ValidFrom, ValidTo) on update cascade
go
--alter table child add constraint childFK2 foreign key(parentID,
parentValidTo)
-- references parent(parentID, ValidTo) on update cascade
go
alter table child add constraint childC1 check((parentValidFrom <=
ValidFrom)
and (ValidFrom < ValidTo) and (ValidTo <= parentValidTo))
go
set nocount on
select 'next batch must succeed'
go
set nocount off
insert into child(childID, parentID, ValidFrom, ValidTo,
parentValidFrom, parentValidTo)
select 1, 1, '20060115', '21000101', '20060101', '21000101'
go
set nocount on
select 'next insert must fail (check)'
go
set nocount off
insert into child(childID, parentID, ValidFrom, ValidTo,
parentValidFrom, parentValidTo)
select 2, 1, '20050115', '21000101', '20060101', '21000101'
go
set nocount on
select 'next insert must fail (FK)'
go
set nocount off
insert into child(childID, parentID, ValidFrom, ValidTo,
parentValidFrom, parentValidTo)
select 2, 1, '20060115', '21000101', '20060103', '21000101'
go
set nocount on
select 'next update must fail (check)'
go
set nocount off
update parent set ValidTo = '20060819' where parentID=1
go
set nocount on
select 'next batch must succeed'
go
set nocount off
update child set ValidTo = '20060818' where parentID=1
update parent set ValidTo = '20060819' where parentID=1
go
select * from parent
select * from child

Celko Answers
Here is my guess at "the obvious RI constraints", and a little
clean up on the data element names.

CREATE TABLE Parents
(parent_id INTEGER NOT NULL PRIMARY KEY,
from_date DATETIME NOT NULL,
to_date DATETIME NOT NULL,
CHECK (from_date < to_date));

CREATE TABLE Children
(parent_id INTEGER NOT NULL
REFERENCES Parents (parent_id)
ON UPDATE CASCADE
ON DELETE CASCADE,
child_id INTEGER NOT NULL,
from_date DATETIME NOT NULL,
to_date DATETIME NOT NULL,
CHECK (from_date < to_date),
PRIMARY KEY (parent_id, child_id));

>> I need to make sure that the parent row is valid for all the time the child one is valid:

Try an updatable VIEW with a CHECK OPTION

CREATE VIEW ValidParents (parent_id, from_date, to_date
AS
SELECT P.parent_id, P.from_date, P.to_date
FROM Parents AS P
WHERE NOT EXISTS
(SELECT *
FROM Children AS C
WHERE C.parent_id = P.parent_id
AND (C.from_date NOT BETWEEN P.from_date
AND P.to_date
OR C.to_date NOT BETWEEN P.from_date
AND P.to_date)
WITH CHECK OPTION;

You can turn it "inside out" to get a valid children VIEW, too.

CREATE VIEW ValidChildren (parent_id, child_id, from_date, to_date
AS
SELECT P.parent_id, child_id, P.from_date, P.to_date
FROM Children AS C
WHERE NOT EXISTS
(SELECT *
FROM Parents AS P
WHERE C.parent_id = P.parent_id
AND (C.from_date NOT BETWEEN P.from_date
AND P.to_date
OR C.to_date NOT BETWEEN P.from_date
AND P.to_date)
WITH CHECK OPTION;

Now do all the updates via the views.

And I would rather be using CREATE ASSERTION instead :)

Joe Celko The SQL Apprentice