Joe Celko The SQL Apprentice: June 2007

Saturday, June 23, 2007

I have problem with union

SQL Apprentice Question
use Sql Server 2000. I've structure like it:

Create table T1(
RID int primary key,
A varchar(8),
...
...
)

Create table T2(
RID int primary key,
A varchar(8),
...
...
)

To make easy to view, so I make a view like it

Create view VUni as
Select RID, A From T1
union

Select RID, A From T2

Now, when I execute

Select * From VUNI Where RID between 15 and 25

So the process needs a lot of times alias Very Slow. How to make it best
performance?

Oh yeach, it's other question, I want to select data from table which its
name is name dimanically (variable)

IF @YEAR=2000 Select * FROM T2000
ELSE
IF @YEAR=2001 Select * FROM T2001
ELSE
IF @YEAR=2002 Select * FROM T2002
ELSE
...
...

To make it compact, So how to the code?

Next question, How to avoid error? I execute like it

Select a/b from T2

seldom the statement is error, because value of filed "a" is 0. I want if
a/b is error calculation, so it give value 0 automatically. Can I do it?
Without I ve to make a function? Or any function (built in) to handle it?

Celko Answers
>> I want to select data from table whose name is dynamic (variable) <<

IF @year=2000 SELECT * FROM T2000
ELSE
IF @year=2001 SELECT * FROM T2001
ELSE
IF @year=2002 SELECT * FROM T2002
ELSE
...
<<

This design flaw is so bad it has a name, like a disease: "Attribute
Splitting". Instead of separate tables, you need one table with a
"_year" column in it. You then use VIEWs or queries.

You probably also split an attribute in the first part of this
posting, and are trying to fix it with a UNION [ALL] construct.

Original Source

Working days calendar in T-SQL

SQL Apprentice Question
My company working days is from Monday to Friday.
I would like to generate a result set which show all
consecutive working days (Monday to Friday excluding the
weekends) that is, for June 2007, it show me 1st, 4th, 5th,
6th, 8th, 11th etc... until 29th (last working day of June). Is
it possible to do this with T-SQL?My company working days is from Monday to Friday.
I would like to generate a result set which show all
consecutive working days (Monday to Friday excluding the
weekends) that is, for June 2007, it show me 1st, 4th, 5th,
6th, 8th, 11th etc... until 29th (last working day of June). Is
it possible to do this with T-SQL?

Celko Answers
Build a calendar table with one column for the calendar data and other
columns to show whatever your business needs in the way of temporal
information. Do not try to calculate holidays in SQL -- Easter alone
requires too much math.

CREATE TABLE Calendar
(cal_date DATE NOT NULL PRIMARY KEY,
fiscal_year INTEGER NOT NULL,
fiscal_month INTEGER NOT NULL,
week_in_year INTEGER NOT NULL, -- SQL server is not ISO standard
holiday_type INTEGER NOT NULL
CHECK(holiday_type IN ( ..), --
day_in_year INTEGER NOT NULL,
julian_business_day INTEGER NOT NULL,
...);

The Julian business day is a good trick. Number the days from
whenever your calendar starts and repeat a number for a weekend or
company holiday.

A calendar table for US Secular holidays can be built from the data at
this website, so you will get the three-day weekends:

http://www.smart.net/~mmontes/ushols.html

Time zones with fractional hour displacements http://www.timeanddate.com/worldclock/city.html?n=246 http://www.timeanddate.com/worldclock/city.html?n=176 http://www.timeanddate.com/worldclock/city.html?n=5 http://www.timeanddate.com/worldclock/city.html?n=54

But the STRANGE ones are:
http://www.timeanddate.com/worldclock/city.html?n=63 http://www.timeanddate.com/worldclock/city.html?n=5

Original Source

Saturday, June 16, 2007

group by datetime

SQL Apprentice Question
I am baffled by this query and can use a little help pls!

query count for each Monday of the week in the last few months between 9 pm
and 1 am. I know I can use datepart() but can't figure out how to query
between 9pm and 1am. Thanks.

Celko Answers
>> I am baffled by this query and can use a little help pls! <<

Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it. Why are you so rude?

>> query count for each Monday of the week in the last few month between 9 pm [sic: 11:00:00 Hrs] and 1 am [sic: 01:00:00 the next day??]. I know I can use DATEPART() but can't figure out how to query between 9 pm [sic] and 1 am [sic]. <<

You really have no idea how time works!! What the hell is that AM and
PM crap? You never heard of ISO-8601 Standards and UTC???

What you do is set up a table of temporal ranges with upper and lower
limits with TIMESTAMP limits and join to it. Hey, you spit on us by
not posting DDL, why should we post DDL and data for you?

Your unit of measurement is wrong and you are getting screwed up. But
your invisible DDL tells us nothing!!

Original Source

Updating based on values of two records

SQL Apprentice Question
I'm sure this is obvious, but for some reason, I can't get my hands
around the solution. I have a table like this:

Account CalYear CalMonth Amount

Comm 2006 12 80
Comm 2007 01 100
Comm 2007 02 125
Incr 2007 01 21
Incr 2007 02 28

I want to update the incr account so it shows the correct difference
between the previous month, the change basically. Corrected it would
look like this:

Account CalYear CalMonth Amount

Comm 2006 12 80
Comm 2007 01 100
Comm 2007 02 125
Incr 2007 01 20
Incr 2007 02 25

Celko Answers

>> I have a table like this: <<

Why do you spit on us and want us to do your homework/job for you?

Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

Why did you invent your own two column data types that avoid the
temporal data types? Have you ever read the ISO-8601 rules for
temporal data?

CREATE TABLE StinkingFoobar -- until we get a real name
(acct_type CHAR(4) NOT NULL,
start_date DATETIME NOT NULL,
end_date DATETIME NOT NULL,
CHECK (start_date < end_date),
squid_amt INTEGER NOT NULL, -- read ISO-11179 before you program
again!!
PRIMARY KEY (acct_type, start_date)
);

Did I guess right?? Gee, wish you had helped!!

>> I want to update the incr account so it shows the correct difference between the previous month <<

No. That is soooooo non-RDBMS!! This is a computed value and needs
to be done in a VIEW using the OLAP functions. RTFM.

Original Source

newbie question on update

SQL Apprentice Question
I have a clients table which consists of, among other things, a column
named source.

I have a second table, sources, with two columns - client_id and source.

The are fewer rows in sources than in clients. I am trying to update
clients.source with sources.source without affecting the clients that
do not have a corresponding row in sources.

My first try was update clients set clients.source = sources.source
where clients.client_id = sources.client_id

But I ended up with nulls in the clients.source column for rows in
which there was no row in sources.

I am guessing that I need to involve a join somehow, but I am having a
very hard time not thinking procedurally, and just don't "get" it.

I've been looking at Join examples, but anything non-trivial just
baffles me. Would someone be so kind as to explain how to do what I'm
trying to do?

Thanks.

Celko Answers
>> I am finding it difficult to think in terms of sets and 'simultaneous' actions. I keep wanting to solve these things procedurally. <<

It takes about a year to have the revelation :) My next book will
deal with thinking in sets. Hey, remember how weird recursion was?

Original Source

Thursday, June 14, 2007

Question regarding multiple data sources and coalesce

SQL Apprentice Question
I inherited a db that is pulling data from 5 different sources and
storing them in 5 different tables. The tables all have the same
basic data, minor differences based on source. The company currently
creates a "summary" table by joining the 5 source tables and using
coalesce( ) to display data from preferred data sources if it is
available from there.

I've been asked to present an alternate schema. My first thought was
to normalize the 5 tables into 1 with a source andI could just include
the source in the ORDER BY of queries to get the preferred source.
But then I thought since each source could have 30Million rows maybe I
would loose performance over the existing schema. So, could somebody
point me towards a reference source that may cover this type of
topic? Of course, any opinions (on this issue) would be appreciated
as well.

Celko Answers

>> I've been asked to present an alternate schema. My first thought was to normalize the 5 tables into 1 with a source and I could just include the source in the ORDER BY of queries to get the preferred source. <<

Something ike this? Cram all the data that you have into staging
table

CREATE TABLE Foobar
(foo_key INTEGER NOT NULL,
source_nbr INTEGER NOT NULL
CHECK (source_nbr BETWEEN 1 AND 5),
PRIMARY KEY (foo_key, source_nbr),
etc.);

Assuming sources are ranks from 1 to 5, pull out the most trusted for
each key

SELECT foo_key, etc.
FROM Foobar AS F1
WHERE source_nbr =
(SELECT MIN(source_nbr)
FROM Foobar AS F2
WHERE F1.foo_key = F2.foo_key);

This is the best we can do without more specs.

Original Source

Need urgent help on a QUERY

SQL Apprentice Question
Given:

SELECT STATE_ID
FROM GROUP_STATE

STATE_ID
----------------
10
15
16
17
18
19
20

SELECT GROUP_ID
FROM GROUP
WHERE GROUP_ID NOT IN (SELECT DISTINCT GROUP_ID FROM GROUP_STATE)

GROUP_ID
---------------
1
2
16
5

I need a query that would look like:
GROUP_ID STATE_ID
--------------- ----------------
1 10
1 15
1 16
1 17
1 18
1 19
1 20

2 10
2 15
2 16
2 17
2 18
2 19
2 20

ETC....

Celko Answers
SELECT GS.state_id, G.group_id
FROM ( SELECT state_id FROM GroupStates AS GS
CROSS JOIN
SELECT group_id FROM Groups
WHERE group_id
NOT IN (SELECT DISTINCT group_id FROM GroupStates)
AS G);

Do not use reserved words for data element names.

Original Source

changing order of records

SQL Apprentice Question
hi, i've got a problem. i need to change order of records in my table . here
is the table

id forumname
----------------
1 test1
2 test2
3 test3
4 test4

i need to write code that can move nay record up or down without changing
id of forumname. i thought about another column called order

id forumname order
---------------------
1 test1 1
2 test2 2
3 test3 3
4 test4 4

but couldnt figureout how to change that order column. any one got an idea?

Celko Answers
Here is an old "cut & paste" for this problem: Given a motorpool with
numbered parking spaces, you want to move the automobiles around.
CREATE TABLE Motorpool
(parking_space INTEGER NOT NULL PRIMARY KEY
CHECK (parking_space > 0),
vin CHAR(17) NOT NULL);

Re-arrange the display order based on the parking_space column:

CREATE PROCEDURE SwapVehicles (@old_parking_space INTEGER,
@new_parking_space INTEGER)
AS
UPDATE Motorpool
SET parking_space
= CASE parking_space
WHEN @old_parking_space
THEN @new_parking_space
ELSE parking_space + SIGN(@old_parking_space - @new_pos)
END
WHERE parking_space BETWEEN @old_parking_space AND
@new_parking_space
OR parking_space BETWEEN @new_parking_space AND
@old_parking_space;

When you want to drop a few rows, remember to close the gaps with
this:

CREATE PROCEDURE CloseMotorpoolGaps()
AS
UPDATE Motorpool
SET parking_space
= (SELECT COUNT (M1.parking_space)
FROM Motorpool AS M1
WHERE M1.parking_space <= Motorpool.parking_space);

Original Source

Wednesday, June 13, 2007

self-referencing constraint with identity column?

SQL Apprentice Question
have a table that requires to have a self-referencing constraint to enforce
a parent-child type of relationship:

ID (identity column)
ParentID (INT column that references the ID column).

To determine if I'm at the top-level of of the relationship, I was going to
leave the ParentID null, otherwise it must be a value of another ID column in
the table which indicates its child of another.

However, can you have it so you don't use NULL to indicate this and instead
set ID and ParentID equal to the same value and still use the IDENTITY column?

So on a INSERT (i.e the IDENTITY would generate 25, so I'd like to set the
ParentID to 25 as well). This would was causing FK violation and I was
wondering if there is a way around it using an insert trigger?

Celko Answers

>> I have a table that requires to have a self-referencing constraint to enforce a parent-child type of relationship: <<

No it is not required at all. Look up the Nested Sets model for trees
and hierarchies. You will find the constraitns are much easier and the
code will run1-2 orders of magnitude faster than recursive path
traversals.

Also, never use IDENTITY in an RDBMS; find a valid relational key
instead.

Original Source

Tuesday, June 12, 2007

Database Design Question

SQL Apprentice Question
I'm fairly new to database design, having only really worked with
existing tables etc in the past.

Simple question this really...........

In a users table, is it wise to have a ManagerID column that is
effectively pointing to another user in the same table, with the
theory being that if that person is top dog, they will just have a
null entry in ManagerID.

I will check for circular references before entering the data.

Is there a better way of storing the data, or is this the only way?

Celko Answers
>> Is there a better way of storing the data, or is this the only way? <<

Get a copy of TREES & HIERARCHIES IN SQL for several different ways to
do an organizational chart. What you have is thd adjacency list
model; look up the nested sets model.

Original Source

Using Soundex to identify possible duplicates

SQL Apprentice Question
In a table that has person forename and surname, is it possible to use
soundex to identify for every row in the table what similar matches
there are in the same table?

Celko Answers

>> is it possible to use Soundex to identify for every row in the table what similar matches there are in the same table? <<

Don't do it. Name handling is an ugly problem and if you have to do
this on a regular basis get a package designed for this kind of work.
Some companies are Group 1 Software, SSA (used to have a great booklet
on this topic), Melissa Data Corporation and Promark Software Inc.

Their software handles mailing lists and you can get a review copy
from Melissa Data. They do not choke on names like "John Paul van der
Poon" and worse.

Original Source

giving one union preference

SQL Apprentice Question
Let say I have a query with the following structure:

Select name, mdate, kdate
from table1 a
inner join (
select name, mdate
from table2 b
where a.id = b.id
and mdate >= kdate
union
select name, mdate
from table3 c
where a.id = c.id
and mdate >= kdate
) mindate

When the select mdate from the inner join union (mindate) returns the
same mdate for table b and c, I want to give preference to the c.mdate
and so c.name.

Is there a way to do this?

Celko Answers
>> When the select mdate from the inner join union (mindate) returns the same mdate for table b and c, I want to give preference to the c.mdate and so c.name. <<

This question makes no sense. The poorly named derived table,
Mindate, acts as if it is materialized and you no longer have access
to tables B and C. It is like asking to see the eggs after the cake
has been baked.

Original Source

Help writing query to return data if there are 4 consecutive holidays

SQL Apprentice Question
I need to write a query that returns data if an employee has taken four or
more consecutive holidays. I need to get the startdate and the employee name.
The employee works either a 4X10..or an 8X5 shift...
Also, the employee need not have an off only on Sat and Sun...Some of them
work on the weekends and have an off during the weekdays.

Also, I need to take into account the days employee is off (eg.Sat and Sun).

So if an employee is out Mon, Tues, Wed, Thu it shows that employee or if
the employee is out on Thu, Fri, Mon, Tues it also shows that employee...

Can anyone assist!

Celko Answers
Use a Calendar table which has the work days Julianized (consecutively
numbered). Google this newsgroup for examples.

Original Source

add blank row between groups of rows to separate them?

SQL Apprentice Question
I want to add a blank row between groups of rows to separate them.

select * from tbl1 where fname in ('Bill', 'tom', 'sam')

returns

Bill, 1, 2, 3
bill 4, 5, 6
tom, 1, 2, 3
tom, 4, 5, 6
sam, 1, 2, 3
sam, 4, 5, 6

I want to return this:

Bill, 1, 2, 3
bill 4, 5, 6

tom, 1, 2, 3
tom, 4, 5, 6

sam, 1, 2, 3
sam, 4, 5, 6

How can I perform this type of operation with tsql?

Celko Answers
>> How can I perform this type of operation with tsql? <<

Why are you formatting data in the back end? The basic principle of a
tiered architecture is that display is done in the front end and never
in the back end. This a more basic programming principle than just
SQL and RDBMS.

Original Source

Monday, June 11, 2007

Calc minutes between days

SQL Apprentice Question
I am using DATEDIFF('n',EndTime, StartTime) to return minutes between 2
datetime columns. The problem is that when the StartTime is 23:00:00 PM and
EndTime is 07:00:00 AM I get a negative # of minutes. How can get 480
minutes from this calculation. Note the 23:00:00 is yesterday and the
07:00:00 is today. Thanks

Celko Answers

>> OK, but I'm not storing the date portion so it thinks both dates are 12/30/1899. Do I need a CASE statement to check for 1st time 2nd time? <<

No, you need to learn to do it right and not look for kludges in
newsgroups. Oh, you will get the kludges, as you have seen, but then
your code will become a convoluted mess over time, queries cannot use
indexes, portability is destroyed, etc.

It would also help if you had learned to use ISO-8601 temporal
formats, too (i.e '1899-12-30'). That would have lead you to study
the ISO temporal model. A proper data model will use DATETIME data
types and model events as having durations. Get a free copy of Rick
Snodgrass book in PDF from the University of Arizona website and spent
a week with. There is also a ton of free code you can download.

Get info from multiple tables. Join problem? Options

SQL Apprentice Question
The Stored Procedure below returns article attributes for my articles
(tblArticleAttribute). This works fine IF NOT tblArticleAttribute.Content is
empty, then it returns all templatedefinitions with the Content. But, if the
Content is empty, it also has to return all template definitions, but with
the content fields empty (now it returns nothing). This is because I bind
this information to a dynamic form, and if (example) the title or ingress is
empty the textfields (definitions) should be shown with no text. Then the
user can fill inn text an save it.

----------------------------------
Here is the SQL:
----------------------------------

SELECT tblTemplateDefinition.[Name] as TemplateDefinitionName,
tblTemplateDefinition.HelpText as TemplateDefinitionHelpText,
tblArticleAttribute.[Content], tblArticle.Id AS ArticleId,
tblTemplateDefinition.Id AS TemplateDefinitionId
FROM tblTemplateDefinition
LEFT OUTER JOIN tblArticleAttribute ON tblTemplateDefinition.Id =
tblArticleAttribute.TemplateDefinitionId
INNER JOIN tblArticle ON tblArticleAttribute.ArticleId= tblArticle.Id
WHERE tblArticle.Id = @ArticleId

----------------------------------
Here is the tables:
----------------------------------

tblArticle:
id | Name

tblArticleAttribute
TemplateDefinitionId | ArticleId | Content

tblTemplate
Id | Name

tblTemplateDefinition
Id | TemplateId | TemplateDefinitionId |Name | HelpText

Celko Answers
Please post DDL, so that people do not have to guess what the keys,
constraints, Declarative Referential Integrity, data types, etc. in
your schema are. Sample data is also a good idea, along with clear
specifications. It is very hard to debug code when you do not let us
see it.

You might want to read any basic book on data modeling. You seem to
have multiple names for the same data element, based on where it
appears. And you have a universal, magical "id" column in all the
pseudo-code you did give us. You also violate ISO-11179 naming rules.
My guess as to what you meant to post if you knew how would be
something like this:

CREATE TABLE Articles:
(article_id INTEGER NOT NULL PRIMARY KEY, -- needs work
article_title VARCHAR(150) NOT NULL);

I have no idea which carefully researched industry standard you are
using the article identifier. But at least I know it is not an
IDENTITY column -- that would be soooo non-relational.

Why did you use a singular name for the table? Do you really have
only one article. Why did you put that silly "tbl-" prefix on the
table names? Just to mess up the data dictionary, or because you are
an OO programmer who has not learned RDBMS yet?

CREATE TABLE ArticleAttributes
(template_id INTEGER NOT NULL PRIMARY KEY,
article_id INTEGER NOT NULL
REFERENCES Articles(article_id)
ON UPDATE CASCADE
ON DELETE CASCADE,
article_content VARCHAR(255) NOT NULL);

We would never store attributes in a schema; that would mix data and
metadata. What about the templates? Why is the definition of a
template not an attribute of a template? Why it is an entity with its
own table?

Also note that you need DRI actions, constraints, etc. to make this
work.

I can make a lot of assumptions about the magical universal names and
how they are related (is the magical "id" actually article_id,
template_id, template_definition_id, template_id, or a hundred other
possible things?) and the cardinality of the relationships, etc., but
based on decades of cleaning up bad schemas, you need more than a
sample query with an OUTER JOIN.

Can you explain your problem better?

Thursday, June 07, 2007

Maximum number of tables (260) exceeded

SQL Apprentice Question
Hi! I receive the referenced message when i run a view I created in SQL2005
in a SQL2000 install of the same database. Basically it is a union query with
about 9 individual select queries. I am sure this has something to do with a
lack of efficiency in my query structure but I am confounded that it runs in
SQL2005 and not in SQL2000. Any suggestions? Thanks in avance for any help
you can offer...

Celko Answers
>> I am confounded that it runs in SQL2005 and not in SQL2000. Any suggestions?

Why not just write better code? Seriously, after all my decades (NOT
years) with SQL, I cannot imagine anything properly designed that
would break the limit.

Remember the human interface part of the first SE course you had
before you were allowed to production data? 5 to 7 levels is all a
human being can maintain. Duh.

Find the moron who nested things this deep and kill him. Then re-
write EVERYTHING he wrote and re-write it. Do a Wikipeida on
"spaghetti code" versus "lasagna code" for some of the issues.

Original Source

Monday, June 04, 2007

db table design question

SQL Apprentice Question
We have an application which has its settings stored in a database
(SQL Server 2000)

We have (strong) disagrements on the best way to store these values,
and I am looking for advice on doing this.
Most of the values are not used by the stored procedures, just the
applications running on the clients PC. The data rarely changes.

Developer #1 wants many tables with many columns. This gives the
ability to strongly type each variable. Each table would have only
one row. He believes this to be faster in retrieving the data.

Developer #2 wants one table with a Key column and a Value column.
This table would have many rows. The application would be
responsible
for detecting invalid datatype in the value. He believes this to be
a
more efficent use of the database, and easier to maintain.

So - what do you think? Are there any advantages or problems with
either approach?

Celko Answers
Use #1. The other way is called a OTLT ("One True Lookup Table") or
MUCK ("Massively Unified Code Keys") in the literature. It is
incredibly bad design. Here is a cut and paste from one of my books
on Constants tables

04.02. Constants Table

When you configure a system, you might want to have a way to set and
keep constants in the schema. One method for doing this is to have a
one-row table that can be set with default values at the start, and
then updated only by someone with administrative privileges.

CREATE TABLE Constants
(lock CHAR(1) DEFAULT 'X'
NOT NULL PRIMARY KEY
CHECK (lock = 'X'),
pi FLOAT DEFAULT 3.142592653 NOT NULL,
e FLOAT DEFAULT 2.71828182 NOT NULL,
phi FLOAT DEFAULT 1.6180339887 NOT NULL,
..);

To initialize the row, execute this statement.

INSERT INTO Constants VALUES DEFAULTS;

Most SQL programmers do not know about the VALUES DEFAULTS option in
the INSERT INTO statement. The lock column assures there is only one
row and the DEFAULT values load the initial values. These defaults
can include the current user and current timestamp, as well as numeric
and character constant values.

Another version of this idea that does not allow for any updates is a
VIEW defined with a table constructor. [[not in SQL Server yet!]]

CREATE VIEW Constants (pi, e, phi, ..)
AS VALUES (CAST 3.142592653 AS FLOAT),
(CAST 2.71828182 AS FLOAT),
(CAST 1.6180339887 AS FLOAT),
..;

Please notice that you have to use a CAST() operators to assure that
the data types are correct. This is not a problem with INTEGER
values, but can be if you wanted DOUBLE PRECISION and got a default of
DECIMAL(s, p) or FLOAT.

[[ a little bit later in the chapter..]]

OTLT or MUCK Table Problems

I think that Paul Keister was the first person to coin the phrase
"OTLT" (One True Look-up Table) for a common SQL programming technique
that is popular with Newbies. Don Peterson (www.SQLServerCentral.com)
gave the same technique the name "Massively Unified Code-Key" or MUCK
tables in one of his articles.

The technique crops up time and time again, but I'll give him credit
as the first writer to give it a name. Simply put, the idea is to
have one table to do all of the code look-ups in the schema. It
usually looks like this:

CREATE TABLE Look-ups
(code_type CHAR(10) NOT NULL,
code_value VARCHAR(255) NOT NULL, -- notice size!
code_description VARCHAR(255) NOT NULL, -- notice size!
PRIMARY KEY (code_value, code_type));

So if we have Dewey Decimal Classification (library codes), ICD
(International Classification of Diseases), and two-letter ISO-3166
country codes in the schema, we have them all in one, honking big
table.

Let's start with the problems in the DDL and then look at the awful
queries you have to write (or hide in VIEWs). So we need to go back
to the original DDL and add a CHECK() constraint on the code_type
column. Otherwise, we might "invent" a new encoding system by
typographical error.

The Dewey Decimal and ICD codes are digits and have the same format --
three digits, a decimal point and more digits (usually three); the
ISO-3166 is alphabetic. Oops, need another CHECK constraint that will
look at the code_type and make sure that the string is in the right
format. Now the table looks something like this, if anyone attempted
to do it right, which is not usually the case:

CREATE TABLE OTLT
(code_type CHAR(10) NOT NULL
CHECK(code_type IN ('DDC','ICD','ISO3166', ..),
code_value VARCHAR(255) NOT NULL,
CHECK
(CASE
WHEN code_type = 'DDC'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN code_type = 'ICD'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN code_type = 'ISO3166'
AND code_value SIMILAR TO '[A-Z][A-Z]'
THEN 1 ELSE 0 END = 1),
code_description VARCHAR(255) NOT NULL,
PRIMARY KEY (code_value, code_type));

The "SIMILAR TO" predicate is the SQL-92 version of a regular
expression parser based on the POSIX Standards, if you are not
familiar with it. Since the typical application database can have
dozens and dozens of codes in it, just keep extending this pattern for
as long as required. Not very pretty is it? In fact, there is a good
chance that you will exceed the number of WHEN clauses allowed in a
CASE expression. That's why most OTLT programmers don't bother with
it.

Now let us consider adding new rows to the OTLT.

INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('ICD', 259.0, 'Inadequate Genitalia after Puberty');

and also

INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('DDC', 259.0, 'Christian Pastoral Practices & Religious Orders');

If you make an error in the code_type during insert, update or delete,
you have screwed up a totally unrelated value. If you make an error
in the code_type during a query, the results could be interesting.
This can really hard to find when one of the similarly structured
schemes had unused codes in it.

The next thing you notice about this table is that the columns are
pretty wide VARCHAR(n), or even worse, that they are NVARCHAR(n) which
can store characters from a strange language. The value of (n) is
most often the largest one allowed in that particular SQL product.

Since you have no idea what is going to be shoved into the table,
there is no way to predict and design with a safe, reasonable maximum
size. The size constraint has to be put into the WHEN clause of that
second CHECK() constraint between code_type and code_value. Or you
can live with fixed length codes that are longer or shorter than what
they should be.

These large sizes tend to invite bad data. You give someone a
VARCHAR(n) column, and you eventually get a string with a lot of white
space and a small odd character sitting at the end of it. You give
someone an NVARCHAR(255) column and eventually it will get a Buddhist
sutra in Chinese Unicode.

Now let's consider the problems with actually using the OTLT in a
query. It is always necessary to add the code_type as well as the
value which you are trying to look-up.

SELECT P1.ssn, P1.lastname, .., L1.code_description
FROM OTLT AS L1, Personnel AS P1
WHERE L1.code_type = 'ICD'
AND L1.code_value = P1.disease_code
AND ..;

In this sample query, you need to know the code_type of the Personnel
table disease_code column and of every other encoded column in the
table. If you got a code_type wrong, you can still get a result.

You also need to allow for some overhead for data type conversions.
It might be more natural to use numeric values instead of VARCHAR(n)
for some encodings to ensure a proper sorting order. Padding a string
of digits with leading zeros adds overhead and can be risky if
programmers do not agree on how many zeros to use.

When you execute a query, the SQL engine has to pull in the entire
look-up table, even if it only uses a few codes. If one code is at
the start of the physical storage, and another is at the end of
physical storage, I can do a lot of caching and paging. When I update
the OTLT table, I have to lock out everyone until I am finished. It
is like having to carry an encyclopedia set with you when all you
needed was a magazine article.

Now consider the overhead with a two-part FOREIGN KEY in a table:

CREATE TABLE EmployeeAbsences
(..
code_type CHAR(3) -- min length needed
DEFAULT 'ICD' NOT NULL
CHECK (code_type = 'ICD'),

code_value CHAR(7) NOT NULL, -- min length needed
FOREIGN KEY (code_type, code_value)
REFERENCES OTLT (code_type, code_value),
..);

Now I have to convert the character types for more overhead. Even
worse, ICD has a natural DEFAULT value (000.000 means "undiagnosed"),
while Dewey Decimal does not. Older encoding schemes often used all
9's for "miscellaneous" so they would sort to the end of the reports
in COBOL programs. Just as there is no Magical Universal "id", there
is no Magical Universal DEFAULT value. I just lost one of the most
important features of SQL!

I am going to venture a guess that this idea came from OO programmers
who think of it as some kind of polymorphism done in SQL. They say to
themselves that a table is a class, which it isn't, and therefore it
ought to have polymorphic behaviors, which it doesn't.

Maybe there are good reasons for the data modeling principle that a
well-designed table is a set of things of the same kind instead of a
pile of unrelated items.

Original Source

Find first available block that does not intersect a range

SQL Apprentice Question
Given a number of numeric ranges, I need to find the first available row
(range) that can accomodate a requested block of contiguous values. For
example, given the following table definition:

CREATE TABLE [dbo].[tbl_AllocatedRange](
[RangeID] [int] IDENTITY(1,1) NOT NULL,
[RangeStart] [bigint] NOT NULL,
[RangeEnd] [bigint] NOT NULL,
CONSTRAINT [PK_tbl_AllocatedRange] PRIMARY KEY CLUSTERED
(
[RangeID] ASC
)

And the following DML:

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 1, 100, 200 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 2, 500, 650 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 3, 2000, 2200 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 4, 4000, 8000 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 5, 16000, 25000 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 6, 25001, 50000 );

INSERT INTO [dbo].[tbl_AllocatedRange] ([RangeID], [RangeStart], [RangeEnd])
VALUES ( 7, 100000, 200000 );

A request for a block of 10000, should return the row where RangeID=6
because that is the first available position where I could allocate a
contiguous block of 10000 without intersecting any of the values in the
existing ranges.

Celko Answers
Let's add some constraints and reduce those BIGINTs to just INTEGER
unless you really need it.

CREATE TABLE AllocatedRanges
(range_id INTEGER NOT NULL PRIMARY KEY,
range_start INTEGER NOT NULL
CHECK (range_start >= 0),
range_end INTEGER NOT NULL
CHECK (range_end >= 0),
CHECK (range_start <= range_end));

This will get you all of the candidates, but in RDBMS, there is no
concept of an ordering so talking about the "first avaialble" range
needs more definition.

SELECT range_id, (range_end - range_start)+1 AS range_size, @my_size
AS my_size
FROM AllocatedRanges
WHERE (range_end - range_start)+1 >= @my_size;

We can do it with the lowest range id:

SELECT MIN(range_id)
FROM AllocatedRanges
WHERE (range_end - range_start)+1 >= @my_size;

But the usual requirement is to find the best fit, not the first
one.

SELECT range_id
FROM AllocatedRanges
WHERE (range_end - range_start)+1 =
(SELECT MIN((range_end - range_start)+1 - @my_size)
FROM AllocatedRanges AS A1
WHERE ((range_end - range_start)+1 - @my_size) >= 0);

Original Source

mutliple refential integrity

SQL Apprentice Question
Suppose I want a column in my table to reference two other columns of
other two tables (of course datatypes are same, and the other two
columns are primary key in their respective relations). How can I
enforce this using SQL?

Celko Answers
>> Suppose I want a column in my table to reference two other columns of other two tables (of course data types are same, and the other two columns are primary key in their respective relations). How can I enforce this using SQL? <<-

CREATE TABLE Foo
(foo_id INTEGER NOT NULL PRIMARY KEY,
..);

CREATE TABLE Bar
(_id INTEGER NOT NULL PRIMARY KEY,
..);

CREATE TABLE Foobar
(foo_id INTEGER NOT NULL
REFERENCES Foo(foo_id),
bar_id INTEGER NOT NULL
REFERENCES Bar(bar_id),

Original Source

Joe Celko The SQL Apprentice