SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Monday, June 04, 2007

db table design question

SQL Apprentice Question
We have an application which has its settings stored in a database
(SQL Server 2000)

We have (strong) disagrements on the best way to store these values,
and I am looking for advice on doing this.
Most of the values are not used by the stored procedures, just the
applications running on the clients PC. The data rarely changes.


Developer #1 wants many tables with many columns. This gives the
ability to strongly type each variable. Each table would have only
one row. He believes this to be faster in retrieving the data.


Developer #2 wants one table with a Key column and a Value column.
This table would have many rows. The application would be
responsible
for detecting invalid datatype in the value. He believes this to be
a
more efficent use of the database, and easier to maintain.


So - what do you think? Are there any advantages or problems with
either approach?



Celko Answers
Use #1. The other way is called a OTLT ("One True Lookup Table") or
MUCK ("Massively Unified Code Keys") in the literature. It is
incredibly bad design. Here is a cut and paste from one of my books
on Constants tables

04.02. Constants Table


When you configure a system, you might want to have a way to set and
keep constants in the schema. One method for doing this is to have a
one-row table that can be set with default values at the start, and
then updated only by someone with administrative privileges.


CREATE TABLE Constants
(lock CHAR(1) DEFAULT 'X'
NOT NULL PRIMARY KEY
CHECK (lock = 'X'),
pi FLOAT DEFAULT 3.142592653 NOT NULL,
e FLOAT DEFAULT 2.71828182 NOT NULL,
phi FLOAT DEFAULT 1.6180339887 NOT NULL,
..);


To initialize the row, execute this statement.


INSERT INTO Constants VALUES DEFAULTS;


Most SQL programmers do not know about the VALUES DEFAULTS option in
the INSERT INTO statement. The lock column assures there is only one
row and the DEFAULT values load the initial values. These defaults
can include the current user and current timestamp, as well as numeric
and character constant values.


Another version of this idea that does not allow for any updates is a
VIEW defined with a table constructor. [[not in SQL Server yet!]]


CREATE VIEW Constants (pi, e, phi, ..)
AS VALUES (CAST 3.142592653 AS FLOAT),
(CAST 2.71828182 AS FLOAT),
(CAST 1.6180339887 AS FLOAT),
..;


Please notice that you have to use a CAST() operators to assure that
the data types are correct. This is not a problem with INTEGER
values, but can be if you wanted DOUBLE PRECISION and got a default of
DECIMAL(s, p) or FLOAT.


[[ a little bit later in the chapter..]]


OTLT or MUCK Table Problems


I think that Paul Keister was the first person to coin the phrase
"OTLT" (One True Look-up Table) for a common SQL programming technique
that is popular with Newbies. Don Peterson (www.SQLServerCentral.com)
gave the same technique the name "Massively Unified Code-Key" or MUCK
tables in one of his articles.


The technique crops up time and time again, but I'll give him credit
as the first writer to give it a name. Simply put, the idea is to
have one table to do all of the code look-ups in the schema. It
usually looks like this:


CREATE TABLE Look-ups
(code_type CHAR(10) NOT NULL,
code_value VARCHAR(255) NOT NULL, -- notice size!
code_description VARCHAR(255) NOT NULL, -- notice size!
PRIMARY KEY (code_value, code_type));


So if we have Dewey Decimal Classification (library codes), ICD
(International Classification of Diseases), and two-letter ISO-3166
country codes in the schema, we have them all in one, honking big
table.


Let's start with the problems in the DDL and then look at the awful
queries you have to write (or hide in VIEWs). So we need to go back
to the original DDL and add a CHECK() constraint on the code_type
column. Otherwise, we might "invent" a new encoding system by
typographical error.


The Dewey Decimal and ICD codes are digits and have the same format --
three digits, a decimal point and more digits (usually three); the
ISO-3166 is alphabetic. Oops, need another CHECK constraint that will
look at the code_type and make sure that the string is in the right
format. Now the table looks something like this, if anyone attempted
to do it right, which is not usually the case:


CREATE TABLE OTLT
(code_type CHAR(10) NOT NULL
CHECK(code_type IN ('DDC','ICD','ISO3166', ..),
code_value VARCHAR(255) NOT NULL,
CHECK
(CASE
WHEN code_type = 'DDC'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN code_type = 'ICD'
AND code_value
SIMILAR TO '[0-9][0-9][0-9].[0-9][0-9][0-9]'
THEN 1
WHEN code_type = 'ISO3166'
AND code_value SIMILAR TO '[A-Z][A-Z]'
THEN 1 ELSE 0 END = 1),
code_description VARCHAR(255) NOT NULL,
PRIMARY KEY (code_value, code_type));


The "SIMILAR TO" predicate is the SQL-92 version of a regular
expression parser based on the POSIX Standards, if you are not
familiar with it. Since the typical application database can have
dozens and dozens of codes in it, just keep extending this pattern for
as long as required. Not very pretty is it? In fact, there is a good
chance that you will exceed the number of WHEN clauses allowed in a
CASE expression. That's why most OTLT programmers don't bother with
it.


Now let us consider adding new rows to the OTLT.


INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('ICD', 259.0, 'Inadequate Genitalia after Puberty');


and also


INSERT INTO OTLT (code_type, code_value, code_description)
VALUES
('DDC', 259.0, 'Christian Pastoral Practices & Religious Orders');


If you make an error in the code_type during insert, update or delete,
you have screwed up a totally unrelated value. If you make an error
in the code_type during a query, the results could be interesting.
This can really hard to find when one of the similarly structured
schemes had unused codes in it.


The next thing you notice about this table is that the columns are
pretty wide VARCHAR(n), or even worse, that they are NVARCHAR(n) which
can store characters from a strange language. The value of (n) is
most often the largest one allowed in that particular SQL product.


Since you have no idea what is going to be shoved into the table,
there is no way to predict and design with a safe, reasonable maximum
size. The size constraint has to be put into the WHEN clause of that
second CHECK() constraint between code_type and code_value. Or you
can live with fixed length codes that are longer or shorter than what
they should be.


These large sizes tend to invite bad data. You give someone a
VARCHAR(n) column, and you eventually get a string with a lot of white
space and a small odd character sitting at the end of it. You give
someone an NVARCHAR(255) column and eventually it will get a Buddhist
sutra in Chinese Unicode.


Now let's consider the problems with actually using the OTLT in a
query. It is always necessary to add the code_type as well as the
value which you are trying to look-up.


SELECT P1.ssn, P1.lastname, .., L1.code_description
FROM OTLT AS L1, Personnel AS P1
WHERE L1.code_type = 'ICD'
AND L1.code_value = P1.disease_code
AND ..;


In this sample query, you need to know the code_type of the Personnel
table disease_code column and of every other encoded column in the
table. If you got a code_type wrong, you can still get a result.


You also need to allow for some overhead for data type conversions.
It might be more natural to use numeric values instead of VARCHAR(n)
for some encodings to ensure a proper sorting order. Padding a string
of digits with leading zeros adds overhead and can be risky if
programmers do not agree on how many zeros to use.


When you execute a query, the SQL engine has to pull in the entire
look-up table, even if it only uses a few codes. If one code is at
the start of the physical storage, and another is at the end of
physical storage, I can do a lot of caching and paging. When I update
the OTLT table, I have to lock out everyone until I am finished. It
is like having to carry an encyclopedia set with you when all you
needed was a magazine article.


Now consider the overhead with a two-part FOREIGN KEY in a table:


CREATE TABLE EmployeeAbsences
(..
code_type CHAR(3) -- min length needed
DEFAULT 'ICD' NOT NULL
CHECK (code_type = 'ICD'),


code_value CHAR(7) NOT NULL, -- min length needed
FOREIGN KEY (code_type, code_value)
REFERENCES OTLT (code_type, code_value),
..);


Now I have to convert the character types for more overhead. Even
worse, ICD has a natural DEFAULT value (000.000 means "undiagnosed"),
while Dewey Decimal does not. Older encoding schemes often used all
9's for "miscellaneous" so they would sort to the end of the reports
in COBOL programs. Just as there is no Magical Universal "id", there
is no Magical Universal DEFAULT value. I just lost one of the most
important features of SQL!


I am going to venture a guess that this idea came from OO programmers
who think of it as some kind of polymorphism done in SQL. They say to
themselves that a table is a class, which it isn't, and therefore it
ought to have polymorphic behaviors, which it doesn't.


Maybe there are good reasons for the data modeling principle that a
well-designed table is a set of things of the same kind instead of a
pile of unrelated items.




Original Source

No comments: