SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Monday, June 12, 2006

Why use a composite PK ever?

SQL Apprentice Question
In a fabulous book -- "Practical Issues in Database Management" -- the
fabulous author, Pascal Fabian, lists three reasons for choosing one
candidate PK over another: familiarity, stability, and simplicity. Just
started skimming through the book.

He notes further that those influenced by OO db design tend to use
simple, surrogate keys for PKs in all tables; that this is not
*precluded* by relational theory, but that it's somehow illicit.


I personally think it's a good rule of thumb to create surrogate keys
for almost all tables. Stability seems to be the single-most important
factor to consider. If the database can't uniquely identify a row,
what's the point? Choosing a surrogate key guarantees stability.


Why chance it on a composite key when you can be assured of uniqueness
with a sequence-generated surrogate key?


"Familiarity" seems like a spurious concern, and a poor tradeoff
against both stability(guaranteeing you are uniquely identifying rows)
and simplicity (with queries, and others intuiting your design).


Am I missing something? Why use a composite key *ever* (and by ever, I
mean for most purposes) aside from "familiarity?"


Could someone give a real-world example where "familiarity" is a
compelling reason to choose a composite PK, trumping both stability and
simplicity?


Finally, if there *are* any compelling reasons to prefer composite keys
in some situations, after how many attributes required to create a
composite key should one give up and create a surrogate key? I've seen
composite keys of up to 5 attributes and often wondered why that
particular decision was made. Seems inelegant, but then perhaps I'm
missing some other fundamental here.


It's a great book by the way. Should be required reading for most
database designers.



Celko Answers
>> [simple, surrogate keys for PKs in all tables] I personally think it's a good rule of thumb to create surrogate keys for almost all tables. <<


No, absolutley no. First of all, they do not create surrogate keys;
they create exposed physical locators, like IDENTITY or other
auto-numbering schemes. Let me maker an appeal to authority, with a
quote from Dr. Codd: "..Database users may cause the system to generate
or delete a surrogate, but they have no control over its value, nor is
its value ever displayed to them ..."(Dr. Codd in ACM TODS, pp 409-410)
and Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp.
397-434.

This means that a surrogate ought to act like an index; created by the
user, managed by the system and NEVER seen by a user. That means never
used in queries, DRI or anything else that a user does.


Codd also wrote the following:


"There are three difficulties in employing user-controlled keys as
permanent surrogates for entities.


(1) The actual values of user-controlled keys are determined by users
and must therefore be subject to change by them (e.g. if two companies
merge, the two employee databases might be combined with the result
that some or all of the serial numbers might be changed.).


(2) Two relations may have user-controlled keys defined on distinct
domains (e.g. one uses social security, while the other uses employee
serial numbers) and yet the entities denoted are the same.


(3) It may be necessary to carry information about an entity either
before it has been assigned a user-controlled key value or after it has
ceased to have one (e.g. and applicant for a job and a retiree).


These difficulties have the important consequence that an equi-join on
common key values may not yield the same result as a join on common
entities. A solution - proposed in part [4] and more fully in [14] -
is to introduce entity domains which contain system-assigned
surrogates. Database users may cause the system to generate or delete
a surrogate, but they have no control over its value, nor is its value
ever displayed to them....." (Codd in ACM TODS, pp 409-410).


References


Codd, E. (1979), Extending the database relational model to capture
more meaning. ACM Transactions on Database Systems, 4(4). pp. 397-434


A real surrogate key maintained by the RDBMS itself, like an index and
the user never sees it, much less uses it in his code. Ever use
hashing in Teradata?



>> Stability seems to be the single-most important factor to consider. If the database can't uniquely identify a row, what's the point? Choosing a surrogate key guarantees stability. <<


No, it makes the data unstable. You cannot validate or verify an
auto-numbering value. Think about it. I can look at an ISBN,
validate it with the check digit, then verify it with a search to
Amazon.com, B&N, Bowker, etc. in *any* database, on *any* SQL (or
non-SQL) engine.

By definition, a relational key is a subset of attributes of the
entity. You or your machinery do not invent it so much as discover it.



>> Why chance it on a composite key when you can be assured of uniqueness with a sequence-generated surrogate key? <<


Why is something you can verify more risk than something you cannot
verify? You use relational keys for data integrity, data portability,
code portability and the fact that if the composite key exists in the
real world, then you **must** model it in your schema. What is the
universal, verifiable single integer for a given (longitude, latitude)
pair? If you are a Douglas Adams fan the answer is 42! Always 42?

All you have done is waste storage and create weak redundancy.



>> "Familiarity" seems like a spurious concern, and a poor tradeoff against both stability (guaranteeing you are uniquely identifying rows) and simplicity (with queries, and others intuiting your design). <<


Wrong again. Familiarity, in the sense of codes that are universally
underestood by people in a given industry, make data exchange possible.
It also means that users can do "reasonableness" checks on data by
eye, with regular expressions, check digits and other programming
tools.


>> Could someone give a real-world example where "familiarity" is a compelling reason to choose a composite PK, trumping both stability and simplicity? <<


(longitude, latitude). Geographical hierarchies. Any co-ordinate
system ((x,y), (x,y,z), log-spiral, circular, etc.)


>> Finally, if there *are* any compelling reasons to prefer composite keys in some situations, after how many attributes required to create a composite key should one give up and create a surrogate key? <<


How big should a vehicle be? The real point is that if you have an
n-part composite key in the data model, you cannot make it go away with
a magic number (ever read the Khabal stuff that is a fad among movie
stars right now? Thinking there is a magic 17-digit Hebrew number that
God puts on everything is not science or logic).


>> 've seen composite keys of up to 5 attributes and often wondered why that particular decision was made. Seems inelegant, but then perhaps I'm missing some other fundamental here. <<


Data integrity and a correct data model. But people do not handle more
than five thigns very well, so that has been considered an upper limit
on human data processing ever since a guy named Brown wrote a paper on
it in the 1950's.

Let's look at the logical problems. First try to create a table with
two columns and try to make them both IDENTITY. If you cannot declare
more than one column to be of a certain data type, then that thing is
not a data type at all, by definition. It is a property which belongs
to the PHYSICAL table, not the LOGICAL data in the table. It also has
to be NULL-able to be a data type in SQL.


Next, create a table with one column and make it an IDENTITY. Now try
to insert, update and delete different numbers from it. If you cannot
insert, update and delete rows from a table, then it is not a table by
definition.


Finally create a simple table with one IDENTITY and a few other
columns. Use a few statements like


INSERT INTO Foobar (a, b, c) VALUES ('a1', 'b1', 'c1');
INSERT INTO Foobar (a, b, c) VALUES ('a2', 'b2', 'c2');
INSERT INTO Foobar (a, b, c) VALUES ('a3', 'b3', 'c3');


To put a few rows into the table and notice that the IDENTITY
sequentially numbered them in the order they were presented. If you
delete a row, the gap in the sequence is not filled in and the sequence
continues from the highest number that has ever been used in that
column in that particular table. This is how we did record numbers in
pre-allocated sequential files in the 1950's, by the way. A utility
program would then "pack" or "compress" the records that were flagged
as deleted or unused to move the empty space to the physical end of the
physical file. IDENTITY leaves the gaps.


But now use a statement with a query expression in it, like this:


INSERT INTO Foobar (a, b, c)
SELECT x, y, z
FROM Floob;


Since a query result is a table, and a table is a set which has no
ordering, what should the IDENTITY numbers be? The entire, whole,
completed set is presented to Foobar all at once, not a row at a time.
There are (n!) ways to number (n) rows, so which one do you pick? The
answer has been to use whatever the *physical* order of the result set
happened to be. That non-relational phrase "physical order" again!


But it is actually worse than that. If the same query is executed
again, but with new statistics or after an index has been dropped or
added, the new execution plan could bring the result set back in a
different physical order. Indexes and statistics are not part of the
logical model.


Can you explain from a logical model viewpoint why the same rows in the
second query get different IDENTITY numbers? In the relational model,
they should be treated the same if all the values of all the attributes
are identical.


Using IDENTITY as a primary key is a sign that there is no data model,
only an imitation of a sequential file system. Since this "magic,
all-purpose, one-size-fits-all" pseudo-identifier exists only as a
result of the physical state of a particular piece of hardware at a
particular time as read by the current release of a particular database
product, how do you verify that an entity has such a number in the
reality you are modeling?


Another cute way to destroy data integrity:


BEGIN ATOMIC
DELETE FROM Foobar
WHERE id = <>;
INSERT INTO Foobar
VALUES ( <>)
END;


Logically this should do nothing, but since IDENTITY has gaps, it
trashes the data.


Now you are REALLY thinking about relations and keys instead of 1950's
sequential record numbering. Adding an IDENTITY column to either of
these tables as a candidate key would be dangerously redundant; one
query uses the IDENTITY and another uses the real key, and like a man
with two watches, you are never sure what time it is.


Daniel Wetzler found out the hard way that IDENTITY changed behavior in
SQL 2000 and SQL 2005. If you perform the statement below you get only
one dataset which has the described properties.


SELECT DISTINCT IDENTITY (INTEGER) AS fake_id, title1, ..
FROM Foobar
WHERE title1 IS NOT NULL
AND ..


The IDENTITY function makes each row unique so DISTINCT doesn't
eliminate the duplicates in this case. Interestingly, this behavior
seems to have changed in SQL Server 2005. If Iupoi run this as a SELECT
INTO on 2005, the execution plan computes the IDENTITY value after
DISTINCT.


For 2000 the kludge is a bit hard to see. The following should insert
just one row into the target table.


CREATE TABLE Foobar (title1 VARCHAR(10), ..);


INSERT INTO Foobar VALUES ('1', ..);
INSERT INTO Foobar VALUES ('1', ..);


SELECT IDENTITY (INTEGER) AS fake_id, title1, ..
INTO Foobar2
FROM (SELECT DISTINCT title1, ..
FROM Foobar
WHERE ..);


Since we are dealling with a proprietary ...

No comments: