SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases

Tuesday, April 25, 2006

Using one table vs. many

SQL Apprentice Question
I am designing a table that will store large amounts of rows.
(>100,000). The first idea is to create a generic-like table:

id | type | data

Where data field is dependant on type field. That is, if type == 1 data
is INT, if type == 2 i might interpret data as VARCHAR... Is that a
good thing to do? I belive i could use BLOB as data field type. This
way i could store variant types, right? Also note, that i do not plan
to run SQL queries that depend or in other ways manipulate data field.

Or should i split one huge table so they look like this:

id | data (INT)

id | data (VARCHAR)

Celko Answers
Before I cut & paste a detailed rant about EAV, you might want to learn
all the HUGE differences between fields and columns. Given the proper
definitions, you will not ask questions like this again. The idea of a
variant column is like "changable constant" in RDBMS.

I found an old "cut & paste". Someone like you posted this:

CREATE TABLE EAV -- no key declared
(key_col VARCHAR (10) NULL,
attrib_value VARCHAR (50) NULL);

INSERT INTO EAV VALUES ('EVENT','verbal aggression');
INSERT INTO EAV VALUES ('EVENT','bad behavior');

CREATE TABLE EAV_DATA -note lack of constraints, defaults, DRI
key_col VARCHAR (10) NULL,
attrib_value VARCHAR (50) NULL );

INSERT INTO EAV_DATA VALUES (1, 'EVENT', 'bad behavior');
INSERT INTO EAV_DATA VALUES (2, 'EVENT', 'verbal aggression');

Ideally, the result set of the query would be Location Event count
(headings if possible)

Bedroom verbal aggression 1
Bedroom peer 0
Bedroom bad behavior 0
Bedroom other 2
Dining Room verbal aggression 0
Dining Room peer 0
Dining Room bad behavior 0
Dining Room other 0
Bathroom verbal aggression 0
Bathroom peer 0
Bathroom bad behavior 0
Bathroom other 0
courtyard verbal aggression 0
courtyard peer 1
courtyard bad behavior 0
courtyard other 1

Also, if possible, another query would return this result set. (I think
I know how to do this one.)

Location Event count
Bedroom verbal aggression 1
Bedroom other 2
courtyard peer 1
courtyard other 1

Here is an answer From: Thomas Coleman

SELECT Locations.locationvalue, Events.eventvalue,
FROM (SELECT LocationData.locationvalue, EventData.eventvalue

FROM (SELECT TD1.bts_id, TD1.value AS locationvalue
FROM eav_data AS TD1
WHERE TD1.key = 'location') AS LocationData
(SELECT TD2.bts_id, TD2.value AS eventvalue
FROM eav_data AS TD2
WHERE TD2.key = 'event'
) AS EventData
ON LocationData.bts_id = EventData.bts_id
) AS CollatedEventData
WHERE CollatedEventData.locationvalue = Locations.locationvalue
AND CollatedEventData.eventvalue = Events.eventvalue
FROM (SELECT T1.value AS locationvalue
WHERE T1.key = 'location') AS Locations,
(SELECT T2.value AS eventvalue
WHERE T2.key = 'event') AS Events
ORDER BY Locations.locationvalue, Events.eventvalue ,
SELECT Locations.locationvalue, Events.eventvalue
FROM (SELECT LocationData.locationvalue, EventData.eventvalue

FROM (SELECT TD1.bts_id, TD1.value AS locationvalue
FROM eav_data AS TD1
WHERE TD1.key = 'location') AS LocationData
(SELECT TD2.bts_id, TD2.value AS eventvalue
FROM eav_data AS TD2
WHERE TD2.key = 'event') AS EventData
ON LocationData.bts_id = EventData.bts_id)
AS CollatedEventData
WHERE CollatedEventData.locationvalue = Locations.locationvalue
AND CollatedEventData.eventvalue = Events.eventvalue)
FROM (SELECT T1.value AS locationvalue
WHERE T1.key = 'location') AS Locations,
(SELECT T2.value AS eventvalue
WHERE T2.key = 'event') AS Events;

Is the same thing in a proper schema as:

SELECT L.locationvalue, E.eventvalue, COUNT(*)
FROM Locations AS L, Events AS E
WHERE L.btd_id = E.btd_id
GROUP BY L.locationvalue, E.eventvalue;

The reason that I had to use so many subqueries is that those entities
are all lopped into the same table. There should be separate tables for
Locations and Events.

The column names are seriously painful. Beyond the fact that I
personally hate underscores in column names, using underscores at the
end of the column name is really non-intuitive. I removed them for my
example and came across the next column name faux pas. Don't use "key"
and "value" for column names. It means that the developer *has*
surround the column name with square brackets for everything which is a
serious pain.

There is such a thing as "too" generic. There has to be some structure
or everything becomes nothing more than a couple of tables called
"things". The real key (no pun intended) is commonality. Is there a
pattern to the data that they want to store? It may not be possible to
create one structure to rule them all and in the darkness bind them.

"To be is to be something in particular; to be nothing in particular is
to be nothing." --Aristole

All data integrity is destroyed. Any typo becomes a new attribute or
entity. Entities are found missing attributes, so all the reports are

ry to write a single CHECK() constraint that works for all the
attributes of those 30+ entities your users created because you were
too dumb or too lazy to do your job. It can be done! You need a case
expression almost 70 WHEN clauses for a simple invoice and order system
when I tried it as an exercise.

ry to write a single DEFAULT clause for 30+ entities crammed into one
column. Impossible!

Try to set up DRI actions among the entities. If you thought the WHEN
clauses in the single CASE expression were unmaintainable, wait until
you see the "TRIGGERs from Hell" -- Too bad that they might not fit
into older SQL Server which had some size limits. Now maintain it.

Like most new ideas, the hard part of understanding what the relational
model is comes in un-learning what you know about file systems. As
Artemus Ward (William Graham Sumner, 1840-1910) put it, "It ain't so
much the things we don't know that get us into trouble. It's the things
we know that just ain't so."

If you already have a background in data processing with traditional
file systems, the first things to un-learn are:

(0) Databases are not file sets.
(1) Tables are not files.
(2) Rows are not records.
(3) Columns are not fields.

Modern data processing began with punch cards. The influence of the
punch card lingered on long after the invention of magnetic tapes and
disk for data storage. This is why early video display terminals were
80 columns across. Even today, files which were migrated from cards to
magnetic tape files or disk storage still use 80 column records.

But the influence was not just on the physical side of data processing.
The methods for handling data from the prior media were imitated in
the new media.

Data processing first consisted of sorting and merging decks of punch
cards (later, sequential magnetic tape files) in a series of distinct
steps. The result of each step feed into the next step in the process.
This leads to temp table and other tricks to mimic that kind of

Relational databases do not work that way. Each user connects to the
entire database all at once, not to one file at time in a sequence of
steps. The users might not all have the same database access rights
once they are connected, however. Magnetic tapes could not be shared
among users at the same time, but shared data is the point of a

Tables versus Files

A file is closely related to its physical storage media. A table may
or may not be a physical file. DB2 from IBM uses one file per table,
while Sybase puts several entire databases inside one file. A table is
a set of rows of the same kind of thing. A set has no ordering
and it makes no sense to ask for the first or last row.

A deck of punch cards is sequential, and so are magnetic tape files.
Therefore, a physical file of ordered sequential records also
became the mental model for data processing and it is still hard
to shake. Anytime you look at data, it is in some physical ordering.

The various access methods for disk storage system came later, but even
these access methods could not shake the mental model.

Another conceptual difference is that a file is usually data that deals
with a whole business process. A file has to have enough data in
itself to support applications for that business process. Files tend
to be "mixed" data which can be described by the name of the business
process, such as "The Payroll file" or something like that.

Tables can be either entities or relationships within a business
process. This means that the data which was held in one file is often
put into several tables. Tables tend to be "pure" data which can be
described by single words. The payroll would now have separate tables
for timecards, employees, projects and so forth.

Tables as Entities

An entity is physical or conceptual "thing" which has meaning be
itself. A person, a sale or a product would be an example. In a
relational database, an entity is defined by its attributes, which are
shown as values in columns in rows in a table.

To remind users that tables are sets of entities, I like to use
collective or plural nouns that describe the function of the entities
within the system for the names of tables. Thus "Employee" is a bad
name because it is singular; "Employees" is a better name because it is
plural; "Personnel" is best because it is collective and does not
summon up a mental picture of individual persons.

If you have tables with exactly the same structure, then they are sets
of the same kind of elements. But you should have only one set for
each kind of data element! Files, on the other hand, were PHYSICALLY
separate units of storage which could be alike -- each tape or disk
file represents a step in the PROCEDURE , such as moving from raw data,
to edited data, and finally to archived data. In SQL, this should be a
status flag in a table.

Tables as Relationships

A relationship is shown in a table by columns which reference one or
more entity tables. Without the entities, the relationship has no
meaning, but the relationship can have attributes of its own. For
example, a show business contract might have an agent, an employer and
a talent. The method of payment is an attribute of the contract
itself, and not of any of the three parties.

Rows versus Records

Rows are not records. A record is defined in the application program
which reads it; a row is defined in the database schema and not by a
program at all. The name of the field in the READ or INPUT statements
of the application; a row is named in the database schema. Likewise,
the PHYSICAL order of the field names in the READ statement is vital
(READ a,b,c is not the same as READ c, a, b; but SELECT a,b,c is the
same data as SELECT c, a, b.

All empty files look alike; they are a directory entry in the operating
system with a name and a length of zero bytes of storage. Empty tables
still have columns, constraints, security privileges and other
structures, even tho they have no rows.

This is in keeping with the set theoretical model, in which the empty
set is a perfectly good set. The difference between SQL's set model
and standard mathematical set theory is that set theory has only one
empty set, but in SQL each table has a different structure, so they
cannot be used in places where non-empty versions of themselves could
not be used.

Another characteristic of rows in a table is that they are all alike in
structure and they are all the "same kind of thing" in the model. In a
file system, records can vary in size, datatypes and structure by
having flags in the data stream that tell the program reading the data
how to interpret it. The most common examples are Pascal's variant
record, C's struct syntax and Cobol's OCCURS clause.

The OCCURS keyword in Cobol and the Variant records in Pascal have a
number which tells the program how many time a record structure is to
be repeated in the current record.

Unions in 'C' are not variant records, but variant mappings for the
same physical memory. For example:

union x {int ival; char j[4];} myStuff;

defines myStuff to be either an integer (which are 4 bytes on most
modern C compilers, but this code is non-portable) or an array of 4
bytes, depending on whether you say myStuff.ival or myStuff.j[0];

But even more than that, files often contained records which were
summaries of subsets of the other records -- so called control break
reports. There is no requirement that the records in a file be related
in any way -- they are literally a stream of binary data whose meaning
is assigned by the program reading them.

Columns versus Fields

A field within a record is defined by the application program that
reads it. A column in a row in a table is defined by the database
schema. The datatypes in a column are always scalar.

The order of the application program variables in the READ or INPUT
statements is important because the values are read into the program
variables in that order. In SQL, columns are referenced only by their
names. Yes, there are shorthands like the SELECT * clause and INSERT
INTO [table name] statements which expand into a list of column names
in the physical order in which the column names appear within their
table declaration, but these are shorthands which resolve to named

The use of NULLs in SQL is also unique to the language. Fields do not
support a missing data marker as part of the field, record or file
itself. Nor do fields have constraints which can be added to them in
the record, like the DEFAULT and CHECK() clauses in SQL.

Relationships among tables within a database

Files are pretty passive creatures and will take whatever an
application program throws at them without much objection. Files are
also independent of each other simply because they are connected to one
application program at a time and therefore have no idea what other
files looks like.

A database actively seeks to maintain the correctness of all its data.
The methods used are triggers, constraints and declarative referential

Declarative referential integrity (DRI) says, in effect, that data in
one table has a particular relationship with data in a second (possibly
the same) table. It is also possible to have the database change
itself via referential actions associated with the DRI.

For example, a business rule might be that we do not sell products
which are not in inventory. This rule would be enforce by a REFERENCES
clause on the Orders table which references the Inventory table and a
referential action of ON DELETE CASCADE

Triggers are a more general way of doing much the same thing as DRI. A
trigger is a block of procedural code which is executed before, after
or instead of an INSERT INTO or UPDATE statement. You can do anything
with a trigger that you can do with DRI and more.

However, there are problems with TRIGGERs. While there is a standard
syntax for them in the SQL-92 standard, most vendors have not
implemented it. What they have is very proprietary syntax instead.
Secondly, a trigger cannot pass information to the optimizer like DRI.
In the example in this section, I know that for every product number in
the Orders table, I have that same product number in the Inventory
table. The optimizer can use that information in setting up EXISTS()
predicates and JOINs in the queries. There is no reasonable way to
parse procedural trigger code to determine this relationship.

The CREATE ASSERTION statement in SQL-92 will allow the database to
enforce conditions on the entire database as a whole. An ASSERTION is
not like a CHECK() clause, but the difference is subtle. A CHECK()
clause is executed when there are rows in the table to which it is
attached. If the table is empty then all CHECK() clauses are
effectively TRUE. Thus, if we wanted to be sure that the Inventory
table is never empty, and we wrote:

( ...
CONSTRAINT inventory_not_empty
CHECK ((SELECT COUNT(*) FROM Inventory) > 0), ... );

it would not work. However, we could write:

CREATE ASSERTION Inventory_not_empty
CHECK ((SELECT COUNT(*) FROM Inventory) > 0);

and we would get the desired results. The assertion is checked at the
schema level and not at the table level.

No comments: