SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Monday, May 01, 2006

algorithm for a kind of "between" query

SQL Apprentice Question
I've tried to construct an algorithm myself in vain for quite some time
now for the following problem. Could anyone help me out?


Given a table R with a primary key consisting of columns { K1, K2, ...,
Kp }, p >= 1, and two arbitrary rows a and b, I want to query all rows
"between" a and b. I mean, if I list all possible rows of R ordered by
the primary key, I want all rows between a and b (inclusive). And those
rows may be really arbitrary, the K1 value of a might be less than the
K2 value of b, while the K2 value of a might be greater than the K2
value of b, if you see what I mean. I want to express that query
without using any fancy SQL stuff.


The algorithm I'd like to find would give me the smallest set of
expression, such that the expression is on the form (using SQL syntax):


(K1 = v_1) and (K2 = v_2) and ... and (Kn = v_n) and (Km between v_m_a
and v_m_b)


where 0 <= n < p, m = n+1, and v_xxx are column values.


Celko Answers
Here are the rules for row comparisons in Standard SQL taken from SQL
FOR SMARTIES 3-rd edition
.

09.02. Row Comparisons in SQL


Standard SQL generalized the theta operators so they would work on row
expressions and not just on scalars. This is not a popular feature
yet, but it is very handy for situations where a key is made from more
than one column, and so forth. This makes SQL more orthogonal and it
has an intuitive feel to it. Take three row constants:


A = (10, 20, 30, 40);


B = (10, NULL, 30, 40);


C = (10, NULL, 30, 100);


It seems reasonable to define a row comparison as valid only when the
data types of each corresponding column in the rows are
union-compatible. If not, the operation is an error and should report
a warning. It also seems reasonable to define the results of the
comparison to the AND-ed results of each corresponding column using the
same operator. That is, (A = B) becomes:


((10, 20, 30, 40) = (10, NULL, 30, 40));


becomes:
((10 = 10) AND (20 = NULL) AND (30 = 30) AND (40 = 40))


becomes:
(TRUE AND UNKNOWN AND TRUE AND TRUE);


becomes:
(UNKNOWN);


This seems to be reasonable and conforms to the idea that a NULL is a
missing value that we expect to resolve at a future date, so we cannot
draw a conclusion about this comparison just yet. Now consider the
comparison (A = C), which becomes:


((10, 20, 30, 40) = (10, NULL, 30, 100));


becomes:
((10 = 10) AND (20 = NULL) AND (30 = 30) AND (40 = 100));


becomes:
(TRUE AND UNKNOWN AND TRUE AND FALSE);


becomes:
(FALSE);


There is no way to pick a value for column 2 of row C such that the
UNKNOWN result will change to TRUE because the fourth column is always
FALSE. This leaves you with a situation that is not very intuitive.
The first case can resolve to TRUE or FALSE, but the second case can
only go to FALSE.


Standard SQL decided that the theta operators would work as shown in
the table below. The expression RX RY is shorthand for a row
RX compared to a row RY; likewise, RXi means the i-th column in the row
RX. The results are still TRUE, FALSE, or UNKNOWN, if there is no
error in type matching. The rules favor solid tests for TRUE or FALSE,
using UNKNOWN as a last resort.


The idea of these rules is that as you read the rows from left to
right, the values in one row are always greater than or less than)
those in the other row after some column. This is how it would work if
you were alphabetizing words.


The rules are


1. RX = RY is TRUE if and only if RXi = RYi for all i.


2. RX <> RY is TRUE if and only if RXi <> RYi for some i.


3. RX < RY is TRUE if and only if RXi = RYi for all i < n and
RXn < RYn for some n.


4. RX > RY is TRUE if and only if RXi = RYi for all i < n and
RXn > RYn for some n.


5. RX <= RY is TRUE if and only if Rx = Ry or Rx < Ry.


6. RX >= RY is TRUE if and only if Rx = Ry or Rx > Ry.


7. RX = RY is FALSE if and only if RX <> RY is TRUE.


8. RX <> RY is FALSE if and only if RX = RY is TRUE.


9. RX < RY is FALSE if and only if RX >= RY is TRUE.


10. RX > RY is FALSE if and only if RX <= RY is TRUE.


11. RX <= RY is FALSE if and only if RX > RY is TRUE.


12. RX >= RY is FALSE if and only if RX < RY is TRUE.


13. RX RY is UNKNOWN if and only if RX RY is
neither TRUE nor FALSE.


The negations are defined so that the NOT operator will still have its
usual properties. Notice that a NULL in a row will give an UNKNOWN
result in a comparison. Consider this expression:


(a, b, c) < (x, y, z)


which becomes


((a < x)
OR ((a = x) AND (b < y))
OR ((a = x) AND (b = y) AND (c < z)))


The standard allows a single-row expression of any sort, including a
single-row subquery, on either side of a comparison. Likewise, the
BETWEEN predicate can use row expressions in any position in Standard
SQL.

No comments: