SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Friday, December 28, 2007

A Strange Use of UDFs?

SQL Apprentice Question
started a new project recently and the .Net/SQL Server 2000
application I was hired to help with was developed by a large
consulting firm. All the data access for the application is done via
stored procedures, but most stored procedure data access is done
via functions. So, you might have a SELECT inside a stored
procedure that looks similar to this:

SELECT e.LastName, et.TerritoryDescription, en.NADescription
FROM Employees e
LEFT JOIN fxEmpNat() en ON e.EmployeeID = en.EmployeeID
LEFT JOIN fxEmpTer() et ON e.EmployeeID = et.EmployeeID
WHERE e.LastName = 'Fuller'


And the function definitions are *similar* to those created in the
script listed at the end of this post (using the Northwind database).
I would solve the same problem using this approach:


SELECT e.LastName, t.TerritoryDescription, n.NADescription
FROM dbo.Employees e
LEFT JOIN dbo.EmployeeTerritories et ON e.EmployeeID = et.EmployeeID
JOIN dbo.Territories t ON et.TerritoryID = t.TerritoryID
LEFT JOIN dbo.EmployeeNationality en ON e.EmployeeID = en.EmployeeID
JOIN dbo.Nationality n ON en.NationalityID = n.NationalityID
WHERE e.LastName = 'Fuller'


The original developers are long gone and no one currently working
on the project knows why the function-centric approach was used.
There are no security restrictions that would merit such an approach,
and even if there were I would think a VIEW solution would be the
more traditional approach. And there are only a couple of complicated
relationships that one might want to "hide" from a less experienced
developer that did not know the intricacies of the data. Anybody have
any ideas when this approach would be justified?


-- Addl. Table and Function Definitions
CREATE TABLE Nationality
(
NationalityID int PRIMARY KEY,
NADescription varchar(20) NOT NULL
)
go
CREATE INDEX IX_Nationality_NADescription ON Nationality(NADescription)
go
INSERT Nationality VALUES (1,'America')
INSERT Nationality VALUES (2,'Canada')
INSERT Nationality VALUES (3,'Angola')
go


CREATE TABLE EmployeeNationality
(
EmployeeID int,
NationalityID int,
CreateDate datetime
CONSTRAINT PK_EmployeeNationality
PRIMARY KEY NONCLUSTERED
(EmployeeID,NationalityID,CreateDate)
)
go
INSERT EmployeeNationality values(1,1,'01/01/80')
INSERT EmployeeNationality values(1,1,'01/01/90')
INSERT EmployeeNationality values(2,1,'01/01/90')
INSERT EmployeeNationality values(3,2,'01/01/90')
INSERT EmployeeNationality values(4,3,'01/01/90')
go


CREATE FUNCTION fxEmpNat()


RETURNS TABLE


AS


RETURN (
SELECT e.EmployeeID, n.NADescription
FROM dbo.Employees e
JOIN dbo.EmployeeNationality en
ON e.EmployeeID = en.EmployeeID
AND en.CreateDate = (SELECT MAX(en2.CreateDate)
FROM EmployeeNationality en2
WHERE en.EmployeeID = en2.EmployeeID
AND en.NationalityID = en2.NationalityID )
JOIN dbo.Nationality n ON en.NationalityID = n.NationalityID
)
go


CREATE FUNCTION fxEmpTer()


RETURNS TABLE


AS


RETURN (
SELECT e.EmployeeID, t.TerritoryDescription
FROM dbo.Employees e
JOIN dbo.EmployeeTerritories et ON e.EmployeeID = et.EmployeeID
JOIN dbo.Territories t ON et.TerritoryID = t.TerritoryID
)
go



Celko Answers
One reason I can think of is job security. This code will never port
and can be read only by dialect speakers. A function call cannot be
optimized like a VIEW, so you are at risk for poor performance, as
well as the maintenance problems, as time goes on.

Another reason is that they are not SQL programmers and do not think
in terms of declarations. They want to see the familiar function call
they know from procedural languages.



Original source

Using WHERE clause parameter

SQL Apprentice Question
When I try to use an SP with parameter that is the WHERE clause it generates
an error.

E.g,

@myWhere = varchar(200)

AS

SELECT x, y FROM skwi WHERE @myWhere

myWhere = status = 7 AND LastName = 'Smith'


The problem is that the where clause is built conditionally in the program.
Any advise and examples on how to accomplish would be appreciated.



Celko Answers

>> When I try to use an SP with parameter that is the WHERE clause it generates an error. <<


The short, dangerous kludge is to use Dynamic SQL.

The right answer is to get out that old text book on Software
Engineering and the chapters on coupling and cohesion of code
modules. Those rules still apply in SQL.


You are writing a "Britney Spears, Automobiles and Squids" module --
you have no idea what it will do at run time, so it has absolutely no
cohesion. Instead of depending on every random future user to write
proper SQL, you need to earn your salary and proper them with a well-
defined module with a meaningful name and a known parameter list.


If you want a general query tool, then use QA. Application users
should be kept away form it.





Original source

Thursday, December 27, 2007

How do I group in a union?

SQL Apprentice Question
I have a select statement that uses union to pull data from multiple
databases and return them in a single recordset. I want to group these
results using group by. How do I do that?


Here's what I have:


SELECT EmployeeID, ProjectID
FROM DB1.table1
UNION
SELECT EmployeeID, ProjectID
FROM DB2.table1


This works fine but I want to group all projects by EmployeeID. I tried the
following but it didn't work


SELECT EmployeeID, ProjectID
FROM DB1.table1
UNION
SELECT EmployeeID, ProjectID
FROM DB2.table1
GROUP BY EmployeeID


I'd appreciate some help here.


Celko Answers
The results of a UNION do not have column names

SELECT X.emp_id, COUNT(project_id) AS project_tot
FROM (SELECT emp_id, project_id
FROM DB1.Table1
UNION
SELECT emp_id, project_id
FROM DB2.Table1)
AS X(emp_id, project_id)
GROUP BY X.emp_id;


UNION ALL will be faster, if it is possible.




Original source

when calling UPDATE from the result set of a SELECT statement, is the order in which rows from the SELECT statement 100% geronteed?

SQL Apprentice Question
when calling UPDATE from the result set of a SELECT statement, is the order
in which rows from the SELECT statement 100% geronteed?

tableA


myid
-----------
0
1
2
3
4
5


UPDATE tableB u
SET myid = i.myid
FROM tableA i
ORDER BY myid


Will this always update tableB with 5 since it is the last one? is this 100%
garonteed to follow the order of the source result set?


Celko Answers
No. There is no ordering in a table by defintion. Since this strictly
proprietary syntax it an do anything MS feels like next week. You are
just looking for the comfort of a 1960's sequential file system
instead of learning to think in RDBMS.


Original source

Which values do NOT appear within a tolerance value Options

SQL Apprentice Question
I have a table of servers that receive updates on a regular basis

CREATE TABLE [dbo].[tbl_ServerUpdate](
[ServerUpdateID] [int] NOT NULL,
[Server] [nvarchar](256) NOT NULL,
[UpdateStamp] [datetime] NOT NULL,
)


I then have tolerance values in another table which I can get like this
(they are in terms of minutes):


DECLARE @Interval INT


SELECT @Interval = IntervalValue
FROM tbl_Interval (NOLOCK)
WHERE IntervalID = 1


I want to know which servers have NOT received an update within the
tolerance value. For example, give me the set of Server that have not
received an update in the last 5 (@Interval) minutes.


Celko Answers
What is an update_id? What would it mean in a logical data model?
Surely, you did NOT just physically number the rows in a table!

This table also had no key, so I made an assumption that you want to
use the (originally very long AND possibly in Chinese!) server names.
But isn't the update interval logically an attribute of each server?
Shouldn't it be in the Servers table? I cannot see an interval
floating around as an entity in itself.


CREATE TABLE Servers
(server_name VARCHAR(25) NOT NULL PRIMARY KEY,
update_stamp DATETIME DEFAULT CURRENT_TIMESTAMP NOT NULL,
update_interval INTEGER NOT NULL
CHECK(update_interval > 0),
etc.
);


If you need this a lot, put it in a VIEW.


CREATE VIEW ExpiredServers (server_name)
AS
SELECT server_name
FROM Servers
WHERE update_stamp
< DATEADD(MI, -update_interval, CURRENT_TIMESTAMP);




SQL Apprentice Question
In the real world, everything is not a beautifully static picture of
relational data. I didn't include the key because I don't think its relevant
to the solution. But since you brought it up: is it better to have a natural
key on a varchar or an identity key as an integer that doesn't really have
any intrinsic business meaning

? In theory, it should be the varchar, but in practice its more performant
to index an int rather than a varchar(25). Is that not true?


The interval is not an attribute of the server, but rather an attribute of
the update type. In my case, there are seven different minute intervals. So,
in the end, I need to know - for each update type - which server did not
receive it within the last [interval] (in terms of minutes). The interval is
configurable by the operations group - sometimes they may want to be alerted
when a server didn't receive an update in the last 3 minutes, sometimes in
the last 10 minutes. They can change this real-time in the database.


I am not a TSQL guru - thats why I use this newsgroup from time to time. I
also have every one of your books on my shelf. However, the corporate world
is NOT academic and there is a tradeoff between having an understandable
data model that is easy to work with and having a data model that can serve
as a model in CS405 (or whatever). Sometimes we have to denormalize or
flatten data to be able to run reports in an acceptable time; sometimes we
have to add bit columns that could easily be derived from another to improve
query performance.


The balance you give is valuable, so I hope you continue to berate us when
our designs violate, but I'll bet - despite all the companies you have
worked with - you can't name one that had a data model that gave you a
boner.



Celko Answers
>> is it better to have a natural key on a VARCHAR(n) or an IDENTITY key [SIC] as an integer that doesn't really have any intrinsic business meaning? <<


How important is data integrity? Why not use a pointer chain DB
instead of mimicking it in SQL? At least they have features to do
garbage collection, restore chains, etc. Eventually, denormalization
and short-cuts will come back and bite you.


>> In theory, it should be the VARCHAR, but in practice its more performant to index an INTEGER rather than a VARCHAR(25). Is that not true? <<


It depends on the product. You automatically assumed indexing; look
at hashing in Teradata. Longer strings for keys lead to easy perfect
hashing, which is always one probe as opposed to multiple probes with
an index when the database gets large. If you are using hidden
pointer chains like SQL Anywhere to implement PK-FK, there is no
difference. SQL Server does have a lot of "code museum" problems and
this is one of them; one type of simple B+ Tree index is used for all
data types and distributions. However, until you get to large DBs, it
works fine and it will get a shot in the arm from 64-bit hardware,
too.


>> The interval is not an attribute of the server, but rather an attribute of the update type. In my case, there are seven different minute intervals. So, in the end, I need to know - for each update type - which server did not receive it within the last [interval] (in terms of minutes). The interval is configurable by the operations group - sometimes they may want to be alerted when a server didn't receive an update in the last 3 minutes, sometimes in the last 10 minutes. They can change this real-time in the database. <<


A type is an attribute by definition, so what entity does the
update_type belong to, if not a server? I would think from this
description, you would have seven columns for the logically different
updates, and seven columns for their corresponding intervals. Or is
this a repeating group where not all update types apply to all servers
and the update types can be changed?


>> I also have every one of your books on my shelf. <<


Neat! Number 7 is out in 2008 February.


>> .. I'll bet - despite all the companies you have worked with - you can't name one that had a data model that gave you a boner. <<


LOL! Remember what I do for a living -- this is like asking a doctor
why all his patients are sick!

The best one I can remember just off hand was for a credit management
company. They had just designed it and wanted a two day review from
me. The only real problem I found was that a bunch of the columns
were stubbed in with a magical CHAR(1) NOT NULL data type; they were
still working on the encodings and waiting for advise from legal.


Another one was a software company with a portal product that manages
corporate software access. We replaced ~60 tables with a nested set
model that lead to ~6 tables for the core processes. In fairness, the
original model had grown over time from a denormalized model on a
small platform to a mainframe tool. People kept adding tables to it
as a work-around and it had become a jungle.



Original source

QUOTED_IDENTIFIER & ANSI_NULLS Options

SQL Apprentice Question
does anyone know how to keep QA from adding the lines setting these
two options on and off along with blank lines at the beginning and end
of every object you edit? i have searched quite a bit on this but
haven't been able to come up with anything.


Celko Answers
>> is there a reason I wouldn't want to do this? <<


Conformance to ANSI/ISO Standards should be a goal in any shop, so you
would not turn off options that bring you to that goal. Why would you
want to write your own database language?



Original source

update table dateCol3 to the later of dateCol1 or dateCol2 Options

SQL Apprentice Question
CREATE TABLE #tmp1(rowID int identity(1,1), dateCol1 datetime, dateCol2
datetime, dateCol3 datetime)

INSERT INTO #tmp1(dateCol1, dateCol2)
SELECT '1/1/05', '2/1/05'
UNION ALL SELECT '3/1/05', '3/7/05'
UNION ALL SELECT '4/1/05', '3/20/05'
UNION ALL SELECT '5/1/05', '5/13/05'


UPDATE #tmp1 SET dateCol3 = ?


If dateCol1 > dateCol2 then update dateCol3 to dateCol1
else
update #tmp1 Set dateCol3 to dateCol2


I am sure this is not the most normalized example, but what would be the
tSql to update my table with the latest date in a row without having to do it
in 2 queries?


I realize I could say

update #tmp1 set datecol3 = datecol1 where datecol1 > datecol2

and then


update #tmp1 set datecol3 = datecol2 where datecol2 > datecol1


Is there a way to do this in one query statement? What would that look like?


Thanks,




Celko Answers
If you followed ISO-11179 data element naming rules, avoided
needlessly proprietary code, and put your date into the proper format
for Standard SQL, would your posting look like this?

CREATE TABLE Foobar
(foobar_id INTEGER NOT NULL PRIMARY KEY,
col1_date DATETIME NOT NULL,
col2_date DATETIME NOT NULL,
col3_date DATETIME);


INSERT INTO Foobar(foobar_id, col1_date, col2_date)
VALUES (42, '2005-01-01', '2005-02-01');
Etc.


IF col1_date > col2_date
THEN update col3_date to col1_date
ELSE update col3_date to col2_date
<< col3_date =" CASE"> col2_date
THEN col1_date
ELSE col2_date END;


You have a few choices here.
1) You can make col3_date into a computed column in proprietary syntax
-- look it up.
2) You can use the CASE expression to update col3_date; "UPDATE Foobar
SET col3_date = CASE.. END;" Of course this still means that you have
materialized computed data in violate of good design.
3) You can put it in a VIEW, and drop col3_date from the base table.
The code will always be right and portable.


Original source