SQL for Smarties | SQL Programming Style | Trees and Hierarchies in SQL | SQL Puzzles and Answers | Data and Databases


Thursday, August 03, 2006

Towel tossed in on use of BETWEEN

SQL Apprentice Question
After extensive testing, with various sizes of tables (the report below concerns a table of 10 million rows), I have concluded that my earlier assertion that using [BETWEEN] seemed to be more efficient than using [ >= AND <= ] was not correct -as many of you gently countered. There were other confounding factors that I had not properly considered.


In fact, using [BETWEEN] incurs a slight performance penalty, most likely in the creation of the execution plan, where [BETWEEN] is converted into [ >= AND <= ].


It is worthy to note that there was no difference found between a properly constructed [ >= AND < ] and [ >= AND <= ].


Since the query was covered with the index, there was no difference between the use of an 'regular' Index or a Clustered Index. (I will redo that test because I want to know the effect of a non-covering index vs. a Clustered Index.


The data below is a representative indication of my results. (The code is also included if someone has use of it for any similar testing.)


The query was crafted so that it would return approximately 2% of the table.
Each Query was verified to return the same number of rows -obstensibly, the same data.
The datetime field was designed to be 'scattered' time values aver a date range of approximately 6.5 years.
The 'unordered' table was loaded to cause significant 'disruption' to table scans.


Table: 10 M Rows
Criteria [ BETWEEN ] [ >= AND < ] [ >= AND <= ]
Query Rows Returned 218,898 218,898 218,898


No Primary key, No Index, Disordered table
Execution Time (ms) 1360 1326 1330


Primary key on [ID], Index on [Testdate]
Execution Time (ms) 60 46 46


Clustered Index on [Testdate]
Execution Time (ms) 60 46 46


/*
Demonstration of Query speed effect from
Using [BETWEEN] vs. [>= AND <] vs. [>= AND <=]


Kudos to the many who challenge and inspire.


The use of @RangeLimit in the WHILE block below serves to limit the
Range of created dates to approx. 6.5 years.
Alter the Primary key and Indexing as desired.


*/


CREATE TABLE #Test
( ID int --PRIMARY KEY
, TestDate datetime
)


--CREATE INDEX Test_datetime_idx ON #Test ( TestDate )
--CREATE clustered INDEX dt_index ON #Test ( TestDate )


DECLARE
@iCounter int
, @RangeLimit int
, @StartDataLoad datetime
, @StopDataLoad datetime
, @TestStart1 datetime
, @TestStop1 datetime
, @TestStart2 datetime
, @TestStop2 datetime
, @TestStart3 datetime
, @TestStop3 datetime
, @DataRowsLoaded int
, @RowsTest1 int
, @RowsTest2 int
, @RowsTest3 int


SELECT
@iCounter = 0
, @RangeLimit = 1


SET NOCOUNT ON


SET @StartDataLoad = getdate()


WHILE @iCounter < 10000000


BEGIN


INSERT INTO #Test ( ID, TestDate )
VALUES ( @iCounter, '01/01/2000'+ cast( rand( @iCounter ) * @RangeLimit AS datetime ))


SELECT
@iCounter = ( @iCounter + 1 )
, @RangeLimit = ( @RangeLimit + 1 )


If ( @iCounter % 2400 ) = 0
SET @RangeLimit = 1


END


SELECT @DataRowsLoaded = count(1)
FROM #Test


SET @StopDataLoad = getdate()


DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE


SET @TestStart1 = getdate()


SELECT @RowsTest1 = count(1)
FROM #Test
WHERE TestDate BETWEEN '03/01/2001' AND '03/31/2001'


SET @TestStop1 = getdate()


DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE


SET @TestStart2 = getdate()


SELECT @RowsTest2 = count(1)
FROM #Test
WHERE TestDate >= '03/01/2001' AND TestDate < '04/01/2001'


SET @TestStop2 = getdate()


DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE


SET @TestStart3 = getdate()


SELECT @RowsTest3 = count(1)
FROM #Test
WHERE TestDate >= '03/01/2001' AND TestDate <= '03/31/2001 23:59:59.998'


SET @TestStop3 = getdate()


DROP TABLE #Test


PRINT ( '' )
PRINT ( 'Loading Data...' )
PRINT ( ' Start at: ' + convert( varchar(30), @StartDataLoad, 113 ))
PRINT ( ' Stop at: ' + convert( varchar(30), @StopDataLoad, 113 ))
PRINT ( ' Load Time (ms): ' + cast( datediff( ms, @StartDataLoad, @StopDataLoad ) AS varchar(10) ))
PRINT ( ' Rows in Table: ' + cast( @DataRowsLoaded as varchar(10) ))
PRINT ( '' )
PRINT ( 'INSERT [ BETWEEN ]' )
PRINT ( ' Start at: ' + convert( varchar(30), @TestStart1, 113 ))
PRINT ( ' Stop at: ' + convert( varchar(30), @TestStop1, 113 ))
PRINT ( ' Difference (ms): ' + cast( datediff( ms, @TestStart1, @TestStop1 ) AS varchar(10) ))
PRINT ( ' Rows in Query: ' + cast( @RowsTest2 as varchar(10) ))
PRINT ( '' )
PRINT ( 'INSERT [ >= AND < ]' )
PRINT ( ' Start at: ' + convert( varchar(30), @TestStart2, 113 ))
PRINT ( ' Stop at: ' + convert( varchar(30), @TestStop2, 113 ))
PRINT ( ' Difference (ms): ' + cast( datediff( ms, @TestStart2, @TestStop2 ) AS varchar(10) ))
PRINT ( ' Rows in Query: ' + cast( @RowsTest2 as varchar(10) ))
PRINT ( '' )
PRINT ( 'INSERT [ >= AND <= ]' )
PRINT ( ' Start at: ' + convert( varchar(30), @TestStart3, 113 ))
PRINT ( ' Stop at: ' + convert( varchar(30), @TestStop3, 113 ))
PRINT ( ' Difference (ms): ' + cast( datediff( ms, @TestStart3, @TestStop3 ) AS varchar(10) ))
PRINT ( ' Rows in Query: ' + cast( @RowsTest3 as varchar(10) ))
PRINT ( '' )


Celko Answers
>> After extensive testing, with various sizes of tables (the report below concerns a table of 10 million rows), I have concluded that my earlier assertion that using [BETWEEN] seemed to be more efficient than using [ >= AND <= ] was not correct -as many of you gently countered. <<


In theory it should be the same. However, the BETWEEN is easier to
maintain and read. Since 80% of the total cost is in maintaining code,
I would not give it up for a small gain (which I doubt is really
there).

It also had a mild advantage in products with better optimizers.
Remember B-Tree indexes? They use a BETWEEN logic to traverse the tree
structure, so they try to group (>= and <=) pairs into a jump to a node
in the index tree. if the BETWEEN is already there, so much the
better. The optimizer will assume that the programmer knows something
and will favor BETWEEN when it has a choice.

No comments: