Speed Up SELECT DISTINCT Queries
By :
Neil Boyle
Jun 30, 2002 |
http://www.sql-server-performance.com/articles/per/select_distinct_queries_p1.aspx
Many people use the
DISTINCT option in a SELECT statement to filter out duplicate results
from a query's output. Take this simple PUBS database query as an
example:
SELECT DISTINCT
au_fname,
au_lname
FROM authors
In a simple SELECT from one table (like the one above) this is the easiest and quickest way of doing things.
However,
with a more complex query you should think about re-coding it to gain a
performance advantage. Take this query for example, which only returns
authors that have a book already published.
SELECT DISTINCT
au_fname,
au_lname
FROM authors a JOIN titleAuthor t
ON t.au_id = a.au_id
Here,
we only want to see unique names of authors who have written books. The
query will work as required, but we can get a small performance
improvement if we write it like this:
SELECT au_fname,
au_lname
FROM authors a
WHERE EXISTS (
SELECT *
FROM titleAuthor t
WHERE t.au_id = a.au_id
)
The
reason the second example runs slightly quicker is that the EXISTS
clause will cause a name to be returned when the first book is found,
and no further books for that author will be considered (we already
have the author’s name, and we only want to see it once)
On the
other hand, the DISTINCT query returns one copy of the author’s name
for each book the author has worked on, and the list of authors
generated subsequently needs to be examined for duplicates to satisfy
the DISTINCT clause.
You can examine the execution plan for
each query to see where the performance improvements come from. For
example, in SQL 6.5 you will normally see a step involving a Worktable
mentioned for the "DISTINCT" version, which does not happen in the
EXISTS version. In SQL Server 7.0 and 2000 you can generate a graphical
execution plan for the two queries and more easily compare them.