The ANTI JOIN – all values from table1 where not in table2

One of the less intuitive concepts I come across regularly in SQL is that of the ANTI JOIN. That is, requesting data from a table where some value is not in another table.

The wrong way of doing it is this:

SELECT
*
FROM table1 t1
WHERE t1.id NOT IN (SELECT id FROM table2)

This is very slow and inefficient, because the RDBMS has to compare each row with an entire table of values. The following is much much faster, as it matches up each row once, then simply eliminates the results that are not in the second table:

SELECT
*
FROM table1 t1
LEFT JOIN table2 t2 ON t1.id = t2.id
WHERE t2.id IS NULL

This is what is known as an ANTI JOIN.. And I really wish it were part of standard SQL.. Be careful, though,if table2.id has any NULL rows, you will experience unexpected results. Here’s my proposal, and the following should throw a fatal error if table2.id is not a NOT NULL column:

SELECT
*
FROM table1 t1
ANTI JOIN table2 t2 ON t1.id = t2.id

Too bad I don’t make the rules.

  1. Another way to do anti-join:
    SELECT * FROM table1 t1 WHERE NOT EXISTS (SELECT 1 FROM table2 t2 WHERE t1.id = t2.id)

    Both this way and your LEFT JOIN … WHERE t2.id IS NULL give an efficient “Hash Anti Join” plan in PostgreSQL. The WHERE NOT IN query gives an efficient “hashed SubPlan 1″ plan in PostgreSQL. Comparing each row to “an entire table of values” is not slow after you’ve done a scan to hash the keys of table2. You may be thinking of a nested loop join, which is a possible plan if your database optimizes poorly, or if the join condition is complicated, or if table2 has lots of rows and an index on its id column (in the last case it will not be slow).

    The WHERE NOT EXISTS variant doesn’t suffer from your “unexpected results if table2.id has NULL values” scenario.

    => CREATE TABLE t1 AS SELECT generate_series(1, 10000) AS id;
    => CREATE TABLE t2 AS SELECT generate_series(1, 10000) AS id;
    => explain select id from t1 WHERE not exists (select id from t2 where t2.id = t1.id);
    QUERY PLAN
    ——————————————————————-
    Hash Anti Join (cost=261.68..896.51 rows=1 width=8)
    Hash Cond: (t1.id = t2.id)
    -> Seq Scan on t1 (cost=0.00..141.30 rows=9630 width=8)
    -> Hash (cost=141.30..141.30 rows=9630 width=8)
    -> Seq Scan on t2 (cost=0.00..141.30 rows=9630 width=8)

    => EXPLAIN SELECT t1.id FROM t1 LEFT JOIN t2 ON t1.id = t2.id WHERE t2.id IS NULL;
    QUERY PLAN
    ——————————————————————–
    Hash Anti Join (cost=270.00..565.00 rows=1 width=8)
    Hash Cond: (t1.id = t2.id)
    -> Seq Scan on t1 (cost=0.00..145.00 rows=10000 width=8)
    -> Hash (cost=145.00..145.00 rows=10000 width=8)
    -> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=8)

    => explain select id from t1 WHERE id not in (select id from t2);
    QUERY PLAN
    —————————————————————-
    Seq Scan on t1 (cost=170.00..340.00 rows=5000 width=8)
    Filter: (NOT (hashed SubPlan 1))
    SubPlan 1
    -> Seq Scan on t2 (cost=0.00..145.00 rows=10000 width=8)

  2. Looks like there is a similar alternative in MySQL:
    http://dev.mysql.com/doc/refman/5.0/en/exists-and-not-exists-subqueries.html

  1. No trackbacks yet.