Know your transactions and know them well


In the context of databases transactions are usually thought of as a mechanism to make ensure that different batches of work can be executed in parallel, but result be the same as if they would have been executed in series. This is however only the 10,000 foot overview, and as always, the devil is in the details.

First I would like to point to two great resources: the PostgreSQL documentation explains the four standard transaction isolation levels giving also counterexamples (it is very important to know the limitations of a technology). If you prefer a slightly longer discussion and/or like podcasts, listen episode 99 of the software engineering radio which deals with transactions.

Now to give a concrete example (with PostgreSQL / plPgSQL): suppose that you wish to implement a 1-to-m structure in the database. You would then typically use three tables:

----------      --------
| Master |      | Link |      ---------
----------      --------      | Child |
| mId    | <--> | mId  |      ---------
| ...... |      | cId  | <--> | cId   |
                --------      | ..... |

When doing an insert, a naive approach would be:

SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
IF NOT found THEN
  -- supposing that cId is a serial
  INSERT INTO Child (someAttribute) VALUES (someValue); 
  SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
END IF;

And this works fine and dandy in developement and test, however in production, when it gets some serious parallel pounding, things will start to fail with unique constraint violation
(supposing that you have set up the tables correctly and put a unique constraint on someAttribute – otherwise you have data duplication, the very thing you wish to avoid with this structure).

What happened? The default transaction level in PostgreSQL is read committed. Thus what happened was: two transactions were trying to insert the same value. The first one committed immediately after the second one did the select. So the second one was under the impression that the given value does not exists in the database, while it actually exists and is already visible to the second transaction (because it is committed).

How can you guard against it? The first-instinct solution would be to raise the transaction isolation level to serializable, however this has a big performance impact (although it works). The second solution is to catch the unique constraint violation exception and conclude that it doesn’t matter who inserted the value, it already exists, thus we can use its value:

SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
IF NOT found THEN
  -- supposing that cId is a serial
  BEGIN
    INSERT INTO Child (someAttribute) VALUES (someValue); 
  EXCEPTION WHEN unique_violation THEN
    -- someone else already inserted it, but we're happy with it
  END;
  SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;
END IF;

One final remark: this code is optimized for the case when the value already exists in the majority of the cases. If the opposite is true (the value needs to be inserted in the majority of the cases), you could use the following slight variation, which does not check before it tries to insert:

BEGIN
  INSERT INTO Child (someAttribute) VALUES (someValue); 
EXCEPTION WHEN unique_violation THEN
  -- already exists, do nothing
END;
SELECT cId FROM Child INTO childId WHERE someAttribute = someValue;


Leave a Reply

Your email address will not be published. Required fields are marked *