PostgreSQL performance degrades rapidly with more connections. Credit:

10 Things I Hate About PostgreSQL

Over the last few years, the software development community’s love affair with the popular open-source relational database has reached a bit of a fever pitch. This Hacker News thread covering a piece titled “PostgreSQL is the worlds’ best database”, busting at the seams with fawning sycophants lavishing unconditional praise, is a perfect example of this phenomenon.

#1: Disastrous XID Wraparound

Read more here. Suffice to say, this one can bite hard. There are many stories of multi-day outages caused by this issue. Go ahead, Google it and you’ll find numerous poor souls writing about the time they stepped on this landmine. Pretty much any non-trivial PostgreSQL install that isn’t staffed with a top expert will run into it eventually.

#2: Failover Will Probably Lose Data

The run-of-the-mill streaming replication setup will almost certainly lose committed data if the active master suddenly fails. “Such is the price of asynchronous replication,” some might say, but it doesn’t have to be this way. PostgreSQL supports synchronous replication with quorum commit for fault-tolerant durability, but it has a much tighter performance envelope that complicates its application.

#3: Inefficient Replication That Spreads Corruption

Streaming Replication is by far the most utilized replication mechanism in production deployments. It is a form of physical replication, meaning that it replicates changes in the on-disk binary data itself.

#4: MVCC Garbage Frequently Painful

Like most mainstream databases, PostgreSQL uses multi-version concurrency control (MVCC) to implement concurrent transactions. However, its particular implementation often introduces operational pain around garbage row versions and their cleanup (VACUUM). Generally speaking, UPDATE operations create new copies (or “row versions”) of any modified rows, leaving the old versions on disk until they can be cleaned up.

#5: Process-Per-Connection = Pain at Scale

PostgreSQL forks a process for every connection, where as most other databases use a more efficient connection concurrency model. This makes for a difficult tuning problem as there is a relatively low threshold at which adding more connections degrades performance (around ~2x cores) and eventually another higher threshold (hard to estimate, highly workload dependent) where performance will plummet.

#6: Primary Key Index is a Space Hog

Tables in PostgreSQL have an index for the primary key and separate row storage called the heap. Other databases integrate these together or support “index-organized tables”. In this arrangement, the primary key lookup process leads directly to the row data without a secondary fetch to get the full row and the requisite additional CPU and I/O utilization.

object_type INTEGER NOT NULL,
PRIMARY KEY(object_type, object_id, user_id)

#7: Major Version Upgrades Can Require Downtime

Some major version upgrades require hours of downtime to convert the data for a large database. Using the typical streaming replication mechanism, it is not possible to do this gracefully by upgrading a replica and doing a failover. The on-disk binary format is incompatible across major versions, thus the wire protocol between master and replica is effectively also incompatible.

#8: Somewhat Cumbersome Replication Setup

To be fair, MySQL’s out-of-the-box replication is much more cumbersome, but compared to some NoSQL stores like MongoDB and Redis or some cluster-oriented replication systems like MySQL Group Replication and Galera Cluster, from an ease-of-use and sharp-edge-avoidance perspective, setting up replication in PostgreSQL leaves a lot to be desired. While logical replication theoretically provides far more flexibility for third-party solutions to paper over these gaps, so far it’s there are some pretty big caveats for using it in place of streaming replication.

#9: Ridiculous No-Planner-Hints Dogma

Planner hints allow queries to direct the query planner to use strategies it wouldn’t otherwise use on its own. In what seems to be a form of the sufficiently smart compiler argument, the PostgreSQL development team has held the line for years in refusing to support query planner hints.

#10: No Block Compression

InnoDB’s Page Compression in MySQL commonly cuts storage footprint in half and is pretty much “free” from a performance perspective. PostgreSQL will automatically compress large values but this isn’t useful for the most common ways data is stored in relational databases. For most RDBMS use cases, a row is typically a few hundred bytes or less, which means compression could only really be effective when applied across multiple rows, or in blocks.

All That Said…

You should probably still use PostgreSQL and not something else for storing data that you’d ideally like to, you know, keep around for some time. In general I’d recommend starting with PostgreSQL and then trying to figure out why it won’t work for your use case.

I do Software Engineering on High-Impact, Large-Scale Internet Services.