Episode Details
Back to Episodes
Why PostgreSQL CDC Breaks in Production
Description
This story was originally published on HackerNoon at: https://hackernoon.com/why-postgresql-cdc-breaks-in-production.
PostgreSQL CDC fails less because of WAL and more because of snapshot gaps, unsafe checkpoints, retries, ordering, and restart behavior.
Check more stories related to programming at: https://hackernoon.com/c/programming.
You can also check exclusive content about #postgresql, #cdc, #database, #data-engineering, #replication, #backend-development, #devops, #good-company, and more.
This story was written by: @slotix_i2xic. Learn more about this writer by checking @slotix_i2xic's about page,
and for more stories, please visit hackernoon.com.
PostgreSQL CDC usually does not fail because WAL is unreliable.
The dangerous failures happen after WAL has already done its job.
A CDC pipeline can read the right change, then still lose it during handoff, retry, checkpointing, or restart.
The most common failure modes are:
- Initial load finishes, but CDC starts from the wrong WAL position.
- Checkpoints advance before the target write is durable.
- A crash turns “already read” into “already delivered”.
- Retries replay events into targets that are not idempotent.
- Parallel workers apply related changes out of order.
- Long transactions release a large batch only after COMMIT.
- Schema changes arrive as data changes, but the target schema is not ready.
The hard part is not reading WAL.
The hard part is proving that every committed source change reached the target in a recoverable order.