| The PostgreSQL 9.0 Reference Manual - Volume 3 - Server Administration Guide
by The PostgreSQL Global Development Group Paperback (6"x9"), 274 pages ISBN 9781906966072 RRP £9.95 ($14.95) Sales of this book support the PostgreSQL project! Get a printed copy>>> |
11.5.2 Handling query conflicts
The primary and standby servers are in many ways loosely connected. Actions on the primary will have an effect on the standby. As a result, there is potential for negative interactions or conflicts between them. The easiest conflict to understand is performance: if a huge data load is taking place on the primary then this will generate a similar stream of WAL records on the standby, so standby queries may contend for system resources, such as I/O.
There are also additional types of conflict that can occur with Hot Standby. These conflicts are hard conflicts in the sense that queries might need to be cancelled and, in some cases, sessions disconnected to resolve them. The user is provided with several ways to handle these conflicts. Conflict cases include:
-
Access Exclusive locks taken on the primary server, including both
explicit
LOCKcommands and various DDL actions, conflict with table accesses in standby queries. - Dropping a tablespace on the primary conflicts with standby queries using that tablespace for temporary work files.
- Dropping a database on the primary conflicts with sessions connected to that database on the standby.
- Application of a vacuum cleanup record from WAL conflicts with standby transactions whose snapshots can still “see” any of the rows to be removed.
- Application of a vacuum cleanup record from WAL conflicts with queries accessing the target page on the standby, whether or not the data to be removed is visible.
On the primary server, these cases simply result in waiting; and the user might choose to cancel either of the conflicting actions. However, on the standby there is no choice: the WAL-logged action already occurred on the primary so the standby must not fail to apply it. Furthermore, allowing WAL application to wait indefinitely may be very undesirable, because the standby's state will become increasingly far behind the primary's. Therefore, a mechanism is provided to forcibly cancel standby queries that conflict with to-be-applied WAL records.
An example of the problem situation is an administrator on the primary
server running DROP TABLE on a table that is currently being
queried on the standby server. Clearly the standby query cannot continue
if the DROP TABLE is applied on the standby. If this situation
occurred on the primary, the DROP TABLE would wait until the
other query had finished. But when DROP TABLE is run on the
primary, the primary doesn't have information about what queries are
running on the standby, so it will not wait for any such standby
queries. The WAL change records come through to the standby while the
standby query is still running, causing a conflict. The standby server
must either delay application of the WAL records (and everything after
them, too) or else cancel the conflicting query so that the DROP
TABLE can be applied.
When a conflicting query is short, it's typically desirable to allow it to
complete by delaying WAL application for a little bit; but a long delay in
WAL application is usually not desirable. So the cancel mechanism has
parameters, max_standby_archive_delay and max_standby_streaming_delay, that define the maximum
allowed delay in WAL application. Conflicting queries will be canceled
once it has taken longer than the relevant delay setting to apply any
newly-received WAL data. There are two parameters so that different delay
values can be specified for the case of reading WAL data from an archive
(i.e., initial recovery from a base backup or “catching up” a
standby server that has fallen far behind) versus reading WAL data via
streaming replication.
In a standby server that exists primarily for high availability, it's best to set the delay parameters relatively short, so that the server cannot fall far behind the primary due to delays caused by standby queries. However, if the standby server is meant for executing long-running queries, then a high or even infinite delay value may be preferable. Keep in mind however that a long-running query could cause other sessions on the standby server to not see recent changes on the primary, if it delays application of WAL records.
The most common reason for conflict between standby queries and WAL replay is “early cleanup”. Normally, PostgreSQL allows cleanup of old row versions when there are no transactions that need to see them to ensure correct visibility of data according to MVCC rules. However, this rule can only be applied for transactions executing on the master. So it is possible that cleanup on the master will remove row versions that are still visible to a transaction on the standby.
Experienced users should note that both row version cleanup and row version
freezing will potentially conflict with standby queries. Running a manual
VACUUM FREEZE is likely to cause conflicts even on tables with
no updated or deleted rows.
Once the delay specified by max_standby_archive_delay or
max_standby_streaming_delay has been exceeded, conflicting
queries will be cancelled. This usually results just in a cancellation
error, although in the case of replaying a DROP DATABASE
the entire conflicting session will be terminated. Also, if the conflict
is over a lock held by an idle transaction, the conflicting session is
terminated (this behavior might change in the future).
Cancelled queries may be retried immediately (after beginning a new transaction, of course). Since query cancellation depends on the nature of the WAL records being replayed, a query that was cancelled may well succeed if it is executed again.
Keep in mind that the delay parameters are compared to the elapsed time since the WAL data was received by the standby server. Thus, the grace period allowed to any one query on the standby is never more than the delay parameter, and could be considerably less if the standby has already fallen behind as a result of waiting for previous queries to complete, or as a result of being unable to keep up with a heavy update load.
Users should be clear that tables that are regularly and heavily updated
on the primary server will quickly cause cancellation of longer running
queries on the standby. In such cases the setting of a finite value for
max_standby_archive_delay or
max_standby_streaming_delay can be considered similar to
setting statement_timeout.
Remedial possibilities exist if the number of standby-query cancellations
is found to be unacceptable. The first option is to connect to the
primary server and keep a query active for as long as needed to
run queries on the standby. This prevents VACUUM from removing
recently-dead rows and so cleanup conflicts do not occur.
This could be done using ‘contrib/dblink’ and
pg_sleep(), or via other mechanisms. If you do this, you
should note that this will delay cleanup of dead rows on the primary,
which may result in undesirable table bloat. However, the cleanup
situation will be no worse than if the standby queries were running
directly on the primary server, and you are still getting the benefit of
off-loading execution onto the standby.
max_standby_archive_delay must be kept large in this case,
because delayed WAL files might already contain entries that conflict with
the desired standby queries.
Another option is to increase vacuum_defer_cleanup_age
on the primary server, so that dead rows will not be cleaned up as quickly
as they normally would be. This will allow more time for queries to
execute before they are cancelled on the standby, without having to set
a high max_standby_streaming_delay. However it is
difficult to guarantee any specific execution-time window with this
approach, since vacuum_defer_cleanup_age is measured in
transactions executed on the primary server.
| ISBN 9781906966072 | The PostgreSQL 9.0 Reference Manual - Volume 3 - Server Administration Guide | See the print edition |