-
Notifications
You must be signed in to change notification settings - Fork 504
Description
Is there an existing issue already for this bug?
- I have searched for an existing issue, and could not find anything. I believe this is a new bug.
I have read the troubleshooting guide
- I have read the troubleshooting guide and I think this is a new bug.
I am running a supported version of CloudNativePG
- I have read the troubleshooting guide and I think this is a new bug.
Contact Details
michael.waelischmiller@4screen.com
Version
1.26 (latest patch)
What version of Kubernetes are you using?
1.31
What is your Kubernetes environment?
Cloud: Azure AKS
How did you install the operator?
Helm
What happened?
We use a preStop hook to gracefully shutdown pooler instances.
lifecycle:
preStop:
exec:
command:
- "/bin/sh"
- "-c"
- "psql -c 'SHUTDOWN WAIT_FOR_CLIENTS' && sleep 360;"
Also we set terminationGracePeriodSeconds: 360
for the pods.
This has been working well with cnpg operator version 1.25 and kubernetes version 1.29.2
The container has been waiting and terminating gracefully without any disruptions on the clients (our clients are set to have maxConnectionLifetime of 300s).
This process is broken now. The containers are shutting down without waiting, causing IO errors on the clients:
Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
We can see that there is a FailedPreStopHook
event appearing for the pooler pods.
We could not yet understand why, but because of this assumingly the pods are terminating immediately,
as from the docs it is stated: If either a PostStart or PreStop hook fails, it kills the Container.
Here are the logs from one of the bouncer instances:
We have already a discussion on the pgbouncer project about that matter:
pgbouncer/pgbouncer#1361
One of the findings is, that the operator uses SIGINT instead of SIGTERM to stop the pooler, which does not gracefully shutdown the pgbouncer.
pgbouncer/pgbouncer#1361 (comment)
https://www.pgbouncer.org/usage.html#:~:text=other%20SQL%20command.)-,Signals,-SIGHUP
I would suggest to change the behaviour of the operator accordingly to then SIGTERM.
Additionally would appreciate any help or advice on how we can achieve a graceful shutdown in the meantime ?
Cluster resource
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: ***
creationTimestamp: "2025-08-06T10:01:51Z"
generation: 3
labels:
argocd.argoproj.io/instance: cloudnative-pg
name: ***
namespace: postgres
resourceVersion: "961222642"
uid: 58376f91-1e75-4465-a8e6-05c5d8935287
spec:
affinity:
podAntiAffinityType: preferred
backup:
barmanObjectStore:
azureCredentials:
connectionString:
key: ***
name: postgres-backups-azure
destinationPath: ***
retentionPolicy: 7d
target: prefer-standby
bootstrap:
initdb:
database: app
encoding: UTF8
localeCType: C
localeCollate: C
owner: app
secret:
name: ***-es
enablePDB: false
enableSuperuserAccess: true
failoverDelay: 0
imageName: ghcr.io/cloudnative-pg/postgis:17-3.5
inheritedMetadata:
annotations:
config.linkerd.io/proxy-admin-shutdown: enabled
config.linkerd.io/shutdown-grace-period: 2460s
instances: 2
logLevel: info
managed:
services:
disabledDefaultServices:
- r
maxSyncReplicas: 0
minSyncReplicas: 0
monitoring:
customQueriesConfigMap:
- key: queries
name: cnpg-default-monitoring
- key: pg_stat_statements_metrics
name: ***-monitoring
disableDefaultQueries: false
enablePodMonitor: true
postgresGID: 26
postgresUID: 26
postgresql:
parameters:
archive_mode: "on"
archive_timeout: 5min
dynamic_shared_memory_type: posix
enable_group_by_reordering: "off"
full_page_writes: "on"
jit: "off"
log_destination: csvlog
log_directory: /controller/log
log_filename: postgres
log_min_duration_statement: "100"
log_rotation_age: "0"
log_rotation_size: "0"
log_truncate_on_rotation: "false"
logging_collector: "on"
max_parallel_workers: "32"
max_replication_slots: "32"
max_worker_processes: "32"
pg_stat_statements.max: "10000"
pg_stat_statements.track: all
shared_memory_type: mmap
shared_preload_libraries: ""
ssl_max_protocol_version: TLSv1.3
ssl_min_protocol_version: TLSv1.3
vacuum_buffer_usage_limit: "256"
wal_keep_size: 512MB
wal_level: logical
wal_log_hints: "on"
wal_receiver_timeout: 5s
wal_sender_timeout: 5s
syncReplicaElectionConstraint:
enabled: false
primaryUpdateMethod: switchover
primaryUpdateStrategy: unsupervised
replicationSlots:
highAvailability:
enabled: true
slotPrefix: _cnpg_
synchronizeReplicas:
enabled: true
updateInterval: 30
resources:
limits:
memory: 1Gi
requests:
cpu: 600m
memory: 1Gi
smartShutdownTimeout: 2100
startDelay: 3600
stopDelay: 2400
storage:
resizeInUseVolumes: true
size: 5Gi
switchoverDelay: 3600
walStorage:
resizeInUseVolumes: true
size: 5Gi
status:
availableArchitectures:
- goArch: amd64
hash: e1cddc583c4841ff68a4224fcb0b6302a26f0aff6cd6fa10be7f1937af3ab6b1
- goArch: arm64
hash: 7706e7a8d054a613b0016238290bdb57a3dd974e16b54234acbb7d222823f312
certificates:
clientCASecret: ***-ca
expirations:
***-ca: 2025-11-04 09:56:51 +0000 UTC
***-replication: 2025-11-04 09:56:51 +0000 UTC
***-server: 2025-11-04 09:56:51 +0000 UTC
replicationTLSSecret: ***-replication
serverAltDNSNames:
- ***
- ***
- ***
- ***
- ***
- ***
- ***
- ***
serverCASecret: ***-ca
serverTLSSecret: ***-server
cloudNativePGCommitHash: 252497fc9
cloudNativePGOperatorHash: e1cddc583c4841ff68a4224fcb0b6302a26f0aff6cd6fa10be7f1937af3ab6b1
conditions:
- lastTransitionTime: "2025-09-01T09:36:19Z"
message: A single, unique system ID was found across reporting instances.
reason: Unique
status: "True"
type: ConsistentSystemID
- lastTransitionTime: "2025-09-01T09:37:08Z"
message: Cluster is Ready
reason: ClusterIsReady
status: "True"
type: Ready
- lastTransitionTime: "2025-09-01T09:31:49Z"
message: Continuous archiving is working
reason: ContinuousArchivingSuccess
status: "True"
type: ContinuousArchiving
- lastTransitionTime: "2025-09-01T00:00:11Z"
message: Backup was successful
reason: LastBackupSucceeded
status: "True"
type: LastBackupSucceeded
configMapResourceVersion:
metrics:
cnpg-default-monitoring: "871162125"
***-monitoring: "955605657"
currentPrimary: ***-1
currentPrimaryTimestamp: "2025-09-01T09:31:47.322259Z"
firstRecoverabilityPoint: "2025-08-25T00:00:05Z"
firstRecoverabilityPointByMethod:
barmanObjectStore: "2025-08-25T00:00:05Z"
healthyPVC:
- ***-1
- ***-1-wal
- ***-2
- ***-2-wal
image: ghcr.io/cloudnative-pg/postgis:17-3.5
instanceNames:
- ***-1
- ***-2
instances: 2
instancesReportedState:
***-1:
ip: 10.244.224.6
isPrimary: true
timeLineID: 11
***-2:
ip: 10.244.224.9
isPrimary: false
timeLineID: 11
instancesStatus:
healthy:
- ***-1
- ***-2
lastSuccessfulBackup: "2025-09-01T00:00:08Z"
lastSuccessfulBackupByMethod:
barmanObjectStore: "2025-09-01T00:00:08Z"
latestGeneratedNode: 2
managedRolesStatus: {}
onlineUpdateEnabled: true
pgDataImageInfo:
image: ghcr.io/cloudnative-pg/postgis:17-3.5
majorVersion: 17
phase: Cluster in healthy state
poolerIntegrations:
pgBouncerIntegration:
secrets:
- ***-pooler
pvcCount: 4
readService: ***-r
readyInstances: 2
secretsResourceVersion:***
switchReplicaClusterStatus: {}
systemID: "7535410717556084761"
targetPrimary: ***-1
targetPrimaryTimestamp: "2025-09-01T09:31:38.311471Z"
timelineID: 11
topology:
instances:
***-1: {}
***-2: {}
nodesUsed: 1
successfullyExtracted: true
writeService: ***-rw
Relevant log output
Code of Conduct
- I agree to follow this project's Code of Conduct