Skip to content

[Bug]: PgBouncer not shutdown gracefully #8505

@michael4screen

Description

@michael4screen

Is there an existing issue already for this bug?

  • I have searched for an existing issue, and could not find anything. I believe this is a new bug.

I have read the troubleshooting guide

  • I have read the troubleshooting guide and I think this is a new bug.

I am running a supported version of CloudNativePG

  • I have read the troubleshooting guide and I think this is a new bug.

Contact Details

michael.waelischmiller@4screen.com

Version

1.26 (latest patch)

What version of Kubernetes are you using?

1.31

What is your Kubernetes environment?

Cloud: Azure AKS

How did you install the operator?

Helm

What happened?

We use a preStop hook to gracefully shutdown pooler instances.

 lifecycle:
   preStop:
     exec:
       command:
         - "/bin/sh"
         - "-c"
         - "psql -c 'SHUTDOWN WAIT_FOR_CLIENTS' && sleep 360;"

Also we set terminationGracePeriodSeconds: 360 for the pods.
This has been working well with cnpg operator version 1.25 and kubernetes version 1.29.2
The container has been waiting and terminating gracefully without any disruptions on the clients (our clients are set to have maxConnectionLifetime of 300s).

This process is broken now. The containers are shutting down without waiting, causing IO errors on the clients:

Caused by: org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.

We can see that there is a FailedPreStopHook event appearing for the pooler pods.
We could not yet understand why, but because of this assumingly the pods are terminating immediately,
as from the docs it is stated: If either a PostStart or PreStop hook fails, it kills the Container.
Here are the logs from one of the bouncer instances:

bouncer_log (1).log

We have already a discussion on the pgbouncer project about that matter:
pgbouncer/pgbouncer#1361

One of the findings is, that the operator uses SIGINT instead of SIGTERM to stop the pooler, which does not gracefully shutdown the pgbouncer.
pgbouncer/pgbouncer#1361 (comment)
https://www.pgbouncer.org/usage.html#:~:text=other%20SQL%20command.)-,Signals,-SIGHUP

I would suggest to change the behaviour of the operator accordingly to then SIGTERM.
Additionally would appreciate any help or advice on how we can achieve a graceful shutdown in the meantime ?

Cluster resource

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: ***
  creationTimestamp: "2025-08-06T10:01:51Z"
  generation: 3
  labels:
    argocd.argoproj.io/instance: cloudnative-pg
  name: ***
  namespace: postgres
  resourceVersion: "961222642"
  uid: 58376f91-1e75-4465-a8e6-05c5d8935287
spec:
  affinity:
    podAntiAffinityType: preferred
  backup:
    barmanObjectStore:
      azureCredentials:
        connectionString:
          key: ***
          name: postgres-backups-azure
      destinationPath: ***
    retentionPolicy: 7d
    target: prefer-standby
  bootstrap:
    initdb:
      database: app
      encoding: UTF8
      localeCType: C
      localeCollate: C
      owner: app
      secret:
        name: ***-es
  enablePDB: false
  enableSuperuserAccess: true
  failoverDelay: 0
  imageName: ghcr.io/cloudnative-pg/postgis:17-3.5
  inheritedMetadata:
    annotations:
      config.linkerd.io/proxy-admin-shutdown: enabled
      config.linkerd.io/shutdown-grace-period: 2460s
  instances: 2
  logLevel: info
  managed:
    services:
      disabledDefaultServices:
      - r
  maxSyncReplicas: 0
  minSyncReplicas: 0
  monitoring:
    customQueriesConfigMap:
    - key: queries
      name: cnpg-default-monitoring
    - key: pg_stat_statements_metrics
      name: ***-monitoring
    disableDefaultQueries: false
    enablePodMonitor: true
  postgresGID: 26
  postgresUID: 26
  postgresql:
    parameters:
      archive_mode: "on"
      archive_timeout: 5min
      dynamic_shared_memory_type: posix
      enable_group_by_reordering: "off"
      full_page_writes: "on"
      jit: "off"
      log_destination: csvlog
      log_directory: /controller/log
      log_filename: postgres
      log_min_duration_statement: "100"
      log_rotation_age: "0"
      log_rotation_size: "0"
      log_truncate_on_rotation: "false"
      logging_collector: "on"
      max_parallel_workers: "32"
      max_replication_slots: "32"
      max_worker_processes: "32"
      pg_stat_statements.max: "10000"
      pg_stat_statements.track: all
      shared_memory_type: mmap
      shared_preload_libraries: ""
      ssl_max_protocol_version: TLSv1.3
      ssl_min_protocol_version: TLSv1.3
      vacuum_buffer_usage_limit: "256"
      wal_keep_size: 512MB
      wal_level: logical
      wal_log_hints: "on"
      wal_receiver_timeout: 5s
      wal_sender_timeout: 5s
    syncReplicaElectionConstraint:
      enabled: false
  primaryUpdateMethod: switchover
  primaryUpdateStrategy: unsupervised
  replicationSlots:
    highAvailability:
      enabled: true
      slotPrefix: _cnpg_
    synchronizeReplicas:
      enabled: true
    updateInterval: 30
  resources:
    limits:
      memory: 1Gi
    requests:
      cpu: 600m
      memory: 1Gi
  smartShutdownTimeout: 2100
  startDelay: 3600
  stopDelay: 2400
  storage:
    resizeInUseVolumes: true
    size: 5Gi
  switchoverDelay: 3600
  walStorage:
    resizeInUseVolumes: true
    size: 5Gi
status:
  availableArchitectures:
  - goArch: amd64
    hash: e1cddc583c4841ff68a4224fcb0b6302a26f0aff6cd6fa10be7f1937af3ab6b1
  - goArch: arm64
    hash: 7706e7a8d054a613b0016238290bdb57a3dd974e16b54234acbb7d222823f312
  certificates:
    clientCASecret: ***-ca
    expirations:
      ***-ca: 2025-11-04 09:56:51 +0000 UTC
      ***-replication: 2025-11-04 09:56:51 +0000 UTC
      ***-server: 2025-11-04 09:56:51 +0000 UTC
    replicationTLSSecret: ***-replication
    serverAltDNSNames:
    - ***
    - ***
    - ***
    - ***
    - ***
    - ***
    - ***
    - ***
    serverCASecret: ***-ca
    serverTLSSecret: ***-server
  cloudNativePGCommitHash: 252497fc9
  cloudNativePGOperatorHash: e1cddc583c4841ff68a4224fcb0b6302a26f0aff6cd6fa10be7f1937af3ab6b1
  conditions:
  - lastTransitionTime: "2025-09-01T09:36:19Z"
    message: A single, unique system ID was found across reporting instances.
    reason: Unique
    status: "True"
    type: ConsistentSystemID
  - lastTransitionTime: "2025-09-01T09:37:08Z"
    message: Cluster is Ready
    reason: ClusterIsReady
    status: "True"
    type: Ready
  - lastTransitionTime: "2025-09-01T09:31:49Z"
    message: Continuous archiving is working
    reason: ContinuousArchivingSuccess
    status: "True"
    type: ContinuousArchiving
  - lastTransitionTime: "2025-09-01T00:00:11Z"
    message: Backup was successful
    reason: LastBackupSucceeded
    status: "True"
    type: LastBackupSucceeded
  configMapResourceVersion:
    metrics:
      cnpg-default-monitoring: "871162125"
      ***-monitoring: "955605657"
  currentPrimary: ***-1
  currentPrimaryTimestamp: "2025-09-01T09:31:47.322259Z"
  firstRecoverabilityPoint: "2025-08-25T00:00:05Z"
  firstRecoverabilityPointByMethod:
    barmanObjectStore: "2025-08-25T00:00:05Z"
  healthyPVC:
  - ***-1
  - ***-1-wal
  - ***-2
  - ***-2-wal
  image: ghcr.io/cloudnative-pg/postgis:17-3.5
  instanceNames:
  - ***-1
  - ***-2
  instances: 2
  instancesReportedState:
    ***-1:
      ip: 10.244.224.6
      isPrimary: true
      timeLineID: 11
    ***-2:
      ip: 10.244.224.9
      isPrimary: false
      timeLineID: 11
  instancesStatus:
    healthy:
    - ***-1
    - ***-2
  lastSuccessfulBackup: "2025-09-01T00:00:08Z"
  lastSuccessfulBackupByMethod:
    barmanObjectStore: "2025-09-01T00:00:08Z"
  latestGeneratedNode: 2
  managedRolesStatus: {}
  onlineUpdateEnabled: true
  pgDataImageInfo:
    image: ghcr.io/cloudnative-pg/postgis:17-3.5
    majorVersion: 17
  phase: Cluster in healthy state
  poolerIntegrations:
    pgBouncerIntegration:
      secrets:
      - ***-pooler
  pvcCount: 4
  readService: ***-r
  readyInstances: 2
  secretsResourceVersion:***
  switchReplicaClusterStatus: {}
  systemID: "7535410717556084761"
  targetPrimary: ***-1
  targetPrimaryTimestamp: "2025-09-01T09:31:38.311471Z"
  timelineID: 11
  topology:
    instances:
      ***-1: {}
      ***-2: {}
    nodesUsed: 1
    successfullyExtracted: true
  writeService: ***-rw

Relevant log output

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

Labels

triagePending triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions