Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors #55159

HsiuChuanHsu · 2025-09-01T23:07:39Z

Description

This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.

Problem

When using KubernetesExecutor or CeleryKubernetesExecutor:

Task pods launch successfully but K8s API starts returning 429 errors
KubernetesJobWatcher crashes, causing Scheduler restart
Scheduler fails to re-adopt running pods due to continued 429s
Tasks are marked orphaned with state reset to None
TaskInstanceHistory is not recorded since state ≠ RUNNING
Airflow UI shows missing log links for failed attempts

Solution

KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures
TaskInstance: Detect orphaned tasks (state=None + start_date set + end_date unset ) and record TaskInstanceHistory

Impact

Before:
Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None →
No History → Missing UI Logs

After:
Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None →
History Recorded → UI Logs Available

Fixes: #49517
Related: #49244

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

uranusjr · 2025-09-02T02:10:12Z

Fix looks reasonable but tests don’t agree. This should include a test case too.

- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios

Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.

HsiuChuanHsu · 2025-09-02T14:55:32Z

Fix looks reasonable but tests don’t agree. This should include a test case too.

When implementing unit tests for the new orphaned task detection logic in the fetch_handle_failure_context method of taskinstance.py, a critical timing issue was discovered that prevented the logic from functioning correctly.

airflow/airflow-core/src/airflow/models/taskinstance.py

Line 1585 in bcf5125

ti.end_date = timezone.utcnow()

Original problem

The orphaned task detection relies on the condition: ti.end_date is None
However, ti.end_date = timezone.utcnow() was executed earlier in the method
This made the ti.end_date is None condition impossible to satisfy

def fetch_handle_failure_context(...):
    # ... other code ...
    ti.end_date = timezone.utcnow()  # ← Sets end_date first
    
    # ... later in the method ...
    if ti.state is None and ti.start_date is not None and ti.end_date is None:
        # ← This condition can never be True because end_date was already set above

Solution
The orphaned task detection logic was moved to execute before the end_date assignment:

def fetch_handle_failure_context(...):
    # Check for orphaned task BEFORE setting end_date
    if (
        ti.is_eligible_to_retry()
        and ti.state is None
        and ti.start_date is not None
        and ti.end_date is None  # ← Now this can actually be True
    ):
        # Handle orphaned task detection and history recording
    
    # THEN set end_date
    ti.end_date = timezone.utcnow()

HsiuChuanHsu requested review from jedcunningham, hussein-awala, XD-DENG and ashb as code owners September 1, 2025 23:07

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Sep 1, 2025

HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from edbf605 to f0ab406 Compare September 1, 2025 23:07

HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from f0ab406 to 01962e3 Compare September 2, 2025 13:07

eladkal requested a review from uranusjr September 2, 2025 14:34

HsiuChuanHsu added 3 commits September 2, 2025 22:42

fix(k8s): Preserve task history during API rate limiting

77074e5

- Handle 429 errors in KubernetesExecutor task publishing retry logic - Detect orphaned tasks and record TaskInstanceHistory in failure handler - Add detailed logging for rate limiting scenarios

fix(k8s): Update tests to reflect new 429 error retry behavior

104d935

fix: Record history for orphaned tasks during K8s executor failures

63a5ad1

Move orphaned task detection before end_date assignment to ensure TaskInstanceHistory is recorded for tasks that become detached during scheduler restarts due to Kubernetes API 429 errors.

HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from 75dfd76 to 63a5ad1 Compare September 2, 2025 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors #55159

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors #55159

HsiuChuanHsu commented Sep 1, 2025 •

edited

Loading

Uh oh!

uranusjr commented Sep 2, 2025

Uh oh!

HsiuChuanHsu commented Sep 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors #55159

Are you sure you want to change the base?

Fix: Preserve TaskInstance history during Kubernetes API rate limiting errors #55159

Conversation

HsiuChuanHsu commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Impact

Uh oh!

uranusjr commented Sep 2, 2025

Uh oh!

HsiuChuanHsu commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

HsiuChuanHsu commented Sep 1, 2025 •

edited

Loading

HsiuChuanHsu commented Sep 2, 2025 •

edited

Loading