Skip to content

Conversation

HsiuChuanHsu
Copy link
Contributor

@HsiuChuanHsu HsiuChuanHsu commented Sep 1, 2025

Description

This PR fixes issue #49517 where TaskInstanceHistory records were lost when Kubernetes API rate limiting (429 errors) prevented task adoption during scheduler restarts.

Problem

When using KubernetesExecutor or CeleryKubernetesExecutor:

  1. Task pods launch successfully but K8s API starts returning 429 errors
  2. KubernetesJobWatcher crashes, causing Scheduler restart
  3. Scheduler fails to re-adopt running pods due to continued 429s
  4. Tasks are marked orphaned with state reset to None
  5. TaskInstanceHistory is not recorded since state ≠ RUNNING
  6. Airflow UI shows missing log links for failed attempts

Solution

KubernetesExecutor: Add 429 error handling to retry logic and detailed logging for adoption failures
TaskInstance: Detect orphaned tasks (state=None + start_date set + end_date unset ) and record TaskInstanceHistory

Impact

Before:
Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None →
No History → Missing UI Logs

After:
Task Running → K8s API 429 → Scheduler Restart → Task Orphaned → State Reset to None →
History Recorded → UI Logs Available

Fixes: #49517
Related: #49244


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Sep 1, 2025
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from edbf605 to f0ab406 Compare September 1, 2025 23:07
@uranusjr
Copy link
Member

uranusjr commented Sep 2, 2025

Fix looks reasonable but tests don’t agree. This should include a test case too.

@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from f0ab406 to 01962e3 Compare September 2, 2025 13:07
@eladkal eladkal requested a review from uranusjr September 2, 2025 14:34
- Handle 429 errors in KubernetesExecutor task publishing retry logic
- Detect orphaned tasks and record TaskInstanceHistory in failure handler
- Add detailed logging for rate limiting scenarios
Move orphaned task detection before end_date assignment to ensure
TaskInstanceHistory is recorded for tasks that become detached
during scheduler restarts due to Kubernetes API 429 errors.
@HsiuChuanHsu HsiuChuanHsu force-pushed the fix/taskinstance-history-k8s-429-error branch from 75dfd76 to 63a5ad1 Compare September 2, 2025 14:42
@HsiuChuanHsu
Copy link
Contributor Author

HsiuChuanHsu commented Sep 2, 2025

Fix looks reasonable but tests don’t agree. This should include a test case too.

When implementing unit tests for the new orphaned task detection logic in the fetch_handle_failure_context method of taskinstance.py, a critical timing issue was discovered that prevented the logic from functioning correctly.

ti.end_date = timezone.utcnow()

Original problem

  • The orphaned task detection relies on the condition: ti.end_date is None
  • However, ti.end_date = timezone.utcnow() was executed earlier in the method
  • This made the ti.end_date is None condition impossible to satisfy
def fetch_handle_failure_context(...):
    # ... other code ...
    ti.end_date = timezone.utcnow()  # ← Sets end_date first
    
    # ... later in the method ...
    if ti.state is None and ti.start_date is not None and ti.end_date is None:
        # ← This condition can never be True because end_date was already set above

Solution
The orphaned task detection logic was moved to execute before the end_date assignment:

def fetch_handle_failure_context(...):
    # Check for orphaned task BEFORE setting end_date
    if (
        ti.is_eligible_to_retry()
        and ti.state is None
        and ti.start_date is not None
        and ti.end_date is None  # ← Now this can actually be True
    ):
        # Handle orphaned task detection and history recording
    
    # THEN set end_date
    ti.end_date = timezone.utcnow()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TI history missing after Scheduler restart during K8s 429 error
2 participants