Replies: 1 comment
-
|
The root cause is that your API servers are heavily under-provisioned for a 300-DAG, 8-scheduler setup. The Airflow 3 task SDK communicates with the API server for XCom reads (and other task operations) — when the API server is saturated, those calls time out. Why this works on dev but not prodIn Airflow 3, tasks fetch XComs by calling the REST API ( Fix 1: Increase API server workers (highest impact)In your Helm chart values: apiServer:
replicas: 2
workers: 4 # Increase from 1 to 4 (or more)Or if using gunicorn-based config: [api_server]
workers = 4
worker_timeout = 120This alone will likely resolve the timeout. Fix 2: Increase the SDK client timeoutSet a higher timeout for the task SDK's HTTP client: [api]
# Timeout in seconds for task SDK → API server calls
client_connect_timeout = 10
client_read_timeout = 60 # default is often too low under loadFix 3: Check your corporate network proxyThe stack trace shows the timeout happening at the TCP read level ( env:
- name: NO_PROXY
value: "localhost,127.0.0.1,.cluster.local,<api-server-service-name>"Fix 4: Reduce XCom payload sizeYour dynamic task uses
TL;DR
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Airflow version 3.1.1
Running on Kubernetes in a Corporate Network
Active DAGs : Around 300
Deployment:
2 Api servers with one worker each
1 DAG Processor
8 Schedulers
1 Triggerer
Each has around 4 CPU and 20 GB memory limits.
I have a particular DAG set up with Dynamic Task Mapping which also uses the
.mapfunction in between to map the output from my first task to have the proper format for the dynamic mapped tasks.Now this DAG ran perfectly fine on a dev server with less number of DAGs and low load.
When I moved it to test in Production, My setup task which is responsible for deciding which dynamic mapped task to run completes successfully, but the two following jobs (one running on KubernetesExecutor and the other on LocalExecutor) keeps on failing at the same step.
This is the Part where it tries to fetch the Xcoms from the API server and times out with an Httpx Error : timeout.
The relevant DAG code:
The error in question:
I am at my wits end on how to resolve this issue.
Any ideas, suggestions, prayers, spells, etc will be appreciated. 💯
Beta Was this translation helpful? Give feedback.
All reactions