We are looking for a hands on Airflow AWS Platform Support engineer who keeps data pipelines and ML workloads running reliably in production, manages access and costs, and enables data teams to build safely.
Client JD:
Apache Airflow, AWS (S3, EMR and Bedrock) Administrator:
- Strong experience in Airflow DAG monitoring, including tracking task states, resolving DAG execution delays, and ensuring reliability across distributed environments.
- Expertise in failure recovery, including retry strategies, SLA miss handling, backfilling, re-running failed task instances, and ensuring consistent pipeline execution across environments.
- Hands on experience providing SLA based job execution support, ensuring time critical pipelines meet business deadlines and production SLAs are continuously maintained.
- Skilled in performing root cause analysis (RCA) for pipeline failures, including dependency failures, task level exceptions, scheduler issues, and platform level bottlenecks.
- Experience in managing S3 storage optimization, including lifecycle policies, intelligent tiering, storage class transitions, versioning, and cost effective data retention strategies.
- Expertise in securing S3 environments using IAM policies, bucket policies, encryption (KMS), access logging, and object level permissions.
- Skilled in conducting cost usage analysis for S3 storage and recommending optimization strategies to reduce operational spend.
- Strong background in administering Amazon EMR clusters, including cluster provisioning, configuration, autoscaling, and lifecycle management.
- Experience supporting Amazon Bedrock environments, including model endpoint configuration, invocation monitoring, access control, and cost governance.
Mandatory Skills
- Strong Hands-on Experience in SQL
- Strong hands-on experience with Apache Airflow (operations support)
- Experience debugging long-running / hung Airflow jobs
- Solid AWS knowledge: S3, EMR, EC2, IAM
- Monitoring alerting using CloudWatch, Grafana, or similar
- Experience with CI/CD for Airflow DAGs (Dev Test Prod)
- Infrastructure automation using Terraform and/or CloudFormation
- Strong troubleshooting mindset in production environments
Good to Have
- Exposure to Databricks or Snowflake
- Experience with ML pipelines / MLOps concepts
- Knowledge of Amazon SageMaker or Amazon Bedrock
- Python scripting for automation/support
- Experience in Platform Support or SRE like roles
Core Responsibilities
- Support and operate Apache Airflow in production environments (AWS Managed Airflow preferred)
- Monitor DAGs, troubleshoot failures, recover missed or delayed pipelines
- Provide SLA based job execution support for critical data pipelines
- Perform root cause analysis (RCA) for Airflow, AWS, and data platform issues
- Support AWS EMR and EC2 workloads running Spark, Python, and data processing jobs
- Manage S3 storage, access control, lifecycle policies, and cost optimization
- Ensure secure access via IAM roles, bucket policies, KMS encryption
- Support ML workloads triggered via Airflow (SageMaker / Bedrock integrations)
- Work closely with Data Engineers, ML Engineers, and DevOps teams
- Ensure production platforms are stable, scalable, and cost efficient
Disclaimer : This job posting has been aggregated from external source. Role details, content, and availability are subject to change. Applicants are advised to confirm the latest information directly on the company website before applying.