Cloud Reliability & Recovery Engineer

Jobgether · India

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Cloud Reliability & Recovery Engineer based in India.

This role sits at the core of large-scale cloud resilience engineering, focused on ensuring critical systems remain highly available, fault-tolerant, and recoverable under any disruption. You will design and operate advanced AWS-based disaster recovery and business continuity architectures across multi-region environments. The position requires deep hands-on engineering expertise in cloud infrastructure, automation, and reliability practices, with a strong emphasis on Kubernetes, Infrastructure as Code, and CI/CD-driven operations. You will work closely with security, infrastructure, and application teams to define and enforce recovery strategies aligned with strict RTO/RPO objectives. This is a highly technical role where you will build automated DR systems, validate resiliency through chaos engineering, and continuously improve platform stability. The environment is fast-paced, engineering-driven, and focused on measurable reliability outcomes at enterprise scale.

Accountabilities:

Design, implement, and maintain highly resilient cloud architectures with a strong focus on disaster recovery, business continuity, and system availability. Responsibilities include:

Designing multi-region and multi-AZ AWS architectures aligned with defined RTO/RPO targets
Building and maintaining failover and failback mechanisms using Route 53, Global Accelerator, and CloudFront
Developing automated disaster recovery runbooks using AWS Systems Manager, Step Functions, and related services
Implementing backup and recovery strategies across AWS services including EC2, RDS, S3, DynamoDB, and Aurora
Automating backup policies, replication workflows, and recovery validation processes
Performing chaos engineering and resilience testing using AWS Fault Injection Simulator
Managing Infrastructure as Code using Terraform and/or CloudFormation for DR environments
Developing CI/CD-driven automation for failover, deployment, and recovery workflows
Building observability dashboards, alerts, and incident response workflows using CloudWatch and third-party tools
Participating in on-call rotations, incident response, and post-incident reviews
Maintaining DR documentation, compliance artifacts, and audit-ready recovery evidence

Requirements:

The ideal candidate brings strong AWS expertise, deep cloud reliability experience, and a proven ability to design and operate large-scale disaster recovery systems.

5+ years of experience in cloud infrastructure, SRE, or disaster recovery engineering roles
3+ years of hands-on AWS production experience at scale
Proven experience designing and implementing multi-region DR architectures with defined RTO/RPO
Strong expertise in AWS services including EC2, RDS, S3, DynamoDB, Aurora, and related resilience tools
Hands-on experience with Kubernetes-based deployments and cloud-native architecture
Strong scripting skills in Python, Bash, or PowerShell for automation and orchestration
Experience with Infrastructure as Code tools such as Terraform or AWS CloudFormation
Solid understanding of networking concepts including VPC, DNS failover, VPN, and Direct Connect
Strong knowledge of CI/CD pipelines and automation frameworks
Excellent communication skills with the ability to produce clear technical and executive reports
Experience with resilience frameworks, compliance standards, and operational best practices

Benefits:

Competitive compensation aligned with experience and industry standards
Opportunity to work on mission-critical, large-scale cloud resilience systems
Remote-friendly work environment with global collaboration
Exposure to advanced AWS architectures, DR automation, and chaos engineering practices
Strong focus on engineering excellence, automation, and continuous improvement
Learning opportunities in cloud reliability, security, and enterprise-scale infrastructure
Collaborative environment working with highly skilled engineering and security teams

Apply →

Cloud Reliability & Recovery Engineer

Accountabilities:

Requirements:

Benefits:

Other roles at Jobgether

More Software roles