Cloud Reliability & Recovery Engineer

Jobgether · India

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Cloud Reliability & Recovery Engineer based in India.

This role sits at the core of large-scale cloud resilience engineering, focused on ensuring critical systems remain highly available, fault-tolerant, and recoverable under any disruption. You will design and operate advanced AWS-based disaster recovery and business continuity architectures across multi-region environments. The position requires deep hands-on engineering expertise in cloud infrastructure, automation, and reliability practices, with a strong emphasis on Kubernetes, Infrastructure as Code, and CI/CD-driven operations. You will work closely with security, infrastructure, and application teams to define and enforce recovery strategies aligned with strict RTO/RPO objectives. This is a highly technical role where you will build automated DR systems, validate resiliency through chaos engineering, and continuously improve platform stability. The environment is fast-paced, engineering-driven, and focused on measurable reliability outcomes at enterprise scale.

Accountabilities:

Design, implement, and maintain highly resilient cloud architectures with a strong focus on disaster recovery, business continuity, and system availability. Responsibilities include:

  • Designing multi-region and multi-AZ AWS architectures aligned with defined RTO/RPO targets
  • Building and maintaining failover and failback mechanisms using Route 53, Global Accelerator, and CloudFront
  • Developing automated disaster recovery runbooks using AWS Systems Manager, Step Functions, and related services
  • Implementing backup and recovery strategies across AWS services including EC2, RDS, S3, DynamoDB, and Aurora
  • Automating backup policies, replication workflows, and recovery validation processes
  • Performing chaos engineering and resilience testing using AWS Fault Injection Simulator
  • Managing Infrastructure as Code using Terraform and/or CloudFormation for DR environments
  • Developing CI/CD-driven automation for failover, deployment, and recovery workflows
  • Building observability dashboards, alerts, and incident response workflows using CloudWatch and third-party tools
  • Participating in on-call rotations, incident response, and post-incident reviews
  • Maintaining DR documentation, compliance artifacts, and audit-ready recovery evidence
  • Requirements:

    The ideal candidate brings strong AWS expertise, deep cloud reliability experience, and a proven ability to design and operate large-scale disaster recovery systems.

    • 5+ years of experience in cloud infrastructure, SRE, or disaster recovery engineering roles
    • 3+ years of hands-on AWS production experience at scale
    • Proven experience designing and implementing multi-region DR architectures with defined RTO/RPO
    • Strong expertise in AWS services including EC2, RDS, S3, DynamoDB, Aurora, and related resilience tools
    • Hands-on experience with Kubernetes-based deployments and cloud-native architecture
    • Strong scripting skills in Python, Bash, or PowerShell for automation and orchestration
    • Experience with Infrastructure as Code tools such as Terraform or AWS CloudFormation
    • Solid understanding of networking concepts including VPC, DNS failover, VPN, and Direct Connect
    • Strong knowledge of CI/CD pipelines and automation frameworks
    • Excellent communication skills with the ability to produce clear technical and executive reports
    • Experience with resilience frameworks, compliance standards, and operational best practices
    • Benefits:

      • Competitive compensation aligned with experience and industry standards
      • Opportunity to work on mission-critical, large-scale cloud resilience systems
      • Remote-friendly work environment with global collaboration
      • Exposure to advanced AWS architectures, DR automation, and chaos engineering practices
      • Strong focus on engineering excellence, automation, and continuous improvement
      • Learning opportunities in cloud reliability, security, and enterprise-scale infrastructure
      • Collaborative environment working with highly skilled engineering and security teams
Apply →