Sr Staff Site Reliability Engineer (SRE)
Position:
Sr Staff Site Reliability Engineer (SRE)Job Description:
We are seeking a Sr Staff Site Reliability Engineer — on a long-term basis during USA hours— who brings deep software engineering roots alongside SRE expertise. This individual will help shape and scale the reliability of our global cloud platform, bringing the full-stack perspective of someone who has built and shipped software and now drives reliability from the inside out.
The Role
This is a Senior Staff-level technical leadership role with organization-wide influence. You will define and drive reliability strategy across our multi-cloud infrastructure (AWS and GCP), establish architectural standards, and ensure our backend systems operate with exceptional availability, scalability, and resilience.
You will also collaborate with strategic partners and engineering teams to enable our organization as a cloud-integrated service, leading technical discussions and ensuring secure and reliable integrations.
This is a long-term position for someone who thrives at the intersection of software development and reliability engineering. The ideal candidate has hands-on development experience, understands the complete software delivery lifecycle, and brings an end-to-end systems perspective — from code commit to production operation.
What You’ll Do
- Define and drive Organization’s SRE strategy across engineering teams.
- Establish reliability standards, architectural guardrails, and production readiness frameworks.
- Initiate, participate in, and review architectural changes — leveraging development experience to ensure reliability and operability are built in, not bolted on.
- Apply SDLC knowledge to reliability decisions — engage early in design and architecture reviews to embed reliability, testability, and operability as first-class requirements.
- Proactively identify system-wide gaps — continuously assess the platform for reliability blind spots, missing observability, or architectural debt, and drive initiatives to close them without waiting to be asked.
- Bridge development and SRE teams — translate between engineering intent and operational reality, serving as a technical liaison who can read code, review PRs, and contribute to service-level design decisions.
- Design and maintain highly available, multi-region, multi-cloud systems.
- Ensure platform reliability supporting millions of IoT devices globally.
- Guide engineering teams in building fault-tolerant, scalable microservices and monolithic systems.
- Define and enforce SLIs, SLOs, and error budgets.
- Lead architecture reviews and production readiness reviews.
- Partner with strategic teams to deliver our organization as a cloud-integrated service and support partner integrations.
- Improve and streamline production release processes.
- Implement safe deployment strategies (canary, blue/green, progressive delivery).
- Build CI/CD guardrails to reduce deployment risk and improve reliability.
- Develop and mature observability strategies across infrastructure and services.
- Lead high-severity incident response, facilitate blameless postmortems, and drive systemic improvements to prevent recurring issues.
What You Bring
- 10+ years of combined software engineering and SRE/infrastructure experience, with a clear progression from development into reliability or platform engineering.
- Deep understanding of the complete Software Development Lifecycle (SDLC) — enabling well-informed reliability and design decisions across all phases of software delivery.
- Strong software development background — with hands-on experience building and shipping production software — enabling effective design collaboration, code-level review, and reliability-driven architectural input.
- End-to-end system comprehension — ability to reason about the full stack from device/client behavior through API layer, backend services, data stores, and infrastructure, connecting the dots across teams and domains.
- Self-directed gap identification — demonstrated initiative in spotting reliability, scalability, or process gaps and driving improvements without needing explicit direction.
- Collaborative cross-team communication — proven ability to work across engineering, product, and operations teams; comfortable influencing without authority and presenting technical decisions to both technical and non-technical stakeholders.
- Proven experience operating large-scale distributed systems in production.
- Strong hands-on expertise with AWS and GCP cloud platforms.
- Deep experience with Kubernetes in production environments.
- Advanced knowledge of Terraform, including modular design and infrastructure governance.
- Strong understanding of distributed systems, networking, and system reliability principles.
- Experience supporting Java-based monolithic systems and microservices architectures.
- Proficiency in Python for automation and tooling.
- Experience with modern observability stacks (Prometheus, Grafana, Datadog, OpenTelemetry, etc.).
- Strong debugging, incident response, and root cause analysis skills.
- Security knowledge in transport and identity — working knowledge of SSL/TLS certificate lifecycle management, mutual TLS (mTLS) for service-to-service authentication, cipher suite selection and hardening, and TLS version enforcement across microservices and infrastructure boundaries.
- Excellent written and verbal communication skills, with experience coordinating across distributed engineering teams, facilitating technical discussions, and driving alignment on reliability decisions.
Qualification-
This Position is only for IST Evening (3pm to midnight) OR IST night (10pm to 7am) flexible rotation shift
Bachelor’s degree in computer science or software engineering.
- 10+ years of combined software engineering and SRE/infrastructure experience, with a clear progression from development into reliability or platform engineering.