Staff Site Reliability Engineer, Cloud Reliability Intelligence

Google · Sunnyvale, CA, USA

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services—both our internally critical and our externally-visible systems—have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE’s will keep an ever-watchful eye on our systems capacity and performance.
Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

The Reliability Outcome Enablement team develops the products, core infrastructure, and datasets that drive and sustain Google Cloud platform's (GCP's) reliability promises. We build the evergreen intelligence platform the core system that automates resilience across the GCP ecosystem. Every product team at Google (from BigQuery to Spanner) relies on our infrastructure and integrated data lake to keep their services bulletproof.

We are currently expanding our platform to integrate Generative AI and LLM-driven workflows, moving from reactive tracking to a predictive system that catches failures and automates risk mitigation.Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $207000 - $301000 (USD) + 20% bonus target + equity + benefits

Learn more about benefits at Google.

Minimum qualifications:

Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
8 years of experience with data structures and algorithms.
3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems.
3 years of experience in a technical leadership role; overseeing projects.
Experience overseeing full-stack architectures, ensuring cohesion between backend data automation layers and engineering frontend.

Preferred qualifications:

Experience in applying LLMs or Generative AI to automate workflows.
Familiarity with large-scale reliability analysis, or policy conformance frameworks.

DevOps pay context

Based on 1,251 disclosed DevOps salaries on RoleSuite, the role pays a median of $141K/year, with most offers between $115K and $173K (10th–90th percentile: $101K–$210K).

Google ranks among the higher-paying employers for this role, at a $217K median across 22 disclosed postings.

This posting lists $207K–$301K, above the $141K market median.

See the full DevOps salary breakdown →

Apply →

Staff Site Reliability Engineer, Cloud Reliability Intelligence

Minimum qualifications:

Preferred qualifications:

DevOps pay context

Other roles at Google

More DevOps roles