Staff Site Reliability Engineer
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Site Reliability Engineer based in the United States.
This role sits at the intersection of large-scale infrastructure operations, software engineering, and AI-driven systems reliability. You will help ensure the stability, performance, and scalability of complex SaaS platforms used by enterprise customers operating in highly critical domains. A key focus of the role is building and evolving intelligent, AI-assisted reliability tooling that reduces operational toil and accelerates incident resolution. You will own production systems end-to-end, from observability and incident response to long-term architectural improvements. The position blends hands-on engineering with technical leadership, requiring strong judgment in ambiguous, high-impact situations. You will also influence how modern SRE practices are defined and scaled across the organization. The environment is highly collaborative, fast-evolving, and deeply focused on engineering excellence and continuous improvement.
Accountabilities:
- Lead the design and development of AI-assisted reliability and operations tooling that leverages logs, traces, tickets, and documentation to improve incident diagnosis and resolution speed.
- Own end-to-end incident response, including detection, mitigation, root cause analysis, and implementation of long-term preventative fixes.
- Improve observability systems across critical production services by enhancing metrics, logging, tracing, and alerting quality.
- Define, implement, and evolve SLOs and SLIs to establish measurable reliability standards across key services.
- Drive improvements in cloud operations for large-scale SaaS deployments, ensuring consistent, repeatable, and reliable customer environments.
- Build internal tools and automation to reduce operational toil and increase engineering efficiency.
- Collaborate with product and engineering teams to embed reliability and observability into system design from the outset.
- Guide and mentor engineers on SRE practices, incident management, and operational excellence.
- Contribute to the evolution of deployment, upgrade, and operational workflows for distributed systems in production.
- Extensive experience in Site Reliability Engineering, platform engineering, or production-focused software engineering roles with strong operational ownership.
- Deep hands-on experience with Kubernetes, Linux systems, and major cloud platforms (AWS, GCP, or Azure).
- Strong software engineering skills in Python or Go, with a track record of building internal tools or production services.
- Proven ability to operate, troubleshoot, and optimize complex distributed systems in production environments.
- Strong expertise in observability practices, including metrics, logging, tracing, and incident response workflows.
- Experience defining and working with SLOs/SLIs in large-scale systems.
- Ability to lead technically ambiguous initiatives and influence cross-functional teams without formal authority.
- Demonstrated success improving system reliability through automation and engineering, not just manual operations.
- Strong communication skills with experience mentoring engineers or shaping technical practices.
- Practical judgment in applying AI/LLM tools effectively within operational or engineering workflows.
- Bonus: experience with SaaS platforms, LLM-based systems, or building tooling for support and developer productivity.
- Competitive US base salary range: $200,000 – $230,000 annually
- Equity participation in a high-growth technology company
- Annual performance bonus or variable compensation (where applicable)
- Comprehensive medical, dental, and vision insurance coverage
- 401(k) retirement savings plan
- Wellness and mental health support programs
- Flexible remote work environment within the United States
- Learning and development opportunities to support continuous growth
- Paid time off and flexible vacation policies
- Opportunity to work on cutting-edge AI-powered reliability systems at scale.
Requirements:
Benefits:
DevOps pay context
Based on 1,235 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $175K (10th–90th percentile: $100K–$211K).
This posting lists $200K–$230K, above the $142K market median.
See the full DevOps salary breakdown →