Staff Site Reliability Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Site Reliability Engineer based in the United States.

This role sits at the intersection of large-scale infrastructure operations, software engineering, and AI-driven systems reliability. You will help ensure the stability, performance, and scalability of complex SaaS platforms used by enterprise customers operating in highly critical domains. A key focus of the role is building and evolving intelligent, AI-assisted reliability tooling that reduces operational toil and accelerates incident resolution. You will own production systems end-to-end, from observability and incident response to long-term architectural improvements. The position blends hands-on engineering with technical leadership, requiring strong judgment in ambiguous, high-impact situations. You will also influence how modern SRE practices are defined and scaled across the organization. The environment is highly collaborative, fast-evolving, and deeply focused on engineering excellence and continuous improvement.

Accountabilities:

Lead the design and development of AI-assisted reliability and operations tooling that leverages logs, traces, tickets, and documentation to improve incident diagnosis and resolution speed.
Own end-to-end incident response, including detection, mitigation, root cause analysis, and implementation of long-term preventative fixes.
Improve observability systems across critical production services by enhancing metrics, logging, tracing, and alerting quality.
Define, implement, and evolve SLOs and SLIs to establish measurable reliability standards across key services.
Drive improvements in cloud operations for large-scale SaaS deployments, ensuring consistent, repeatable, and reliable customer environments.
Build internal tools and automation to reduce operational toil and increase engineering efficiency.
Collaborate with product and engineering teams to embed reliability and observability into system design from the outset.
Guide and mentor engineers on SRE practices, incident management, and operational excellence.
Contribute to the evolution of deployment, upgrade, and operational workflows for distributed systems in production.

Requirements:

Extensive experience in Site Reliability Engineering, platform engineering, or production-focused software engineering roles with strong operational ownership.
Deep hands-on experience with Kubernetes, Linux systems, and major cloud platforms (AWS, GCP, or Azure).
Strong software engineering skills in Python or Go, with a track record of building internal tools or production services.
Proven ability to operate, troubleshoot, and optimize complex distributed systems in production environments.
Strong expertise in observability practices, including metrics, logging, tracing, and incident response workflows.
Experience defining and working with SLOs/SLIs in large-scale systems.
Ability to lead technically ambiguous initiatives and influence cross-functional teams without formal authority.
Demonstrated success improving system reliability through automation and engineering, not just manual operations.
Strong communication skills with experience mentoring engineers or shaping technical practices.
Practical judgment in applying AI/LLM tools effectively within operational or engineering workflows.
Bonus: experience with SaaS platforms, LLM-based systems, or building tooling for support and developer productivity.

Benefits:

Competitive US base salary range: $200,000 – $230,000 annually
Equity participation in a high-growth technology company
Annual performance bonus or variable compensation (where applicable)
Comprehensive medical, dental, and vision insurance coverage
401(k) retirement savings plan
Wellness and mental health support programs
Flexible remote work environment within the United States
Learning and development opportunities to support continuous growth
Paid time off and flexible vacation policies
Opportunity to work on cutting-edge AI-powered reliability systems at scale.

DevOps pay context

Based on 1,235 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $175K (10th–90th percentile: $100K–$211K).

This posting lists $200K–$230K, above the $142K market median.

See the full DevOps salary breakdown →

Apply →

Staff Site Reliability Engineer

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles