Staff Site Reliability Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Site Reliability Engineer based in the United States.

This role sits at the intersection of large-scale infrastructure operations, software engineering, and AI-driven systems reliability. You will help ensure the stability, performance, and scalability of complex SaaS platforms used by enterprise customers operating in highly critical domains. A key focus of the role is building and evolving intelligent, AI-assisted reliability tooling that reduces operational toil and accelerates incident resolution. You will own production systems end-to-end, from observability and incident response to long-term architectural improvements. The position blends hands-on engineering with technical leadership, requiring strong judgment in ambiguous, high-impact situations. You will also influence how modern SRE practices are defined and scaled across the organization. The environment is highly collaborative, fast-evolving, and deeply focused on engineering excellence and continuous improvement.

Accountabilities:

  • Lead the design and development of AI-assisted reliability and operations tooling that leverages logs, traces, tickets, and documentation to improve incident diagnosis and resolution speed.
  • Own end-to-end incident response, including detection, mitigation, root cause analysis, and implementation of long-term preventative fixes.
  • Improve observability systems across critical production services by enhancing metrics, logging, tracing, and alerting quality.
  • Define, implement, and evolve SLOs and SLIs to establish measurable reliability standards across key services.
  • Drive improvements in cloud operations for large-scale SaaS deployments, ensuring consistent, repeatable, and reliable customer environments.
  • Build internal tools and automation to reduce operational toil and increase engineering efficiency.
  • Collaborate with product and engineering teams to embed reliability and observability into system design from the outset.
  • Guide and mentor engineers on SRE practices, incident management, and operational excellence.
  • Contribute to the evolution of deployment, upgrade, and operational workflows for distributed systems in production.
  • Requirements:

    • Extensive experience in Site Reliability Engineering, platform engineering, or production-focused software engineering roles with strong operational ownership.
    • Deep hands-on experience with Kubernetes, Linux systems, and major cloud platforms (AWS, GCP, or Azure).
    • Strong software engineering skills in Python or Go, with a track record of building internal tools or production services.
    • Proven ability to operate, troubleshoot, and optimize complex distributed systems in production environments.
    • Strong expertise in observability practices, including metrics, logging, tracing, and incident response workflows.
    • Experience defining and working with SLOs/SLIs in large-scale systems.
    • Ability to lead technically ambiguous initiatives and influence cross-functional teams without formal authority.
    • Demonstrated success improving system reliability through automation and engineering, not just manual operations.
    • Strong communication skills with experience mentoring engineers or shaping technical practices.
    • Practical judgment in applying AI/LLM tools effectively within operational or engineering workflows.
    • Bonus: experience with SaaS platforms, LLM-based systems, or building tooling for support and developer productivity.
    • Benefits:

      • Competitive US base salary range: $200,000 – $230,000 annually
      • Equity participation in a high-growth technology company
      • Annual performance bonus or variable compensation (where applicable)
      • Comprehensive medical, dental, and vision insurance coverage
      • 401(k) retirement savings plan
      • Wellness and mental health support programs
      • Flexible remote work environment within the United States
      • Learning and development opportunities to support continuous growth
      • Paid time off and flexible vacation policies
      • Opportunity to work on cutting-edge AI-powered reliability systems at scale.

DevOps pay context

Based on 1,235 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $175K (10th–90th percentile: $100K–$211K).

This posting lists $200K–$230K, above the $142K market median.

See the full DevOps salary breakdown →
Apply →