Senior Site Reliability Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in the United States.

This role sits at the heart of large-scale production reliability and cloud operations within a highly regulated financial services environment. You will be responsible for ensuring the stability, observability, and performance of mission-critical systems that power modern banking and payments platforms. The position blends deep hands-on engineering with operational excellence, focusing on reducing noise, improving signal quality, and strengthening incident response practices. You will work closely with cross-functional engineering and operations teams to design resilient alerting frameworks, refine production monitoring strategies, and continuously improve system reliability. This is a high-impact role for an engineer who thrives in complex AWS environments and enjoys turning operational chaos into structured, scalable processes.

Accountabilities:

  • Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
  • Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
  • Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
  • Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
  • Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
  • Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
  • Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.
  • Requirements:

    This role requires a strong background in site reliability engineering, production support, and cloud infrastructure, with a focus on high-scale, regulated environments. The ideal candidate brings extensive hands-on experience with AWS, observability tools, and production incident management, along with a proven ability to reduce operational noise and improve system signal quality. Strong analytical and communication skills are essential, as this role requires collaboration across technical and non-technical stakeholders.

    • Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
    • Strong expertise in AWS services and cloud-native architectures.
    • Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
    • Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
    • Experience building incident response playbooks, severity frameworks, and operational runbooks.
    • Strong troubleshooting skills in complex distributed systems and production environments.
    • Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
    • Excellent communication skills with the ability to coordinate across engineering and operations teams.
    • Benefits:

      • Competitive compensation package
      • Flexible work arrangements with remote options
      • Professional development and continuous learning opportunities
      • Exposure to large-scale financial systems and modern cloud infrastructure
      • Collaborative, engineering-driven culture focused on innovation
      • Supportive environment encouraging ownership and autonomy
      • Tools and resources to support operational excellence and career growth

DevOps pay context

Based on 1,235 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $175K (10th–90th percentile: $100K–$211K).

See the full DevOps salary breakdown →
Apply →