Senior Site Reliability Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in the United States.

This role sits at the heart of large-scale production reliability and cloud operations within a highly regulated financial services environment. You will be responsible for ensuring the stability, observability, and performance of mission-critical systems that power modern banking and payments platforms. The position blends deep hands-on engineering with operational excellence, focusing on reducing noise, improving signal quality, and strengthening incident response practices. You will work closely with cross-functional engineering and operations teams to design resilient alerting frameworks, refine production monitoring strategies, and continuously improve system reliability. This is a high-impact role for an engineer who thrives in complex AWS environments and enjoys turning operational chaos into structured, scalable processes.

Accountabilities:

Own and improve production reliability across large-scale distributed systems, ensuring high availability and performance in critical financial infrastructure environments.
Design, refine, and maintain observability and monitoring systems using tools such as Splunk, Datadog, and ServiceNow, focusing on actionable insights rather than alert noise.
Reduce alert fatigue by analyzing existing monitoring signals, eliminating false positives, and improving severity classification frameworks and escalation paths.
Develop and maintain incident response playbooks, ensuring clear operational procedures for troubleshooting, mitigation, and post-incident review.
Lead efforts to troubleshoot complex production issues in AWS-based environments, ensuring rapid identification and resolution of system failures.
Collaborate with engineering, infrastructure, and product teams to improve system reliability, scalability, and operational efficiency.
Continuously enhance operational maturity by introducing automation, observability improvements, and best practices for production support.

Requirements:

This role requires a strong background in site reliability engineering, production support, and cloud infrastructure, with a focus on high-scale, regulated environments. The ideal candidate brings extensive hands-on experience with AWS, observability tools, and production incident management, along with a proven ability to reduce operational noise and improve system signal quality. Strong analytical and communication skills are essential, as this role requires collaboration across technical and non-technical stakeholders.

Extensive experience in Site Reliability Engineering, production support, or infrastructure engineering roles.
Strong expertise in AWS services and cloud-native architectures.
Proven experience with observability tools such as Splunk, Datadog, or similar monitoring platforms.
Demonstrated ability to reduce alert noise, improve signal-to-noise ratios, and design effective alerting strategies.
Experience building incident response playbooks, severity frameworks, and operational runbooks.
Strong troubleshooting skills in complex distributed systems and production environments.
Experience working in regulated industries such as financial services, banking, or payments is highly preferred.
Excellent communication skills with the ability to coordinate across engineering and operations teams.

Benefits:

Competitive compensation package
Flexible work arrangements with remote options
Professional development and continuous learning opportunities
Exposure to large-scale financial systems and modern cloud infrastructure
Collaborative, engineering-driven culture focused on innovation
Supportive environment encouraging ownership and autonomy
Tools and resources to support operational excellence and career growth

DevOps pay context

Based on 1,235 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $175K (10th–90th percentile: $100K–$211K).

See the full DevOps salary breakdown →

Apply →

Senior Site Reliability Engineer

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles