OpsJobs
RoleSuite
CompaniesRemoteAboutMethodologyContactPrivacy
Updated 2026-06-17 12:00 UTC·© 2025–2026 RoleSuite
← Back to listings

Staff Production Operations Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Production Operations Engineer based in United States.

This role sits at the intersection of reliability engineering, automation, and operational excellence, supporting large-scale distributed systems that process high volumes of real-time data.
You will be responsible for improving system reliability, streamlining incident management workflows, and building automation that reduces operational overhead across engineering teams.
The position plays a key role in shaping how incidents are detected, managed, and learned from, ensuring faster resolution and continuous improvement across production environments.
You will collaborate closely with engineering, product, and customer-facing teams to maintain high availability and performance standards across global systems.
A strong emphasis is placed on leveraging AI-driven tooling to automate repetitive operational tasks and enhance incident response efficiency.
This is a high-impact role ideal for someone who thrives in fast-paced infrastructure environments and enjoys combining SRE discipline with automation and tooling innovation.

Accountabilities:

  • Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
  • Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
  • Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
  • Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
  • Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
  • Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
  • Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
  • Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.
  • Requirements:

    • 5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
    • Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
    • Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
    • Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
    • Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
    • Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
    • Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
    • Experience leveraging AI-assisted development tools to improve workflows and operational processes.
    • Strong written communication skills with the ability to coordinate across teams without direct authority.
    • Nice to have: experience building AI agents, working with streaming systems like Kafka or Redpanda, or experience in high-scale infrastructure environments.
    • Benefits:

      • Competitive compensation aligned with senior infrastructure engineering roles in the US market
      • Equity participation in a high-growth, innovation-driven technology company
      • Remote-first work environment across the United States
      • Comprehensive medical, dental, and vision insurance coverage
      • Flexible paid time off and paid holidays
      • Strong emphasis on engineering autonomy, tooling, and modern AI-driven workflows
      • Opportunity to work on large-scale distributed systems and real-time data infrastructure
      • Professional growth in a fast-moving, globally distributed engineering organization

Operations pay context

Based on 4,522 disclosed Operations salaries on RoleSuite, the role pays a median of $110K/year, with most offers between $83K and $145K (10th–90th percentile: $66K–$185K).

See the full Operations salary breakdown →
Apply →

Other roles at Jobgether

  • Senior Full-Stack Engineer - Broker API (Partner-Facing Application)India
  • Team Lead (L6)India
  • Senior/Staff Applied Research Software EngineerIndia
  • Principal Business Integration Analyst(SAP MM)India
  • Principal Business Integration Analyst(SAP FI/CO)India
  • Sr Product Marketing ManagerIndia
  • Bookkeeper/ControllerBrazil
  • Bookkeeper/ControllerCanada
  • Bookkeeper/ControllerIndia
  • Deal Commercial Strategy & Operations LeadUS

More Operations roles

  • Country Operations Lead (Fixed Term)Tide · Berlin, Germany
  • Head of Business Systems & OperationsReachdesk · Philadelphia, Pennsylvania, United States
  • Safety SpecialistAirbnb · Singapore
  • SA-Operations Expert (Saudi and GCC Nationals)Apple · Saudi Arabia
  • IN-Operations ExpertApple · India
  • TW-Operations ExpertApple · Taiwan
  • BE-Operations ExpertApple · Belgium
  • UK-Operations ExpertApple · United Kingdom
  • JP-Operations ExpertApple · Japan
  • DE-Operations Expert (m/f/d)Apple · Germany