This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Production Operations Engineer based in United States.
This role sits at the intersection of reliability engineering, automation, and operational excellence, supporting large-scale distributed systems that process high volumes of real-time data.
You will be responsible for improving system reliability, streamlining incident management workflows, and building automation that reduces operational overhead across engineering teams.
The position plays a key role in shaping how incidents are detected, managed, and learned from, ensuring faster resolution and continuous improvement across production environments.
You will collaborate closely with engineering, product, and customer-facing teams to maintain high availability and performance standards across global systems.
A strong emphasis is placed on leveraging AI-driven tooling to automate repetitive operational tasks and enhance incident response efficiency.
This is a high-impact role ideal for someone who thrives in fast-paced infrastructure environments and enjoys combining SRE discipline with automation and tooling innovation.
Accountabilities:
- Drive end-to-end improvements across the incident lifecycle, including alerting quality, severity classification, triage processes, and post-incident follow-ups.
- Coordinate on-call programs across distributed teams, including scheduling, onboarding, and ensuring consistent operational coverage.
- Lead incident reviews, identify root causes, and ensure actionable follow-ups are tracked and completed effectively.
- Build and deploy automation and AI-driven agents to reduce operational toil, including incident summarization and on-call optimization.
- Maintain and evolve runbooks, playbooks, and operational documentation to reflect current system behavior and best practices.
- Partner with engineering and product teams to improve system observability, reliability metrics, and operational readiness.
- Contribute directly to incident resolution when needed by debugging, prototyping fixes, or implementing mitigation strategies.
- Improve monitoring, alerting, and observability systems to reduce noise and increase signal quality across production environments.
Requirements:
- 5+ years of experience in Site Reliability Engineering, DevOps, or Production Operations in large-scale distributed environments.
- Strong experience with incident management platforms such as PagerDuty, incident.io, or similar tools.
- Hands-on expertise with observability stacks including Datadog, Grafana, CloudWatch, Sentry, or equivalents.
- Strong understanding of reliability engineering principles such as SLOs, SLIs, MTTR, MTTA, and error budgets.
- Experience building automation, tooling, or systems to reduce operational toil and improve engineering efficiency.
- Proficiency in Go or another systems programming language with the ability to contribute to production codebases.
- Familiarity with cloud environments (AWS, Azure, or GCP) and infrastructure-as-code practices.
- Experience leveraging AI-assisted development tools to improve workflows and operational processes.
- Strong written communication skills with the ability to coordinate across teams without direct authority.
- Nice to have: experience building AI agents, working with streaming systems like Kafka or Redpanda, or experience in high-scale infrastructure environments.
Benefits:
- Competitive compensation aligned with senior infrastructure engineering roles in the US market
- Equity participation in a high-growth, innovation-driven technology company
- Remote-first work environment across the United States
- Comprehensive medical, dental, and vision insurance coverage
- Flexible paid time off and paid holidays
- Strong emphasis on engineering autonomy, tooling, and modern AI-driven workflows
- Opportunity to work on large-scale distributed systems and real-time data infrastructure
- Professional growth in a fast-moving, globally distributed engineering organization