This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Observability Engineer based in United States.
This role is central to ensuring engineering teams have full visibility into system health, performance, and reliability across complex distributed environments. The engineer will design and operate end-to-end observability platforms covering metrics, logs, traces, and events, enabling fast and accurate detection of issues before they impact users. The environment is highly technical, cloud-native, and deeply aligned with SRE principles, with strong emphasis on automation, scalability, and signal quality. The role involves shaping how telemetry is collected, stored, and transformed into actionable insight across the organization. It also requires close collaboration with platform, SRE, and product engineering teams to embed observability into every layer of the system. The position is ideal for someone passionate about reliability engineering, data-driven operations, and building systems that empower others to debug and improve production services.
Accountabilities
This role is responsible for building, operating, and evolving the organization’s observability ecosystem, ensuring engineers can effectively monitor, troubleshoot, and improve distributed systems at scale.
- Design and operate enterprise-grade observability platforms across metrics, logs, traces, and events
- Architect and manage tools such as Prometheus, Thanos, Mimir, Grafana, Loki, Tempo, OpenTelemetry, and Datadog
- Define and enforce SLOs, SLIs, error budgets, and observability standards across teams
- Build alerting frameworks integrated with on-call systems to reduce noise and improve incident response
- Develop instrumentation standards including logging formats, metric naming, and trace propagation
- Manage large-scale telemetry pipelines with a focus on performance, retention, and cost optimization
- Build dashboards and self-service tools to improve observability adoption across engineering teams
- Improve incident response readiness through better alerting, monitoring, and post-incident analysis
- Partner with SRE and platform teams to embed observability into CI/CD and deployment workflows
- Mentor engineers on observability best practices, debugging techniques, and reliability engineering principles
Requirements:
The ideal candidate brings deep experience in observability, SRE practices, and distributed systems, with strong technical and communication skills to drive adoption across engineering teams.
- 5+ years of experience in SRE, platform engineering, or observability-focused roles
- Strong hands-on expertise with Prometheus, Grafana, and at least one commercial tool (Datadog, New Relic, or Splunk)
- Solid understanding of OpenTelemetry, distributed tracing, and structured logging
- Proficiency in at least one programming language such as Go, Python, or Java
- Experience operating high-scale metrics and log pipelines with high cardinality
- Strong knowledge of SLOs, SLIs, error budgets, and reliability engineering principles
- Experience integrating observability systems with CI/CD and incident management tools
- Solid understanding of Linux systems, networking, and containerized environments
- Strong troubleshooting, analytical, and communication skills
- Experience in building or scaling observability platforms is highly valued
Benefits:
- Competitive salary range ($100K–$150K based on experience)
- 100% remote work within the United States
- Full-time W2 employment structure (no C2C or 1099 arrangements)
- Health, dental, and vision insurance options
- Paid time off and company holidays
- Retirement savings plan with employer contributions
- Professional development and career growth opportunities
- Exposure to modern cloud-native observability stacks and large-scale distributed systems
- Collaborative engineering culture focused on reliability and continuous improvement