This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a DevOps/Observability Engineer based in United States.
This role sits at the core of modern cloud infrastructure reliability, focused on building and scaling a next-generation observability platform for complex, distributed systems. You will design and implement end-to-end monitoring, logging, and telemetry pipelines that provide deep visibility across large-scale cloud environments. The position requires strong expertise in cloud-native architectures, with a focus on AWS, Kubernetes, and open-source observability tooling. You will play a key role in unifying metrics, logs, and traces using technologies such as OpenTelemetry, Prometheus, Grafana, and Splunk. Operating in a fast-paced, engineering-driven environment, you will collaborate closely with platform and DevOps teams to improve system reliability, performance, and cost efficiency. This is a highly technical, hands-on role where your work directly strengthens the stability and scalability of mission-critical systems.
Accountabilities:
- Design and implement end-to-end observability architectures using OpenTelemetry, Prometheus, Grafana, and related tools across cloud environments.
- Build and maintain centralized observability pipelines across multi-account AWS environments, including CloudWatch, CloudTrail, and VPC Flow Logs.
- Develop scalable log aggregation and routing strategies, including filtering, noise reduction, and integration with systems such as Splunk HEC.
- Create advanced alerting frameworks and high-quality dashboards using Alertmanager, CloudWatch Alarms, and Grafana with PromQL.
- Deploy and manage observability infrastructure using Infrastructure as Code tools such as Terraform.
- Support Kubernetes and container-based observability across EKS and ECS environments.
- Optimize observability systems for performance, cost efficiency, and scalability in large-scale production environments.
- Collaborate with engineering teams to improve system reliability, monitoring standards, and incident response capabilities.
Requirements:
- 8+ years of experience in DevOps, Site Reliability Engineering, or Observability Engineering roles.
- Strong hands-on experience designing unified observability pipelines using OpenTelemetry, Prometheus, and Grafana.
- Deep expertise in AWS observability services including CloudWatch, CloudTrail, and cross-account telemetry strategies.
- Proven ability to build and manage large-scale log aggregation systems and optimize high-volume data pipelines.
- Strong experience with Kubernetes (EKS) or containerized environments (ECS) in production settings.
- Advanced proficiency with Terraform or other Infrastructure as Code tools for infrastructure and observability deployments.
- Experience building alerting systems, dashboards, and monitoring frameworks for distributed systems.
- Strong understanding of cost optimization strategies for observability platforms (log filtering, metric reduction, storage tiering).
- Excellent problem-solving, debugging, and collaboration skills in complex cloud-native environments.
Benefits:
- Competitive compensation aligned with experience and market benchmarks.
- Remote work flexibility within United States.
- Opportunity to work on large-scale, AI-driven, cloud-native infrastructure systems.
- Exposure to enterprise clients and high-impact digital transformation projects.
- Hands-on experience with leading observability and cloud technologies in production environments.
- Strong learning and upskilling culture in AI, cloud, and platform engineering.
- Collaborative, high-performance engineering environment focused on innovation and reliability.
- Opportunity to shape next-generation observability practices at scale.