Senior Site Reliability Engineer, Robotics & Cloud Infrastructure

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer, Robotics & Cloud Infrastructure based in the United States.

This role sits at the intersection of robotics, edge computing, and cloud-scale infrastructure, ensuring the reliability of autonomous underwater systems and the data pipelines that power them. You will be responsible for keeping fleets of autonomous vehicles, onboard systems, and cloud services continuously operational in challenging real-world environments. The work spans everything from embedded compute on underwater vehicles to large-scale AWS-based data processing and customer-facing platforms. You will design and implement automation, observability, and self-healing systems that reduce manual intervention and enable sustained autonomous operations. The environment is highly technical, mission-driven, and deeply hands-on, with a strong emphasis on ownership and operational excellence. Success in this role means enabling continuous ocean data collection at scale with minimal downtime and maximum system resilience. Your work will directly support cutting-edge autonomy in maritime exploration and infrastructure intelligence.

Accountabilities:

Own end-to-end system reliability across the full stack, including onboard robotics compute, operator systems, cloud infrastructure, and data delivery platforms.
Build and enhance infrastructure automation for provisioning, deployment, configuration management, and self-healing system behaviors across edge and cloud environments.
Design and scale observability systems (metrics, logging, tracing, alerting) to provide actionable insights across vehicle fleets and distributed cloud services.
Reduce operational overhead by eliminating single points of failure, automating manual workflows, and documenting runbooks for repeatable incident resolution.
Participate in a shared on-call rotation covering robotics and cloud incidents, while leading blameless postmortems and reliability improvements.
Define and track system reliability metrics such as uptime, data yield, and recovery time, aligned with continuous autonomous operations.
Manage and optimize AWS infrastructure across compute, storage, networking, security, and cost efficiency for large-scale data processing workloads.
Improve deployment safety, configuration management, and rollback strategies for fleet-wide updates across robotics systems.
Collaborate closely with robotics, data, and platform teams to embed reliability into system design from the ground up.

Requirements

5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles supporting production systems with on-call ownership.
Strong experience designing and operating scalable cloud infrastructure, preferably on AWS, including networking, compute, storage, and IAM.
Proficiency in infrastructure-as-code tools such as Terraform and strong automation skills using Python, Go, or Bash.
Experience with containerization and orchestration technologies such as Docker and Kubernetes or equivalent systems.
Strong understanding of Linux systems, networking fundamentals, and modern observability tooling (e.g., Prometheus, Grafana, or equivalents).
Experience operating in hybrid environments that include edge or embedded systems, intermittent connectivity, or physical hardware constraints.
Strong incident management mindset with experience improving operational reliability, reducing toil, and building scalable on-call practices.
Ability to write clear documentation, automate repetitive workflows, and design systems that reduce reliance on tribal knowledge.
Excellent communication skills and strong ownership mentality in fast-moving, small-team environments.
Comfort working across robotics, cloud infrastructure, and distributed data systems.
Bonus: experience with ROS/ROS 2, Jetson or edge compute platforms, or robotics-adjacent systems.
Bonus: exposure to field deployments, autonomous systems, or geospatial/data-heavy workloads.

Benefits

Competitive base salary ranging from $164,000 to $220,000 depending on location and experience
Equity compensation package
Comprehensive medical, dental, and vision insurance
Flexible PTO and paid holidays
Remote-friendly structure with required travel to field deployments and HQ as needed
Home office and equipment support
Opportunity to work on cutting-edge robotics and autonomous ocean systems
High-impact role with direct ownership of mission-critical infrastructure
Exposure to both advanced robotics and large-scale cloud systems.

DevOps pay context

Based on 1,258 disclosed DevOps salaries on RoleSuite, the role pays a median of $140K/year, with most offers between $115K and $173K (10th–90th percentile: $99K–$210K).

This posting lists $164K–$220K, above the $140K market median.

See the full DevOps salary breakdown →

Apply →

Senior Site Reliability Engineer, Robotics & Cloud Infrastructure

Accountabilities:

Requirements

Benefits

DevOps pay context

Other roles at Jobgether

More DevOps roles