Site Reliability Consultant

Jobgether · Canada

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Consultant based in Canada.

This role sits at the intersection of cloud infrastructure, software reliability, and large-scale distributed systems engineering. You will be responsible for designing, operating, and continuously improving highly available platforms that support critical workloads across cloud-native environments. The position involves deep hands-on work with Kubernetes, observability tooling, and automation frameworks to ensure systems remain resilient, scalable, and performant. You will collaborate closely with engineering, data, and AI/ML teams to enable reliable infrastructure for complex workloads. This is a highly technical and impact-driven role where your work directly influences system uptime, performance, and engineering efficiency. You will also contribute to incident response, root cause analysis, and long-term reliability improvements across global systems.

Accountabilities:

This role is responsible for ensuring the reliability, scalability, and performance of distributed systems and cloud infrastructure across production environments.

Operate, optimize, and troubleshoot Kubernetes clusters, service mesh environments (Istio), and Linux-based systems.
Design and implement automation using Go, Python, and Shell scripting to reduce manual operational workload.
Build and maintain observability stacks using tools such as Prometheus, Grafana, and Loki for monitoring and alerting.
Diagnose and resolve complex issues across networking, storage, compute, and application performance layers.
Support AI/ML workloads by ensuring infrastructure readiness for training pipelines and data-intensive processing.
Participate in on-call rotations, incident response, and postmortem analysis to improve system reliability.
Collaborate with engineering teams to implement infrastructure-as-code practices using Terraform and cloud-native tools.

Requirements:

The ideal candidate has strong Site Reliability Engineering experience with deep expertise in cloud-native infrastructure, automation, and distributed systems.

5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles.
Strong hands-on experience with Google Cloud Platform and Infrastructure-as-Code tools such as Terraform.
Deep understanding of Kubernetes, Docker, microservices architectures, and service mesh concepts.
Strong Linux systems administration skills with experience in networking, PKI, and distributed system troubleshooting.
Proficiency in scripting and automation using Python, Shell, and ideally Go.
Experience building and maintaining observability and monitoring systems in production environments.
Strong incident management experience, including root cause analysis and postmortem practices.
Solid understanding of scalability, reliability engineering principles, and automation-first thinking.
Strong communication and collaboration skills in cross-functional engineering environments.

Benefits:

Competitive compensation package aligned with market standards (CAD 90,000 – 100,000 per year)
Fully remote-friendly work environment with flexibility and autonomy
Generous paid time off, including vacation days, sick leave, and volunteer days
Annual wellness budget supporting health, fitness, and personal well-being
Home office support with equipment and workspace personalization allowance
Strong learning and development support, including training, certifications, and professional growth opportunities
Collaborative engineering culture working alongside highly skilled global teams
Opportunities to work on cutting-edge cloud, AI, and distributed systems infrastructure

DevOps pay context

Based on 1,180 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $173K (10th–90th percentile: $101K–$210K).

See the full DevOps salary breakdown →

Apply →

Site Reliability Consultant

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles