This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Consultant based in Canada.
This role sits at the intersection of cloud infrastructure, software reliability, and large-scale distributed systems engineering. You will be responsible for designing, operating, and continuously improving highly available platforms that support critical workloads across cloud-native environments. The position involves deep hands-on work with Kubernetes, observability tooling, and automation frameworks to ensure systems remain resilient, scalable, and performant. You will collaborate closely with engineering, data, and AI/ML teams to enable reliable infrastructure for complex workloads. This is a highly technical and impact-driven role where your work directly influences system uptime, performance, and engineering efficiency. You will also contribute to incident response, root cause analysis, and long-term reliability improvements across global systems.
Accountabilities:
This role is responsible for ensuring the reliability, scalability, and performance of distributed systems and cloud infrastructure across production environments.
- Operate, optimize, and troubleshoot Kubernetes clusters, service mesh environments (Istio), and Linux-based systems.
- Design and implement automation using Go, Python, and Shell scripting to reduce manual operational workload.
- Build and maintain observability stacks using tools such as Prometheus, Grafana, and Loki for monitoring and alerting.
- Diagnose and resolve complex issues across networking, storage, compute, and application performance layers.
- Support AI/ML workloads by ensuring infrastructure readiness for training pipelines and data-intensive processing.
- Participate in on-call rotations, incident response, and postmortem analysis to improve system reliability.
- Collaborate with engineering teams to implement infrastructure-as-code practices using Terraform and cloud-native tools.
Requirements:
The ideal candidate has strong Site Reliability Engineering experience with deep expertise in cloud-native infrastructure, automation, and distributed systems.
- 5+ years of experience in Site Reliability Engineering, DevOps, or infrastructure engineering roles.
- Strong hands-on experience with Google Cloud Platform and Infrastructure-as-Code tools such as Terraform.
- Deep understanding of Kubernetes, Docker, microservices architectures, and service mesh concepts.
- Strong Linux systems administration skills with experience in networking, PKI, and distributed system troubleshooting.
- Proficiency in scripting and automation using Python, Shell, and ideally Go.
- Experience building and maintaining observability and monitoring systems in production environments.
- Strong incident management experience, including root cause analysis and postmortem practices.
- Solid understanding of scalability, reliability engineering principles, and automation-first thinking.
- Strong communication and collaboration skills in cross-functional engineering environments.
Benefits:
- Competitive compensation package aligned with market standards (CAD 90,000 – 100,000 per year)
- Fully remote-friendly work environment with flexibility and autonomy
- Generous paid time off, including vacation days, sick leave, and volunteer days
- Annual wellness budget supporting health, fitness, and personal well-being
- Home office support with equipment and workspace personalization allowance
- Strong learning and development support, including training, certifications, and professional growth opportunities
- Collaborative engineering culture working alongside highly skilled global teams
- Opportunities to work on cutting-edge cloud, AI, and distributed systems infrastructure