Site Reliability Engineer (SRE)

Jobgether · Brazil

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer (SRE) based in Brazil.

This role is at the core of ensuring reliability, scalability, and performance across mission-critical systems in a highly innovative technology environment. You will be responsible for shaping and evolving observability, incident response, and automation practices that directly impact platform stability and customer experience. Acting as a bridge between development, platform, and security teams, you will help define operational excellence standards and drive a “software as operations” mindset. The environment is fast-paced, collaborative, and strongly oriented toward engineering ownership and continuous improvement. You will work on distributed systems running in Kubernetes-based infrastructures, with strong emphasis on resilience and proactive problem-solving. A key part of your mission will be reducing manual operational work through automation and AI-driven approaches (AIOps). This is a high-impact role where your work will directly improve system reliability and engineering efficiency at scale.

Accountabilities:

In this role, you will be responsible for building and maintaining highly reliable systems while continuously improving operational maturity across engineering teams. You will define reliability standards, lead incident management practices, and drive automation initiatives that reduce operational toil and increase system resilience.

Define and track SLI, SLO, and SLA metrics, operating with error budget principles
Design and implement high availability, disaster recovery, and resilience strategies (RTO/RPO)
Build and evolve observability platforms (logs, metrics, traces, alerts, dashboards)
Lead incident response processes, including on-call coordination and escalation flows
Perform root cause analysis (RCA) and post-mortem reviews with preventive actions
Optimize system performance through capacity planning, tuning, and infrastructure analysis
Drive automation and self-healing solutions to eliminate repetitive operational tasks
Apply AI-driven approaches (AIOps) for anomaly detection, log analysis, and troubleshooting
Collaborate with development teams to improve system reliability and deployment safety
Ensure security, compliance, and operational best practices in production environments

Requirements:

We are looking for a strong technical profile with deep infrastructure understanding, solid automation skills, and a proactive mindset focused on reliability and scalability.

Experience as an SRE, DevOps, or Backend/Platform Engineer in production environments
Strong knowledge of Kubernetes, Docker, and cloud-native architectures
Solid experience with observability tools (Grafana, Prometheus, ELK, Datadog, or similar)
Strong understanding of Linux systems, networking, HTTP, DNS, and TLS/SSL
Proficiency in scripting/automation using Python, Shell, or similar languages
Experience with distributed systems, incident management, and troubleshooting
Familiarity with CI/CD pipelines, infrastructure automation, and Git workflows
Knowledge of reliability engineering concepts (SLI, SLO, error budgets) is highly valued
Experience with high-availability systems and production-scale environments
Strong analytical thinking, autonomy, and structured problem-solving skills
Clear communication skills and ability to collaborate across engineering teams
Familiarity with AIOps, OpenTelemetry, or chaos engineering is a plus

Benefits:

100% remote work, with flexibility to work from anywhere in Brazil
Competitive compensation aligned with senior-level engineering roles
Health and dental care plans
Life insurance coverage
Meal and food allowances (depending on contract model)
Home office support and ergonomic assistance
Wellness and mental health support programs
Access to fitness and wellness platforms and partnerships
Learning and development programs to support career growth
Performance-based recognition and engagement initiatives
Collaborative and innovation-driven engineering culture.

DevOps pay context

Based on 1,127 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $118K and $174K (10th–90th percentile: $100K–$209K).

See the full DevOps salary breakdown →

Apply →

Site Reliability Engineer (SRE)

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles