Site Reliability Engineer Specialist
This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer Specialist based in Brazil.
This role is a senior technical leadership opportunity focused on defining and elevating reliability practices across a complex, distributed cloud-native platform. You will be responsible for shaping observability, incident response, and SRE standards across large-scale systems running in Kubernetes (GKE) and supported by a modern microservices ecosystem. The environment includes critical components such as messaging, databases, API gateways, and logging pipelines, requiring deep systems thinking and strong operational discipline. This is a highly influential individual contributor position, where you will set the benchmark for SRE excellence, drive SLO adoption, and reduce operational toil at scale. You will also play a key role in major incident management and postmortem culture. The role offers strong cross-team visibility and the opportunity to shape how reliability engineering is practiced across the entire platform.
Accountabilities:
- Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing using tools such as OpenTelemetry and Dash0.
- Establish and evolve SLIs, SLOs, and error budgets, ensuring they drive engineering and product decision-making.
- Lead major incident response efforts as incident commander, ensuring structured resolution and blameless postmortems with actionable outcomes.
- Improve on-call practices by reducing alert noise, minimizing toil, and building a sustainable operational model.
- Influence and support architectural decisions across distributed systems including GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
- Mentor SRE and platform engineers, raising the overall maturity of reliability engineering practices across teams.
- Drive adoption of observability and reliability best practices across Java and Node.js services in production.
- 8+ years of experience in SRE, infrastructure, or platform engineering, with senior or specialist-level exposure to large-scale production environments.
- Strong hands-on experience with Kubernetes (preferably GKE), including debugging and operating production workloads.
- Deep expertise in observability systems (OpenTelemetry, Prometheus, centralized logging such as Elasticsearch, Logstash, Fluent Bit).
- Experience defining and operationalizing SLIs, SLOs, and error budgets in real-world environments.
- Strong background in incident management, including leading high-severity incidents and postmortem processes.
- Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
- Production experience with Java services (JVM tuning, performance troubleshooting) and familiarity with Node.js environments.
- Proven ability to influence engineering teams and mentor senior engineers without formal authority.
- Strong communication skills in English and Portuguese, with experience working in distributed, cross-functional teams.
- Experience with iPaaS or multi-tenant distributed platforms.
- Knowledge of Kong API Gateway, Apache Camel, or similar integration technologies.
- Experience with GitOps tools such as FluxCD or GitLab CI.
- Exposure to Chaos Engineering or Production Readiness Review frameworks.
- CNCF or cloud certifications (CKA, CKS, GCP Professional certifications).
- Contributions to open-source observability or Kubernetes ecosystems.
- Health and dental care coverage.
- Monthly flexible benefits via Caju card (R$ 1.400, covering food, mobility, home office, wellness, and education).
- Life insurance.
- Childcare assistance.
- Equity (RSUs).
- Gympass partnership for wellness and fitness.
- English classes at a subsidized group rate.
- Collaborative and flexible remote-first work environment.
- Strong engineering culture focused on learning, autonomy, and impact.
Requirements:
Nice to have:
Benefits:
DevOps pay context
Based on 1,191 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $173K (10th–90th percentile: $101K–$210K).
See the full DevOps salary breakdown →