Site Reliability Engineer Specialist

Jobgether · Brazil

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer Specialist based in Brazil.

This role is a senior technical leadership opportunity focused on defining and elevating reliability practices across a complex, distributed cloud-native platform. You will be responsible for shaping observability, incident response, and SRE standards across large-scale systems running in Kubernetes (GKE) and supported by a modern microservices ecosystem. The environment includes critical components such as messaging, databases, API gateways, and logging pipelines, requiring deep systems thinking and strong operational discipline. This is a highly influential individual contributor position, where you will set the benchmark for SRE excellence, drive SLO adoption, and reduce operational toil at scale. You will also play a key role in major incident management and postmortem culture. The role offers strong cross-team visibility and the opportunity to shape how reliability engineering is practiced across the entire platform.

Accountabilities:

Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing using tools such as OpenTelemetry and Dash0.
Establish and evolve SLIs, SLOs, and error budgets, ensuring they drive engineering and product decision-making.
Lead major incident response efforts as incident commander, ensuring structured resolution and blameless postmortems with actionable outcomes.
Improve on-call practices by reducing alert noise, minimizing toil, and building a sustainable operational model.
Influence and support architectural decisions across distributed systems including GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
Mentor SRE and platform engineers, raising the overall maturity of reliability engineering practices across teams.
Drive adoption of observability and reliability best practices across Java and Node.js services in production.

Requirements:

8+ years of experience in SRE, infrastructure, or platform engineering, with senior or specialist-level exposure to large-scale production environments.
Strong hands-on experience with Kubernetes (preferably GKE), including debugging and operating production workloads.
Deep expertise in observability systems (OpenTelemetry, Prometheus, centralized logging such as Elasticsearch, Logstash, Fluent Bit).
Experience defining and operationalizing SLIs, SLOs, and error budgets in real-world environments.
Strong background in incident management, including leading high-severity incidents and postmortem processes.
Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
Production experience with Java services (JVM tuning, performance troubleshooting) and familiarity with Node.js environments.
Proven ability to influence engineering teams and mentor senior engineers without formal authority.
Strong communication skills in English and Portuguese, with experience working in distributed, cross-functional teams.

Nice to have:

Experience with iPaaS or multi-tenant distributed platforms.
Knowledge of Kong API Gateway, Apache Camel, or similar integration technologies.
Experience with GitOps tools such as FluxCD or GitLab CI.
Exposure to Chaos Engineering or Production Readiness Review frameworks.
CNCF or cloud certifications (CKA, CKS, GCP Professional certifications).
Contributions to open-source observability or Kubernetes ecosystems.

Benefits:

Health and dental care coverage.
Monthly flexible benefits via Caju card (R$ 1.400, covering food, mobility, home office, wellness, and education).
Life insurance.
Childcare assistance.
Equity (RSUs).
Gympass partnership for wellness and fitness.
English classes at a subsidized group rate.
Collaborative and flexible remote-first work environment.
Strong engineering culture focused on learning, autonomy, and impact.

DevOps pay context

Based on 1,191 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $173K (10th–90th percentile: $101K–$210K).

See the full DevOps salary breakdown →

Apply →

Site Reliability Engineer Specialist

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles