Site Reliability Engineer Specialist

Jobgether · Brazil

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer Specialist based in Brazil.

This role is a senior technical leadership opportunity focused on defining and elevating reliability practices across a complex, distributed cloud-native platform. You will be responsible for shaping observability, incident response, and SRE standards across large-scale systems running in Kubernetes (GKE) and supported by a modern microservices ecosystem. The environment includes critical components such as messaging, databases, API gateways, and logging pipelines, requiring deep systems thinking and strong operational discipline. This is a highly influential individual contributor position, where you will set the benchmark for SRE excellence, drive SLO adoption, and reduce operational toil at scale. You will also play a key role in major incident management and postmortem culture. The role offers strong cross-team visibility and the opportunity to shape how reliability engineering is practiced across the entire platform.

Accountabilities:

  • Define and own the technical strategy for observability across the platform, including metrics, logs, and distributed tracing using tools such as OpenTelemetry and Dash0.
  • Establish and evolve SLIs, SLOs, and error budgets, ensuring they drive engineering and product decision-making.
  • Lead major incident response efforts as incident commander, ensuring structured resolution and blameless postmortems with actionable outcomes.
  • Improve on-call practices by reducing alert noise, minimizing toil, and building a sustainable operational model.
  • Influence and support architectural decisions across distributed systems including GKE, Kong, RabbitMQ, PostgreSQL, MongoDB Atlas, Redis, and MinIO.
  • Mentor SRE and platform engineers, raising the overall maturity of reliability engineering practices across teams.
  • Drive adoption of observability and reliability best practices across Java and Node.js services in production.
  • Requirements:

    • 8+ years of experience in SRE, infrastructure, or platform engineering, with senior or specialist-level exposure to large-scale production environments.
    • Strong hands-on experience with Kubernetes (preferably GKE), including debugging and operating production workloads.
    • Deep expertise in observability systems (OpenTelemetry, Prometheus, centralized logging such as Elasticsearch, Logstash, Fluent Bit).
    • Experience defining and operationalizing SLIs, SLOs, and error budgets in real-world environments.
    • Strong background in incident management, including leading high-severity incidents and postmortem processes.
    • Experience operating distributed stateful systems such as PostgreSQL, MongoDB Atlas, Redis, RabbitMQ, or object storage (S3/MinIO).
    • Production experience with Java services (JVM tuning, performance troubleshooting) and familiarity with Node.js environments.
    • Proven ability to influence engineering teams and mentor senior engineers without formal authority.
    • Strong communication skills in English and Portuguese, with experience working in distributed, cross-functional teams.
    • Nice to have:

      • Experience with iPaaS or multi-tenant distributed platforms.
      • Knowledge of Kong API Gateway, Apache Camel, or similar integration technologies.
      • Experience with GitOps tools such as FluxCD or GitLab CI.
      • Exposure to Chaos Engineering or Production Readiness Review frameworks.
      • CNCF or cloud certifications (CKA, CKS, GCP Professional certifications).
      • Contributions to open-source observability or Kubernetes ecosystems.
      • Benefits:

        • Health and dental care coverage.
        • Monthly flexible benefits via Caju card (R$ 1.400, covering food, mobility, home office, wellness, and education).
        • Life insurance.
        • Childcare assistance.
        • Equity (RSUs).
        • Gympass partnership for wellness and fitness.
        • English classes at a subsidized group rate.
        • Collaborative and flexible remote-first work environment.
        • Strong engineering culture focused on learning, autonomy, and impact.

DevOps pay context

Based on 1,191 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $173K (10th–90th percentile: $101K–$210K).

See the full DevOps salary breakdown →
Apply →