Senior Site Reliability Engineer (SRE)

Oowlish · Brasília, Federal District, Brazil / Recife, Pernambuco, Brazil / Mexico City / Buenos Aires Province / Bogota,D.C., Capital District / Rio de Janeiro, Rio de Janeiro, Brazil / São Paulo

Join Our Team

Oowlish, one of Latin America's rapidly expanding software development companies, is seeking experienced technology professionals to enhance our diverse and vibrant team.

As a valued member of Oowlish, you will collaborate with premier clients from the United States and Europe, contributing to pioneering digital solutions. Our commitment to creating a nurturing work environment is recognized by our certification as a Great Place to Work, where you will have opportunities for professional development, growth, and a chance to make a significant international impact.

We offer the convenience of remote work, allowing you to craft a work-life balance that suits your personal and professional needs. We're looking for candidates who are passionate about technology, proficient in English, and excited to engage in remote collaboration for a worldwide presence.

About the Role:
 

We are looking for an experienced Senior Site Reliability Engineer (SRE) to own the reliability, availability, and operational excellence of business-critical production systems.

This is a dedicated Site Reliability Engineering role—not a general DevOps or Infrastructure position. You will define how reliability is measured, lead incident response during production outages, drive observability strategy, and continuously improve operational practices across high-availability environments.

The ideal candidate has hands-on experience managing SLOs, leading major incidents, improving on-call operations, and building a strong reliability culture through automation, observability, and continuous improvement.

Responsibilities:

  • Define, implement, and continuously improve Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets.
  • Develop and maintain observability strategies, including monitoring, logging, tracing, and alerting.
  • Own observability configuration, instrumentation, and alert optimization.
  • Lead Incident Command during production incidents and coordinate cross-functional response efforts.
  • Drive blameless postmortems and ensure corrective actions are completed.
  • Own and continuously improve the on-call program, including rotations, escalation policies, runbooks, and alert tuning.
  • Establish production readiness standards for new services.
  • Partner with engineering teams on capacity planning, scalability, and disaster recovery initiatives.
  • Automate operational processes and reliability improvements using software engineering best practices.
  • Continuously improve system reliability, availability, and operational efficiency.
  • Requirements:

  • 5+ years of experience in Site Reliability Engineering, Production Engineering, Reliability Engineering, or similar roles.
  • Proven experience operating production systems in high-availability environments.
  • Hands-on experience defining and managing SLOs, SLIs, and Error Budgets.
  • Experience leading production incident response and Incident Command.
  • Strong observability and monitoring experience.
  • Strong software engineering skills using Python, Go, or TypeScript.
  • Experience working with cloud platforms.
  • Strong written and verbal English communication skills.
  • Must have:

  • Proven Site Reliability Engineering experience.
  • Experience defining and managing:
    • Service Level Indicators (SLIs)
    • Service Level Objectives (SLOs)
    • Error Budgets
    • Experience leading Incident Command during major production incidents.
    • Experience conducting blameless postmortems and driving follow-up actions.
    • Experience designing, maintaining, and improving on-call programs.
    • Experience developing runbooks and escalation policies.
    • Strong observability experience, including:
      • Monitoring
      • Logging
      • Alerting
      • Distributed Tracing
      • Experience tuning alerts to reduce operational noise.
      • Strong automation skills using Python, Go, or TypeScript.
      • Experience supporting mission-critical production systems.
      • Experience working in high-availability production environments.
  • Nice to have:

  • Experience with Datadog.
  • Experience with AWS.
  • Experience with Heroku.
  • Experience working in regulated industries (Healthcare, HIPAA, Financial Services, etc.).
  • Experience establishing or maturing an SRE practice.
  • Capacity planning experience.
  • Disaster recovery planning and execution.
  • Experience with Kubernetes.
  • Experience with PostgreSQL or SQL Server.
  • Experience supporting modern TypeScript-based applications.
  • DevOps pay context

    Based on 1,245 disclosed DevOps salaries on RoleSuite, the role pays a median of $140K/year, with most offers between $115K and $173K (10th–90th percentile: $99K–$210K).

    See the full DevOps salary breakdown →
    Apply →