Site Reliability Engineer - AI Agents

Jobgether · Brazil

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer – AI Agents based in Brazil.

This role sits at the intersection of platform engineering, site reliability, and applied AI, focusing on the systems that power production-grade AI agents at scale. You will help design, operate, and evolve the infrastructure that enables orchestration, execution, and serving of AI-driven workflows across internal tools and external-facing products. The environment is fast-moving and highly technical, requiring strong production discipline applied to emerging AI technologies. You will work closely with data, ML, and engineering teams to ensure reliability, observability, and scalability of agentic systems. Beyond operations, the role emphasizes building developer-facing platforms, APIs, and SDKs that make AI infrastructure accessible and reusable across teams. This is a high-impact opportunity to shape foundational systems for next-generation AI agent platforms in a globally distributed organization.

Accountabilities:

You will be responsible for building and operating the infrastructure backbone that supports AI agent systems in production, ensuring reliability, scalability, and usability across engineering teams.

  • Design, build, and operate cloud-native infrastructure supporting AI agent workflows, including orchestration, execution, and model serving
  • Ensure high reliability, scalability, and observability of distributed agentic systems across internal and external products
  • Develop platform capabilities such as APIs, SDKs, and self-service tools to enable efficient consumption of AI infrastructure
  • Manage compute, deployment, and serving infrastructure for AI and ML workloads in production environments
  • Build and maintain CI/CD pipelines enabling safe, reliable, and rapid deployment of AI services and agent workflows
  • Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based environments
  • Design and operate observability systems, including monitoring, alerting, and incident response tailored to AI/ML workloads
  • Define reliability patterns, failure handling mechanisms, and recovery strategies for LLM and agent-based systems
  • Collaborate with AI, Data Engineering, and Product teams to transition experimental prototypes into production-grade systems
  • Manage Kubernetes-based container orchestration environments to ensure efficient scaling and deployment of services
  • Implement security controls and access management best practices across infrastructure layers
  • Document system architecture, operational procedures, and best practices to support platform adoption and knowledge sharing
  • Requirements

    The ideal candidate is a strong infrastructure or SRE engineer with platform engineering experience and exposure to ML or AI-driven systems in production.

    • 5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles
    • Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production environments
    • Experience building developer platforms, internal tooling, APIs, or SDKs used at scale by engineering teams
    • Strong understanding of platform engineering principles, including self-service infrastructure and developer experience design
    • Proficiency with Infrastructure as Code tools, particularly Terraform
    • Strong experience with Kubernetes and containerized environments (Docker)
    • Solid cloud infrastructure experience, preferably AWS
    • Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
    • Experience designing and operating observability, monitoring, and alerting systems
    • Experience with incident response, on-call rotations, and production reliability ownership
    • Strong collaboration skills across data, AI, and engineering organizations
    • High ownership mindset and ability to operate in fast-paced, high-stakes production environments
    • Familiarity with AI agent systems, LLM-based applications, or orchestration frameworks is a strong plus
    • Benefits

      • Competitive compensation package with performance-based incentives
      • Fully remote working model across eligible countries, including Brazil
      • Comprehensive healthcare coverage (medical, dental, and vision where applicable)
      • Retirement savings programs with employer contributions (where applicable)
      • Flexible PTO policy and paid company holidays
      • Mental health and wellness support programs
      • Learning and development budget for professional and technical growth
      • Opportunity to work on cutting-edge AI agent infrastructure at global scale
      • Distributed, high-ownership engineering culture with strong collaboration across teams
      • Exposure to advanced platform engineering and applied AI systems;

DevOps pay context

Based on 1,138 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $118K and $175K (10th–90th percentile: $100K–$209K).

See the full DevOps salary breakdown →
Apply →