Site Reliability Engineer - AI Agents

Jobgether · Brazil

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer – AI Agents based in Brazil.

This role sits at the intersection of platform engineering, site reliability, and applied AI, focusing on the systems that power production-grade AI agents at scale. You will help design, operate, and evolve the infrastructure that enables orchestration, execution, and serving of AI-driven workflows across internal tools and external-facing products. The environment is fast-moving and highly technical, requiring strong production discipline applied to emerging AI technologies. You will work closely with data, ML, and engineering teams to ensure reliability, observability, and scalability of agentic systems. Beyond operations, the role emphasizes building developer-facing platforms, APIs, and SDKs that make AI infrastructure accessible and reusable across teams. This is a high-impact opportunity to shape foundational systems for next-generation AI agent platforms in a globally distributed organization.

Accountabilities:

You will be responsible for building and operating the infrastructure backbone that supports AI agent systems in production, ensuring reliability, scalability, and usability across engineering teams.

Design, build, and operate cloud-native infrastructure supporting AI agent workflows, including orchestration, execution, and model serving
Ensure high reliability, scalability, and observability of distributed agentic systems across internal and external products
Develop platform capabilities such as APIs, SDKs, and self-service tools to enable efficient consumption of AI infrastructure
Manage compute, deployment, and serving infrastructure for AI and ML workloads in production environments
Build and maintain CI/CD pipelines enabling safe, reliable, and rapid deployment of AI services and agent workflows
Implement Infrastructure as Code using tools such as Terraform to provision and manage AWS-based environments
Design and operate observability systems, including monitoring, alerting, and incident response tailored to AI/ML workloads
Define reliability patterns, failure handling mechanisms, and recovery strategies for LLM and agent-based systems
Collaborate with AI, Data Engineering, and Product teams to transition experimental prototypes into production-grade systems
Manage Kubernetes-based container orchestration environments to ensure efficient scaling and deployment of services
Implement security controls and access management best practices across infrastructure layers
Document system architecture, operational procedures, and best practices to support platform adoption and knowledge sharing

Requirements

The ideal candidate is a strong infrastructure or SRE engineer with platform engineering experience and exposure to ML or AI-driven systems in production.

5+ years of experience in Site Reliability Engineering, Platform Engineering, Infrastructure Engineering, or similar roles
Hands-on experience supporting ML infrastructure, model serving, or MLOps pipelines in production environments
Experience building developer platforms, internal tooling, APIs, or SDKs used at scale by engineering teams
Strong understanding of platform engineering principles, including self-service infrastructure and developer experience design
Proficiency with Infrastructure as Code tools, particularly Terraform
Strong experience with Kubernetes and containerized environments (Docker)
Solid cloud infrastructure experience, preferably AWS
Strong scripting skills (bash/shell) and proficiency in at least one programming language (Python preferred)
Experience designing and operating observability, monitoring, and alerting systems
Experience with incident response, on-call rotations, and production reliability ownership
Strong collaboration skills across data, AI, and engineering organizations
High ownership mindset and ability to operate in fast-paced, high-stakes production environments
Familiarity with AI agent systems, LLM-based applications, or orchestration frameworks is a strong plus

Benefits

Competitive compensation package with performance-based incentives
Fully remote working model across eligible countries, including Brazil
Comprehensive healthcare coverage (medical, dental, and vision where applicable)
Retirement savings programs with employer contributions (where applicable)
Flexible PTO policy and paid company holidays
Mental health and wellness support programs
Learning and development budget for professional and technical growth
Opportunity to work on cutting-edge AI agent infrastructure at global scale
Distributed, high-ownership engineering culture with strong collaboration across teams
Exposure to advanced platform engineering and applied AI systems;

DevOps pay context

Based on 1,138 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $118K and $175K (10th–90th percentile: $100K–$209K).

See the full DevOps salary breakdown →

Apply →

Site Reliability Engineer - AI Agents

Accountabilities:

Requirements

Benefits

DevOps pay context

Other roles at Jobgether

More DevOps roles