This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Staff Machine Learning Systems Engineer (MLOps) based in the United States.
This is a high-impact infrastructure role focused on building and operating the production systems that power large-scale AI and ML services. You will define how machine learning workloads are deployed, observed, secured, and scaled across cloud-native environments. The role sits at the intersection of platform engineering, DevOps, and applied AI, ensuring that every AI product can be shipped safely and reliably. You will design the underlying Kubernetes-based infrastructure, CI/CD pipelines, and model-serving systems that support mission-critical workloads. Working closely with ML engineers, product teams, and security stakeholders, you will help translate experimental AI capabilities into production-grade systems. This is a hands-on senior technical role for someone who thrives in complex, high-scale, and fast-evolving environments.
Accountabilities:
Lead the design, evolution, and operation of the core ML infrastructure platform supporting AI workloads across production systems, ensuring scalability, reliability, and security across environments.
- Own and optimize Kubernetes-based infrastructure (e.g., EKS), including autoscaling, workload orchestration, and cluster lifecycle management for ML and AI systems
- Build and maintain GitOps-based CI/CD pipelines enabling safe, repeatable, and efficient deployment of AI services across environments
- Design and implement model serving and inference infrastructure, including LLM routing, API gateways, and multi-provider integrations
- Develop observability, tracing, and monitoring systems for AI workloads using tools such as OpenTelemetry, Datadog, and LLM tracing platforms
- Define and enforce SLOs, incident response processes, and reliability standards for ML systems in production
- Own infrastructure-as-code and platform tooling (Terraform, CLIs, internal frameworks) to improve developer velocity and consistency
- Drive security, IAM, and secrets management architecture ensuring compliance, least-privilege access, and data protection standards
- Collaborate with ML, product, and data teams to translate research and prototypes into production-ready systems
- Identify platform bottlenecks and lead initiatives to improve performance, cost efficiency, and deployment speed
- Provide technical leadership, mentorship, and architectural guidance across ML systems engineering initiatives
Requirements:
This role requires deep expertise in cloud infrastructure, ML systems, and production-grade platform engineering, with a strong focus on reliability, scalability, and security.
- 8+ years of experience in platform engineering, DevOps, SRE, or infrastructure roles, including hands-on ML/AI systems experience
- Strong expertise with Kubernetes (preferably EKS), including cluster operations, autoscaling, and workload orchestration
- Proficiency in infrastructure-as-code tools such as Terraform and experience designing secure cloud architectures
- Solid programming skills in Python with experience building infrastructure tooling and automation systems
- Experience operating LLM or ML inference systems in production, including routing, serving, and observability
- Hands-on experience with observability stacks (Datadog, OpenTelemetry, logging/tracing systems, or equivalents)
- Strong understanding of CI/CD systems, GitOps workflows, and developer platform engineering
- Experience designing IAM, OIDC, and secrets management systems in cloud environments
- Systems-thinking mindset with strong attention to failure modes, reliability, and long-term maintainability
- Ability to collaborate across engineering, ML, security, and product teams in fast-paced environments
- Experience in regulated or high-compliance environments (healthcare, fintech, or similar) is a plus
Benefits:
- Competitive salary with equity opportunities
- Comprehensive health coverage including medical, dental, and vision
- Unlimited PTO, company holidays, and mental health days
- Parental leave and family support benefits
- 401(k) with employer matching
- Employee stock purchase program (ESPP)
- Remote-first flexibility and offsite team gatherings
- Strong emphasis on wellness, learning, and professional development.