DevOpsJobs
RoleSuite
CompaniesRemoteAboutMethodologyContactPrivacy
Updated 2026-07-04 16:00 UTC·© 2025–2026 RoleSuite
← Back to listings

ML Infrastructure Engineer

Jobgether · US

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a ML Infrastructure Engineer based in United States.

This role focuses on building and operating the core platform that powers large-scale machine learning training and inference workloads.
You will work on GPU cluster infrastructure spanning cloud, on-prem, and hybrid environments.
The position plays a critical role in enabling efficient, reliable, and scalable AI development across multiple teams.
You will design systems for scheduling, distributed training, storage throughput, and high-performance networking.
The environment is highly technical, combining systems engineering, ML frameworks, and platform reliability at scale.
You will collaborate closely with ML researchers and engineers to optimize performance, cost, and developer experience.
This is a hands-on engineering role where impact is measured by infrastructure efficiency and production readiness of AI workloads.

Accountabilities:

  • Design, build, and operate GPU and accelerator infrastructure for large-scale training and inference workloads across cloud, on-prem, and hybrid environments.
  • Develop scheduling, queueing, and resource management systems to maximize utilization of compute clusters.
  • Integrate and support ML frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray-based training workflows.
  • Build and maintain high-performance storage and data pipelines ensuring consistent GPU throughput.
  • Design and optimize networking layers including RDMA, InfiniBand, and NCCL-based communication.
  • Implement observability, monitoring, and failure analysis tools for distributed ML workloads.
  • Drive automation for provisioning, lifecycle management, and infrastructure configuration.
  • Partner with ML teams to forecast capacity needs and improve developer workflows and tooling.
  • Ensure security, isolation, and multi-tenant access control across AI infrastructure systems.
  • Optimize cost efficiency across compute, storage, and networking through intelligent resource management.
  • Requirements:

    • Bachelor’s or Master’s degree in Computer Science or related field.
    • 6+ years of experience in infrastructure, platform engineering, or high-performance computing environments.
    • Hands-on experience operating GPU clusters or large-scale ML training systems in production.
    • Strong proficiency in Python and at least one systems programming language (Go or C++ preferred).
    • Deep understanding of distributed systems, accelerator architectures, and ML training workflows.
    • Experience with Kubernetes, Slurm, Ray, or similar orchestration/scheduling systems.
    • Strong knowledge of Linux internals, networking concepts, and high-performance storage systems.
    • Familiarity with at least one major cloud provider’s ML infrastructure stack.
    • Solid software engineering practices including testing, CI/CD, and code review workflows.
    • Strong communication skills and ability to collaborate across research and engineering teams.
    • Experience with RDMA/InfiniBand, FinOps for ML workloads, or open-source ML infrastructure is a plus.
    • Benefits:

      • Competitive salary range: $100,000 – $150,000
      • 100% remote (within the United States)
      • Full-time W2 employment structure
      • H1B transfer support for eligible candidates
      • Opportunity to work on large-scale AI infrastructure systems
      • Exposure to cutting-edge ML frameworks and distributed training technologies
      • Strong engineering culture focused on performance, reliability, and scalability
      • Direct impact on production AI systems and research acceleration
      • Comprehensive career growth opportunities in advanced ML infrastructure.

DevOps pay context

Based on 1,222 disclosed DevOps salaries on RoleSuite, the role pays a median of $140K/year, with most offers between $115K and $173K (10th–90th percentile: $100K–$208K).

This posting lists $100K–$150K, below the $140K market median.

See the full DevOps salary breakdown →
Apply →

Other roles at Jobgether

  • AI/ML Research EngineerUK
  • AI/ML Research EngineerBrazil
  • AI/ML Research EngineerIndia
  • Dynamics 365 CE Field Service Consultant/ArchitectNetherlands
  • Dynamics 365 CE Field Service Consultant/ArchitectIreland
  • SAP Database EngineerUS
  • Dynamics 365 CE Field Service Consultant/ArchitectSwitzerland
  • Dynamics 365 CE Field Service Consultant/ArchitectFrance
  • Dynamics 365 CE Field Service Consultant/ArchitectGermany
  • Dynamics 365 CE Field Service Consultant/ArchitectSpain

More DevOps roles

  • IN_Senior Associate_SAP BASIS_SAP_Advisory_KolkataPwC · Kolkata DN 57
  • System Administrator IIGeneral Dynamics · International
  • Network Operations Senior System AdministratorGeneral Dynamics · USA FL MacDill AFB
  • CSfC System AdministratorGeneral Dynamics · International
  • Cleared DevOps Lead Engineer HybridCisco · Annapolis Junction, Maryland, US
  • Infra Tech Support PractitionerAccenture · Indore
  • Site Reliability Engineer - AI & ML Infrastructure (Kubernetes, AWS & Terraform)Deepgram · USA | Remote
  • Software Eng, Sr Staff - DevOpsRDC · Austin, Texas, United States
  • Senior Mainframe Production Support SpecialistEncora · Mexico; Mexico City
  • Mainframe Production Support SpecialistEncora · Mexico; Mexico City