DevJobs
RoleSuite
CompaniesRemoteAboutMethodologyContactPrivacy
Updated 2026-06-10 00:00 UTC·© 2025–2026 RoleSuite
← Back to listings

HPC Engineer

ifm · Sunnyvale, CA

About MBZUAI
The Institute for Foundation Models (IFM) operates some of the world's largest AI supercomputing environments.

Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.

Responsibilities

• Monitor health, performance, and availability of large-scale GPU clusters.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.

Education

Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.

Experience

• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.

Preferred Qualifications

• Slurm.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.
Apply →

Other roles at ifm

  • (Storm3) Research Scientist, Agentic Data & BenchmarkingSunnyvale, CA
  • Research Scientist, Agentic Data & BenchmarkingSunnyvale, CA
  • Research Scientist - Vision Language ModelSunnyvale, CA
  • Finance Operations CoordinatorSunnyvale, CA
  • Communications and Developer Community InternSunnyvale, CA

More Software roles

  • Automations Engineer, Post Sales SystemsClickUp · United States
  • Business Systems EngineerClickUp · United States
  • Developer Intern, Data Security - Fall 20261Password · Remote (United States | Canada)
  • Senior Software Engineer, BMSArcher · San Jose, California, United States
  • AI Support Engineer - San Francisco (Weekend Shift)OpenAI · San Francisco
  • Software Engineer, SecurityNotion · San Francisco, California
  • Integration & Test Engineer, OmenAnduril Industries · Costa Mesa, California, United States
  • Senior Software Developer - Clients & AccountsWealthsimple · Toronto Headquarters
  • Software Engineer, Simulation InfrastructureAnduril Industries · Costa Mesa, California, United States; Seattle, Washington, United States; Washington, District of Columbia, United States
  • Software Engineer - Sensor Systems, Robot SoftwareWayve · Sunnyvale, California USA