Senior HPC Cluster Engineer

Jobgether · France

This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior HPC Cluster Engineer based in France.

This role sits at the core of a next-generation AI cloud infrastructure environment, focused on building and optimizing large-scale high-performance computing systems. You will work on complex GPU and InfiniBand cluster architectures that power AI and HPC workloads at scale. The position involves deep system-level engineering, performance tuning, and hands-on troubleshooting across distributed infrastructure. You will contribute directly to improving reliability, efficiency, and scalability of compute platforms used for advanced AI and data-intensive applications. Working in a highly technical engineering culture, you will collaborate with experts across systems, networking, and virtualization. This is a high-impact role where your work directly influences the performance of large-scale cloud and AI workloads.

Accountabilities:

Own the performance optimization and reliability of large-scale GPU clusters and InfiniBand networking environments supporting HPC workloads:

Tune and optimize GPU cluster performance and InfiniBand fabric efficiency to ensure high throughput and low-latency computing.
Diagnose, troubleshoot, and resolve complex system-level issues across GPU, network, and compute layers.
Integrate and validate new hardware components into existing HPC infrastructure, including support for GPUs and related accelerators.
Work across virtualization and orchestration layers (KVM/QEMU, Kubernetes) to ensure seamless hardware utilization and deployment.
Develop and improve automation for monitoring, fault detection, and proactive remediation in distributed compute environments.
Configure, manage, and maintain GPU devices, PCIe systems, and InfiniBand networks to ensure stability and scalability.

Requirements:

We are looking for a highly experienced systems engineer with strong expertise in HPC and low-level infrastructure:

5+ years of experience in system-level software engineering with a focus on performance, scalability, or infrastructure optimization.
3+ years of hands-on experience with Linux systems administration, debugging, and performance tuning.
Strong understanding of server and hardware architecture including PCIe, NICs, GPUs, and Linux kernel-level behavior.
Proficiency in C, C++, Go, or Python for systems or performance-oriented development.
Experience working with distributed or HPC environments and solving complex infrastructure challenges.
Strong analytical and problem-solving skills with the ability to work on deep technical issues independently.
Familiarity with GPU clusters, InfiniBand networking, and large-scale compute systems is highly desirable.
Experience with KVM/QEMU or containerized orchestration environments is a plus.
Exposure to distributed computing frameworks or libraries such as MPI or NCCL is advantageous.

Benefits:

Competitive compensation package.
Career development and continuous learning opportunities in advanced AI and HPC systems.
Flexible working arrangements and remote-friendly culture across Europe.
Opportunity to work on cutting-edge AI infrastructure and large-scale distributed systems.
Collaborative engineering environment with high technical ownership.
Exposure to international teams and world-class engineering challenges.

DevOps pay context

Based on 1,233 disclosed DevOps salaries on RoleSuite, the role pays a median of $142K/year, with most offers between $115K and $174K (10th–90th percentile: $100K–$211K).

See the full DevOps salary breakdown →

Apply →

Senior HPC Cluster Engineer

Accountabilities:

Requirements:

Benefits:

DevOps pay context

Other roles at Jobgether

More DevOps roles