This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior HPC Cluster Engineer based in France.
This role sits at the core of a next-generation AI cloud infrastructure environment, focused on building and optimizing large-scale high-performance computing systems. You will work on complex GPU and InfiniBand cluster architectures that power AI and HPC workloads at scale. The position involves deep system-level engineering, performance tuning, and hands-on troubleshooting across distributed infrastructure. You will contribute directly to improving reliability, efficiency, and scalability of compute platforms used for advanced AI and data-intensive applications. Working in a highly technical engineering culture, you will collaborate with experts across systems, networking, and virtualization. This is a high-impact role where your work directly influences the performance of large-scale cloud and AI workloads.
Accountabilities:
Own the performance optimization and reliability of large-scale GPU clusters and InfiniBand networking environments supporting HPC workloads:
- Tune and optimize GPU cluster performance and InfiniBand fabric efficiency to ensure high throughput and low-latency computing.
- Diagnose, troubleshoot, and resolve complex system-level issues across GPU, network, and compute layers.
- Integrate and validate new hardware components into existing HPC infrastructure, including support for GPUs and related accelerators.
- Work across virtualization and orchestration layers (KVM/QEMU, Kubernetes) to ensure seamless hardware utilization and deployment.
- Develop and improve automation for monitoring, fault detection, and proactive remediation in distributed compute environments.
- Configure, manage, and maintain GPU devices, PCIe systems, and InfiniBand networks to ensure stability and scalability.
Requirements:
We are looking for a highly experienced systems engineer with strong expertise in HPC and low-level infrastructure:
- 5+ years of experience in system-level software engineering with a focus on performance, scalability, or infrastructure optimization.
- 3+ years of hands-on experience with Linux systems administration, debugging, and performance tuning.
- Strong understanding of server and hardware architecture including PCIe, NICs, GPUs, and Linux kernel-level behavior.
- Proficiency in C, C++, Go, or Python for systems or performance-oriented development.
- Experience working with distributed or HPC environments and solving complex infrastructure challenges.
- Strong analytical and problem-solving skills with the ability to work on deep technical issues independently.
- Familiarity with GPU clusters, InfiniBand networking, and large-scale compute systems is highly desirable.
- Experience with KVM/QEMU or containerized orchestration environments is a plus.
- Exposure to distributed computing frameworks or libraries such as MPI or NCCL is advantageous.
Benefits:
- Competitive compensation package.
- Career development and continuous learning opportunities in advanced AI and HPC systems.
- Flexible working arrangements and remote-friendly culture across Europe.
- Opportunity to work on cutting-edge AI infrastructure and large-scale distributed systems.
- Collaborative engineering environment with high technical ownership.
- Exposure to international teams and world-class engineering challenges.