System Hardware Reliability Manager, AI Infrastructure

Google · Taipei, Taiwan

Be part of a team that pushes boundaries, developing custom silicon solutions that power the future of Google's direct-to-consumer products. You'll contribute to the innovation behind products loved by millions worldwide. Your expertise will shape the next generation of hardware experiences, delivering unparalleled performance, efficiency, and integration.

In this role, you will lead the team responsible for building reliability into our products from early architecture through global deployment. You will shift our focus from reactive troubleshooting to scalable strategy, partnering with Design teams and APAC manufacturers to define specifications and mitigate hardware risks before they hit production. Ultimately, you will own the technical strategy for NPI reliability frameworks, drive systemic root-cause failure analysis, and oversee the health of our active global fleet to ensure our infrastructure remains highly resilient.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving team behind Google's groundbreaking innovations, empowering the development of our AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Minimum qualifications:

  • Bachelor's degree in Electrical Engineering, Mechanical Engineering, Reliability Engineering, Materials Science, or a related technical discipline, or equivalent practical experience.
  • 10 years of experience in manufacturing.
  • 8 years of experience in people management.

Preferred qualifications:

  • Experience with large-scale data center infrastructure, high-density compute/server topologies, or power/cooling sub-systems.
  • Demonstrated experience in performing risk mitigation during early design phases using predictive modeling or reliability simulations before design lockdown.
  • Experience designing and executing accelerated life testing (ALT, HALT) and manufacturing detection profiles tailored to data center environmental profiles.
  • Deep expertise in structured problem-solving methodologies (e.g., 8D, FMEA, FTA) and physical failure analysis for complex electronic assemblies or server-grade hardware.
  • Strong background in data analysis tools (e.g., JMP, SQL, Python/R) for life-data analysis, Weibull modeling, and predicting fleet-wide failure rates.
Apply →