Senior Scientist, Synthetic Data Generation

NVIDIA · US, CA, Santa Clara

NVIDIA is at the forefront of the AI revolution, and our research is shaping the future of large language models. We are looking for a Senior Scientist to join our team and help advance our capabilities in synthetic data generation for training frontier models. You will contribute to open-source libraries within the NVIDIA NeMo ecosystem that generate synthetic datasets across text, code, structured, and multimodal data, directly feeding the pre- and post-training of LLMs such as Nemotron. This role combines hands-on software engineering with applied research in generative methods, and you will collaborate with research, engineering, product, and model teams as well as external labs.

What you'll be doing:

Build synthetic data generation pipelines using LLM-based methods and automated quality evaluation, producing datasets that improve the pre- and post-training of LLMs such as Nemotron — reasoning, coding, structured output, and multimodal understanding.
Advance multimodal synthetic data generation — image, document, video, and audio — in partnership with NVIDIA's model teams.
Design and maintain open-source libraries and SDKs with clean APIs and strong documentation.
Drive software excellence with modern tooling, architecture based on configuration, and professional Git/CI-CD.
Publish original research at top machine learning and AI conferences to maintain NVIDIA's technical leadership.
Mentor interns and junior researchers to develop technical growth within the team.

What we need to see:

PhD in Computer Science, Machine Learning, Statistics, or a related field, or equivalent experience.
A research background of 3+ years in synthetic data generation, generative modeling, multimodal machine learning, or related areas. Comparable experience is also considered.
Deep technical understanding of LLMs, how data shapes their pre- and post-training, and inference frameworks such as vLLM or TGI.
Proven track record of developing or maintaining software libraries used by a broad developer community.
Strong publication record at premier venues such as NeurIPS, ICML, ICLR, ACL or similar.

Ways to stand out from the crowd:

Open-source contributions in ML or data tooling.
Experience with multimodal generation or understanding (vision-language, document AI, video, or audio).
Building and optimizing scalable data pipelines for large-scale model training (throughput, distributed inference).
Experience generating data for agentic, tool-use, or reinforcement-learning post-training.

NVIDIA is widely considered to be one of the technology world's most desirable employers. We have some of the most forward-thinking and talented people in the world working with us. If you are creative, autonomous, and passionate about building open-source tools that make AI safer and more private, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 264,500 USD for Level 3, and 192,000 USD - 304,750 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 14, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Apply →

Senior Scientist, Synthetic Data Generation

Other roles at NVIDIA

More Science & R&D roles