Job location: Remote in India
About the role:
We are looking for a seasoned Machine Learning Engineer with deep expertise in deep learning to design and build production-grade models in the speech/audio and computer vision domains. The Machine Learning Engineer will own the end-to-end model development lifecycle - from dataset curation and architecture design through training, optimization, and deployment, and will work closely with software and product teams to ship low-latency, scalable AI features.
- Research, design, train, and productionize deep learning models for speech/audio (ASR, speaker diarization, audio classification, TTS, noise suppression) and computer vision (detection, segmentation, classification, video understanding) use cases.
- Architect training pipelines capable of handling large-scale datasets, manage data preprocessing, augmentation, and versioning workflows.
- Select and adapt state-of-the-art architectures (Transformers, CNNs, RNNs, diffusion models, etc.) and fine-tune or distill pre-trained models for production constraints.
- Optimize models for inference - quantization, pruning, knowledge distillation - targeting latency, throughput, and memory budgets on CPU/GPU/edge hardware.
- Work with software engineers to integrate models into production systems, design APIs and microservices for model serving.
- Define and track evaluation benchmarks, monitor model performance in production, and drive continuous improvement cycles.
- Stay current with research literature, evaluate and implement relevant SOTA techniques, contribute to internal technical forums.
- B.Tech / M.Tech / M.Sc. / PhD in Computer Science, Electrical Engineering, Signal Processing, or a related field.
- 4+ years of experience in machine learning/deep learning engineering with significant hands-on work in speech/audio and/or computer vision in production environments.
- PyTorch (primary) and TensorFlow/Keras, familiarity with JAX is a plus.
- Deep expertise in Wav2Vec 2.0, Whisper, ESPnet, SpeechBrain, torchaudio, librosa.
- Strong experience with speech/audio frameworks and toolkits: YOLO, DETR, ViT, EfficientNet, SAM, and standard CV pipelines (OpenCV, torchvision, Albumentations).
- Solid background in computer vision: TensorRT, ONNX Runtime, TorchScript, DeepSpeed, or Triton Inference Server.
- Hands-on experience with model inference optimization using Python and strong software engineering fundamentals: OOP, clean code, testing, CI/CD.
- Proficient in SQL for data extraction and pipeline logging.
- Experience with distributed training frameworks (DDP, DeepSpeed, FSDP) and large-scale data pipelines.
- Familiarity with cloud ML platforms (AWS SageMaker, GCP Vertex AI, Azure ML) and containerized deployment (Docker, Kubernetes, ECS).
- Good knowledge of system design principles for ML services: throughput, latency, scalability, fault tolerance.
- Professional growth in a dynamic, rapidly expanding, high-social-impact industry
- An open-minded, collaborative culture made up of enthusiastic colleagues who are driven by the challenge of innovation towards profound impact on people and the planet.
- A truly multicultural experience: You will have the chance to work with and learn from people from different geographies, nationalities, and backgrounds.
- Structured, tailored learning and development programs that help you become a better leader, manager, and professional through the Sun King Center for Leadership.