Our client is seeking a highly skilled and driven Senior AI Engineer to join their organization as a founding member of a new AI engineering group. In this role, you’ll help build the core data and AI infrastructure used to train advanced vision models and other foundation models for large?scale industrial and energy?related applications.
You will architect and optimize end?to?end systems, high?performance data pipelines, and distributed training workflows that fuel cutting?edge AI research. Working closely with research scientists, you will transform new research ideas into robust, scalable, and efficient implementations.
This position requires deep expertise in distributed training, data engineering, AI infrastructure, and familiarity with MLOps. The ideal candidate has a proven track record of building systems that scale.
How You’ll Make an Impact
- Design, build, and optimize large?scale training and fine?tuning systems across a variety of model architectures.
- Develop and improve every part of the training stack—data ingestion, preprocessing, distributed training pipelines, and inference workflows—with a focus on maximizing Model Flop Utilization (MFU) across multi?node GPU clusters.
- Collaborate proactively with research teams to turn ideas, algorithms, and papers into high?performance production code.
- Rapidly implement, test, and iterate on approaches inspired by academic research or open-source projects.
- Diagnose and resolve performance bottlenecks by profiling and optimizing all layers of the training stack.
- Contribute to evaluating hardware, software, and cloud services that shape the client’s AI platform.
- Use MLOps frameworks (MLFlow, Weights & Biases, etc.) to build best practices across the model lifecycle including development, training, validation, and monitoring.
- Produce high-quality documentation for infrastructure, data workflows, and training procedures.
- Stay on the forefront of advancements in large-scale training, distributed systems, and data engineering to drive innovation across the AI engineering organization.
- Operate with high ownership, initiative, and a commitment to delivering reliable, high?quality code.
What You Bring
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- 3+ years of hands-on experience in AI Engineering or Machine Learning Engineering.
- Deep practical expertise with AI frameworks (e.g., PyTorch, PyTorch Lightning, TorchTitan).
- Hands-on experience with large-scale multi-node GPU training, distributed systems, and training optimization for computer vision or foundation models.
- Background working on Computer Vision systems and projects.
- Strong debugging, system-level performance optimization, and problem?solving skills.
- Excellent communication and collaboration skills.
- Experience with MLOps practices for model tracking, evaluation, and deployment.
- Open-source contributions are a strong plus.