HPC and AI Software Architect
Nvidia
Remote
We are looking for a forward-thinking HPC and AI Inference Software Architect to help shape the future of scalable AI infrastructure focusing on distributed training, real-time inference, and communication optimization across large-scale systems. Join our world-class team of researchers and engineers building next-generation software and hardware systems that power the most demanding AI workloads on the planet.
What you will be doing:
- Design and prototype scalable software systems that optimize distributed AI training and inference focusing on throughput, latency, and memory efficiency.
- Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
- Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
- Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
- Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.
What we need to see:
- Ph.D. or equivalent industry experience in computer science, computer engineering, or a closely related field.
- 2+ years of experience in systems programming, parallel or distributed computing, or high-performance data movement.
- Strong programming background in C++, Python, and ideally CUDA or other GPU programming models.
- Practical experience with AI frameworks (e.g., PyTorch, TensorFlow) and familiarity with how they use communication libraries under the hood.
- Experience in designing or optimizing software for high-throughput, low-latency systems.
- Strong collaboration skills in a multi-national, interdisciplinary environment.
Ways to stand out from the crowd:
- Expertise with NCCL, Gloo, UCX, or similar libraries used in distributed AI workloads.
- Background in networking and communication protocols, RDMA, collective communications, or accelerator-aware networking.
- Deep understanding of large model training, inference serving at scale, and associated communication bottlenecks.
- Knowledge of quantization, tensor/activation fusion, or memory optimization for inference.
- Familiarity with infrastructure for deployment of LLMs or transformer-based models, including sharding, pipelining, or hybrid parallelism.
Apply Now
Don't forget to mention EuroTechJobs when applying.