HPC and AI Software Architect

HPC and AI Software Architect

Nvidia

Remote

We are looking for a forward-thinking HPC and AI Inference Software Architect to help shape the future of scalable AI infrastructure focusing on distributed training, real-time inference, and communication optimization across large-scale systems. Join our world-class team of researchers and engineers building next-generation software and hardware systems that power the most demanding AI workloads on the planet.

What you will be doing:

  • Design and prototype scalable software systems that optimize distributed AI training and inference focusing on throughput, latency, and memory efficiency.
  • Develop and evaluate enhancements to communication libraries such as NCCL, UCX, and UCC, tailored to the unique demands of deep learning workloads.
  • Collaborate with AI framework teams (e.g., TensorFlow, PyTorch, JAX) to improve integration, performance, and reliability of communication backends.
  • Co-design hardware features (e.g., in GPUs, DPUs, or interconnects) that accelerate data movement and enable new capabilities for inference and model serving.
  • Contribute to the evolution of runtime systems, communication libraries, and AI-specific protocol layers.

What we need to see:

  • Ph.D. or equivalent industry experience in computer science, computer engineering, or a closely related field.
  • 2+ years of experience in systems programming, parallel or distributed computing, or high-performance data movement.
  • Strong programming background in C++, Python, and ideally CUDA or other GPU programming models.
  • Practical experience with AI frameworks (e.g., PyTorch, TensorFlow) and familiarity with how they use communication libraries under the hood.
  • Experience in designing or optimizing software for high-throughput, low-latency systems.
  • Strong collaboration skills in a multi-national, interdisciplinary environment.

Ways to stand out from the crowd:

  • Expertise with NCCL, Gloo, UCX, or similar libraries used in distributed AI workloads.
  • Background in networking and communication protocols, RDMA, collective communications, or accelerator-aware networking.
  • Deep understanding of large model training, inference serving at scale, and associated communication bottlenecks.
  • Knowledge of quantization, tensor/activation fusion, or memory optimization for inference.
  • Familiarity with infrastructure for deployment of LLMs or transformer-based models, including sharding, pipelining, or hybrid parallelism.

Apply Now

Don't forget to mention EuroTechJobs when applying.

Share this Job

More Job Searches

Multiple Countries      C++ Developer      Developer      Hardware and Telecoms      Python Developer      Remote      Nvidia     

EuroTechJobs Logo

© EuroJobsites 2025