Organization
Our client’s innovative technology company is shaping the future of cloud computing by delivering scalable solutions to support the global AI economy. Their mission is to empower organizations to solve complex real-world problems and transform industries through advanced tools and infrastructure—without the high costs or need for extensive internal AI/ML teams. The company employs experts at the forefront of AI cloud infrastructure, collaborating with some of the most accomplished leaders and engineers in the industry.
Headquartered in Amsterdam and publicly listed, the company maintains a global presence with research and development centers across Europe, North America, and Israel. With a workforce of over 800 professionals, including more than 400 engineers, the organization brings deep technical knowledge in both hardware and software systems, supported by a dedicated AI research division.
You will have the opportunity to work with cutting-edge technologies in data operations, cloud computing, and infrastructure management. As global data center operations continue to expand, so do opportunities for career growth. Work in the data center directly impacts performance, customer satisfaction, and operational efficiency, while offering involvement in new data center projects. Collaboration with AI infrastructure experts provides valuable insight, and the environment promotes innovation and high standards in design and deployment.
We are looking for a GPU Cluster Architect to drive the design of the next-generation AI infrastructure. In this high-impact, hands-on role, you will make end-to-end architectural decisions across compute, networking, and storage — ensuring our platforms can meet the massive scale, performance, and reliability requirements of modern AI workloads.
This is a high-impact, hands-on architecture role where you’ll define how tens of thousands of GPUs are interconnected, cooled down, powered, and optimized across multiple data center sites.
Responsibilities
- Cluster Design: Architect scalable GPU cluster topologies including compute nodes, interconnect (InfiniBand, Ethernet), storage, and control planes.
- Performance Modeling: Analyze AI/ML workloads (e.g. LLM training, inference) to inform design tradeoffs across latency, bandwidth, and GPU density.
- Network Architecture: Align with network architect relevant design and validate low-latency, high-throughput interconnects (e.g., InfiniBand HDR/NDR, RoCEv2) at POD and DC scale.
- Storage Integration: Work with storage teams to optimize performance for training datasets, checkpointing, and others.
- Reliability & Monitoring: Understand and analyze signal from monitoring systems to the detect flows in design
- Collaboration: Partner with site reliability, networking, storage, and DC engineering teams to operationalize and scale your architecture.
Job requirements
- 5+ years of experience designing clusters.
- Deep understanding of modern GPU architecture (NVIDIA, AMD, etc.).
- Experience with HPC interconnects (InfiniBand & RoCE).
- Solid background in systems architecture, networking, and hardware reliability.
- Experience in scripting for automation and telemetry pipelines (Python, Go, etc.)
Offer
-
Competitive compensation and benefits.
-
Career growth opportunities.
-
Hybrid working arrangements.
-
A collaborative environment that values initiative and innovation.