The Role:
The Senior HPC Engineer at Openchip will design and implement high-performance infrastructure to accelerate the training and serving of AI models across heterogeneous hardware backends. This role is ideal for someone who understands the deep mechanics of modern compute systems - memory bandwidth, threading models, instruction pipelines - and knows how to squeeze every drop of performance out of them. You will work closely with AI optimization and compiler teams to build the low-level, CUDA/C++ driven components that sit at the core of the AI OS runtime and toolchain. Your contributions will directly influence how efficiently we deploy large-scale models across GPUs, CPUs, and custom accelerators. You will report directly to the AI Model Optimization Lead and be a key driver in making secure AI practical, performant, and production-ready.
Key Responsibilities:
- Design and implement high-throughput, low-latency components for AI model training and inference using CUDA and C++.
- Profile, analyze, and eliminate bottlenecks in compute, memory, and threading across multi-GPU and CPU environments.
- Collaborate with compiler, optimization, and runtime teams to build backend components that integrate with TVM, MLIR, and/or ONNX Runtime stacks.
- Develop performant kernels, data movement routines, and scheduling mechanisms tailored to real-world AI workloads (e.g. LLMs, transformers, sparse ops).
- Contribute to the development of internal performance libraries and execution runtimes that abstract heterogeneous hardware backends.
- Drive improvements to build tooling (e.g. CMake), benchmarking pipelines, and system-level validation harnesses.
Qualifications:
- More than 7 years of experience in C++ systems or HPC engineering; strong familiarity with C++14 or newer.
- Deep experience writing and optimizing CUDA kernels; comfortable with warp-level programming and memory hierarchy tuning.
- Strong knowledge of systems-level performance: cache coherence, memory paging, NUMA architectures, and multithreaded programming models (OpenMP, pthreads, etc.).
- Demonstrated experience debugging and profiling performance using tools like Nsight, nvprof, perf, or VTune.
- Comfortable working close to hardware, including on Unix/Linux systems with low-level tooling and build systems (CMake, Make).
- Experience with ML runtimes (e.g., TensorRT, TVM, XLA, or IREE) or scientific computing stacks (BLAS, cuDNN, etc.) is a plus.
- Familiarity with Rust or Python FFI/bindings (e.g., pybind11, pyo3) is a plus.
Soft Skills:
- Obsessed with performance and deeply motivated to push systems to their limits.
- Collaborative engineer who thrives at the intersection of hardware and AI.
- Comfortable owning technically complex systems and mentoring others in low-level software practices.
- Communicates clearly across engineering disciplines—from compiler internals to model optimization.
What We Offer:
- Join an innovative team and experience company growth.
- We believe in investing in our employees and providing them with opportunities for growth and career development.
- Work in a hybrid environment with flexible scheduling.
- We offer a remuneration package that values your experience.
- A chance to work on one of the most transformative AI and silicon engineering companies in Europe.
- The position will be based Barcelona (Spain).
We are looking for outstanding people willing to join our mission to change the silicon industry and help build a better world. If you feel identified with Openchip, please contact us.
At Openchip & Software Technologies S.L., we believe a diverse and inclusive team is the key to groundbreaking ideas. We foster a work environment where everyone feels valued, respected, and empowered to reach their full potential—regardless of race, gender, ethnicity, sexual orientation, or gender identity.