

Andiamo
SRE, Compute - Decentralized High-Performance Computing Leader
β - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior/Staff Site Reliability Engineer focused on high-performance computing, offering a contract length of "unknown" and a pay rate of "unknown". Key skills include Linux internals, virtualization (KVM/QEMU), and programming in C, Go, or Rust.
π - Country
United States
π± - Currency
$ USD
-
π° - Day rate
Unknown
-
ποΈ - Date
October 23, 2025
π - Duration
Unknown
-
ποΈ - Location
Unknown
-
π - Contract
Unknown
-
π - Security
Unknown
-
π - Location detailed
New York, NY
-
π§ - Skills detailed
#Debugging #AI (Artificial Intelligence) #VMware #Linux #KVM (Kernel-based Virtual Machine) #Observability #Programming #Virtualization #Consulting #Regression #Automation
Role description
Senior / Staff Site Reliability Engineer β Compute Infrastructure
About The Role
Weβre looking for a Senior or Staff Site Reliability Engineer who thrives at the intersection of large-scale infrastructure, deep systems engineering, and cutting-edge compute performance. In this role, youβll be a driving force behind the reliability, speed, and efficiency of a massive compute platform designed to power modern AI and high-performance computing workloads.
This is a hands-on, high-impact position where youβll architect, optimize, and evolve a global fleet of bare-metal and virtualized systems. Youβll work across the full stack β from kernel tuning to orchestration automation β to ensure that complex workloads run flawlessly, at scale and with exceptional performance.
What Youβll Do
β’ Push the limits of virtualization: Engineer hypervisors (KVM/QEMU) and fine-tune kernel subsystems, CPU topology, and NUMA configurations to drive down tail latencies for demanding AI and HPC workloads.
β’ Deploy and optimize at scale: Roll out new compute clusters with thousands of CPU and GPU nodes, validate offload capabilities on SmartNICs and DPUs, and fortify isolation across diverse workloads.
β’ Automate everything: Build intelligent telemetry systems and observability pipelines that surface kernel-to-orchestrator insights. Create automated incident-response tooling and rich performance dashboards to keep operations transparent and resilient.
β’ Diagnose the toughest issues: Lead deep-dive investigations into kernel crashes, kexec/kdump analyses, and performance regressions β distilling findings into actionable fixes, configuration improvements, or upstream contributions.
β’ Collaborate on the future of compute: Partner with hardware and kernel engineering teams to debug complex drivers, accelerate I/O pathways, and integrate emerging compute technologies such as TPUs and DPUs.
β’ Drive continuous improvement: Design chaos experiments, lead operational game days, and translate postmortems into meaningful SLOs that measure what truly impacts end users.
Who You Are
β’ Have 5+ years of experience in site reliability, kernel, or virtualization engineering within large-scale or compute-intensive environments.
β’ Expert understanding of Linux internals β from schedulers and memory management to device drivers and kernel debugging.
β’ Hands-on experience with virtualization technologies such as KVM, QEMU, Xen, or VMware in production settings.
β’ Strong programming skills in C, Go, or Rust, along with practical knowledge of Infrastructure-as-Code and CI/CD systems.
β’ Familiar with SmartNICs, DPUs, or kernel-bypass networking technologies that enhance data throughput and reduce system overhead.
β’ Proven success scaling high-performance or HPC-grade infrastructure with measurable gains in reliability and efficiency.
Why This Role Matters
This role offers the chance to shape the foundations of large-scale AI and scientific computing infrastructure. Youβll work on problems that demand both creativity and precision β optimizing systems that operate at the limits of what current hardware and software can deliver. Your impact will be felt not just in performance metrics, but in the ability of thousands of users to innovate faster and push boundaries across the AI and HPC ecosystem.
About Andiamo
Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the worldβs largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com
Senior / Staff Site Reliability Engineer β Compute Infrastructure
About The Role
Weβre looking for a Senior or Staff Site Reliability Engineer who thrives at the intersection of large-scale infrastructure, deep systems engineering, and cutting-edge compute performance. In this role, youβll be a driving force behind the reliability, speed, and efficiency of a massive compute platform designed to power modern AI and high-performance computing workloads.
This is a hands-on, high-impact position where youβll architect, optimize, and evolve a global fleet of bare-metal and virtualized systems. Youβll work across the full stack β from kernel tuning to orchestration automation β to ensure that complex workloads run flawlessly, at scale and with exceptional performance.
What Youβll Do
β’ Push the limits of virtualization: Engineer hypervisors (KVM/QEMU) and fine-tune kernel subsystems, CPU topology, and NUMA configurations to drive down tail latencies for demanding AI and HPC workloads.
β’ Deploy and optimize at scale: Roll out new compute clusters with thousands of CPU and GPU nodes, validate offload capabilities on SmartNICs and DPUs, and fortify isolation across diverse workloads.
β’ Automate everything: Build intelligent telemetry systems and observability pipelines that surface kernel-to-orchestrator insights. Create automated incident-response tooling and rich performance dashboards to keep operations transparent and resilient.
β’ Diagnose the toughest issues: Lead deep-dive investigations into kernel crashes, kexec/kdump analyses, and performance regressions β distilling findings into actionable fixes, configuration improvements, or upstream contributions.
β’ Collaborate on the future of compute: Partner with hardware and kernel engineering teams to debug complex drivers, accelerate I/O pathways, and integrate emerging compute technologies such as TPUs and DPUs.
β’ Drive continuous improvement: Design chaos experiments, lead operational game days, and translate postmortems into meaningful SLOs that measure what truly impacts end users.
Who You Are
β’ Have 5+ years of experience in site reliability, kernel, or virtualization engineering within large-scale or compute-intensive environments.
β’ Expert understanding of Linux internals β from schedulers and memory management to device drivers and kernel debugging.
β’ Hands-on experience with virtualization technologies such as KVM, QEMU, Xen, or VMware in production settings.
β’ Strong programming skills in C, Go, or Rust, along with practical knowledge of Infrastructure-as-Code and CI/CD systems.
β’ Familiar with SmartNICs, DPUs, or kernel-bypass networking technologies that enhance data throughput and reduce system overhead.
β’ Proven success scaling high-performance or HPC-grade infrastructure with measurable gains in reliability and efficiency.
Why This Role Matters
This role offers the chance to shape the foundations of large-scale AI and scientific computing infrastructure. Youβll work on problems that demand both creativity and precision β optimizing systems that operate at the limits of what current hardware and software can deliver. Your impact will be felt not just in performance metrics, but in the ability of thousands of users to innovate faster and push boundaries across the AI and HPC ecosystem.
About Andiamo
Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the worldβs largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com






