Andiamo

SRE, Observability - Decentralized High-Performance Computing Leader

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior/Staff Site Reliability Engineer focused on observability and telemetry systems, with a contract length of "unknown" and a pay rate of "unknown." Key skills include expertise in Grafana, Kubernetes, and programming in Go or Python.
🌎 - Country
United States
πŸ’± - Currency
$ USD
-
πŸ’° - Day rate
Unknown
-
πŸ—“οΈ - Date
October 23, 2025
πŸ•’ - Duration
Unknown
-
🏝️ - Location
Unknown
-
πŸ“„ - Contract
Unknown
-
πŸ”’ - Security
Unknown
-
πŸ“ - Location detailed
New York, NY
-
🧠 - Skills detailed
#Bash #Consulting #Deployment #Storage #Programming #AI (Artificial Intelligence) #Monitoring #Linux #Security #Batch #Kafka (Apache Kafka) #Terraform #Leadership #Data Ingestion #Ceph #ML (Machine Learning) #Observability #Cloud #Python #IP (Internet Protocol) #Istio #Debugging #Grafana #Automation #Kubernetes #Prometheus
Role description
Senior / Staff Site Reliability Engineer – Observability & Telemetry Systems About The Role We’re seeking an accomplished Site Reliability Engineer with deep expertise in large-scale observability systems to help shape and operate the monitoring backbone of a global AI cloud platform. You’ll design, build, and maintain the telemetry infrastructure that ensures the performance, reliability, and visibility of systems powering advanced machine learning and high-performance computing workloads around the world. In this role, you’ll be the technical authority driving how metrics, logs, and traces are captured, processed, and visualized across a massive distributed environment. From optimizing cost efficiency at scale to ensuring rapid root-cause analysis during incidents, you’ll be building the observability systems that keep mission-critical AI workloads running smoothly and predictably. What You’ll Do β€’ Architect large-scale observability systems: Design and operate telemetry pipelines for metrics, logs, and traces using modern observability stacks (Prometheus, Mimir, Loki, Tempo, Grafana) at petabyte scale. β€’ Ensure reliability and efficiency: Tune distributed telemetry systems for performance, cardinality control, and cost optimization while maintaining high availability across global deployments. β€’ Empower debugging and insight: Build tools and frameworks that give developers deep visibility into distributed ML training, inference pipelines, and infrastructure performance. β€’ Collaborate cross-functionally: Partner with platform, SRE, and infrastructure teams to extend observability coverage for Kubernetes clusters, SLURM schedulers, and GPU-based compute environments. β€’ Operational excellence: Establish SLOs, alerting policies, and observability standards that reduce noise, streamline incident response, and strengthen reliability culture across teams. β€’ Automate at scale: Develop clean, maintainable code in Go, Python, or Bash to extend observability tooling and automate operational workflows. Who You Are β€’ 7+ years of total engineering experience, including at least 3 years building or operating large-scale observability or telemetry infrastructure (100M+ metric series, 10TB+/day logs). β€’ Proven expertise with the Grafana ecosystem β€” Prometheus, Mimir, Loki, Tempo, Grafana, and Alertmanager β€” in production environments. β€’ Hands-on proficiency with Kubernetes, including Helm, Kustomize, custom CRDs, and multi-cluster federation. β€’ Experienced with Terraform (or Pulumi) and Infrastructure-as-Code best practices for hybrid or bare-metal provisioning. β€’ Strong programming ability in Go (preferred), with additional experience in Python or Bash for automation, data collection, and controller development. β€’ Deep knowledge of Linux internals β€” cgroups, namespaces, networking, and filesystem performance β€” plus foundational TCP/IP and TLS expertise. β€’ Experienced in defining and enforcing SLOs, SLIs, and alerting mechanisms that align engineering focus with real user impact. β€’ Calm and methodical under pressure β€” you’ve led incident response efforts, authored postmortems, and driven systemic improvements afterward. β€’ Communicative and collaborative β€” able to explain complex systems clearly and influence peers in dynamic, cross-functional environments. Preferred Experience β€’ Instrumentation of GPU-heavy or HPC clusters (NVIDIA A-/H-series, NVSwitch, DGX, RoCE, RDMA). β€’ Observability for distributed ML workloads managed by Slurm, Ray, or Kubernetes-native batch schedulers. β€’ Hands-on with eBPF, Cilium, or Hubble for high-fidelity, low-overhead network visibility. β€’ Experience deploying and migrating OpenTelemetry across metrics, logs, and traces. β€’ Operating service meshes like Istio or Linkerd and managing telemetry pipelines built on Envoy. β€’ Managing observability across distributed or multi-region environments (US/EU/APAC), optimizing for latency and cost. β€’ Implementing cost and resource monitoring using tools like Kubecost or Cloudability. β€’ Security observability overlap β€” integrating Falco, GuardDuty, or auditd into telemetry pipelines. β€’ Contributions to open-source observability projects or thought leadership through blogs, talks, or community participation. β€’ Knowledge of high-performance storage systems (Ceph, Lustre, NVMe-oF) and telemetry integrations for throughput and latency analysis. β€’ Experience building custom backends with Kafka, ClickHouse, or VictoriaMetrics for large-scale data ingestion. Why This Role Matters Observability isn’t just about monitoring β€” it’s about empowerment. In this role, you’ll be building the visibility layer that allows engineers, scientists, and operators to understand their systems at every level, from GPU utilization to global latency trends. Your work will directly impact the stability and performance of some of the most advanced AI computing environments in existence, ensuring they stay transparent, efficient, and resilient as they scale. About Andiamo Andiamo is a globally recognized staffing and consulting firm specializing in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies. For over 20 years, we've maintained the status of tier-one vendor for firms such as Amazon, Bloomberg, Palantir, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms. Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com