

Andiamo
Member of Technical Staff - Decentralized High-Performance Computing Leader
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Software Engineer - AI Systems & Infrastructure, offering a contract of "X months" at a pay rate of "$Y/hour". It requires expertise in distributed computing, ML tools, and experience with large-scale ML workloads across 1,000+ GPUs.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
November 14, 2025
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
New York, NY
-
🧠 - Skills detailed
#Distributed Computing #Monitoring #Kubernetes #Data Management #ML Ops (Machine Learning Operations) #AI (Artificial Intelligence) #Consulting #Deployment #Data Pipeline #Libraries #ML (Machine Learning) #MLflow #PyTorch #TensorFlow #Observability #Scala
Role description
Software Engineer – AI Systems & Infrastructure
About The Role
We’re seeking a highly skilled Software Engineer to design and build the systems that power next-generation AI infrastructure. In this role, you’ll architect and develop the software that keeps large-scale machine learning workloads running efficiently, enabling researchers and engineers to push the boundaries of what’s possible with modern AI.
You’ll collaborate closely with internal teams and customers to craft robust, scalable solutions for distributed computing, data management, and model training. Each day brings new challenges — from optimizing GPU utilization to creating smarter orchestration tools for massive compute clusters.
What You’ll Do
• Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads.
• Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow.
• Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments.
• Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow.
• Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity.
• Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads.
Who You Are
You’re passionate about building the backbone of large-scale AI systems. You thrive in dynamic environments, enjoy solving deep technical problems, and have a track record of turning complex requirements into elegant, reliable code. You value clarity, teamwork, and the satisfaction that comes from shipping tools that others depend on daily.
What We Value
• A customer-focused mindset and the ability to turn user needs into thoughtful, scalable solutions.
• A drive to take initiative, act decisively, and deliver results without waiting for perfect conditions.
• Comfort working in ambiguous, fast-evolving problem spaces with shifting priorities.
• Excellent communication skills and a collaborative approach that uplifts teammates and partners alike.
Desired Experience
• Developed or optimized systems for training or serving large-scale ML models, ideally across 1,000+ GPUs.
• Improved performance and efficiency of distributed training workflows spanning multiple nodes and accelerators.
• Built APIs, SDKs, or interfaces that simplify machine learning operations and enhance developer experience.
• Experience with cluster orchestration technologies such as Kubernetes or SLURM in the context of large-scale ML workloads.
• Contributed to or worked with ML infrastructure tools such as Ray, Horovod, or DeepSpeed, and have experience with workflow systems like MLflow, Kubeflow, or Weights & Biases.
Why This Role Matters
AI development is only as powerful as the infrastructure behind it. This position offers the opportunity to shape the systems that drive some of the world’s most advanced machine learning workloads. You’ll help design the tools, frameworks, and services that define how AI at scale is trained, deployed, and managed — with real impact on the industry’s evolution.
About Andiamo
Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com
Software Engineer – AI Systems & Infrastructure
About The Role
We’re seeking a highly skilled Software Engineer to design and build the systems that power next-generation AI infrastructure. In this role, you’ll architect and develop the software that keeps large-scale machine learning workloads running efficiently, enabling researchers and engineers to push the boundaries of what’s possible with modern AI.
You’ll collaborate closely with internal teams and customers to craft robust, scalable solutions for distributed computing, data management, and model training. Each day brings new challenges — from optimizing GPU utilization to creating smarter orchestration tools for massive compute clusters.
What You’ll Do
• Design and enhance job scheduling systems to increase GPU efficiency and throughput for large-scale machine learning workloads.
• Develop intuitive management interfaces and APIs that simplify cluster control and integration with frameworks like PyTorch, JAX, and TensorFlow.
• Build observability and monitoring systems to track performance, utilization, and progress across vast distributed training environments.
• Streamline data pipelines to accelerate both model training and inference processes, ensuring smooth and reliable data flow.
• Integrate deeply with ML tooling such as MLflow, Kubeflow, and Weights & Biases, developing seamless services and connectors that enhance developer productivity.
• Write high-performance libraries and internal utilities to automate deployment, scaling, and the management of distributed training workloads.
Who You Are
You’re passionate about building the backbone of large-scale AI systems. You thrive in dynamic environments, enjoy solving deep technical problems, and have a track record of turning complex requirements into elegant, reliable code. You value clarity, teamwork, and the satisfaction that comes from shipping tools that others depend on daily.
What We Value
• A customer-focused mindset and the ability to turn user needs into thoughtful, scalable solutions.
• A drive to take initiative, act decisively, and deliver results without waiting for perfect conditions.
• Comfort working in ambiguous, fast-evolving problem spaces with shifting priorities.
• Excellent communication skills and a collaborative approach that uplifts teammates and partners alike.
Desired Experience
• Developed or optimized systems for training or serving large-scale ML models, ideally across 1,000+ GPUs.
• Improved performance and efficiency of distributed training workflows spanning multiple nodes and accelerators.
• Built APIs, SDKs, or interfaces that simplify machine learning operations and enhance developer experience.
• Experience with cluster orchestration technologies such as Kubernetes or SLURM in the context of large-scale ML workloads.
• Contributed to or worked with ML infrastructure tools such as Ray, Horovod, or DeepSpeed, and have experience with workflow systems like MLflow, Kubeflow, or Weights & Biases.
Why This Role Matters
AI development is only as powerful as the infrastructure behind it. This position offers the opportunity to shape the systems that drive some of the world’s most advanced machine learning workloads. You’ll help design the tools, frameworks, and services that define how AI at scale is trained, deployed, and managed — with real impact on the industry’s evolution.
About Andiamo
Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.
For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.
Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com





