

BayInfotech
AI ML Ops Engineer
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for an "AI ML Ops Engineer" on a contract basis, requiring 5-10 years of ML operations experience. Pay rate is competitive. Candidates must have worked at OpenAI, Anthropic, or similar companies. Key skills include MLflow, Airflow, and SageMaker.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 4, 2025
🕒 - Duration
More than 6 months
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Data Quality #Kafka (Apache Kafka) #Spark (Apache Spark) #Kubernetes #PyTorch #Prometheus #Terraform #Deep Learning #SageMaker #Airflow #Delta Lake #Monitoring #Grafana #Deployment #MLflow #"ETL (Extract #Transform #Load)" #Snowflake #AutoScaling #Datasets #Batch #Automation #Observability #AI (Artificial Intelligence) #Migration #ML Ops (Machine Learning Operations) #Data Engineering #Databricks #ML (Machine Learning)
Role description
Note: Candidate must have worked at either of these companies for more than 6 months: OpenAI / Anthropic / FAANG / DeepMind / Databricks / Snowflake
1- Senior AI Infra Engineer
You build and run GPU backed training and inference systems for high scale teams. You support workloads that run across large clusters and strict SLAs.
Key duties
• Deploy and manage Kubernetes clusters for GPU jobs.
• Build training stacks with Ray, Dask, Triton, and TorchServe.
• Run KServe for online and batch inference.
• Set up autoscaling for GPU pools.
• Improve throughput with scheduling and resource tuning.
• Build deployment workflows for research teams.
• Track uptime, failures, and performance trends.
• Improve reliability across regions and zones.
Required experience
• Seven to twelve years in infra and distributed systems.
• Strong work with GPUs, container platforms, and cluster tuning.
• Deep work with Kubernetes, Ray, KServe, Dask, Triton, TorchServe.
• Background from OpenAI, Anthropic, Databricks, Meta Infra, or similar scale groups.
Engagement: Contract or C2H.
2-ML Ops Engineer
You run the full ML operations layer from data to deployment. You support fast model delivery, testing, and monitoring across multiple teams.
Key duties
• Build ML pipelines with MLflow and Airflow.
• Set up CI and CD for model training, testing, and deployment.
• Operate SageMaker, Vertex, or BentoML for production traffic.
• Manage model registry, versioning, and lineage.
• Build monitoring for drift, latency, and accuracy.
• Improve deployment speed and reduce rollout failures.
• Support research teams with repeatable training flows.
Required experience
• Five to ten years in ML operations.
• Strong work with MLflow, Airflow, SageMaker, Vertex, or BentoML.
• Experience from applied AI startups or high volume ML groups.
Engagement: Contract.
3- Data Infra and AI Pipeline Engineer
You build and operate data systems that power ML training, real time inference, and analytics. You keep data fresh, accurate, and delivered at scale.
Key duties
• Build ETL and ELT flows with Spark and Kafka.
• Manage streaming and batch pipelines for high throughput loads.
• Operate Delta Lake and Snowflake for large datasets.
• Build ingestion layers for structured and unstructured data.
• Improve pipeline reliability, backpressure handling, and error recovery.
• Track schema changes and data quality signals.
• Support ML teams with well modeled datasets.
Required experience
• Six to twelve years in data infra or data engineering.
• Strong work with Spark, Kafka, Delta Lake, Snowflake, DBX.
• Experience in data heavy startups or enterprise AI groups.
Engagement: Contract.
4-Systems and GPU Performance Engineer
You improve GPU training speed across clusters. You inspect low level code paths and fix performance issues that slow down training or inference.
Key duties
• Profile CUDA kernels and NCCL collectives.
• Improve scaling across multi GPU and multi node setups.
• Tune PyTorch internals for stable large batch training.
• Identify bottlenecks in memory, compute, and network paths.
• Build tests for throughput, latency, and error cases.
• Work with research teams to speed up model experiments.
• Track improvements with benchmarks and reproducible tests.
Required experience
• Seven to fifteen years in HPC or deep learning systems.
• Strong work with CUDA, NCCL, and PyTorch internals.
• Background from research labs or advanced ML groups.
Engagement: Project.
5-Platform Engineer for LLM Workloads
You build and maintain the core platform used by LLM engineering teams. You support fast environment setup, stable deployments, and strong observability.
Key duties
• Build Terraform and Helm based automation for infra.
• Manage Kubernetes clusters for training, data, and inference.
• Run observability with Prometheus, Grafana, and custom alerts.
• Improve developer workflows with templates and CI flows.
• Set up secure access to GPU pools and shared services.
• Track cluster usage, cost, and resource pressure across teams.
• Support platform rollouts, migrations, and upgrades.
Required experience
• Six to twelve years in platform engineering.
• Strong work with Terraform, Helm, Kubernetes, Prometheus, Grafana.
• Experience from infra heavy AI groups.
Engagement: Contract or FTE.
Note: Candidate must have worked at either of these companies for more than 6 months: OpenAI / Anthropic / FAANG / DeepMind / Databricks / Snowflake
1- Senior AI Infra Engineer
You build and run GPU backed training and inference systems for high scale teams. You support workloads that run across large clusters and strict SLAs.
Key duties
• Deploy and manage Kubernetes clusters for GPU jobs.
• Build training stacks with Ray, Dask, Triton, and TorchServe.
• Run KServe for online and batch inference.
• Set up autoscaling for GPU pools.
• Improve throughput with scheduling and resource tuning.
• Build deployment workflows for research teams.
• Track uptime, failures, and performance trends.
• Improve reliability across regions and zones.
Required experience
• Seven to twelve years in infra and distributed systems.
• Strong work with GPUs, container platforms, and cluster tuning.
• Deep work with Kubernetes, Ray, KServe, Dask, Triton, TorchServe.
• Background from OpenAI, Anthropic, Databricks, Meta Infra, or similar scale groups.
Engagement: Contract or C2H.
2-ML Ops Engineer
You run the full ML operations layer from data to deployment. You support fast model delivery, testing, and monitoring across multiple teams.
Key duties
• Build ML pipelines with MLflow and Airflow.
• Set up CI and CD for model training, testing, and deployment.
• Operate SageMaker, Vertex, or BentoML for production traffic.
• Manage model registry, versioning, and lineage.
• Build monitoring for drift, latency, and accuracy.
• Improve deployment speed and reduce rollout failures.
• Support research teams with repeatable training flows.
Required experience
• Five to ten years in ML operations.
• Strong work with MLflow, Airflow, SageMaker, Vertex, or BentoML.
• Experience from applied AI startups or high volume ML groups.
Engagement: Contract.
3- Data Infra and AI Pipeline Engineer
You build and operate data systems that power ML training, real time inference, and analytics. You keep data fresh, accurate, and delivered at scale.
Key duties
• Build ETL and ELT flows with Spark and Kafka.
• Manage streaming and batch pipelines for high throughput loads.
• Operate Delta Lake and Snowflake for large datasets.
• Build ingestion layers for structured and unstructured data.
• Improve pipeline reliability, backpressure handling, and error recovery.
• Track schema changes and data quality signals.
• Support ML teams with well modeled datasets.
Required experience
• Six to twelve years in data infra or data engineering.
• Strong work with Spark, Kafka, Delta Lake, Snowflake, DBX.
• Experience in data heavy startups or enterprise AI groups.
Engagement: Contract.
4-Systems and GPU Performance Engineer
You improve GPU training speed across clusters. You inspect low level code paths and fix performance issues that slow down training or inference.
Key duties
• Profile CUDA kernels and NCCL collectives.
• Improve scaling across multi GPU and multi node setups.
• Tune PyTorch internals for stable large batch training.
• Identify bottlenecks in memory, compute, and network paths.
• Build tests for throughput, latency, and error cases.
• Work with research teams to speed up model experiments.
• Track improvements with benchmarks and reproducible tests.
Required experience
• Seven to fifteen years in HPC or deep learning systems.
• Strong work with CUDA, NCCL, and PyTorch internals.
• Background from research labs or advanced ML groups.
Engagement: Project.
5-Platform Engineer for LLM Workloads
You build and maintain the core platform used by LLM engineering teams. You support fast environment setup, stable deployments, and strong observability.
Key duties
• Build Terraform and Helm based automation for infra.
• Manage Kubernetes clusters for training, data, and inference.
• Run observability with Prometheus, Grafana, and custom alerts.
• Improve developer workflows with templates and CI flows.
• Set up secure access to GPU pools and shared services.
• Track cluster usage, cost, and resource pressure across teams.
• Support platform rollouts, migrations, and upgrades.
Required experience
• Six to twelve years in platform engineering.
• Strong work with Terraform, Helm, Kubernetes, Prometheus, Grafana.
• Experience from infra heavy AI groups.
Engagement: Contract or FTE.






