Great Value Hiring

CUDA Kernel Optimizer - ML Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub

This role is for a CUDA Kernel Optimizer - ML Engineer, offering $120-$250/hr for a fully remote contract. Ideal candidates have deep CUDA expertise, GPU architecture knowledge, and experience with performance profiling, especially in deep learning contexts.

🌎 - Country

United States

💱 - Currency

$ USD

💰 - Day rate

250

🗓️ - Date

November 8, 2025

🕒 - Duration

Unknown

🏝️ - Location

Remote

📄 - Contract

Unknown

🔒 - Security

Unknown

📍 - Location detailed

United States

🧠 - Skills detailed

#Deep Learning #Programming #TensorFlow #PyTorch #ML (Machine Learning) #Scala #GitHub

Role description

CUDA Kernel Optimizer - ML Engineer [$120-$250/hr] As referral partner, we are posting to seek advanced CUDA experts who specialize in GPU kernel optimization, performance profiling, and numerical efficiency. These professionals possess a deep mental model of how modern GPU architectures execute deep learning workloads. They are comfortable translating algorithmic concepts into finely tuned kernels that maximize throughput while maintaining correctness and reproducibility. Key Responsibilities • Develop, tune, and benchmark CUDA kernels for tensor and operator workloads • Optimize for occupancy, memory coalescing, instruction-level parallelism, and warp scheduling • Profile and diagnose performance bottlenecks using Nsight Systems, Nsight Compute, and comparable tools • Report performance metrics, analyze speedups, and propose architectural improvements • Collaborate asynchronously with PyTorch Operator Specialists to integrate kernels into production frameworks • Produce well-documented, reproducible benchmarks and performance write-ups Ideal Qualifications • Deep expertise in CUDA programming, GPU architecture, and memory optimization • Proven ability to achieve quantifiable performance improvements across hardware generations • Proficiency with mixed precision, Tensor Core usage, and low-level numerical stability considerations • Familiarity with frameworks like PyTorch, TensorFlow, or Triton (not required but beneficial) • Strong communication skills and independent problem-solving ability • Demonstrated open-source, research, or performance benchmarking contributions More About the Opportunity • Ideal for candidates who thrive in performance-critical, systems-level work • Engagements focus on measurable, high-impact kernel optimizations and scalability studies • Work is fully remote and asynchronous; deliverables are outcome-driven Application Process • Submit a brief overview of prior CUDA optimization experience, profiling results, or performance reports • Include links to relevant GitHub repos, papers, or benchmarks if available • Indicate your hourly rate, time availability, and preferred engagement length • Selected experts may complete a small, paid pilot kernel optimization project

Apply now Apply with DFH Sign up

Unstructured.io Developer-5

This role is for an Unstructured.io Developer with a contract length of 6–12 months, offering a competitive pay rate. Remote work is available, requiring 8+ years of IT experience, 3+ years with Unstructured.io, and strong Python skills.

Great Value Hiring

CUDA Kernel Optimizer - ML Engineer

Unstructured.io Developer-5

BI Analyst

SAP Business Objects Data Services Developer-5

MySQL Database Administrator (10 Years Exp Minimum)

Book a

chat

with us

Company