GPU Kernel Engineer Consultant US

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a GPU Kernel Engineer Consultant (2–3 months, remote) focused on optimizing Transformer model training efficiency. Key skills include CUDA/Triton kernel optimization, GPU architecture knowledge, and experience with FlashAttention. Open-source contributions are highly desirable.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
-
🗓️ - Date discovered
August 28, 2025
🕒 - Project duration
1 to 3 months
-
🏝️ - Location type
Remote
-
📄 - Contract type
Unknown
-
🔒 - Security clearance
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#"ETL (Extract #Transform #Load)" #ML (Machine Learning) #PyTorch #Deep Learning #AI (Artificial Intelligence) #Transformers #C++ #Consulting #GitHub
Role description
Giotto.ai is a Swiss research lab focused on advancing Artificial General Intelligence (AGI). We are best known for our work on the ARC-AGI benchmark, where we currently hold the top position in the Kaggle 2025 competition. Our mission is to push the boundaries of efficient training and to contribute cutting-edge AI research to the open-source community. Role Overview We are seeking a GPU Kernel Engineer (Consultant) to help us optimize the training efficiency of Transformer models. This is a short-term consulting engagement (initially 2–3 months, with possible extension) where you will work closely with our research team to design, implement, and benchmark high-performance GPU kernels that significantly accelerate training throughput while maintaining model quality. Key Responsibilities • Implement and optimize custom attention kernels for Transformers. • Adapt and integrate FlashAttention/FlexAttention. • Develop and optimize fused GPU kernels (RMSNorm, CrossEntropy, matmul-bias-activation). • Optimize kernel performance for mixed precision training (bf16/fp16, optional fp8 on H100). • Profile and debug GPU performance bottlenecks using Nsight Systems, Nsight Compute, or PyTorch Profiler. • Collaborate with our ML Systems Engineers to ensure functional equivalence of optimized kernels with baseline implementations. • Provide knowledge transfer and best practices to the Giotto.ai research team. Qualifications • Proven experience writing and optimizing CUDA or Triton kernels for deep learning workloads. • Strong understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores). • Hands-on experience with FlashAttention, FlexAttention, xFormers, or similar optimized attention implementations. • Familiarity with PyTorch internals (autograd, C++/CUDA extensions, torch.compile) a strong plus. • Solid understanding of mixed precision training and numerical stability trade-offs. • Track record of open-source contributions or publications in ML systems is highly desirable. Engagement Details • Duration: Initial 2–3 months (extension possible). • Location: Remote (with flexible working hours). • Commitment: Part-time or full-time consulting engagement depending on availability. • Compensation: Competitive consulting rate (hourly or project-based, to be discussed). Why Join Giotto.ai • Work at the frontier of efficient AGI research. • Collaborate with a small, high-impact team of researchers and engineers. • Directly influence the training efficiency of cutting-edge long-context Transformer models. • Flexible, research-oriented environment with opportunities for long-term collaboration. How to Apply If you are interested, please share: • A brief description of your relevant experience. • Links to any open-source contributions (GitHub, papers, blog posts). • Your availability and consulting rates.