GPU Kernel Engineer Consultant US

⭐ - Featured Role | Apply direct with Data Freelance Hub

This role is for a GPU Kernel Engineer Consultant (2–3 months, remote) focused on optimizing Transformer model training efficiency. Key skills include CUDA/Triton kernel optimization, GPU architecture knowledge, and experience with FlashAttention. Open-source contributions are highly desirable.

🌎 - Country

United States

💱 - Currency

$ USD

💰 - Day rate

🗓️ - Date discovered

August 28, 2025

🕒 - Project duration

1 to 3 months

🏝️ - Location type

Remote

📄 - Contract type

Unknown

🔒 - Security clearance

Unknown

📍 - Location detailed

United States

🧠 - Skills detailed

#"ETL (Extract #Transform #Load)" #ML (Machine Learning) #PyTorch #Deep Learning #AI (Artificial Intelligence) #Transformers #C++ #Consulting #GitHub

Role description

Giotto.ai is a Swiss research lab focused on advancing Artificial General Intelligence (AGI). We are best known for our work on the ARC-AGI benchmark, where we currently hold the top position in the Kaggle 2025 competition. Our mission is to push the boundaries of efficient training and to contribute cutting-edge AI research to the open-source community. Role Overview We are seeking a GPU Kernel Engineer (Consultant) to help us optimize the training efficiency of Transformer models. This is a short-term consulting engagement (initially 2–3 months, with possible extension) where you will work closely with our research team to design, implement, and benchmark high-performance GPU kernels that significantly accelerate training throughput while maintaining model quality. Key Responsibilities • Implement and optimize custom attention kernels for Transformers. • Adapt and integrate FlashAttention/FlexAttention. • Develop and optimize fused GPU kernels (RMSNorm, CrossEntropy, matmul-bias-activation). • Optimize kernel performance for mixed precision training (bf16/fp16, optional fp8 on H100). • Profile and debug GPU performance bottlenecks using Nsight Systems, Nsight Compute, or PyTorch Profiler. • Collaborate with our ML Systems Engineers to ensure functional equivalence of optimized kernels with baseline implementations. • Provide knowledge transfer and best practices to the Giotto.ai research team. Qualifications • Proven experience writing and optimizing CUDA or Triton kernels for deep learning workloads. • Strong understanding of GPU architectures (memory hierarchy, warp scheduling, tensor cores). • Hands-on experience with FlashAttention, FlexAttention, xFormers, or similar optimized attention implementations. • Familiarity with PyTorch internals (autograd, C++/CUDA extensions, torch.compile) a strong plus. • Solid understanding of mixed precision training and numerical stability trade-offs. • Track record of open-source contributions or publications in ML systems is highly desirable. Engagement Details • Duration: Initial 2–3 months (extension possible). • Location: Remote (with flexible working hours). • Commitment: Part-time or full-time consulting engagement depending on availability. • Compensation: Competitive consulting rate (hourly or project-based, to be discussed). Why Join Giotto.ai • Work at the frontier of efficient AGI research. • Collaborate with a small, high-impact team of researchers and engineers. • Directly influence the training efficiency of cutting-edge long-context Transformer models. • Flexible, research-oriented environment with opportunities for long-term collaboration. How to Apply If you are interested, please share: • A brief description of your relevant experience. • Links to any open-source contributions (GitHub, papers, blog posts). • Your availability and consulting rates.

Apply now Apply with DFH Sign up

← See all roles

Go to role

GPU Kernel Engineer Consultant US

Premium Members Land Roles Faster—Upgrade today.

Cloud Engineer

GenAI Engineer (W2 Contract)

Power Platform & AI Solutions Administrator

Technical Data Analyst

Premium Members Land Roles Faster—Upgrade today.

Book a

chat

with us

Company