

STEM Sync AI
Agentic Workflow Evaluation Consultant | Remote
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for an "Agentic Workflow Evaluation Consultant" on a W2 contract, remote for 30+ hours/week, with a pay rate up to $1,920. Requires a PhD or current/retired professor in STEM or quantitative fields, with strong Python skills and model evaluation experience.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
880
-
🗓️ - Date
May 28, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
Remote
-
📄 - Contract
W2 Contractor
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Compliance #Python #Data Science #Model Evaluation #Mathematics #Statistics #GitHub #ML (Machine Learning)
Role description
Frontier Model Evaluator (Academic & Domain Expert) Remote | W2 Contract | Up to $1,920 Referral Bonus | 30+ hrs/week
Quick Snapshot
• Embedded within a leading frontier-model lab's GenAI team, working directly on benchmark design and model evaluation for cutting-edge LLM development
• Design and validate real-world, domain-specific agentic tasks with executable Python test suites to surface reasoning and problem-solving failures in target models
• Analyze model and agent behavior to classify failure types distinguishing logical reasoning gaps from other performance issues
• Open to professors, retired academics, and PhD candidates across STEM, finance, law, economics, business, and quantitative disciplines
• W2 employment through an established enterprise staffing partner structured role with payroll, benefits, and compliance support
• Minimum 30 hours/week commitment during weekdays; work is remote and task-driven, suited to researchers with flexible schedules
• Referral program available earn up to $1,920 per successful referral with no cap on referrals
Requirements
• Current or retired professor, or PhD student (or candidate) in a STEM field (ML, CS, mathematics, physics, engineering, statistics, biology, chemistry, data science) or quantitative/professional domain (finance, economics, law, accounting, business)
• Degree or PhD in progress from a top-tier university in your field
• Hands-on Python proficiency demonstrated through research, industry work, GitHub projects, or coursework; theoretical familiarity alone does not qualify
• Ability to design rigorous, real-world domain problems targeting specific capability gaps in large language models or agentic systems
• Build complete task specifications including golden solutions and executable test cases within an agentic development environment
• Evaluate model outputs systematically and classify failure modes with precision
• Prior experience in model evaluation, data annotation, or LLM/agent training is a strong plus
Easy apply to proceed.
Frontier Model Evaluator (Academic & Domain Expert) Remote | W2 Contract | Up to $1,920 Referral Bonus | 30+ hrs/week
Quick Snapshot
• Embedded within a leading frontier-model lab's GenAI team, working directly on benchmark design and model evaluation for cutting-edge LLM development
• Design and validate real-world, domain-specific agentic tasks with executable Python test suites to surface reasoning and problem-solving failures in target models
• Analyze model and agent behavior to classify failure types distinguishing logical reasoning gaps from other performance issues
• Open to professors, retired academics, and PhD candidates across STEM, finance, law, economics, business, and quantitative disciplines
• W2 employment through an established enterprise staffing partner structured role with payroll, benefits, and compliance support
• Minimum 30 hours/week commitment during weekdays; work is remote and task-driven, suited to researchers with flexible schedules
• Referral program available earn up to $1,920 per successful referral with no cap on referrals
Requirements
• Current or retired professor, or PhD student (or candidate) in a STEM field (ML, CS, mathematics, physics, engineering, statistics, biology, chemistry, data science) or quantitative/professional domain (finance, economics, law, accounting, business)
• Degree or PhD in progress from a top-tier university in your field
• Hands-on Python proficiency demonstrated through research, industry work, GitHub projects, or coursework; theoretical familiarity alone does not qualify
• Ability to design rigorous, real-world domain problems targeting specific capability gaps in large language models or agentic systems
• Build complete task specifications including golden solutions and executable test cases within an agentic development environment
• Evaluate model outputs systematically and classify failure modes with precision
• Prior experience in model evaluation, data annotation, or LLM/agent training is a strong plus
Easy apply to proceed.






