

VeeAR Projects Inc.
Gen AI System Evaluations Engineer
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a "Gen AI System Evaluations Engineer" on a contract basis, offering a competitive pay rate. Key skills include AI expertise and Python proficiency. Experience in evaluating Generative AI systems and managing human annotation workflows is essential.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
February 6, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
Austin, TX
-
🧠 - Skills detailed
#AI (Artificial Intelligence) #Python
Role description
• Design, execute, and maintain evaluation frameworks for LLM responses to assess accuracy, relevance, safety, and overall quality.
• Evaluate and benchmark Generative AI systems , including end-to-end GenAI pipelines, against defined performance metrics.
• Assess and validate Retrieval-Augmented Generation (RAG) systems , focusing on retrieval quality, grounding, and response faithfulness.
• Evaluate agentic systems , including multi-step reasoning, tool usage, and autonomous decision-making behaviors.
• Conduct systematic GenAI system evaluations , identifying failure modes, biases, and areas for model improvement.
• Perform and manage human annotation workflows , ensuring high-quality labeling, clear guidelines, and inter-annotator consistency.
• Apply strong critical thinking to analyze model outputs, edge cases, and ambiguous scenarios with sound judgment.
• Collaborate with cross-functional teams (research, product, engineering) to define evaluation criteria and report actionable insights.
• (Nice to Have) Use Python to automate evaluation pipelines, analyze results, and generate performance reports.
• Design, execute, and maintain evaluation frameworks for LLM responses to assess accuracy, relevance, safety, and overall quality.
• Evaluate and benchmark Generative AI systems , including end-to-end GenAI pipelines, against defined performance metrics.
• Assess and validate Retrieval-Augmented Generation (RAG) systems , focusing on retrieval quality, grounding, and response faithfulness.
• Evaluate agentic systems , including multi-step reasoning, tool usage, and autonomous decision-making behaviors.
• Conduct systematic GenAI system evaluations , identifying failure modes, biases, and areas for model improvement.
• Perform and manage human annotation workflows , ensuring high-quality labeling, clear guidelines, and inter-annotator consistency.
• Apply strong critical thinking to analyze model outputs, edge cases, and ambiguous scenarios with sound judgment.
• Collaborate with cross-functional teams (research, product, engineering) to define evaluation criteria and report actionable insights.
• (Nice to Have) Use Python to automate evaluation pipelines, analyze results, and generate performance reports.






