

Teamware Solutions
LLM Evaluation Engineer
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for an LLM Evaluation Engineer, remote for 12+ months, offering competitive pay. Key skills include LLMs, AI evaluation methodologies, Python, and experience with evaluation tools. Strong understanding of AI safety and bias testing is essential.
🌎 - Country
United States
💱 - Currency
€ EUR
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 17, 2025
🕒 - Duration
More than 6 months
-
🏝️ - Location
Remote
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Data Analysis #AI (Artificial Intelligence) #API (Application Programming Interface) #Programming #Automation #Security #Python #Datasets #Batch
Role description
LLM Evaluation Engineer
Location: Remote
Duration: 12+ Months
Required Skills
• Strong understanding of LLMs and generative AI concepts, including model behavior and output evaluation
• Experience with AI evaluation and benchmarking methodologies, including baseline creation and model comparison
• Hands-on expertise in Eval testing, creating structured test suites to measure accuracy, relevance, safety, and performance
• Ability to define and apply evaluation metrics (precisionrecall, BLEUROUGE, F1, hallucination rate, latency, cost per output)Prompt engineering and prompt testing experience across zero-shot, few-shot, and system prompt scenarios
• Python other programming languages, for automation, data analysis, batch evaluation execution, and API integration
• Experience with evaluation tools/frameworks (OpenAI Evals, HuggingFace evals, Promptfoo, Ragas, DeepEval, LM Eval Harness)
• Ability to create datasets, test cases, benchmarks, and ground truth references for consistent scoring
• Test design and test automation experience, including reproducible evaluation pipelines
• Knowledge of AI safety, bias, security testing, and hallucination analysis
LLM Evaluation Engineer
Location: Remote
Duration: 12+ Months
Required Skills
• Strong understanding of LLMs and generative AI concepts, including model behavior and output evaluation
• Experience with AI evaluation and benchmarking methodologies, including baseline creation and model comparison
• Hands-on expertise in Eval testing, creating structured test suites to measure accuracy, relevance, safety, and performance
• Ability to define and apply evaluation metrics (precisionrecall, BLEUROUGE, F1, hallucination rate, latency, cost per output)Prompt engineering and prompt testing experience across zero-shot, few-shot, and system prompt scenarios
• Python other programming languages, for automation, data analysis, batch evaluation execution, and API integration
• Experience with evaluation tools/frameworks (OpenAI Evals, HuggingFace evals, Promptfoo, Ragas, DeepEval, LM Eval Harness)
• Ability to create datasets, test cases, benchmarks, and ground truth references for consistent scoring
• Test design and test automation experience, including reproducible evaluation pipelines
• Knowledge of AI safety, bias, security testing, and hallucination analysis






