Generative Evalutation AI Engineer (AI, MLops, GxP, HIPAA, Python, RAG, Etc. Needed)

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Generative Evaluation AI Engineer on a long-term contract in New Brunswick, NJ (hybrid/remote), offering competitive pay. Key skills include Python, MLOps, and regulatory compliance (GxP, HIPAA). Requires 2–5 years of ML evaluation experience in regulated domains.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
-
🗓️ - Date discovered
September 17, 2025
🕒 - Project duration
More than 6 months
-
🏝️ - Location type
Hybrid
-
📄 - Contract type
W2 Contractor
-
🔒 - Security clearance
Unknown
-
📍 - Location detailed
New Brunswick, NJ
-
🧠 - Skills detailed
#Grafana #NLTK (Natural Language Toolkit) #Docker #GIT #Langchain #BI (Business Intelligence) #NumPy #Datasets #ML (Machine Learning) #Microsoft Power BI #Scala #Batch #Monitoring #Microservices #FastAPI #Pandas #Azure #Automated Testing #Infrastructure as Code (IaC) #Data Science #AWS S3 (Amazon Simple Storage Service) #Azure DevOps #Libraries #Kubernetes #Airflow #AI (Artificial Intelligence) #S3 (Amazon Simple Storage Service) #GitHub #Computer Science #Regression #Compliance #Azure Data Factory #Databases #ADF (Azure Data Factory) #DevOps #JSON (JavaScript Object Notation) #Statistics #Prometheus #Python #MLflow #Deployment #Terraform #AWS (Amazon Web Services) #NLP (Natural Language Processing)
Role description
Job Details: Job Title: GenAI Evaluation Engineer Location: New Brunswick, NJ- Hybrid/Remote Long Term Contract, can turn into full time Pay available for w2 and c2c prefer usc/gc for when they decide to convert, but open to all with excellent communication and a strong background Role Overview You will own the end-to-end evaluation for our frameworks. Your mission is to translate stakeholder needs from Regulatory, QA, and Data Science into rigorous, automated pipelines that guarantee model quality, safety, and regulatory compliance. You will develop robust microservices and Retrieval-Augmented Generation (RAG) pipelines that power a wide range of critical applications, from regulatory-grade document quality control to agentic assistants for sales and medical claims reasoning. This role is essential for creating reliable, scalable AI systems that augment the work of our subject matter experts and drive innovation across the organization. Key Responsibilities High-Throughput RAG Pipeline Development: • Design and build scalable document processing pipelines to ingest and semantically chunk large batches of documents (PDF/DOCX) from sources like Azure Blob and AWS S3. • Integrate embedding models and tune vector databases like Milvus for high-performance, sub-100 ms k-NN retrieval. • Implement hybrid retrieval systems using BM25 and vector search, and continually track and improve retrieval performance using metrics like MRR and recall Model Fine-Tuning & Prompt Engineering: • Apply large language models (LLMs) and NLP techniques to solve complex problems such as named-entity recognition, question answering, and summarization. • Build fine-tuning pipelines using frameworks like LoRA/PEFT and run hyperparameter sweeps in Azure ML. • Author multi-step prompt chains, enforce structured JSON outputs, and use validation guards to reduce hallucinations and improve model consistency. MLOps & Production Deployment: • Develop and containerize agent-based microservices using frameworks like FastAPI or Azure Functions. • Define Infrastructure as Code using Terraform/ARM and build CI/CD workflows in GitHub Actions for automated testing and canary rollouts. • Implement robust monitoring and alerting for latency (p50/p95) and error rates using tools like Prometheus, Grafana, or Azure Monitor to ensure SLA compliance. • Requirements & Test Plan Design: o Run workshops with stakeholders to codify acceptance criteria for applications like CMC report analysis and patient-safety monitoring. o Draft detailed test cases (both positive and negative), define clear pass/fail thresholds, and maintain traceability matrices. • Automated Evaluation Pipelines: o Implement classic NLP metrics (BLEU/ROUGE), semantic similarity measures, and custom hallucination detectors in Python. o Orchestrate evaluation pipelines in Azure Data Factory or Airflow, integrating with tools like Prodigy or LightTag for human-in-the-loop annotation. o Develop qualitative coding schemas, author annotation guidelines, and ensure high inter-annotator agreement (Cohen’s κ ≥ 0.8). • Data Versioning & Drift Monitoring: o Manage and version evaluation datasets using DVC/Git-LFS while tracking dataset lineage. o Automate data and model drift detection using KS-tests and embedding-based alerts, publishing weekly reports on findings. • Reporting & Governance: o Create interactive Power BI or Grafana dashboards to report on SLA compliance (e.g., accuracy > 98%, hallucination < 2%), trends, and anomalies. o Set up automated regression suites that block deployments if key metrics degrade beyond a set threshold (e.g., ± 5%). o Maintain detailed audit logs of evaluation runs and sign-offs to comply with GxP/GMP standards. o Lead internal training sessions on evaluation best practices and mentor junior evaluators. Required Qualifications: • BS/MS in Computer Science, Statistics, Engineering, or a related field. • 2–5 years of experience in ML evaluation, QA engineering, or analytics, ideally in regulated domains. • Proficiency in Python, pandas, and NumPy. • Hands-on experience with evaluation libraries like NLTK, HuggingFace evaluate, and sacrebleu. • Strong statistical rigor, with a deep understanding of metrics like Cohen’s Kappa. • Experience with BI/dashboarding tools (Power BI, Grafana) and data versioning tools (DVC, Git-LFS). • 2+ years of experience building end-to-end LLM/RAG systems in a production environment. • Deep Python experience, including libraries like FastAPI, pandas, and NumPy. • Hands-on experience with LLM orchestration frameworks (LangChain, LlamaIndex), NLP libraries (HuggingFace), and OpenAI/Azure SDKs. • Proven expertise in MLOps including CI/CD (GitHub Actions/Azure DevOps) and containerization (Docker/Kubernetes). Preferred Qualifications (Nice-to-Haves): • Experience with MLflow for tracking experiments and metrics. • Background in qualitative research methods like open/axial coding, especially in safety-critical settings. • Expertise in regulatory compliance and audits for standards like HIPAA or GxP (21 CFR Part 11). • Experience working in a regulated industry such as pharmaceuticals or life sciences. • Hands-on experience with vector databases like Milvus or Pinecone. • Familiarity with chatbot frameworks like Rasa or Botpress. • Experience with data-centric AI tools for validation and monitoring, such as Great Expectations or Deepchecks.