

Evals Demonstration Engineer (Contract)
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is an Evals Demonstration Engineer (Contract) for 6 months in London, paying £7,500 GBP per month. Key skills include Python programming, policy communication, and experience with AI/ML frameworks. Familiarity with Inspect is required; government experience is a plus.
🌎 - Country
United Kingdom
💱 - Currency
£ GBP
-
💰 - Day rate
357.1428571429
-
🗓️ - Date discovered
August 10, 2025
🕒 - Project duration
More than 6 months
-
🏝️ - Location type
On-site
-
📄 - Contract type
Unknown
-
🔒 - Security clearance
Unknown
-
📍 - Location detailed
London, England, United Kingdom
-
🧠 - Skills detailed
#AI (Artificial Intelligence) #Model Evaluation #Workday #Python #Base #ML (Machine Learning) #Documentation #Deployment #Visualization #Programming
Role description
Heading 1
Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Block quote
Ordered list
- Item 1
- Item 2
- Item 3
Unordered list
- Item A
- Item B
- Item C
Bold text
Emphasis
Superscript
Subscript
Applications deadline: We accept submissions until 10 Sep 2025. We review applications on a rolling basis and encourage early submissions.
The Opportunity
We're seeking an Evals Demonstration Engineer who will design new demonstrations and translate our technical evaluation findings into compelling and accessible demonstrations for policymakers and non-technical stakeholders (e.g. government officials, think tanks).
This role offers the opportunity to shape how frontier AI risks and our work are communicated to those with the power to address them. You will also act as the bridge between Apollo’s internal technical research and governance teams. This role requires a unique mix of technical understanding, policy acumen, creativity and communication skills.
This 6-month full-time contract role. Depending on your performance and needs of Apollo, this role might transition into a permanent role.
Key Responsibilities
• Build a library of reusable self-contained demos using the Inspect framework and our internal code base
• Develop interactive and effective demonstrations in the right medium (web-based and interactive, visual, video, report etc. ) that clearly communicate our evaluation findings, AI capabilities and risks that we care about
• Here are examples of demos/visuals that we generally like: Anthropic Interpretability, Anthropic persona vectors, 3Blue1Brown, AI2027 video, OWID graphs, EPOCH graphs
• The visualizations will likely mostly revolve around transcripts rather than graphs. For example, these may look like better versions of our in-context scheming snippets
• Deliver live demonstrations and presentations to policymakers and non-technical stakeholders with clarity and coherence
• Create clear documentation and guides to enable Apollo team members to present demos effectively
• Rapidly prototype and iterate using evidence-based methods e.g. user testing, analytics and stakeholder feedback
• Collaborate with internal research and governance teams to ensure that the content produced accurately represent our work and concerns, as well as tailored to our specific policy audience and objectives
Job Requirements
• Proven ability to choose and execute technical demonstrations in the right medium (web-based and interactive, visual, video, report etc.) based on audience needs and measured effectiveness
• Solid Python programming skills (in order to run evaluations in Inspect and modify tasks to fit demo needs)
• Exceptional verbal and written communication skills, specifically the ability to explain complex concepts simply
• Experience working with or presenting to policymakers and non-technical audiences, with awareness of policy communication requirements and decision-making
• Familiarity with Inspect (ability to run and modify evaluations and build small agent evaluations in Inspect)
• Proven ability to measure what resonates with audiences and systematically iterate content effectiveness based on evidence
• Self-directed work style with ability to execute independently on projects
• Ability to travel occasionally to key policy locations (e.g., Washington D.C., Brussels, London)
Nice to havesPrevious experience with AI/ML evaluation frameworksPrior experience working in a government agency, think tank or policy organizationBackground in user experience (UX) design
We want to emphasize that people who feel they don’t fulfill all of these characteristics but think they would be a good fit for the position nonetheless are strongly encouraged to apply. We believe that excellent candidates can come from a variety of backgrounds and are excited to give you opportunities to shine. Representative Projects
• Create at least 5 high-impact demonstrations used in meetings with policymakers and non-technical stakeholders
• Build a comprehensive library of demonstrations covering our key evaluation findings
• Publish 2 standalone blog posts highlighting and explaining some of our most important findings to a less technical audience.
About the team
• You will work both with the evals research and the governance teams. Marius Hobbhahn manages and advises the Evals team and Charlotte Stix leads the governance team.
• You can find our full team here.
Logistics
• Contract Duration: 6 months with possibility of extension and conversion to permanent
• Compensation: £7,500 GBP per month (approximately $10,000 USD)
• Start Date: Target of 2-3 months after the first interview
• Location: The office is in London, and the building is shared with the London Initiative for Safe AI (LISA) offices. This is an in-person role.
• Work Visas: Due to the current short-term nature of the role, we are prioritising candidates who have the right to work in the UK. If you think you have an exceptional profile but don't have the right to work, please apply anyway.
Benefits
• Flexible work hours and schedule
• Lunch, dinner, and snacks are provided for all employees on workdays
• Paid work trips, including staff retreats, business trips, and relevant conferences
• Private medical insurance
• Statutory benefits apply
• Potential pathway to full-time employment
• Opportunity to work on cutting-edge AI safety research
• Collaborative environment with leading researchers
• Central London location
About Apollo
• The capabilities of current AI systems are evolving at a rapid pace. While these advancements offer tremendous opportunities, they also present significant risks, such as the potential for deliberate misuse or the deployment of sophisticated yet misaligned models.
• At Apollo Research, our primary concern lies with deceptive alignment, a phenomenon where a model appears to be aligned but is, in fact, misaligned and capable of evading human oversight.
• Our approach focuses on behavioral model evaluations, which we then use to audit real-world models. In our evaluations, we focus on LM agents, i.e. LLMs with agentic scaffolding similar to AIDE or SWE agent.
• At Apollo, we aim for a culture that emphasizes truth-seeking, being goal-oriented, giving and receiving constructive feedback, and being friendly and helpful. If you’re interested in more details about what it’s like working at Apollo, you can find more information here.
Equality Statement: Apollo Research is an Equal Opportunity Employer. We value diversity and are committed to providing equal opportunities to all, regardless of age, disability, gender reassignment, marriage and civil partnership, pregnancy and maternity, race, religion or belief, sex, or sexual orientation.
Our streamlined interview process includes: - Application review with detailed questionnaire- Screening call (30 minutes)- Work test (3 hours): Create a 3-minute demonstration from a provided evaluation (screen recording required)- Technical interview (60 minutes) with Charlotte (Head of AI Governance)- Final interview (30 minutes) with Marius (CEO)