Princeton University

Senior Data Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub

This role is for a Senior Data Engineer with a contract length of 3 months, offering a competitive pay rate. Key skills include Databricks, PySpark, Delta Lake, and large-scale ETL systems. Experience with cloud cost optimization and large datasets is required.

🌎 - Country

United States

💱 - Currency

$ USD

💰 - Day rate

Unknown

🗓️ - Date

December 5, 2025

🕒 - Duration

1 to 3 months

🏝️ - Location

Unknown

📄 - Contract

Unknown

🔒 - Security

Unknown

📍 - Location detailed

Atlanta Metropolitan Area

🧠 - Skills detailed

#PySpark #Scala #AutoScaling #DevOps #Metadata #Monitoring #Documentation #Spark (Apache Spark) #Cloud #Azure #Azure DevOps #Storage #Regression #Delta Lake #Data Management #GitHub #Libraries #Deployment #Databricks #Data Quality #Data Access #Python #Leadership #"ETL (Extract #Transform #Load)" #Compliance #Observability #GitLab #Data Engineering #Data Architecture #Datasets

Role description

1. Overview The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platform—designing new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions. This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others). The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization. 1. Objectives • Provide end-to-end data engineering leadership for the Accelerator’s Databricks ecosystem. • Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets. • Establish durable engineering standards, monitoring, and documentation to support long-term sustainability. • Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders. 1. Scope of WorkA. Pipeline Maintenance & Optimization • Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed. • Schema & Metadata Management: Maintain schema consistency and update documentation. • Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion. • Data Validation & QA: Implement automated data quality checks. • Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics). B. Researcher & Stakeholder Support • Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns. • Prepare reference tables, specialized views, or dataset extracts as needed. • Communicate all changes and updates through Slack, Confluence, and regular engineering meetings. 1. Deliverables • Stable, scalable pipeline architecture across all Accelerator datasets. • Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption. • Unified observability suite with job monitoring, lineage, and cost visibility. • Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures. • Comscore dataset reliably supported and refreshed with documented validation steps. • Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations. 1. Required Skills • Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog. • Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark. • Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design. • Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent). • Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform. • Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning. • Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies. • Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders. • Experience working with large datasets (10–100 TB range) with evolving schemas or complex ingestion patterns. 1. Success Metrics • Reduction in compute and/or storage costs due to architectural and performance improvements. • All core pipelines achieve consistent SLA compliance. • Comscore dataset refresh completed with validated data and minimal regression issues. • Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness. • Complete, maintainable documentation and observability tools in place. 1. Proposal Submission Guidelines Interested vendors should submit: • A brief company profile and relevant experience. • Proposed approach and timeline. • Key personnel and their qualifications. • Budget estimate for the 3-month engagement. • References from similar projects (if available). 1. Submission Deadline All proposals must be submitted by December 12th to info@researchaccelerator.org.

Apply now Apply with DFH Sign up

← See all roles