Princeton University

Senior Data Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior Data Engineer with a contract length of 3 months, offering a competitive pay rate. Key skills include Databricks, PySpark, Delta Lake, and large-scale ETL systems. Experience with cloud cost optimization and large datasets is required.
🌎 - Country
United States
πŸ’± - Currency
$ USD
-
πŸ’° - Day rate
Unknown
-
πŸ—“οΈ - Date
December 5, 2025
πŸ•’ - Duration
1 to 3 months
-
🏝️ - Location
Unknown
-
πŸ“„ - Contract
Unknown
-
πŸ”’ - Security
Unknown
-
πŸ“ - Location detailed
Atlanta Metropolitan Area
-
🧠 - Skills detailed
#PySpark #Scala #AutoScaling #DevOps #Metadata #Monitoring #Documentation #Spark (Apache Spark) #Cloud #Azure #Azure DevOps #Storage #Regression #Delta Lake #Data Management #GitHub #Libraries #Deployment #Databricks #Data Quality #Data Access #Python #Leadership #"ETL (Extract #Transform #Load)" #Compliance #Observability #GitLab #Data Engineering #Data Architecture #Datasets
Role description
1. Overview The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platformβ€”designing new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions. This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others). The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization. 1. Objectives β€’ Provide end-to-end data engineering leadership for the Accelerator’s Databricks ecosystem. β€’ Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets. β€’ Establish durable engineering standards, monitoring, and documentation to support long-term sustainability. β€’ Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders. 1. Scope of WorkA. Pipeline Maintenance & Optimization β€’ Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed. β€’ Schema & Metadata Management: Maintain schema consistency and update documentation. β€’ Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion. β€’ Data Validation & QA: Implement automated data quality checks. β€’ Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics). B. Researcher & Stakeholder Support β€’ Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns. β€’ Prepare reference tables, specialized views, or dataset extracts as needed. β€’ Communicate all changes and updates through Slack, Confluence, and regular engineering meetings. 1. Deliverables β€’ Stable, scalable pipeline architecture across all Accelerator datasets. β€’ Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption. β€’ Unified observability suite with job monitoring, lineage, and cost visibility. β€’ Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures. β€’ Comscore dataset reliably supported and refreshed with documented validation steps. β€’ Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations. 1. Required Skills β€’ Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog. β€’ Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark. β€’ Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design. β€’ Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent). β€’ Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform. β€’ Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning. β€’ Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies. β€’ Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders. β€’ Experience working with large datasets (10–100 TB range) with evolving schemas or complex ingestion patterns. 1. Success Metrics β€’ Reduction in compute and/or storage costs due to architectural and performance improvements. β€’ All core pipelines achieve consistent SLA compliance. β€’ Comscore dataset refresh completed with validated data and minimal regression issues. β€’ Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness. β€’ Complete, maintainable documentation and observability tools in place. 1. Proposal Submission Guidelines Interested vendors should submit: β€’ A brief company profile and relevant experience. β€’ Proposed approach and timeline. β€’ Key personnel and their qualifications. β€’ Budget estimate for the 3-month engagement. β€’ References from similar projects (if available). 1. Submission Deadline All proposals must be submitted by December 12th to info@researchaccelerator.org.