

Princeton University
Senior Data Engineer
β - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior Data Engineer with a contract length of 3 months, offering a competitive pay rate. Key skills include Databricks, PySpark, Delta Lake, and large-scale ETL systems. Experience with cloud cost optimization and large datasets is required.
π - Country
United States
π± - Currency
$ USD
-
π° - Day rate
Unknown
-
ποΈ - Date
December 5, 2025
π - Duration
1 to 3 months
-
ποΈ - Location
Unknown
-
π - Contract
Unknown
-
π - Security
Unknown
-
π - Location detailed
Atlanta Metropolitan Area
-
π§ - Skills detailed
#PySpark #Scala #AutoScaling #DevOps #Metadata #Monitoring #Documentation #Spark (Apache Spark) #Cloud #Azure #Azure DevOps #Storage #Regression #Delta Lake #Data Management #GitHub #Libraries #Deployment #Databricks #Data Quality #Data Access #Python #Leadership #"ETL (Extract #Transform #Load)" #Compliance #Observability #GitLab #Data Engineering #Data Architecture #Datasets
Role description
1. Overview
The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platformβdesigning new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions.
This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others).
The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization.
1. Objectives
β’ Provide end-to-end data engineering leadership for the Acceleratorβs Databricks ecosystem.
β’ Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets.
β’ Establish durable engineering standards, monitoring, and documentation to support long-term sustainability.
β’ Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders.
1. Scope of WorkA. Pipeline Maintenance & Optimization
β’ Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed.
β’ Schema & Metadata Management: Maintain schema consistency and update documentation.
β’ Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion.
β’ Data Validation & QA: Implement automated data quality checks.
β’ Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics).
B. Researcher & Stakeholder Support
β’ Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns.
β’ Prepare reference tables, specialized views, or dataset extracts as needed.
β’ Communicate all changes and updates through Slack, Confluence, and regular engineering meetings.
1. Deliverables
β’ Stable, scalable pipeline architecture across all Accelerator datasets.
β’ Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption.
β’ Unified observability suite with job monitoring, lineage, and cost visibility.
β’ Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures.
β’ Comscore dataset reliably supported and refreshed with documented validation steps.
β’ Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations.
1. Required Skills
β’ Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog.
β’ Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark.
β’ Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design.
β’ Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent).
β’ Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform.
β’ Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning.
β’ Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies.
β’ Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders.
β’ Experience working with large datasets (10β100 TB range) with evolving schemas or complex ingestion patterns.
1. Success Metrics
β’ Reduction in compute and/or storage costs due to architectural and performance improvements.
β’ All core pipelines achieve consistent SLA compliance.
β’ Comscore dataset refresh completed with validated data and minimal regression issues.
β’ Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness.
β’ Complete, maintainable documentation and observability tools in place.
1. Proposal Submission Guidelines
Interested vendors should submit:
β’ A brief company profile and relevant experience.
β’ Proposed approach and timeline.
β’ Key personnel and their qualifications.
β’ Budget estimate for the 3-month engagement.
β’ References from similar projects (if available).
1. Submission Deadline
All proposals must be submitted by December 12th to info@researchaccelerator.org.
1. Overview
The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platformβdesigning new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions.
This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others).
The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization.
1. Objectives
β’ Provide end-to-end data engineering leadership for the Acceleratorβs Databricks ecosystem.
β’ Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets.
β’ Establish durable engineering standards, monitoring, and documentation to support long-term sustainability.
β’ Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders.
1. Scope of WorkA. Pipeline Maintenance & Optimization
β’ Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed.
β’ Schema & Metadata Management: Maintain schema consistency and update documentation.
β’ Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion.
β’ Data Validation & QA: Implement automated data quality checks.
β’ Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics).
B. Researcher & Stakeholder Support
β’ Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns.
β’ Prepare reference tables, specialized views, or dataset extracts as needed.
β’ Communicate all changes and updates through Slack, Confluence, and regular engineering meetings.
1. Deliverables
β’ Stable, scalable pipeline architecture across all Accelerator datasets.
β’ Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption.
β’ Unified observability suite with job monitoring, lineage, and cost visibility.
β’ Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures.
β’ Comscore dataset reliably supported and refreshed with documented validation steps.
β’ Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations.
1. Required Skills
β’ Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog.
β’ Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark.
β’ Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design.
β’ Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent).
β’ Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform.
β’ Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning.
β’ Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies.
β’ Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders.
β’ Experience working with large datasets (10β100 TB range) with evolving schemas or complex ingestion patterns.
1. Success Metrics
β’ Reduction in compute and/or storage costs due to architectural and performance improvements.
β’ All core pipelines achieve consistent SLA compliance.
β’ Comscore dataset refresh completed with validated data and minimal regression issues.
β’ Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness.
β’ Complete, maintainable documentation and observability tools in place.
1. Proposal Submission Guidelines
Interested vendors should submit:
β’ A brief company profile and relevant experience.
β’ Proposed approach and timeline.
β’ Key personnel and their qualifications.
β’ Budget estimate for the 3-month engagement.
β’ References from similar projects (if available).
1. Submission Deadline
All proposals must be submitted by December 12th to info@researchaccelerator.org.






