Insight Global

Data Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Data Engineer with 5+ years of experience, focusing on large-scale distributed systems. The contract is remote, with a pay rate of "unknown" and requires expertise in Apache Spark, Azure services, and healthcare data.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
640
-
🗓️ - Date
March 20, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
Remote
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Spark SQL #Data Transformations #Delta Lake #Azure cloud #ADF (Azure Data Factory) #Data Analysis #Scala #PySpark #Spark (Apache Spark) #Cloud #Data Quality #ML (Machine Learning) #SQL (Structured Query Language) #Data Pipeline #Azure Data Factory #Kubernetes #Distributed Computing #Data Lineage #"ETL (Extract #Transform #Load)" #Azure #AI (Artificial Intelligence) #Data Science #Data Engineering #Storage #Python #Data Framework #Databricks #Monitoring #Data Layers #Apache Spark
Role description
Must be willing to work with overlap to European hours. Ideally 6:00am-3pm EST We are looking for a Data Engineer to join a dedicated team building and evolving a clinical data platform serving the clinical operations space. You will architect and build the large-scale data pipelines that power clinical insights — processing billions of records across medical claims, clinical trials, publications, and provider data. This is a core infrastructure role. You will be responsible for designing, building, and maintaining ETL frameworks that feed into analytics, machine learning, and product surfaces. You should be deeply comfortable with distributed computing at scale and experienced working alongside ML and data science teams in production environments. What You'll Do ● Design, build, and maintain large-scale ETL pipelines and data frameworks using Apache Spark (PySpark/Scala) on cloud infrastructure ● Architect scalable data models and pipeline patterns to process structured and unstructured healthcare data at volume ● Build and optimize data layers on Azure cloud services, including Databricks, Delta Lake, and supporting compute and storage infrastructure ● Ensure data quality, lineage, and governance across the platform — implementing validation, monitoring, and alerting at scale ● Collaborate with AI Scientists and MLOps teams to build data pipelines that serve model training, inference, and retraining workflows ● Work with data analysts and product teams to ensure curated, reliable data is available for downstream insights and reporting ● Contribute to platform architecture decisions and help define best practices for data engineering within the team What We're Looking For ● 5+ years of experience in data engineering with a focus on large-scale distributed data systems ● Strong proficiency in Python, SQL, and Scala ● Deep hands-on experience with Apache Spark (PySpark, Spark SQL) for building ETL pipelines and data transformations at scale ● Experience with Azure cloud services — including Databricks, Delta Lake, and Azure Data Factory ● Familiarity with Kubernetes and container orchestration for data workloads ● Understanding of MLOps practices and experience building data infrastructure that supports machine learning workflows ● Experience with data quality frameworks, data lineage, and governance tooling ● Background in healthcare, life sciences, pharma, or clinical research is a strong plus ● Comfortable working independently in a remote setting with a distributed, cross-time zone team Who You Are ● A builder who thinks in systems — you design for scale, reliability, and maintainability from the start ● Someone who understands how data engineering connects to ML and analytics, and proactively bridges those gaps ● Confident owning pipeline architecture end-to-end, from ingestion through transformation to serving layers ● Pragmatic and communicative — you flag trade-offs early and keep teams aligned on data dependencies ● Experienced collaborating across time zones with distributed teams including data science, MLOps, and product stakeholder