Rivago Infotech Inc

Data Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Data Engineer in San Diego, CA, with a long-term contract. Key skills include Apache Spark, Spark SQL, Python, and AWS (EMR Serverless, S3, Redshift). Experience in fraud, risk, or compliance domains is preferred.
🌎 - Country
United States
πŸ’± - Currency
$ USD
-
πŸ’° - Day rate
Unknown
-
πŸ—“οΈ - Date
May 1, 2026
πŸ•’ - Duration
More than 6 months
-
🏝️ - Location
On-site
-
πŸ“„ - Contract
Unknown
-
πŸ”’ - Security
Unknown
-
πŸ“ - Location detailed
San Diego, CA
-
🧠 - Skills detailed
#Documentation #AWS (Amazon Web Services) #GCP (Google Cloud Platform) #Security #"ETL (Extract #Transform #Load)" #Spark SQL #Apache Spark #Compliance #Data Science #JSON (JavaScript Object Notation) #Data Governance #Data Quality #AWS EMR (Amazon Elastic MapReduce) #Spark (Apache Spark) #Monitoring #S3 (Amazon Simple Storage Service) #PySpark #Migration #Redshift #Data Lake #Data Engineering #AWS IAM (AWS Identity and Access Management) #Storage #Athena #ML (Machine Learning) #SQL (Structured Query Language) #Metadata #Python #Datasets #IAM (Identity and Access Management) #Data Documentation
Role description
Role : Data Engineer location : San Diego, CA 92129 (onsite) Duration: Long term Project Design, build, and performance-tune Apache Spark workloads using Spark SQL and PySpark for complex transformations (JSON/semi-structured data, nested structures, window functions, joins, aggregations). 1. Profile and optimize Spark jobs: partitioning, shuffles, join strategies, skew, memory/spill, and right-sized resource usageβ€”especially on EMR Serverlessβ€”for large-scale and petabyte-scale data. 1. Support Customers and Monitor Pipelines with Strict SLA for Fixs and Re Instating Issues around the clock. 1. Implement reusable patterns for incremental loads, deduplication and CDC-style processing. 1. Build and maintain ETL/ELT on AWS EMR Serverless (Spark), with S3 as the data lake layer: partitioning, compression, external tables, and layouts that support fast Spark and downstream SQL. workloads: sort keys, distribution, and SQL patterns that fit S3 Spark Redshift flows. 1. Optimize cost and performance across Spark jobs, S3 storage, and Redshift (including retention and lifecycle thinking where relevant). 1. Produce end-to-end designs: pipeline topology, data models, staging vs curated layers, incremental strategies, and clear tradeoffs (freshness, cost, complexity, reliability). 1. Apply access controls for sensitive financial and user data (least privilege, row/column-level patterns where required). 1. Support data governance: metadata, documentation, and alignment with compliance expectations. 1. Implement data quality (validation rules, regex, null-safety) and monitoring/alerting with error handling for production pipelines. 1. Manage schema evolution and migrations with backward compatibility and risk reduction. 1. Partner with IRL Teams and ML/Data Science on feature-rich datasets; work with risk/compliance and platform teams. What we’re looking for 1. Strong Spark + Spark SQL + hands-on performance tuning (not only SQL writing). 1. Python for Spark/data engineering. 1. AWS: EMR Serverless, S3 (delta and data lake patterns), Redshift (SQL + tuning). 1. Ability to design pipelines and data models and communicate tradeoffs. 1. Familiarity with access control concepts for data platforms (AWS IAM, lake/warehouse permissions, RLS / column-level security as applicable). 1. Ownership of production systems, support, 24/7 monitoring and collaboration. Good to have 1. Fraud, risk, or compliance domains. 1. Athena,GCP and other S3-query engines alongside Spark. 1. Highly interactive or SLA-tight workloads support and monitoring on large data piplines. 1. Deeper Redshift ops (WLM, queues, workload patterns) alongside Spark.