Holistic Partners, Inc

Lead SRE Engineer || W2 Only

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Lead SRE Engineer in Chicago, IL (Hybrid) on a contract basis. Requires expertise in Java, Linux, multi-data-center architecture, and disaster recovery strategy, with familiarity in Kafka and Oracle. Pay rate is "$X/hour".
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
October 23, 2025
🕒 - Duration
Unknown
-
🏝️ - Location
Hybrid
-
📄 - Contract
W2 Contractor
-
🔒 - Security
Unknown
-
📍 - Location detailed
Chicago, IL
-
🧠 - Skills detailed
#Datadog #Data Integration #Deployment #"ETL (Extract #Transform #Load)" #Programming #Vault #Monitoring #Linux #Data Warehouse #Strategy #GitHub #Batch #Kafka (Apache Kafka) #Leadership #Replication #IP (Internet Protocol) #Oracle #Scala #Java #Disaster Recovery #Splunk #Documentation #Compliance #Jenkins
Role description
Job Opportunity:Lead SRE Engineer Location:Chicago, IL (Hybrid) Duration:Contract Key Responsibility Top Skills' Details Strong background in Java application architecture and TCP/IP socket programming. Expertise with Linux (Red Hat), VM environments, and Nutanix infrastructure. Knowledge of multi-data-center Active/Active design patterns and high-availability systems. Familiarity with Kafka, DB2, Oracle, and mainframe data integration. Hands-on experience with Splunk, DataDog, Jenkins, ServiceNow, and Autosys. Proven ability to lead technical DR exercises, coordinate multi-team execution, and present results to leadership. Secondary Skills - Nice to Haves Job Description We are seeking a Lead Engineer to lead the resiliency, disaster recovery (DR), and operational continuity efforts for our mission-critical Domestic Transaction Switching Application and Diner’s Club International (DCI) Switch. This role requires deep technical expertise in Java, Linux, networking, and distributed systems—combined with strategic program delivery skills to coordinate multiple infrastructure, development, and operations teams. The ideal candidate has hands-on experience managing Active/Active multi–data-center architectures, low-level TCP/IP integrations, and DR orchestration within regulated financial environments. Core Responsibilities Application & Infrastructure Oversight Oversee the Domestic Transaction Switching Application, a Java-based platform running on VMs hosted in Nutanix clusters with Red Hat Linux. • Manage all low-level TCP/IP socket communications, including connectors, listeners, and transaction routing logic. • Coordinate with teams supporting the Diner’s Club International Switch, a WebSphere application, ensuring interoperability and fault-tolerant communication between domestic and international payment networks. • Ensure high availability, scalability, and compliance of multi-data-center deployments through Active/Active/Active architecture review and validation. Disaster Recovery (DR) Strategy & Analysis • Own the end-to-end DR planning, testing, and documentation as outlined in Milestone 5.1 of the detailed DR plan. • Evaluate the impact of DR events across configuration data sources, including 30+ read-only configuration files (IIN ranges, currency codes, merchant category codes, etc.) loaded into in-memory caches. • Assess external dependencies such as DB2 Global Database, mainframe negative files and account-level processing files, and the Oracle UI used by operations to manage client connections and routes. • Perform criticality analysis to classify configuration dependencies (blockers, critical, non-critical) and design mitigation strategies for stale or unavailable data sources. • Define recovery point (RPO) and recovery time objectives (RTO) for all dependent systems. Active/Active Architecture Validation • Review and strengthen the Active/Active/Active data-center strategy for the Hydra Switching Application. • Identify and document exceptions, such as low-volume participants operating in Active/Passive mode, and assess potential transaction impact during site failover; inventory and track remediation plans. • Analyze inter-data-center dependencies, including the dynamic key exchange (DKE) process requiring three-way acknowledgment for encryption key rotation. Document functional areas that degrade or fail during partial data-center outages and propose operational mitigations. Transaction Extracts & Event Processing • Oversee downstream batch transaction extracts distributed to Data Warehouse, Settlement Systems, WorldPay, and regional datastores (e.g., India). • Verify Kafka Enterprise Event Bus integrity during DR events, ensuring Active/Active message replication and recovery consistency—trust but verify. • Analyze downstream dependencies to validate continuity for all transaction, settlement, and compliance feeds. Control Plane & Platform Dependencies • Assess DR implications for control-plane components (Jenkins, GitHub, Nexus, Vault, Protegrity, Okta, etc.) which operate in Active/Passive configurations. • Coordinate with enterprise platform teams to balance scope and minimize global outage risk during DR testing. • Contribute to the Enterprise DR Playbook to define which components are within or excluded from DR scope. Monitoring, Runbooks & Evidence Capture • Maintain comprehensive monitoring coverage using Splunk (functional transaction view) and DataDog (infrastructure health). • Develop runbooks and implementation plan templates integrating ServiceNow, Jenkins, and Autosys workflows for deployment, validation, and rollback. • Standardize evidence capture processes using Splunk dashboards, system logs, and console screenshots for audit and compliance reporting. Non-Production Test Environments & Simulation • Design and coordinate a production-like QA/Dev environment for full DR simulation testing across all dependent components. • Execute controlled DR test events, emulating change windows and data-center failovers: place impacted data center into down state, freeze configuration and batch jobs, redirect traffic and validate health on remaining sites, bring passive site online and validate configuration/job recovery. • Document lessons learned and integrate continuous improvement into DR planning. Required Skills & Experience • Strong background in Java application architecture and TCP/IP socket programming. • Expertise with Linux (Red Hat), VM environments, and Nutanix infrastructure. • Knowledge of multi-data-center Active/Active design patterns and high-availability systems. • Familiarity with Kafka, DB2, Oracle, and mainframe data integration. • Hands-on experience with Splunk, DataDog, Jenkins, ServiceNow, and Autosys. • Proven ability to lead technical DR exercises, coordinate multi-team execution, and present results to leadership