

Synechron
Resiliency and Recovery Engineer
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Resiliency and Recovery Engineer with a contract length of "unknown," offering a pay rate of "unknown," and is located "remote." Key skills include production resiliency, SQL, automation (Python/PowerShell), and CI/CD experience. Financial services experience is required.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
April 4, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
Charlotte, NC
-
🧠 - Skills detailed
#Kubernetes #Datadog #Stories #Data Science #"ETL (Extract #Transform #Load)" #Python #Cloud #AWS (Amazon Web Services) #Monitoring #Batch #Terraform #Leadership #GitLab #Azure #DevOps #SQL (Structured Query Language) #JQL (Jira Query Language) #Consulting #Observability #Deployment #Splunk #SQL Server #Jira #Jenkins #Automation #BI (Business Intelligence) #AI (Artificial Intelligence) #Firewalls
Role description
We are
At Synechron, we believe in the power of digital to transform businesses for the better. Our global consulting firm combines creativity and innovative technology to deliver industry-leading digital solutions. Synechron’s progressive technologies and optimization strategies span end-to-end Artificial Intelligence, Consulting, Digital, Cloud & DevOps, Data, and Software Engineering, servicing an array of noteworthy financial services and technology firms. Through research and development initiatives in our FinLabs we develop solutions for modernization, from Artificial Intelligence and Blockchain to Data Science models, Digital Underwriting, mobile-first applications and more. Over the last 20+ years, our company has been honored with multiple employer awards, recognizing our commitment to our talented teams. With top clients to boast about, Synechron has a global workforce of 14,500+, and has 58 offices in 21 countries within key global markets.
Our challenge
The Resiliency & Recovery Engineer is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness. The engineer will partner closely with application teams, DevOps, Infrastructure, Database teams, and operational stakeholders to identify resiliency gaps, prioritize remediation, and implement durable solutions that improve stability and reduce customer impact.
• Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes that benefit every platform, ensuring these enhancements are adopted broadly across the organization rather than siloed on any single platform.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
• Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
• Runs weekly standup + prepares bi-weekly exec readout.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement across the organization.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Ensure a seamless handoff of all newly created resiliency and recovery practices (once mature and repeatable) to the MMC Engineering team by thoroughly documenting the improvements and conducting knowledge transfer, so that the permanent team can sustain and build upon these enhancements after the contract period.
Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
• Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
• Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab,
• Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab
• Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies
• Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
• Experience using metrics and monitoring data to identify and communicate deficiencies.
• Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and “why did this change break prod” investigations
• Experience with identity/credentials/certificate & secret-rotation resilience (preventing outages during password rotations, certificate upgrades, and secret propagation; implementing guardrails and monitoring for these events).
• Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-DC scenarios, and building controls to ensure scheduled processing does not impact customers).
• Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation—especially across vendor/downstream dependencies).
Nice-to-have (differentiators)
• Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
• Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
• Cloud Engineering (Azure, AWS)
• DevOps tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts)
• Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of “app vs infra vs network” to stabilize service)
We offer:
• A highly competitive compensation and benefits package.
• A multinational organization with 58 offices in 21 countries and the possibility to work abroad.
• 10 days of paid annual leave (plus sick leave and national holidays).
• Maternity & paternity leave plans.
• A comprehensive insurance plan including medical, dental, vision, life insurance, and long-/short-term disability (plans vary by region).
• Retirement savings plans.
• A higher education certification policy.
• Commuter benefits (varies by region).
• Extensive training opportunities, focused on skills, substantive knowledge, and personal development.
• On-demand Udemy for Business for all Synechron employees with free access to more than 5000 curated courses.
• Coaching opportunities with experienced colleagues from our Financial Innovation Labs (FinLabs) and Center of Excellences (CoE) groups.
• Cutting edge projects at the world’s leading tier-one banks, financial institutions and insurance firms.
• A flat and approachable organization.
• A truly diverse, fun-loving, and global work culture.
SYNECHRON’S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.
We are
At Synechron, we believe in the power of digital to transform businesses for the better. Our global consulting firm combines creativity and innovative technology to deliver industry-leading digital solutions. Synechron’s progressive technologies and optimization strategies span end-to-end Artificial Intelligence, Consulting, Digital, Cloud & DevOps, Data, and Software Engineering, servicing an array of noteworthy financial services and technology firms. Through research and development initiatives in our FinLabs we develop solutions for modernization, from Artificial Intelligence and Blockchain to Data Science models, Digital Underwriting, mobile-first applications and more. Over the last 20+ years, our company has been honored with multiple employer awards, recognizing our commitment to our talented teams. With top clients to boast about, Synechron has a global workforce of 14,500+, and has 58 offices in 21 countries within key global markets.
Our challenge
The Resiliency & Recovery Engineer is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness. The engineer will partner closely with application teams, DevOps, Infrastructure, Database teams, and operational stakeholders to identify resiliency gaps, prioritize remediation, and implement durable solutions that improve stability and reduce customer impact.
• Work across all MMC payment rails to develop faster, more repeatable resiliency and recovery processes that benefit every platform, ensuring these enhancements are adopted broadly across the organization rather than siloed on any single platform.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build/strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB/infrastructure teams.
• Owns backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech).
• Runs weekly standup + prepares bi-weekly exec readout.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement across the organization.
• Improve release safety and rollback/fallback readiness (clear, repeatable cutback procedures).
• Ensure a seamless handoff of all newly created resiliency and recovery practices (once mature and repeatable) to the MMC Engineering team by thoroughly documenting the improvements and conducting knowledge transfer, so that the permanent team can sustain and build upon these enhancements after the contract period.
Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery (recovery execution, runbooks/playbooks, RCA mindset).
• Incident pattern analysis + MTTR baselines (P2 Major/Minor) and recurring failure taxonomy (by rail/service).
• Senior-level observability expertise: dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Splunk, Datadog, SQLs, JQL Jira Query language, Gitlab,
• Experience of CI / CD metrics and generating code quality, changes, testing automation executives reports from Gitlab
• Understand quality of stories, metrics, monitoring experiences - help get data to showcase deficiencies
• Senior CI/CD experience: pipeline design/operation, release safety patterns, and rollback readiness.
• Experience using metrics and monitoring data to identify and communicate deficiencies.
• Automation skills: Python and/or PowerShell (or equivalent) for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform production troubleshooting (deployments, pods, config drift, safe restarts, and “why did this change break prod” investigations
• Experience with identity/credentials/certificate & secret-rotation resilience (preventing outages during password rotations, certificate upgrades, and secret propagation; implementing guardrails and monitoring for these events).
• Batch/scheduler/job-execution reliability (detecting/preventing silent job failures, validating multi-DC scenarios, and building controls to ensure scheduled processing does not impact customers).
• Distributed integration failure-handling (timeouts, retries, backpressure, idempotency, duplicate prevention, and reconciliation—especially across vendor/downstream dependencies).
Nice-to-have (differentiators)
• Experience with SRE-style reliability practices (SLO/SLI thinking, error budgets, operational metrics).
• Experience with failover / DC flip / active-active or active-passive recovery concepts and scenario-based runbooks.
• Cloud Engineering (Azure, AWS)
• DevOps tools expertise, (Jenkins, Terraform, Sonar Cube, Helm Charts)
• Network & traffic-management incident triage (load balancers/firewalls/VLAN changes, DC traffic flips, and rapid isolation of “app vs infra vs network” to stabilize service)
We offer:
• A highly competitive compensation and benefits package.
• A multinational organization with 58 offices in 21 countries and the possibility to work abroad.
• 10 days of paid annual leave (plus sick leave and national holidays).
• Maternity & paternity leave plans.
• A comprehensive insurance plan including medical, dental, vision, life insurance, and long-/short-term disability (plans vary by region).
• Retirement savings plans.
• A higher education certification policy.
• Commuter benefits (varies by region).
• Extensive training opportunities, focused on skills, substantive knowledge, and personal development.
• On-demand Udemy for Business for all Synechron employees with free access to more than 5000 curated courses.
• Coaching opportunities with experienced colleagues from our Financial Innovation Labs (FinLabs) and Center of Excellences (CoE) groups.
• Cutting edge projects at the world’s leading tier-one banks, financial institutions and insurance firms.
• A flat and approachable organization.
• A truly diverse, fun-loving, and global work culture.
SYNECHRON’S DIVERSITY & INCLUSION STATEMENT
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and is an affirmative action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.






