

Flexon Technologies Inc.
Resiliency and Recovery Engineer - Tech Lead
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Resiliency and Recovery Engineer - Tech Lead in Charlotte, North Carolina, offering a contract length of "X months" at a pay rate of "$X/hour." Key skills include high-availability experience, production resiliency, automation (Python/PowerShell), and CI/CD expertise.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
April 10, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
On-site
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
Charlotte, NC
-
🧠 - Skills detailed
#Observability #DevOps #Datadog #Splunk #Python #Cloud #Kubernetes #Azure #JQL (Jira Query Language) #Leadership #"ETL (Extract #Transform #Load)" #Deployment #SQL Server #GitLab #Automation #Monitoring #SonarQube #Jenkins #Jira #Terraform #BI (Business Intelligence) #Documentation #Firewalls #SQL (Structured Query Language) #AWS (Amazon Web Services) #Batch
Role description
Job Title: Resiliency and Recovery Engineer - Tech Lead
Location: Charlotte, North Carolina (On-Site)
Job Description:
The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness.
Responsibilities:
• Work across all payment rails to develop faster, repeatable resiliency and recovery processes adopted broadly across the organization.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build or strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness with clear, repeatable cut-back procedures.
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB and infrastructure teams.
• Own backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech). Run weekly stand-ups and prepare bi-weekly executive readouts.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement.
• Ensure seamless handoff of newly created resiliency and recovery practices to the permanent Engineering team through thorough documentation and knowledge transfer.
Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery, runbooks/playbooks, and root-cause analysis.
• Incident pattern analysis and MTTR baselining.
• Senior-level observability expertise with dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Experience with Splunk, Datadog, SQL, JQL (Jira Query Language), and GitLab.
• Deep CI/CD experience: pipeline design and operation, release safety patterns, rollback readiness, and generating code-quality and testing metrics.
• Automation skills using Python and/or PowerShell for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform troubleshooting (deployments, pods, config drift, safe restarts, production incident investigation).
• Experience with identity/credentials/certificate and secret-rotation resilience.
• Reliability of batch/scheduler job execution and distributed integration failure-handling (timeouts, retries, idempotency, duplicate prevention, reconciliation).
Nice-to-Have Qualifications:
• SRE-style reliability practices (SLO/SLI, error budgets, operational metrics).
• Failover / data-center flip / active-active or active-passive recovery concepts.
• Cloud engineering with Azure or AWS.
• DevOps tooling such as Jenkins, Terraform, SonarQube, and Helm Charts.
• Network and traffic-management incident triage (load balancers, firewalls, VLAN changes, rapid isolation of app vs. infra vs. network issues).
Job Title: Resiliency and Recovery Engineer - Tech Lead
Location: Charlotte, North Carolina (On-Site)
Job Description:
The Resiliency & Recovery Engineer (Contractor) is a senior, hands-on engineering role focused on improving production resiliency and recovery outcomes across critical services and payment rails. This role is responsible for driving measurable improvements such as faster recovery (reduced time to restore service), stronger and actionable alert coverage, increased automation to reduce manual toil, and safer releases with repeatable rollback/cutback readiness.
Responsibilities:
• Work across all payment rails to develop faster, repeatable resiliency and recovery processes adopted broadly across the organization.
• Identify resiliency gaps based on incident patterns and recurring failures; turn findings into prioritized remediation work.
• Build or strengthen monitoring, alerting, and dashboards that are actually used by engineers and leadership.
• Create runbooks and automate recovery actions to reduce manual toil and human error during incidents.
• Improve release safety and rollback/fallback readiness with clear, repeatable cut-back procedures.
• Support SQL reliability efforts (SQL Server 2022 focus) in partnership with DB and infrastructure teams.
• Own backlog, prioritization, design reviews, and cross-team coordination (Ops/Product/Tech). Run weekly stand-ups and prepare bi-weekly executive readouts.
• Integrate resilience testing into CI/CD pipelines and DevOps workflows to catch issues early and ensure robust, automated releases.
• Conduct chaos engineering experiments (failure injections, game days) to proactively uncover system weaknesses and validate recovery processes under real-world failure scenarios.
• Document and share resiliency best practices; mentor and train engineering teams to foster a culture of reliability and continuous improvement.
• Ensure seamless handoff of newly created resiliency and recovery practices to the permanent Engineering team through thorough documentation and knowledge transfer.
Must-Have Qualifications:
• Proven experience in high-availability, high-transaction environments (preferably payments or financial services).
• Strong background in production resiliency and recovery, runbooks/playbooks, and root-cause analysis.
• Incident pattern analysis and MTTR baselining.
• Senior-level observability expertise with dashboards, monitors, and alerts (Datadog preferred; similar tools considered).
• Experience with Splunk, Datadog, SQL, JQL (Jira Query Language), and GitLab.
• Deep CI/CD experience: pipeline design and operation, release safety patterns, rollback readiness, and generating code-quality and testing metrics.
• Automation skills using Python and/or PowerShell for building repeatable recovery workflows and operational tooling.
• Kubernetes/container platform troubleshooting (deployments, pods, config drift, safe restarts, production incident investigation).
• Experience with identity/credentials/certificate and secret-rotation resilience.
• Reliability of batch/scheduler job execution and distributed integration failure-handling (timeouts, retries, idempotency, duplicate prevention, reconciliation).
Nice-to-Have Qualifications:
• SRE-style reliability practices (SLO/SLI, error budgets, operational metrics).
• Failover / data-center flip / active-active or active-passive recovery concepts.
• Cloud engineering with Azure or AWS.
• DevOps tooling such as Jenkins, Terraform, SonarQube, and Helm Charts.
• Network and traffic-management incident triage (load balancers, firewalls, VLAN changes, rapid isolation of app vs. infra vs. network issues).






