riTara.ai

Project: Synthetic PDF Form Generation & Auto-Labeling (Multiple Lending PDF Forms)

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is a remote data/ML engineer position for a 3-4 week project in McLean, VA, focused on synthetic PDF form generation and auto-labeling. Key skills include Python/Java, machine learning, document analysis, and PDF generation expertise.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 16, 2025
🕒 - Duration
1 to 3 months
-
🏝️ - Location
Remote
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Java #Data Pipeline #Programming #Databases #Data Science #ML (Machine Learning) #Data Processing #Datasets #JSON (JavaScript Object Notation) #Documentation #Python #Computer Science
Role description
Company Description We are a startup company working on organizing unstructured data for various industry domains that would help the business make better informed decisions and execute workflows. Role Description This is a project based opportunity, remote role located in McLean, VA. The role involves working on the Synthetic PDF Form Generation & Auto-Labeling project, which includes creating synthetic PDF forms and developing methods for auto-labeling multiple types of lending PDF forms. Day-to-day tasks include designing and generating synthetic PDFs, implementing automated labeling solutions, collaborating with team members to meet project milestones, creating high-quality data pipelines, and conducting thorough testing to ensure accuracy and efficiency. We are seeking an experienced data / ML engineer to build a fully automated pipeline that generates synthetic PDF forms and corresponding ground-truth label files for 10 lending-related document types. This is not a manual labeling project. Scope • Use provided PDF form templates, spreadsheet-based field data, spreadsheet based ground truth attributes • Programmatically generate 200+ synthetic PDFs across 10 form types • Automatically generate pixel-accurate label files (JSON) per document, including: • • Field key • • Field value • • Page number • • Bounding box coordinates • Create a reusable, configuration-driven solution that supports new forms easily Deliverables • ~2000 PDF documents filled with synthetic data • Ground-truth label files (JSON) • Field-to-template mapping files (per form) • Python codebase + SOP documentation Qualifications • Strong technical skills in PDF generation, data processing, and programming languages such as Python or Java • Experience with machine learning, including implementing auto-labeling algorithms and training models • Expertise in document analysis and optical character recognition (OCR) technologies, Working with coordinates / bounding boxes • Experience in Programmatic PDF generation (ReportLab, PyMuPDF, or similar) • Proficiency in working with databases and managing large datasets • Strong analytical and problem-solving skills, with attention to detail and accuracy • Collaborative mindset and ability to work effectively in a cross-functional team environment • Relevant educational background, such as a degree in Computer Science, Data Science, or a related field • Experience with document formatting standards like PDF and familiarity with lending or financial forms is a plus Timeline & Terms • Estimated duration: 3–4 weeks • Contract: Fixed price • All work is proprietary; NDA required How to Apply Please include: 1. Relevant experience or examples 1. Brief description of your technical approach 1. Estimated timeline and cost