

riTara.ai
Project: Synthetic PDF Form Generation & Auto-Labeling (Multiple Lending PDF Forms)
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is a remote data/ML engineer position for a 3-4 week project in McLean, VA, focused on synthetic PDF form generation and auto-labeling. Key skills include Python/Java, machine learning, document analysis, and PDF generation expertise.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 16, 2025
🕒 - Duration
1 to 3 months
-
🏝️ - Location
Remote
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Java #Data Pipeline #Programming #Databases #Data Science #ML (Machine Learning) #Data Processing #Datasets #JSON (JavaScript Object Notation) #Documentation #Python #Computer Science
Role description
Company Description
We are a startup company working on organizing unstructured data for various industry domains that would help the business make better informed decisions and execute workflows.
Role Description
This is a project based opportunity, remote role located in McLean, VA. The role involves working on the Synthetic PDF Form Generation & Auto-Labeling project, which includes creating synthetic PDF forms and developing methods for auto-labeling multiple types of lending PDF forms. Day-to-day tasks include designing and generating synthetic PDFs, implementing automated labeling solutions, collaborating with team members to meet project milestones, creating high-quality data pipelines, and conducting thorough testing to ensure accuracy and efficiency.
We are seeking an experienced data / ML engineer to build a fully automated pipeline that generates synthetic PDF forms and corresponding ground-truth label files for 10 lending-related document types.
This is not a manual labeling project.
Scope
• Use provided PDF form templates, spreadsheet-based field data, spreadsheet based ground truth attributes
• Programmatically generate 200+ synthetic PDFs across 10 form types
• Automatically generate pixel-accurate label files (JSON) per document, including:
• • Field key
• • Field value
• • Page number
• • Bounding box coordinates
• Create a reusable, configuration-driven solution that supports new forms easily
Deliverables
• ~2000 PDF documents filled with synthetic data
• Ground-truth label files (JSON)
• Field-to-template mapping files (per form)
• Python codebase + SOP documentation
Qualifications
• Strong technical skills in PDF generation, data processing, and programming languages such as Python or Java
• Experience with machine learning, including implementing auto-labeling algorithms and training models
• Expertise in document analysis and optical character recognition (OCR) technologies, Working with coordinates / bounding boxes
• Experience in Programmatic PDF generation (ReportLab, PyMuPDF, or similar)
• Proficiency in working with databases and managing large datasets
• Strong analytical and problem-solving skills, with attention to detail and accuracy
• Collaborative mindset and ability to work effectively in a cross-functional team environment
• Relevant educational background, such as a degree in Computer Science, Data Science, or a related field
• Experience with document formatting standards like PDF and familiarity with lending or financial forms is a plus
Timeline & Terms
• Estimated duration: 3–4 weeks
• Contract: Fixed price
• All work is proprietary; NDA required
How to Apply
Please include:
1. Relevant experience or examples
1. Brief description of your technical approach
1. Estimated timeline and cost
Company Description
We are a startup company working on organizing unstructured data for various industry domains that would help the business make better informed decisions and execute workflows.
Role Description
This is a project based opportunity, remote role located in McLean, VA. The role involves working on the Synthetic PDF Form Generation & Auto-Labeling project, which includes creating synthetic PDF forms and developing methods for auto-labeling multiple types of lending PDF forms. Day-to-day tasks include designing and generating synthetic PDFs, implementing automated labeling solutions, collaborating with team members to meet project milestones, creating high-quality data pipelines, and conducting thorough testing to ensure accuracy and efficiency.
We are seeking an experienced data / ML engineer to build a fully automated pipeline that generates synthetic PDF forms and corresponding ground-truth label files for 10 lending-related document types.
This is not a manual labeling project.
Scope
• Use provided PDF form templates, spreadsheet-based field data, spreadsheet based ground truth attributes
• Programmatically generate 200+ synthetic PDFs across 10 form types
• Automatically generate pixel-accurate label files (JSON) per document, including:
• • Field key
• • Field value
• • Page number
• • Bounding box coordinates
• Create a reusable, configuration-driven solution that supports new forms easily
Deliverables
• ~2000 PDF documents filled with synthetic data
• Ground-truth label files (JSON)
• Field-to-template mapping files (per form)
• Python codebase + SOP documentation
Qualifications
• Strong technical skills in PDF generation, data processing, and programming languages such as Python or Java
• Experience with machine learning, including implementing auto-labeling algorithms and training models
• Expertise in document analysis and optical character recognition (OCR) technologies, Working with coordinates / bounding boxes
• Experience in Programmatic PDF generation (ReportLab, PyMuPDF, or similar)
• Proficiency in working with databases and managing large datasets
• Strong analytical and problem-solving skills, with attention to detail and accuracy
• Collaborative mindset and ability to work effectively in a cross-functional team environment
• Relevant educational background, such as a degree in Computer Science, Data Science, or a related field
• Experience with document formatting standards like PDF and familiarity with lending or financial forms is a plus
Timeline & Terms
• Estimated duration: 3–4 weeks
• Contract: Fixed price
• All work is proprietary; NDA required
How to Apply
Please include:
1. Relevant experience or examples
1. Brief description of your technical approach
1. Estimated timeline and cost






