

HireTalent - Diversity Staffing & Recruiting Firm
Data Scientist
⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior Data Scientist with a contract length of "unknown," offering a pay rate of "unknown." Key skills include LLM tooling, Databricks, Python, and SQL. An advanced degree and 4+ years of experience in data science, NLP, and GenAI are required.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 19, 2025
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Datasets #OpenSearch #Databases #HTML (Hypertext Markup Language) #Data Governance #Forecasting #Indexing #Data Processing #NLP (Natural Language Processing) #Normalization #Elasticsearch #Data Extraction #Monitoring #Observability #Spark (Apache Spark) #"ETL (Extract #Transform #Load)" #Metadata #Databricks #Statistics #SQL (Structured Query Language) #pydantic #ML (Machine Learning) #Computer Science #Regression #Strategy #Automation #Python #Data Enrichment #Schema Design #NumPy #Data Quality #Langchain #Libraries #Security #Pandas #Compliance #Data Science
Role description
Role Overview:
We are seeking a Senior Data Scientist to build and deploy LLM-based capabilities for working with large, diverse datasets and documents relevant to growth analytics & bid strategy. This role emphasizes ingestion, document processing, information extraction, and retrieval methods to support analytics use cases in production. Experience with modern LLM tooling and Databricks is required; hands-on experience with advanced reasoning models & agentic/orchestration frameworks are a plus.
Key Responsibilities:
• Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights.
• Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks.
• Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout-aware parsing, table extraction, metadata enrichment, and document versioning.
• Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation.
• Own the retrieval stack end-to-end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform.
• Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting.
• Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability.
• Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria.
• Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data-driven forecasting and strategy.
• Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments.
Qualifications:
• Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field.
• Minimum of 4 years of experience in data science or applied ML/NLP with a focus on NLP & GenAI
• Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines.
• Strong experience with Databricks for data processing and pipeline development, including Spark and common Lakehouse patterns.
• Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real-world use cases.
• Experience with document ingestion and parsing, including OCR and handling messy, semi-structured content such as PDFs, tables, forms, and web pages.
• Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning.
• Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution.
• Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions.
Libraries and Tools:
• Proficiency with LLM and orchestration libraries such as OpenAI, Google GenAI, Lang graph, langchain.
• Experience with supporting tooling commonly used in production LLM systems, for example: Pydantic for schema validation, tenacity for retries, beautifulsoup4 for HTML data extraction, and standard Python data tooling such as pandas and NumPy.
• Experience with retrieval and vector tooling, such as FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example, Pinecone, Weaviate, Milvus, Chroma).
Preferred Qualifications:
• Exposure to agentic patterns and tool-calling for workflow automation.
• Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.
Role Overview:
We are seeking a Senior Data Scientist to build and deploy LLM-based capabilities for working with large, diverse datasets and documents relevant to growth analytics & bid strategy. This role emphasizes ingestion, document processing, information extraction, and retrieval methods to support analytics use cases in production. Experience with modern LLM tooling and Databricks is required; hands-on experience with advanced reasoning models & agentic/orchestration frameworks are a plus.
Key Responsibilities:
• Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights.
• Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks.
• Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout-aware parsing, table extraction, metadata enrichment, and document versioning.
• Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation.
• Own the retrieval stack end-to-end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform.
• Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting.
• Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability.
• Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria.
• Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data-driven forecasting and strategy.
• Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments.
Qualifications:
• Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field.
• Minimum of 4 years of experience in data science or applied ML/NLP with a focus on NLP & GenAI
• Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines.
• Strong experience with Databricks for data processing and pipeline development, including Spark and common Lakehouse patterns.
• Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real-world use cases.
• Experience with document ingestion and parsing, including OCR and handling messy, semi-structured content such as PDFs, tables, forms, and web pages.
• Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning.
• Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution.
• Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions.
Libraries and Tools:
• Proficiency with LLM and orchestration libraries such as OpenAI, Google GenAI, Lang graph, langchain.
• Experience with supporting tooling commonly used in production LLM systems, for example: Pydantic for schema validation, tenacity for retries, beautifulsoup4 for HTML data extraction, and standard Python data tooling such as pandas and NumPy.
• Experience with retrieval and vector tooling, such as FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example, Pinecone, Weaviate, Milvus, Chroma).
Preferred Qualifications:
• Exposure to agentic patterns and tool-calling for workflow automation.
• Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.






