HireTalent - Diversity Staffing & Recruiting Firm

Data Scientist

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Senior Data Scientist with a contract length of "unknown," offering a pay rate of "unknown." Key skills include LLM tooling, Databricks, Python, and SQL. An advanced degree and 4+ years of experience in data science, NLP, and GenAI are required.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
December 19, 2025
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
United States
-
🧠 - Skills detailed
#Datasets #OpenSearch #Databases #HTML (Hypertext Markup Language) #Data Governance #Forecasting #Indexing #Data Processing #NLP (Natural Language Processing) #Normalization #Elasticsearch #Data Extraction #Monitoring #Observability #Spark (Apache Spark) #"ETL (Extract #Transform #Load)" #Metadata #Databricks #Statistics #SQL (Structured Query Language) #pydantic #ML (Machine Learning) #Computer Science #Regression #Strategy #Automation #Python #Data Enrichment #Schema Design #NumPy #Data Quality #Langchain #Libraries #Security #Pandas #Compliance #Data Science
Role description
Role Overview: We are seeking a Senior Data Scientist to build and deploy LLM-based capabilities for working with large, diverse datasets and documents relevant to growth analytics & bid strategy. This role emphasizes ingestion, document processing, information extraction, and retrieval methods to support analytics use cases in production. Experience with modern LLM tooling and Databricks is required; hands-on experience with advanced reasoning models & agentic/orchestration frameworks are a plus. Key Responsibilities: • Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights. • Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks. • Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout-aware parsing, table extraction, metadata enrichment, and document versioning. • Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation. • Own the retrieval stack end-to-end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform. • Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting. • Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability. • Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria. • Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data-driven forecasting and strategy. • Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments. Qualifications: • Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field. • Minimum of 4 years of experience in data science or applied ML/NLP with a focus on NLP & GenAI • Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines. • Strong experience with Databricks for data processing and pipeline development, including Spark and common Lakehouse patterns. • Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real-world use cases. • Experience with document ingestion and parsing, including OCR and handling messy, semi-structured content such as PDFs, tables, forms, and web pages. • Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning. • Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution. • Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions. Libraries and Tools: • Proficiency with LLM and orchestration libraries such as OpenAI, Google GenAI, Lang graph, langchain. • Experience with supporting tooling commonly used in production LLM systems, for example: Pydantic for schema validation, tenacity for retries, beautifulsoup4 for HTML data extraction, and standard Python data tooling such as pandas and NumPy. • Experience with retrieval and vector tooling, such as FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example, Pinecone, Weaviate, Milvus, Chroma). Preferred Qualifications: • Exposure to agentic patterns and tool-calling for workflow automation. • Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.