IPolarity

Data Engineer

⭐ - Featured Role | Apply direct with Data Freelance Hub
This role is for a Data Engineer with a contract length of "Unknown," offering a pay rate of "Unknown." Key skills include Python, SQL, NoSQL, and experience with LLMs, Apache Spark, and cloud platforms. Industry experience in AI/ML is required.
🌎 - Country
United States
💱 - Currency
$ USD
-
💰 - Day rate
Unknown
-
🗓️ - Date
March 17, 2026
🕒 - Duration
Unknown
-
🏝️ - Location
Unknown
-
📄 - Contract
Unknown
-
🔒 - Security
Unknown
-
📍 - Location detailed
Whippany, NJ
-
🧠 - Skills detailed
#Scala #API (Application Programming Interface) #Data Cleaning #Langchain #SQL (Structured Query Language) #Datasets #AWS (Amazon Web Services) #Transformers #NoSQL #Cloud #ML (Machine Learning) #Apache Spark #Mathematics #Storage #Model Evaluation #Databases #Hugging Face #Azure #GCP (Google Cloud Platform) #Data Management #Kubernetes #Data Pipeline #Kafka (Apache Kafka) #PyTorch #Spark (Apache Spark) #Data Engineering #AI (Artificial Intelligence) #Python #Docker #"ETL (Extract #Transform #Load)" #NLP (Natural Language Processing) #TensorFlow
Role description
We are looking for Data Engineer experienced and skilled in designing, building, and maintaining high-quality data pipelines, preprocessing workflows, and vector databases required for training, fine-tuning, and deploying Large Language Models (LLMs). Build and maintain high-throughput data pipelines, infrastructure, and storage solutions specifically to feed, train, and deploy AI/ML models, implementing RAG (Retrieval-Augmented Generation) systems, data cleaning, and model evaluation to ensure efficient, scalable, and reliable LLM applications. Required Skills & Qualifications • Strong proficiency in Python is essential, along with SQL and NoSQL for data management. • Experience with LangChain, LlamaIndex, Hugging Face Transformers, and OpenAI API • Experience with Apache Spark, Kafka, or modern data stack tools. • Knowledge of NLP techniques, word embeddings, tokenization, and vector mathematics. • Familiarity with TensorFlow, PyTorch, or Hugging Face • Familiarity with cloud platforms (AWS, GCP, Azure), CI/CD, Docker, and Kubernetes. Key Responsibilities • Design and build robust ETL/ELT pipelines for unstructured text data, including scraping, cleaning, deduplication, and transformation for LLM training. • Build and maintain vector search solutions (e.g., Pinecone, Milvus, Weaviate, Chroma) to store and retrieve embeddings for RAG systems. • Prepare high-quality datasets for fine-tuning adapters (e.g., LoRA) and train LLMs using frameworks like PyTorch or TensorFlow. • Implement Retrieval-Augmented Generation using frameworks like LangChain or LlamaIndex to connect LLMs to company data. • Develop evaluation frameworks for model performance, testing for accuracy, hallucination, and bias, and monitor deployed models. • Create APIs and internal web tools for data annotation, curation, and model interaction.