Senior Data Engineer - RIT

Randstad

  • Shanghai
  • RMB¥360,000-480,000 per year
  • Permanent
  • Full-time
  • 24 days ago
职位概述about the company.
Internetabout the team.
Data...about the job.
  • Data Pipeline Development for LLMs: Design, develop, and maintain highly scalable, reliable, and efficient data pipelines (ETL/ELT) for ingesting, transforming, and loading diverse datasets critical for LLM pre-training, fine-tuning, and evaluation. This includes structured, semi-structured, and unstructured text data.
  • High-Quality Dataset Creation & Curation:
  • Implement advanced techniques for data cleaning and preprocessing, including deduplication, noise reduction, PII masking, tokenization, and formatting of large text corpora.
  • Explore and implement methods for expanding and enriching datasets for LLM training, such as data augmentation and synthesis.
  • Establish and enforce rigorous data quality standards, implement automated data validation checks, and ensure data privacy and security compliance (e.g., GDPR, CCPA).
  • Data Job Management:
  • Establish robust systems for data versioning, lineage tracking, and reproducibility of datasets used across the LLM development lifecycle.
  • Identify and resolve data-related performance bottlenecks within data pipelines, optimizing data storage, retrieval, and processing for efficiency and cost-effectiveness.
  • Data Infrastructure & Orchestration:
  • Build and maintain scalable data warehouses and data lakes specifically designed for LLM data on both on-premise and public cloud environments.
  • Implement and manage data orchestration tools (e.g., Apache Airflow, Prefect, Dagster) to automate and manage complex data workflows for LLM dataset preparation.
skills and experience required.
  • Bachelor's or Master's degree in Computer Science, Data Science, Engineering, or a related quantitative field. With 3+ years of professional experience in Data Engineering, with a significant focus on building and managing data pipelines for large-scale machine learning or data science initiatives, especially those involving large text/image/voice datasets.
  • Direct experience with data engineering specifically for Large Language Models (LLMs), including pre-training, fine-tuning, and evaluation datasets.
  • Familiarity with common challenges and techniques for preprocessing massive text corpora (e.g., handling noise, deduplication, PII detection/masking, tokenization at scale).
  • Experience with data versioning and lineage tools/platforms (e.g., DVC, Pachyderm, LakeFS, or data versioning features within MLOps platforms like MLflow).
  • Familiarity with deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) from a data loading and preparation perspective.
  • Experience designing and implementing data annotation workflows and pipelines.
  • Strong proficiency in Python, and extensive experience with its data ecosystem.
  • Proficiency in SQL, and good understanding of data warehousing concepts, data modeling, and schema design.
显示更多

Randstad

Similar Jobs

  • Senior Data Al algorithm Engineer

    Grundfos

    • Shanghai
    What is the job about? The senior data engineer will be working with developing analytics and algorithms that makes a high degree of business impact, working on standardizing and…
    • 13 days ago
  • Senior Data Engineer

    Robert Half

    • Shanghai
    • RMB¥180,000-300,000 per year
    Job Description: Hiring: Senior Data Engineer We are looking for an accomplished Senior Data Engineer to join our team and help drive data-driven solutions for impactful projec…
    • 1 month ago
    • Apply easily
  • Senior Data Engineer

    Blizzard Entertainment

    • Shanghai
    Team Name: IT - Infrastructure Engineering Job Title: Senior Data Engineer Requisition ID: R025704 Job Description: We are seeking a highly skilled and experienced Senior D…
    • 1 month ago