PySpark Developer (Big Data Engineer) Gurgaon, Noida - Apply Mancraft Consulting Pvt Ltd

PySpark Developer (Big Data Engineer)

Opening: 1 Nos.

Job ID: 118789

Employment Type: Full Time

Reference:

Work Experience: 5.0 Year(s) To 8.0 Year(s)

CTC Salary: Not Disclosed by Recruiter

Function: IT Software- DBA / Datawarehousing

Industry: Insurance

Qualification: B.Tech/B.E. - Computers

Location:

Posted On: 27th Oct, 2025

Job Description:

Job Summary: PySpark Developer (with Python Migration Focus) - to design, develop, and optimize big data solutions, with a critical focus on migrating and modernizing existing Python data processing codebases within a Health Insurance Company.

The PySpark Developer will be a key member of the Data Engineering team, responsible for leveraging PySpark and Apache Spark's distributed computing power to build scalable and high-performance data pipelines. A primary focus will be on migrating existing, complex Python scripts and data logic (e.g., Pandas, proprietary ETL tools) to the PySpark framework to enhance performance, scalability, and integration with the cloud-based data ecosystem. This role is vital for modernizing our data platform, handling massive volumes of sensitive health and claims data, and ensuring compliance with industry standards like HIPAA.

Key Responsibilities

PySpark Development & Data Engineering

Design and Develop Data Pipelines: Create, maintain, and optimize scalable Extract, Transform, Load (ETL) and ELT pipelines using PySpark on a distributed computing environment (e.g., Databricks, AWS EMR, Azure Synapse).
Code Migration and Modernization: Lead the effort to re-engineer existing Python-based data processes, functions, and analytical logic (including legacy systems or complex Pandas transformations) into efficient and performant PySpark code.
Performance Tuning: Profile, optimize, and fine-tune PySpark jobs for maximum speed and efficiency, focusing on techniques like partitioning, caching, broadcast variables, and query optimization to handle terabyte-scale healthcare datasets.
Data Quality and Governance: Implement robust data validation, cleansing, and monitoring procedures within PySpark jobs to ensure the highest quality and integrity of claims, member, and provider data.
Data Modeling: Collaborate with Data Architects to implement and optimize data models (Star, Snowflake, Data Vault) using technologies like Delta Lake or Parquet for the Health Insurance Data Lake/Warehouse.

Health Insurance-Specific Focus & Compliance

Sensitive Data Handling: Implement security and privacy controls within the data pipelines to ensure strict compliance with health data regulations, including HIPAA and other privacy laws.
Business Logic Implementation: Translate complex health insurance business rules (e.g., claim adjudication logic, risk adjustment, quality metrics, payment calculations) into accurate and scalable PySpark transformations.
Cross-Functional Collaboration: Work closely with Actuarial, Data Science, and Business Intelligence teams to understand their data needs, deliver curated datasets, and support advanced analytics/Machine Learning initiatives.

Technology & Operations

SQL Proficiency: Utilize advanced SQL and Spark SQL for complex data querying, manipulation, and analysis.
Cloud Integration: Develop and deploy solutions leveraging cloud services (AWS, Azure, or GCP) such as S3/ADLS/GCS for storage and EMR/Databricks/Synapse for processing.
DevOps & CI/CD: Integrate PySpark applications into CI/CD pipelines (Git, Jenkins, GitLab CI, Azure DevOps) for automated testing, deployment, and operational monitoring.

Requirements:

Technical Skills

Deep PySpark Expertise: Strong, hands-on experience (3+ years for Mid, 5+ years for Senior) with PySpark, Spark RDDs, DataFrames, Spark SQL, and optimizing Spark jobs.
Core Programming: Advanced proficiency in Python and its data processing libraries.
SQL & Data Warehousing: Expert-level SQL skills and solid understanding of data warehousing concepts (ETL/ELT).
Big Data Ecosystem: Experience with distributed file systems (HDFS, S3/ADLS) and data storage formats like Parquet and Delta Lake.
Cloud Computing: Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
Version Control: Proficiency with Git.

Education & Industry Experience

Bachelor's degree in Computer Science, Data Engineering, or a related quantitative field.
Prior experience in the Health Insurance or highly regulated Financial Services industries is highly preferred.
Demonstrated experience in migrating legacy code (especially Python/Pandas/SAS) to a modern, distributed framework like PySpark is a significant advantage.

Desired Soft Skills

Excellent problem-solving and debugging skills in a distributed environment.
Strong communication skills to articulate complex technical concepts to non-technical stakeholders.
Proven ability to work independently and collaboratively in an Agile/Scrum team structure.

Key Skills :

Company Profile

One of the leading --- --- --- of India.

PySpark Developer (Big Data Engineer)