PySpark Developer (Big Data Engineer)
Job Description:
Job Summary: PySpark Developer (with Python Migration Focus) - to design, develop, and optimize big data solutions, with a critical focus on migrating and modernizing existing Python data processing codebases within a Health Insurance Company.
The PySpark Developer will be a key member of the Data Engineering team, responsible for leveraging PySpark and Apache Spark's distributed computing power to build scalable and high-performance data pipelines. A primary focus will be on migrating existing, complex Python scripts and data logic (e.g., Pandas, proprietary ETL tools) to the PySpark framework to enhance performance, scalability, and integration with the cloud-based data ecosystem. This role is vital for modernizing our data platform, handling massive volumes of sensitive health and claims data, and ensuring compliance with industry standards like HIPAA.
Key Responsibilities
PySpark Development & Data Engineering
- Design and Develop Data Pipelines: Create, maintain, and optimize scalable Extract, Transform, Load (ETL) and ELT pipelines using PySpark on a distributed computing environment (e.g., Databricks, AWS EMR, Azure Synapse).
- Code Migration and Modernization: Lead the effort to re-engineer existing Python-based data processes, functions, and analytical logic (including legacy systems or complex Pandas transformations) into efficient and performant PySpark code.
- Performance Tuning: Profile, optimize, and fine-tune PySpark jobs for maximum speed and efficiency, focusing on techniques like partitioning, caching, broadcast variables, and query optimization to handle terabyte-scale healthcare datasets.
- Data Quality and Governance: Implement robust data validation, cleansing, and monitoring procedures within PySpark jobs to ensure the highest quality and integrity of claims, member, and provider data.
- Data Modeling: Collaborate with Data Architects to implement and optimize data models (Star, Snowflake, Data Vault) using technologies like Delta Lake or Parquet for the Health Insurance Data Lake/Warehouse.
Health Insurance-Specific Focus & Compliance
- Sensitive Data Handling: Implement security and privacy controls within the data pipelines to ensure strict compliance with health data regulations, including HIPAA and other privacy laws.
- Business Logic Implementation: Translate complex health insurance business rules (e.g., claim adjudication logic, risk adjustment, quality metrics, payment calculations) into accurate and scalable PySpark transformations.
- Cross-Functional Collaboration: Work closely with Actuarial, Data Science, and Business Intelligence teams to understand their data needs, deliver curated datasets, and support advanced analytics/Machine Learning initiatives.
Technology & Operations
- SQL Proficiency: Utilize advanced SQL and Spark SQL for complex data querying, manipulation, and analysis.
- Cloud Integration: Develop and deploy solutions leveraging cloud services (AWS, Azure, or GCP) such as S3/ADLS/GCS for storage and EMR/Databricks/Synapse for processing.
- DevOps & CI/CD: Integrate PySpark applications into CI/CD pipelines (Git, Jenkins, GitLab CI, Azure DevOps) for automated testing, deployment, and operational monitoring.
Requirements:
Technical Skills
- Deep PySpark Expertise: Strong, hands-on experience (3+ years for Mid, 5+ years for Senior) with PySpark, Spark RDDs, DataFrames, Spark SQL, and optimizing Spark jobs.
- Core Programming: Advanced proficiency in Python and its data processing libraries.
- SQL & Data Warehousing: Expert-level SQL skills and solid understanding of data warehousing concepts (ETL/ELT).
- Big Data Ecosystem: Experience with distributed file systems (HDFS, S3/ADLS) and data storage formats like Parquet and Delta Lake.
- Cloud Computing: Hands-on experience with at least one major cloud platform (AWS, Azure, or GCP).
- Version Control: Proficiency with Git.
Education & Industry Experience
- Bachelor's degree in Computer Science, Data Engineering, or a related quantitative field.
- Prior experience in the Health Insurance or highly regulated Financial Services industries is highly preferred.
- Demonstrated experience in migrating legacy code (especially Python/Pandas/SAS) to a modern, distributed framework like PySpark is a significant advantage.
Desired Soft Skills
- Excellent problem-solving and debugging skills in a distributed environment.
- Strong communication skills to articulate complex technical concepts to non-technical stakeholders.
- Proven ability to work independently and collaboratively in an Agile/Scrum team structure.
Company Profile
One of the leading --- --- --- of India.
Apply Now
- Interested candidates are requested to apply for this job.
- Recruiters will evaluate your candidature and will get in touch with you.