Strong proficiency in Python, PySpark / Apache Spark
Solid understanding of RDDs,
Spark SQL , and Spark performance tuning
Experience in writing optimized ETL/ELT pipelines
Experience with SQL and relational databases (PostgreSQL, MySQL, Oracle, etc.)
Exposure to Big Data ecosystems (Hadoop, Hive, HDFS)
Familiarity with batch and streaming data processing
Good to Have
AWS / Azure / GCP (preferred)
AWS services such as S3, EMR, Glue, Redshif
Version control using Git
Experience with CI/CD pipelines
Basic familiarity with Docker and workflow schedulers (Airflow preferred)
Knowledge of Databrick
Responsibility :
Design, develop, and maintain data pipelines using PySpark and Python .
Process and transform large structured and unstructured datasets in distributed environments .
Optimize Spark jobs for performance, scalability, and reliability .
Develop reusable data transformation frameworks and utilities.
Integrate data from multiple sources including relational, NoSQL, and streaming systems.
Perform data quality checks, validations, and error handling.
Collaborate with data analysts, data scientists, and upstream/downstream teams.
Support deployment and monitoring of data pipelines in production environments.
Mandatory Competencies
Big Data - Big Data - Pyspark
Data Science and Machine Learning - Data Science and Machine Learning - Apache Spark
Programming Language - Python - Apache Airflow
Database - PostgreSQL - PostgreSQL
Big Data - Big Data - HIVE
Big Data - Big Data - Hadoop
Big Data - Big Data - HDFS
Database - Database Programming - SQL
Beh - Communication and collaboration
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: Software DevelopmentRole: Data Platform EngineerEmployement Type: Full time