About
Experienced Data Engineer with a demonstrated history of solving complex data problems…
Articles by Shashank
Contributions
-
What skills will future data engineers need?
The field of Data Engineering is expanding at an unprecedented pace. Relying solely on foundational tools like Python, SQL, and Spark will no longer suffice.Below is a comprehensive list of skills essential for anyone aspiring to be an elite data engineer in today's world 🌟 1.) Foundational Skills: SQL, Python, BigData Fundamentals 2.) Cloud: AWS, GCP, Azure 3.) Databases: PostgreSQL, MongoDB 4.) Processing: Apache Spark, Flink, Beam, Databricks 5.) Orchestration: Apache Airflow, Mage, Kestra 6.) CI/CD: Git, Jenkins, Docker 7.) Cluster Management: Kubernetes 8.) ETL Tools: Informatica, DBT, AirByte 9.) Monitoring: Grafana, Splunk 10.) Streaming: Apache Kafka 11.) Lakes: AWS Lake Formation, Delta Lake 12.) Warehousing: Snowflake, BigQuery
Activity
-
Databricks has raised $10 billion in a funding round and now valued at $62 billion 🤯🤯🤯 Itni funding bhi milti hai kya (Someone can really get this…
Databricks has raised $10 billion in a funding round and now valued at $62 billion 🤯🤯🤯 Itni funding bhi milti hai kya (Someone can really get this…
Liked by Shashank Mishra 🇮🇳
-
Databricks has raised $10 billion in a funding round and now valued at $62 billion 🤯🤯🤯 Itni funding bhi milti hai kya (Someone can really get this…
Databricks has raised $10 billion in a funding round and now valued at $62 billion 🤯🤯🤯 Itni funding bhi milti hai kya (Someone can really get this…
Posted by Shashank Mishra 🇮🇳
-
10 Red Flags to Watch Out for in a Data Engineer 🚩 The role of a Data Engineer is pivotal, but not all engineers are created equal. Here are 10 red…
10 Red Flags to Watch Out for in a Data Engineer 🚩 The role of a Data Engineer is pivotal, but not all engineers are created equal. Here are 10 red…
Liked by Shashank Mishra 🇮🇳
Experience
Education
-
Motilal Nehru National Institute Of Technology
-
Activities and Societies: • Coordinator, Computer Club Classes, NIT Allahabad, 2016 • Dance Coordinator, SWAGAT ( Cultural festival ) of NIT Allahabad, 2015 • Volunteer, Pahal (NGO); personally conducted 100+ classes for 150+ slum kids
-
-
Projects
-
Salesforce to Redshift Ingestion - Migration from Informatica to Native AWS
-
-> Tech Stack – Salesforce, Informatica, S3, Lambda, Glue, AppFlow, Redshift, SNS
- Crafted generic scalable Native AWS solution for Salesforce to Redshift ingestion
- It helped to move ingestion pipelines from third party tool Informatica and saved cost for heavy license fee
- This generic framework helped other business units for smooth ingestion of newly onboarded Salesforce object into Redshift Datalake -
Incremental Ingestion pipeline – Employee Benefits Data
-
-> Tech Stack – Shell Scripting, AWS CLI, S3, EMR, Glue, Redshift, SNS, QuickSight,PySpark
- Build generic & optimized ingestion pipeline for highly critical & confidential Employee Benefits Data
- Pipeline is designed in a way to handle GB’s of daily & weekly data together for different use cases like Audit, Payroll, Reimbursement, Education Reimbursement etc
- Took complete ownership and worked closely with business teams to understand the requirements & deliver enriching…-> Tech Stack – Shell Scripting, AWS CLI, S3, EMR, Glue, Redshift, SNS, QuickSight,PySpark
- Build generic & optimized ingestion pipeline for highly critical & confidential Employee Benefits Data
- Pipeline is designed in a way to handle GB’s of daily & weekly data together for different use cases like Audit, Payroll, Reimbursement, Education Reimbursement etc
- Took complete ownership and worked closely with business teams to understand the requirements & deliver enriching dashboards -
Automated Alerting System for Job Monitoring
-
-> Tech Stack – Python, AWS CLI, QuickSight
- Created automated alerting system for Redshift load metrics and Job monitoring
- It saved 1.5 Hours/day of manual efforts by each team member to monitor & prepare Daily Job Status -
Feature Development For Telecom Data
-
-> Tech Stack – PySpark, Kedro, Azure Cloud, Databricks
- Created large scale & optimized pipelines for Telcom data using PySpark & Kedro framework
- Worked closely with client in order to get business requirements
- Implemented business logics to prepare clean & aggregated data for Customer Churn Analysis -
GG VMN migration
-
Tech Stack – PySpark, Hive, Azkaban, Jenkins
- Migrated all Facts/Olaps written in Hive into PySpark
- Created job flows in Azkaban -
Data Ingestion & Sync Process
-
Tech Stack – Python, Hive, ElasticSearch, Scala Play Framework, SBT, EMR, Lambda, DynamoDB, Azkaban, Jenkins
- Crafted data-sync logic by prioritizing datasets (High/Medium/Low tag) based upon criticality to meet SLO
- Built premption logic to prioritize highly critical datasets when multiple low priority sync processes are running
- Designed Rest API in data ingestion for retention of GA data in order to optimize cluster space
- Added exception handling scenarios in data sync…Tech Stack – Python, Hive, ElasticSearch, Scala Play Framework, SBT, EMR, Lambda, DynamoDB, Azkaban, Jenkins
- Crafted data-sync logic by prioritizing datasets (High/Medium/Low tag) based upon criticality to meet SLO
- Built premption logic to prioritize highly critical datasets when multiple low priority sync processes are running
- Designed Rest API in data ingestion for retention of GA data in order to optimize cluster space
- Added exception handling scenarios in data sync logic to fix multiple bugs
- Fix for missing PG data from Kafka for UMP panel - Created a new pipeline to ingest missing data from HDFS to ElasticSearch in case of cluster failure -
Near Real Time Data Pipeline - POC
-
Tech Stack – Java, Spark, Kafka, Datastax Cassandra, Datastax studio, Zookeeper, Maven
- Crafted a Cassandra based real time ingestion pipeline for marketplace data in order to help DWH team to reduce request load from production MySQL. The Objective was to shift business users from production, to overcome data leaks & security issues
- Interacted with different business users to know about their use cases, ingestion tables, PII data and built data models accordingly for faster…Tech Stack – Java, Spark, Kafka, Datastax Cassandra, Datastax studio, Zookeeper, Maven
- Crafted a Cassandra based real time ingestion pipeline for marketplace data in order to help DWH team to reduce request load from production MySQL. The Objective was to shift business users from production, to overcome data leaks & security issues
- Interacted with different business users to know about their use cases, ingestion tables, PII data and built data models accordingly for faster insertion/updation of data
- Setup web interface Datastax Studio for users to query real time data from Cassandra using LDAP authentication -
Dehleez - Report Scheduling Tool
-
Tech Stack – Python, JavaScript, Django, Azkaban, Docker, Hive, Ajax, Bootstrap, REST API, DataDog
- Enhanced Paytm's proprietary report scheduling tool which is used by business users working on data analysis where they can schedule their reports by writing HIVE/MySQL/Cassandra queries and report output in various formats
- Diff Checker - Admins can check the difference between queries before approving reports
- Time slot picker to schedule a report - User can see scheduled…Tech Stack – Python, JavaScript, Django, Azkaban, Docker, Hive, Ajax, Bootstrap, REST API, DataDog
- Enhanced Paytm's proprietary report scheduling tool which is used by business users working on data analysis where they can schedule their reports by writing HIVE/MySQL/Cassandra queries and report output in various formats
- Diff Checker - Admins can check the difference between queries before approving reports
- Time slot picker to schedule a report - User can see scheduled reports for next 4 hours from intended schedule time and can pick the slot accordingly
- Dump report output into S3 bucket - User can take dump of report output into AWS S3 bucket
- Cassandra Connector - User can schedule reports having Cassandra query panels in addition with HIVE/MySQL -
Hive Query Parser
-
Tech Stack – Django, Django RestFramework, Python, NGINX
- Query Validator and Optimization Engine - Created a Django web application to parse and validate user's hive queries. In case of a bad query (missing partition columns/unbalanced joins), it also provides suggestions to improve the query
- PII detector – Built a Django web application to detect all running hive queries which are fetching PII data. -
Procurement Spend Optimizer
-
o Developed CXO-level insights engine to manage USD 60Bn; engine enabled cost optimization using
smart categorisation, benchmarking and anomaly detection
o Crafted a Big Data based solution; organised structured & unstructured data
o Built solution using Hadoop Ecosystem (HDFS, YARN), Spark and Python
o Built a google translator API based solution to automate legacy translation engine; improved record aggregation accuracy by 50% and saved team 120 hours/month…o Developed CXO-level insights engine to manage USD 60Bn; engine enabled cost optimization using
smart categorisation, benchmarking and anomaly detection
o Crafted a Big Data based solution; organised structured & unstructured data
o Built solution using Hadoop Ecosystem (HDFS, YARN), Spark and Python
o Built a google translator API based solution to automate legacy translation engine; improved record aggregation accuracy by 50% and saved team 120 hours/month
Technologies Used : Hadoop Framework, Spark
Languages Used : Java, Python 2.7
Tools Used : Signal Hub ( Opera’s proprietary development framework ), Signal Hub Manager ( SHM )
Version Control : SVN -
Trip Narrative
-
o Deployed an end to end solution for a leading US airlines; Aggregated a 360 view of customer's
engagement throughout the life-cycle of the trip
o Developed data pipelines from scratch; optimised data aggregation from 10+ independent
sources and automated the ETL process to roll out the solution
o The solution powers a web application; used by 1000+ CSRs and decision makers o Built
application on RESTFUL API`s using Hadoop Ecosystem (HDFS, YARN)…o Deployed an end to end solution for a leading US airlines; Aggregated a 360 view of customer's
engagement throughout the life-cycle of the trip
o Developed data pipelines from scratch; optimised data aggregation from 10+ independent
sources and automated the ETL process to roll out the solution
o The solution powers a web application; used by 1000+ CSRs and decision makers o Built
application on RESTFUL API`s using Hadoop Ecosystem (HDFS, YARN), DataRush Applications
(Distributed Processing Engine), SQL and Python
Technologies Used : Hadoop Framework, REST, Ingres DB, NGINX
Languages Used : Java, Python 2.7, YAML, SQL
Tools Used : Signal Hub ( Opera’s proprietary development framework ), Signal Hub Manager ( SHM )
Version Control : SVN
Honors & Awards
-
Opera Cool Ovation Award - 2017
Opera Solution
Received opera cool ovation award - 2017 for excellent contribution in project Trip Narrative.
-
Geek Of The Month , September 2016
GeeksForGeeks
Got this honour for extraordinary contribution in article writting for GeeksForGeeks.
http://www.geeksforgeeks.org/geek-of-the-month/
Languages
-
English
Full professional proficiency
Recommendations received
2 people have recommended Shashank
Join now to viewMore activity by Shashank
-
10 Red Flags to Watch Out for in a Data Engineer 🚩 The role of a Data Engineer is pivotal, but not all engineers are created equal. Here are 10 red…
10 Red Flags to Watch Out for in a Data Engineer 🚩 The role of a Data Engineer is pivotal, but not all engineers are created equal. Here are 10 red…
Shared by Shashank Mishra 🇮🇳
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Shashank Mishra 🇮🇳 in India
-
Shashank Mishra
-
Shashank Mishra
Mondelez | SPJIMR | MAQ Software | NIT Jamshedpur
-
Shashank Mishra
Academic Associate || IIM Indore || Data Science || 5🌟@Hackerrank (Python) || Ex-DS & BA Intern @The Sparks Foundation || Python || SQL || Power BI📊 || Tableau || ML
-
Shashank Mishra
IIM Shillong’24 | IIT Kanpur'24 | IIT Madras'23 | IIT Ropar’25 | 3x LinkedIn Top Voice Badge | 16k+ LinkedIn Fam | CEHv12 | CHFIv10 | CSCUv2 | CND | ESS | CN | DFE | EHE | NDE | A-BBP | CTM | ICIP-OPSWAT | CPTA V2 - CWL
2029 others named Shashank Mishra 🇮🇳 in India are on LinkedIn
See others named Shashank Mishra 🇮🇳