“Ajay has excellent technical and business acumen and strong data management toolkit, his thorough and quality driven work always leads to right outcomes, he is self learner and adapts well in complex situations ”
About
As a Lead Data Engineer with 7+ years of experience, I tackle the complexities of…
Contributions
Activity
-
🚀 Data Engineer Interview scenario - Slow running jobs As a data engineer we all have encountered this situation spark job is running on massive…
🚀 Data Engineer Interview scenario - Slow running jobs As a data engineer we all have encountered this situation spark job is running on massive…
Liked by Ajay Kadiyala
Experience
Education
-
Siddhartha Institute of Engineering & Technology.
-
Activities and Societies: Qualified for project at Central Tool Design organization Hyderabad. Represented District Level Cricket, Kabaddi Teams Volunteered at social activities through NGC(National Green Corps)
-
-
Activities and Societies: Volunteered and certified at Indian Armed forces. Proficient at playing Volleyball. Certified Intern at National atmospheric research laboratory( ISRO affiliated.)
Licenses & Certifications
Projects
-
Log Analytics Project with Spark Streaming and Kafka
-
What is Log Processing?
The process of evaluating, understanding, and comprehending computer-generated documents known as logs is known as log analysis. A wide range of programmable technologies, including networking devices, operating systems, apps, and more, produce logs. A log is a collection of messages in chronological order that describe what is going on in a system. Regardless, log analysis is the subtle technique of evaluating and interpreting these messages in order to get insights…What is Log Processing?
The process of evaluating, understanding, and comprehending computer-generated documents known as logs is known as log analysis. A wide range of programmable technologies, including networking devices, operating systems, apps, and more, produce logs. A log is a collection of messages in chronological order that describe what is going on in a system. Regardless, log analysis is the subtle technique of evaluating and interpreting these messages in order to get insights into any system's underlying functioning. Web server log analysis can offer important insights on everything from security to customer service to SEO. The information collected in web server logs can help you with:
Network troubleshooting efforts
Development and quality assurance
Identifying and understanding security issues
Customer service
Maintaining compliance with both government and corporate policies
The common log-file format is as follows:
remotehost rfc931 authuser [date] "request" status bytes
Key Takeaways:
● Understanding the project and how to use AWS EC2 Instance
● Understanding the basics of Containers, log analysis, and their application & Port Forwarding
● Visualizing the complete Architecture of the system
● Introduction to Docker
● Usage of docker-composer and starting all tools
● Exploring dataset and common log format
● Understanding Lambda Architecture.
● Installing NiFi and using it for data ingestion
● Installing Kafka and using it for creating topics
● Publishing logs using NiFi
● Integration of NiFi and Kafka
● Installing Spark and using it for data processing and cleaning
● Integration of Kafka and Spark
● Reading data from Kafka via Spark structured streaming API
● Installing and creating namespace and table in Cassandra
● Integration of Spark and Cassandra
● Continuously loading data in Cassandra for aggregated results.
● Integrating Cassandra and Plotly and Dash
● Displaying live stream, Hourly and Daily results using Python Plotly and Dash -
Retail Analytics Project Example using Sqoop, HDFS, and Hive
-
Business Overview:
Retail analytics is the process of delivering analytical data on inventory levels, supply chain movement, customer demand, sales, and other important factors for marketing and procurement choices. Demand and supply data analytics may be utilized to manage procurement levels as well as make marketing decisions. Retail analytics provides us with precise consumer insights and insights into the organization's business and procedures, as well as the scope and need for…Business Overview:
Retail analytics is the process of delivering analytical data on inventory levels, supply chain movement, customer demand, sales, and other important factors for marketing and procurement choices. Demand and supply data analytics may be utilized to manage procurement levels as well as make marketing decisions. Retail analytics provides us with precise consumer insights and insights into the organization's business and procedures, as well as the scope and need for development.
Companies may use retail analytics to strengthen their marketing strategies by better grasping individual preferences and gaining more detailed data.They may design strategies that focus on people and have a success rate by combining demographic data.
Here, we will be using Walmart store sales data to perform analysis and answer the following questions:
Which store has a minimum and maximum sales?
Which store has a maximum standard deviation?
Which store/s has an excellent quarterly growth rate in Q3'2012?
Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together.
Approach
● Containers for all the services are spun up using Docker.
● Setup for MySQL is performed for Table creation using the dataset.
● Data is imported using Sqoop into Hive.
● Data transformation is performed for analysis and reporting.
Key Takeaways
● Understanding the project and how to use AWS EC2 Instance
● Introduction to Docker
● Visualizing the complete Architecture of the system
● Usage of docker-composer and starting all tools
● Understanding HFDS and various file formats
● Understanding the use of different HDFS command
● Understanding Sqoop Jobs and valuable tools
● Introduction to Hive architecture
● Understanding Hive Joins and Views
● Performing various transformation tasks in Hive
● Setting up MySQL for table creation
● Migrating from RDBMS to Hive warehouse
Tech Stack
➔Language: SQL, Bash
➔Services: AWS EC2, Docker, MySQL, Sqoop, Hive, HDFS -
Data Processing and Transformation in Hive using Azure VM
-
Business Overview
Big Data is a collection of massive quantities of semi-structured and unstructured data created by a heterogeneous group of high-performance devices spanning from social networks to scientific computing applications. Companies have the ability to collect massive amounts of data, and they must ensure that the data is in highly usable condition by the time it reaches data scientists and analysts. The profession of data engineering involves designing and constructing systems…Business Overview
Big Data is a collection of massive quantities of semi-structured and unstructured data created by a heterogeneous group of high-performance devices spanning from social networks to scientific computing applications. Companies have the ability to collect massive amounts of data, and they must ensure that the data is in highly usable condition by the time it reaches data scientists and analysts. The profession of data engineering involves designing and constructing systems for acquiring, storing, and analyzing vast volumes of data. It is a broad field with applications in nearly every industry.
Apache Hadoop is a Big Data solution that allows for the distributed processing of enormous data volumes across computer clusters by employing basic programming techniques. It is meant to scale from a single server to thousands of computers, each of which will provide local computation and storage.
Apache Hive is a fault-tolerant distributed data warehouse system that allows large-scale analytics. Hive allows users to access, write, and manage petabytes of data using SQL. It is built on Apache Hadoop, and as a result, it is tightly integrated with Hadoop and is designed to manage petabytes of data quickly. Hive is distinguished by its ability to query enormous datasets utilizing a SQL-like interface and an Apache Tez, MapReduce, or Spark engine.
Dataset Description
In this project, we will use the Airlines dataset to demonstrate the issues related to massive amounts of data and how various Hive components can be used to tackle them. Following are the files used in this project, along with a few of their fields :
airlines.csv - IATA_code, airport_name, city, state, country
carrier.csv - code, description
plane-data.csv - tail_number, type, manufacturer, model, engine_type
Flights data (yearly) - flight_num, departure, arrival, origin, destination, distance
Tech Stack
➔ Language: HQL
➔ Services: Azure VM, Hive, Hadoop -
Learn Data Processing with Spark SQL using Scala on AWS
-
Agenda:
Apache Spark is an open-source distributed processing solution for substantial data workloads. It combines in-memory caching and rapid query execution for quick analytic queries on any amount of data. It includes development APIs in Java, Scala, Python, and R. It allows code reuse across various workloads, including batch processing, interactive queries, real-time analytics, machine learning, and graph processing.
Scala is a multi-paradigm, general-purpose, high-level programming…Agenda:
Apache Spark is an open-source distributed processing solution for substantial data workloads. It combines in-memory caching and rapid query execution for quick analytic queries on any amount of data. It includes development APIs in Java, Scala, Python, and R. It allows code reuse across various workloads, including batch processing, interactive queries, real-time analytics, machine learning, and graph processing.
Scala is a multi-paradigm, general-purpose, high-level programming language. It's an object-oriented programming language that also supports functional programming. Scala applications can be converted to byte-codes and run on the Java Virtual Machine (JVM). Scala is a scalable programming language, and JavaScript run-times are also available. This project presents the fundamentals of Scala in an easy-to-understand manner.
Aim:
This project involves in understanding basics of scala RDD, performing transformations and actions, also involves analyzing the Movies dataset using RDD and Spark SQL.
Data Description:
In the project, we will use Movies and Rating datasets. The Movies dataset contains movie id, title, release date, etc. The Rating dataset contains customer id, movie id, ratings, and timestamp information.
Approach:
● Create an AWS EC2 instance and launch it.
● Create docker images using docker-compose file on EC2 machine via ssh.
● Load data from local machine into Spark container via EC2 machine.
● Perform analysis on Movie and Ratings data.
Project Takeaways:
● Understanding various services provided by AWS
● Creating an AWS EC2 instance and launching it
● Connecting to an AWS EC2 instance via SSH
● Dockerization.
● Copying a file from a local machine to an EC2 machine
● Understanding fundamentals of Scala.
● Creating RDDs
● Applying Transformation operations on RDDs
● Difference between RDDs and Dataframes
● Perform analysis using RDDs
Tech Stack
➔ Language: SQL, Scala
➔ Services: AWS EC2, Docker, Hive, HDFS, Spark -
Streaming Data Pipeline using Spark, HBase and Phoenix
-
Business objective:
Sensor networks clearly focus on a wide range of practice-oriented research as individual sensors and related networks for structural monitoring and highly efficient monitoring. The sensor network is designed to respond to emergencies in real-time while observing the radiation field. An intelligent system-based sensor network is presented and used to monitor the condition of remote wells. The network consists of three sensors used to collect oil field data, including…Business objective:
Sensor networks clearly focus on a wide range of practice-oriented research as individual sensors and related networks for structural monitoring and highly efficient monitoring. The sensor network is designed to respond to emergencies in real-time while observing the radiation field. An intelligent system-based sensor network is presented and used to monitor the condition of remote wells. The network consists of three sensors used to collect oil field data, including temperature, pressure, and gas. Intelligent microprocessor sensors are generally designed for oil well data processing, critical failure alarms or signals, traditional data storage or signals, and data/status connections.
Aim:
To build an application that monitors oil wells. Sensors in oil rigs generate streaming data processed by Spark and stored in HBase for use by various analytical and reporting tools.
Approach
Create an AWS EC2 instance and launch it.
Create docker images using docker-compose file on EC2 machine via ssh.
Download the dataset and load it into HDFS storage.
Read data from HDFS storage and write into HBase table using Spark.
Create Phoenix view on top of HBase table to analyze data using SQL queries.
Tech Stack
● AWS EC2
● Docker
● Scala
● HBase
● Apache Spark SQL
● Spark Structured Streaming
● HDFS
● Apache Phoenix
● SBT
Project Takeaways:
● Understanding various services provided by AWS
● Creating an AWS EC2 instance and launching it
● Connecting to an AWS EC2 instance via SSH
● Copying a file from a local machine to an EC2 machine
● Dockerization
● Download the dataset and load it into HDFS
● Difference between RDBMS and HBase
● SBT packaging
● Read data from HDFS and write into HBase tables
● Understanding of Apache Phoenix
● Create a Phoenix view on top of the HBase table -
Hive Mini Project to Build a Data Warehouse for e-Commerce
-
Agenda:
For hive project Using SQL is still highly popular, and it will be for the foreseeable future. Most big data technologies have been modified to allow users to interact with them using SQL.
This big data project will look at Hive's capabilities to run analytical queries on massive datasets. We will use the Adventure works dataset in a MySQL dataset for this project, and we'll need to ingest and modify the data. We'll use Adventure works sales and Customer demographics data to…Agenda:
For hive project Using SQL is still highly popular, and it will be for the foreseeable future. Most big data technologies have been modified to allow users to interact with them using SQL.
This big data project will look at Hive's capabilities to run analytical queries on massive datasets. We will use the Adventure works dataset in a MySQL dataset for this project, and we'll need to ingest and modify the data. We'll use Adventure works sales and Customer demographics data to perform analysis.
Approach
● Create an AWS EC2 instance and launch it.
● Create docker images using docker-compose file on EC2 machine via ssh.
● Create tables in MySQL.
● Load data from MySQL into HDFS storage using Sqoop commands.
● Move data from HDFS to Hive.
● Integrate Hive into Spark.
● Using Scala, extract Customer demographics information from data and store it as parquet files.
● Move parquet files from Spark to Hive.
● Create tables in Hive and load data from Parquet files into tables.
● Perform Hive analytics on Sales and Customer demographics data.
Project Takeaways
● Understanding various services provided by AWS
● Creating an AWS EC2 instance
● Connecting to an AWS EC2 instance via SSH
● Introduction to Docker
● Visualizing the complete Architecture of the system
● Usage of docker-composer and starting all tools
● Copying a file from a local machine to an EC2 machine
● Understanding the schema of the dataset
● Data ingestion/transformation using Sqoop, Spark, and Hive
● Moving the data from MySQL to HDFS
● Creating Hive table and troubleshooting it
● Using Parquet and Xpath to access schema
● Understanding the use of GROUP BY, GROUPING SETS, ROLL-UP, CUBE clauses
● Understanding different analytic functions in Hive
Tech Stack
Language: SQL, Scala
Services: AWS EC2, Docker, MySQL, Sqoop, Hive, HDFS, Spark -
SQL Project for Data Analysis using Oracle Database-Part 7
-
Agenda of the project:
This is the seventh project in the SQL project series. This project involves the understanding of the Online Shopping Database, and using this database to perform the following Data Wrangling activities
1. Split full name into the first name and last name
2. Correct phone numbers and emails which are not in a proper format
3. Correct contact number and remove full name
4. Read BLOB column and fetch attribute details from the regular tag
5. Read BLOB…Agenda of the project:
This is the seventh project in the SQL project series. This project involves the understanding of the Online Shopping Database, and using this database to perform the following Data Wrangling activities
1. Split full name into the first name and last name
2. Correct phone numbers and emails which are not in a proper format
3. Correct contact number and remove full name
4. Read BLOB column and fetch attribute details from the regular tag
5. Read BLOB column and fetch attribute details from nested columns
6. Read BLOB column and fetch attribute details from nested columns
7. Create separate tables for blob attributes
8. Remove invalid records from order_items where shipment_id is not mapped
9. Map missing first name and last name with email id credentials
Key Takeaways:
● Understanding the project and how to use Oracle SQL Developer.
● Understanding the basics of data analysis, SQL commands, and their application.
● Understanding the use of Oracle SQL Developer.
● Understanding the concept of Data Wrangling.
● Understanding the Online Shopping database.
● Perform Data Wrangling activities on the data.
Tech stack:
● SQL Programming language
● Oracle SQL Developer -
Analyse Yelp Dataset with Spark & Parquet Format on Azure Databricks
-
Business Overview
The project involves Analyzing Yelp Dataset with Spark & Parquet Format on Azure Databricks. We download the Yelp dataset from the Yelp website and understand the problem.Then a solution is designed which defines ingestion of data, preparation of data and publishing it on Databricks. Then subscription is set up for using Microsoft Azure and categorisation of resources are done into a resource group. A storage account is set up to store all the data required for Analyzing…Business Overview
The project involves Analyzing Yelp Dataset with Spark & Parquet Format on Azure Databricks. We download the Yelp dataset from the Yelp website and understand the problem.Then a solution is designed which defines ingestion of data, preparation of data and publishing it on Databricks. Then subscription is set up for using Microsoft Azure and categorisation of resources are done into a resource group. A storage account is set up to store all the data required for Analyzing Yelp Dataset with spark & Parquet format on Azure Databricks. Creation of containers in a storage account and uploading of the Yelp dataset in it. Creating an Azure data factory, a copy data pipeline and starting link storage for standard storage account in Azure datafactory. Copying data from Azure storage to Azure data lake storage using a copy data pipeline in Azure data factory followed by the conversion of Yelp dataset from JSON to Parquet file format and JSON to Delta format. Then performing partitioning, repartitioning and coalesce on the dataset in Databricks. Performing data analysis on the repartitioned dataset and finally deducing the recommendations
Approach
Read yelp datasets in ADLS and convert JSON to parquet for better performance.
Convert JSON to Delta Format.
Total records in each dataset.
Partition tip dataset tip by a date column.
repartition() vs coalesce()
Find the top 3 users based on their total number of reviews.
Find the top 10 users with the most fans
Analyse the top 10 categories by a number of reviews.
Analyse top businesses which have over 1000 reviews.
Analyse Business Data: Number of restaurants per state.
Analyze the top 3 restaurants in each state.
List the top restaurants in a state by the number of reviews.
Numbers of restaurants in Arizona state per city.
Broadcast Join: restaurants as per review ratings in Pheonix city.
Most rated Italian restaurant in Pheonix.
Tech Stack
Language: Python3
Services: Azure Data factory, Azure Databricks, ADLS -
Build an Azure Recommendation Engine on Movielens Dataset
-
Agenda of the project?
The Project involves deriving Movie Recommendations using Python and Spark on Microsoft Azure. We understand the problem and download the Movielens dataset from the grouplens website. Then a subscription is set up for Azure, and created a resource group. A storage account is a setup to store all the data required for serving movie recommendations using Python and Spark on Azure, followed by creating a storage blob account in the same resource group. Firstly, we make…Agenda of the project?
The Project involves deriving Movie Recommendations using Python and Spark on Microsoft Azure. We understand the problem and download the Movielens dataset from the grouplens website. Then a subscription is set up for Azure, and created a resource group. A storage account is a setup to store all the data required for serving movie recommendations using Python and Spark on Azure, followed by creating a storage blob account in the same resource group. Firstly, we make containers in a storage account and standard storage blob account and upload the movielens zip file dataset in its standard storage blob account. Then we create an Azure data factory, a copy data pipeline, and start link storage for standard blob storage account in the Azure data factory. We are copying data from Azure blob storage to Azure data lake storage using a copy data pipeline in the Azure data factory. It is followed by creating the databricks workspace, cluster on databricks, and accessing Azure data lake storage from databricks. We are creating mount points and extracting the zip file to get CSV files. Finally, we upload files into databricks, read the datasets into Spark dataframes in databricks, and analyze the dataset to get the movie recommendations.
Data Analysis:
Data is downloaded from grouplens website containing
Resource manager is created in Azure & Storage account.
Pipeline is created to copy the data from Azure blob storage to Azure data lake storage in the Azure data factory.
The Databricks workspace created and accessed Azure data lake storage from databricks followed by the creation of Mount pairs.
extracted the Movielens data zip file to get the CSV files out of it using the Databricks file system(DFS) and ADF.
In Transformation & load process, the dataset in Spark is read into Spark dataframes. Data tags are read into Spark in Databricks.
Finally, data is analyzed into Spark in Databricks using mount points, and data is visualized using bar charts -
End-to-End Big Data Project to Learn PySpark SQL Functions
-
Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda
This is the fifth project in…Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda
This is the fifth project in the Pyspark series. The fourth project involves advanced functionalities of Dataframes with the help of business case study, also the use of Spark submit command. This project mainly focuses on PySpark SQL, SQL function, and various joins available in PySpark SQL with the help of business case study.
Key Takeaways:
● Understanding the project overview
● Introduction to PySpark
● Introduction to SQL
● Features and benefits of SQL
● Understanding Spark SQL
● Understanding the business case study
● Understanding business requirements
● Converting PySpark Dataframes into SQL tables
● Different types of Joins in PySpark SQL
● Implementing PySpark SQL code
Tech stack:
➔Language: Python
➔Package: Pyspark -
GCP Project to Explore Cloud Functions using Python Part 1
-
Business Overview:
Google Cloud is a collection of physical assets, such as computers and hard disk drives, and virtual resources, such as virtual machines (VMs), housed in Google data centers worldwide. This resource distribution has various advantages, including redundancy in a failure and decreased latency by putting resources closer to customers. This release also presents some guidelines for combining resources.
GCP offers a web-based graphical user interface for managing Google…Business Overview:
Google Cloud is a collection of physical assets, such as computers and hard disk drives, and virtual resources, such as virtual machines (VMs), housed in Google data centers worldwide. This resource distribution has various advantages, including redundancy in a failure and decreased latency by putting resources closer to customers. This release also presents some guidelines for combining resources.
GCP offers a web-based graphical user interface for managing Google Cloud projects and resources. If a user prefers to work at the command line, the gcloud command-line tool can handle most Google Cloud activities.
We will explore the following services of GCP:
Cloud Storage
Compute Engine
PubSub
Key Takeaways
● Introduction to the Google Cloud Console
● Understanding Cloud Storage concepts and classes
● Creating a Service Account
● Setting up Gcloud SDK
● Installing Python and other dependencies
● Understanding retention policies and holds
● Setting up GCP Virtual Machine and SSH configuration
● Understanding Pub/Sub Architecture
● Creating a Pub/Sub Topic and implementing message flow
● Implementing Pub/Sub notification using GCS
Tech Stack:
Language: Python3
Services: Cloud Storage, Compute Engine, Pub/Sub -
Hadoop Project-Analysis of Yelp Dataset using Hadoop Hive
-
Business Overview
Big Data is the collection of huge datasets of semi-structured and unstructured data, generated by the high-performance heterogeneous group of devices ranging from social networks to scientific computing applications. Companies have the potential to gather large volumes of data, and they must guarantee that the data is in a highly useable shape by the time it reaches data scientists and analysts. Data engineering is the profession of creating and constructing systems for…Business Overview
Big Data is the collection of huge datasets of semi-structured and unstructured data, generated by the high-performance heterogeneous group of devices ranging from social networks to scientific computing applications. Companies have the potential to gather large volumes of data, and they must guarantee that the data is in a highly useable shape by the time it reaches data scientists and analysts. Data engineering is the profession of creating and constructing systems for gathering, storing, and analyzing
large amounts of data. It is a vast field with applications in almost every sector. Apache Hadoop is a Big Data technology that enables the distributed processing of massive data volumes across computer clusters using simple programming concepts. It is intended to grow from a single server to thousands of computers, each supplying local computing and storage. Yelp is a community review site and an American multinational firm based in San Francisco, California. It publishes crowd-sourced reviews of local businesses as well as the online reservation service Yelp Reservations. Yelp has made a portion of their data available in order to launch a new activity called the Yelp Dataset Challenge, which allows anyone to do research or analysis to find what insights are buried in their data. Due to the bulk of the data, this project only selects a subset of Yelp data. User and Review dataset is considered for this session.
Key Takeaways
Understanding Project overview
Introduction to Big Data
Overview of Hadoop ecosystem
Understanding Hive concepts
Understanding the dataset
Implementing Hive table operations
Creating static and dynamic Partitioning
Creating Hive Buckets
Understanding different file formats in Hive
Using Complex Hive Functions in Hive
Launching EMR cluster in AWS
Tech Stack
Language: HQL
Services: AWS EMR, Hive, HDFS, AWS S3 -
PySpark Project to Learn Advanced DataFrame Concepts
-
Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda:
This is the fourth project in…Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda:
This is the fourth project in the Pyspark series. The third project involves in-depth knowledge of Dataframes, different types of Dataframe operations, also implementation of transformation and action functions on spark dataframe. This project involves advanced functionalities of Dataframes with the help of a business case study, and the use of the Spark submit command.
Key Takeaways:
● Understanding the project overview
● Introduction to PySpark
● Understanding Spark Architecture and Lifecycle
● Introduction to Spark Operations
● Understanding the business case study
● Understanding business requirements
● Understanding Resilient Distributed Data (RDD)
● Difference between Transformation and Action
● Methods of creating Dataframes in pyspark
● Implementation of spark submit command
Tech stack:
➔Language: Python
➔Package: Pyspark -
Snowflake Azure Project to build real-time Twitter feed dashboard
-
Business Overview
Companies miss opportunities and are exposed to risk as a result of delays in company operations and decision-making. Organizations can move rapidly based on real-time data since it reveals issues and opportunities. Data that is gathered, processed, and evaluated in real-time is referred to as real-time data, and it comprises data that is ready to utilize as soon as it is created. A snapshot of historical data is what near real-time data is. When speed is critical, near…Business Overview
Companies miss opportunities and are exposed to risk as a result of delays in company operations and decision-making. Organizations can move rapidly based on real-time data since it reveals issues and opportunities. Data that is gathered, processed, and evaluated in real-time is referred to as real-time data, and it comprises data that is ready to utilize as soon as it is created. A snapshot of historical data is what near real-time data is. When speed is critical, near real-time processing is preferred, although
processing time in minutes rather than seconds is acceptable. Batch data that has been previously stored is considerably slower, and by the time it is ready to use, it might be days old.
Dataset Description
We will use the Twitter API and fetch tweets and their metadata(re-tweets, comments,likes) using Python.
Approach
● We write API calls to fetch Twitter insights in real-time via Python and this code can be run in a local machine every day once
● We create a Snowpipe component in Snowflakes by using Azure IAM integration(cross-account access) as Snowflakes hosted on the Azure account is different from the Azure account we own. This in turn uses Azure EventGrid and Function App in the backend to automate the file load
● As soon as the script generates files in Azure Blob storage, Snowpipe recognizes the file arrival and loads the snowflakes table with file data automatically
● We create a dashboard in Snowflakes that is scheduled to refresh every 30 mins to show actual feed data from Twitter, Eg. No of likes and comments/feed to understand popular feed and their sentiment.
Tech Stack
Language: Python
Services: Azure Storage Account, Azure Queue, Snowpipe, Snowflake, Azure Resource Group -
Hands-On Real Time PySpark Project for Beginners
-
Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. For quick analytic queries against any quantity of data, it uses in-memory caching and efficient query execution. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Data Pipeline:
A data pipeline is a…Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. For quick analytic queries against any quantity of data, it uses in-memory caching and efficient query execution. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Data Pipeline:
A data pipeline is a technique for transferring data from one system to another. The data may or may not be updated, and it may be handled in real-time (or streaming) rather than in batches. The data pipeline encompasses everything from harvesting or acquiring data using various methods to storing raw data, cleaning, validating, and transforming data into a query-worthy format, displaying KPIs, and managing the above process.
PySpark:
PySpark is a Python interface for Apache Spark. It not only lets you develop Spark applications using Python APIs, but it also includes the PySpark shell for interactively examining data in a distributed context. PySpark supports most of Spark's capabilities, including Spark SQL, DataFrame, Streaming,MLlib, and Spark Core. In this project, you will learn about core Spark architecture, Spark Sessions, Transformation, Actions, and Optimization Techniques using PySpark.
Key Takeaways:
● Understanding the project overview
● Introduction to PySpark
● Understanding Spark Architecture and Lifecycle
● Introduction to Spark Operations
● Understanding the components of Spark Apache
● Understanding Resilient Distributed Data (RDD)
● Difference between Transformation and Action
● Understanding Interactive Spark Shell
● Understanding the concept of Directed Acyclic Graph(DAG)
● Features of Spark
● Applications of Spark
Tech stack:
➔Language: Python
➔Package: Pyspark -
PySpark Big Data Project to Learn RDD Operations
-
Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda:
This is the second project in…Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R.
Agenda:
This is the second project in the Pyspark series. The first project involves the Pyspark introduction, Spark component and architecture, and basic introduction about RDD and DAG. This project involves in-depth knowledge of RDD, different types of RDD operations, the difference between transformation and action, and the various functions available in transformation and action with their execution.
Key Takeaways:
● Understanding the project overview
● Introduction to PySpark
● Understanding Spark Architecture and Lifecycle
● Introduction to Spark Operations
● Understanding the components of Spark Apache
● Understanding Resilient Distributed Data (RDD)
● Difference between Transformation and Action
● Understanding Interactive Spark Shell
● Understanding the concept of Directed Acyclic Graph(DAG)
● Understanding different Transformation functions
● Execute different Transformation functions
● Understanding different Action functions
● Execute different Action functions
Tech stack:
➔Language: Python
➔Package: Pyspark -
PySpark Project for Beginners to Learn DataFrame Operations
-
Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R
Agenda:
This is the third project in…Business Overview
Apache Spark is a distributed processing engine that is open source and used for large data applications. It uses in-memory caching and efficient query execution for quick analytic queries against any quantity of data. It offers code reuse across many workloads such as batch processing, interactive queries, real-time analytics, machine learning, and graph processing. It provides development APIs in Java, Scala, Python, and R
Agenda:
This is the third project in the Pyspark series. The second project involves in-depth knowledge of RDD, different types of RDD operations, the difference between transformation and action, and the various functions available in transformation and action with their execution. This project involves in-depth knowledge of Dataframes, different types of Dataframe operations, also implementation of transformation and action functions on spark dataframe.
Key Takeaways:
● Understanding the project overview
● Introduction to PySpark
● Understanding Spark Architecture and Lifecycle
● Introduction to Spark Operations
● Understanding the components of Spark Apache
● Understanding Resilient Distributed Data (RDD)
● Difference between Transformation and Action
● Understanding about datasets
● Understanding spark Dataframes
● Difference between Structured and Semi-structured data
● Methods of creating Dataframes in pyspark
● Understanding UDF in spark
● Implementation of Transformation and Action functions on spark Dataframe
Tech stack:
➔Language: Python
➔Package: Pyspark -
SQL Project for Data Analysis using Oracle Database-Part 4
-
● Understanding the project and how to use Oracle SQL Developer
● Understanding the basics of data analysis, SQL commands, and their application
● Understanding the use of Oracle SQL Developer
● Understanding the difference between COUNT(*) and COUNT(column_name).
● Data analysis using WITH clause.
● Categorization using CASE statement.
● Understanding the inline view.
● Simplify query with WITH clause and View.
● Understanding the use of the ROWNUM clause.
Tech…● Understanding the project and how to use Oracle SQL Developer
● Understanding the basics of data analysis, SQL commands, and their application
● Understanding the use of Oracle SQL Developer
● Understanding the difference between COUNT(*) and COUNT(column_name).
● Data analysis using WITH clause.
● Categorization using CASE statement.
● Understanding the inline view.
● Simplify query with WITH clause and View.
● Understanding the use of the ROWNUM clause.
Tech stack:
● SQL Programming language
● Oracle SQL Developer -
SQL Project for Data Analysis using Oracle Database-Part 5
-
Key Takeaways:
● Understanding the project and how to use Oracle SQL Developer
● Understanding the basics of data analysis, SQL commands, and their application
● Understanding the use of Oracle SQL Developer
● Understanding the ROW_NUMBER function
● Data analysis using the RANK function
● Difference between RANK and DENSE_RANK functions
● Understanding the use of SUBSTR and INSTR functions
● Data analysis using the built-in functions
● Deal with NULL values using the…Key Takeaways:
● Understanding the project and how to use Oracle SQL Developer
● Understanding the basics of data analysis, SQL commands, and their application
● Understanding the use of Oracle SQL Developer
● Understanding the ROW_NUMBER function
● Data analysis using the RANK function
● Difference between RANK and DENSE_RANK functions
● Understanding the use of SUBSTR and INSTR functions
● Data analysis using the built-in functions
● Deal with NULL values using the NVL function
● Understanding the use of COALESCE function
● Change the date format
Tech stack:
SQL Programming language
Oracle SQL Developer -
SQL Project for Data Analysis using Oracle Database-Part 6
-
Key Takeaways:
● Understanding the project and how to use Oracle SQL Developer.
● Understanding the basics of data analysis, SQL commands, and their application.
● Understanding the use of Oracle SQL Developer.
● Understanding the concept of Data Wrangling.
● Remove unwanted features from data using SQL queries.
● Deal with missing data.
● How to remove missing data using SQL queries.
● How to impute missing data using SQL queries.
● Understanding Pivot and Unpivot…Key Takeaways:
● Understanding the project and how to use Oracle SQL Developer.
● Understanding the basics of data analysis, SQL commands, and their application.
● Understanding the use of Oracle SQL Developer.
● Understanding the concept of Data Wrangling.
● Remove unwanted features from data using SQL queries.
● Deal with missing data.
● How to remove missing data using SQL queries.
● How to impute missing data using SQL queries.
● Understanding Pivot and Unpivot functions in SQL.
● Pivoting rows to columns using SQL queries.
● Pivoting rows to columns with joins using SQL queries
Tech stack:
● SQL Programming language
● Oracle SQL Developer -
SQL Project for Data Analysis using Oracle Database-Part 1
-
Data Analysis:
● The Oracle Database 21c is downloaded from the Oracle website for SQL Data analysis.
● The SQL Developer is downloaded for working in Oracle databases, connecting it to the
“SYSTEM” username and creating tables in the database.
● Data is inserted into the tables, followed by the exploration of the tables, including a
walkthrough of columns and seeing comments.
● The listing of the Employees and Departments is done based on some conditions using the
SQL…Data Analysis:
● The Oracle Database 21c is downloaded from the Oracle website for SQL Data analysis.
● The SQL Developer is downloaded for working in Oracle databases, connecting it to the
“SYSTEM” username and creating tables in the database.
● Data is inserted into the tables, followed by the exploration of the tables, including a
walkthrough of columns and seeing comments.
● The listing of the Employees and Departments is done based on some conditions using the
SQL commands followed by displaying the records in an ordered manner and handling of the
NULL values.
● The selection of the records is made based on some patterns like Wildcard, Operators, etc
followed by implementation of the Data Manipulation commands(DML) like Add, Update and
Delete for the Data Analysis.
● The backup of the table where migration is going on is taken, followed by COMMIT and
ROLLBACK commands. Then the listing of distinct records is done, and further renaming of
the columns.
● Finally, a listing of the employee details based on the complex nested conditions is done.
Tech stack:
● SQL Programming language
● Oracle SQL Developer -
SQL Project for Data Analysis using Oracle Database-Part 2
-
This is the second project in the SQL project series. This project’s Agenda involves Analyzing the data using SQL on the Oracle Database Software. Understanding different types of Joins(Inner join, Left outer join, Right outer join, Full outer join, Self join), different types of Operators(Minus, Union, Union all, Intersect) and Resolve the column ambiguously defined error.
Tech stack:
● SQL Programming language
● Oracle SQL Developer -
SQL Project for Data Analysis using Oracle Database-Part 3
-
This is the third project in the SQL project series; the second project involved analyzing the data using SQL on the Oracle Database Software. Understanding different types of Joins(Inner join, Left outer join, Right outer join, Full outer join, Self join), different types of Operators(Minus, Union, Union all, Intersect). This project involves the data analysis using Sub-query, Group-by clause and Exists clause. It also consists of using inline view and aggregate functions(Min, Max, Count, Avg)…
This is the third project in the SQL project series; the second project involved analyzing the data using SQL on the Oracle Database Software. Understanding different types of Joins(Inner join, Left outer join, Right outer join, Full outer join, Self join), different types of Operators(Minus, Union, Union all, Intersect). This project involves the data analysis using Sub-query, Group-by clause and Exists clause. It also consists of using inline view and aggregate functions(Min, Max, Count, Avg) to perform better analysis on data.
Tech stack:
SQL Programming language
Oracle SQL Developer
Recommendations received
6 people have recommended Ajay
Join now to viewMore activity by Ajay
-
🎉 Exciting update from AWS re:Invent 2024! The long-awaited S3 Table feature has been revealed, sparking considerable excitement. This latest…
🎉 Exciting update from AWS re:Invent 2024! The long-awaited S3 Table feature has been revealed, sparking considerable excitement. This latest…
Liked by Ajay Kadiyala
-
If You Don’t Know What to Pursue, Pursue Yourself When you’re unsure of what to pursue, turn your focus inward and uncover your true self. ➡️…
If You Don’t Know What to Pursue, Pursue Yourself When you’re unsure of what to pursue, turn your focus inward and uncover your true self. ➡️…
Liked by Ajay Kadiyala
-
Want to Clear Next Java Developer Interview? Prepare these topic to ace in your next Java Interview: This roadmap will guide you through the most…
Want to Clear Next Java Developer Interview? Prepare these topic to ace in your next Java Interview: This roadmap will guide you through the most…
Liked by Ajay Kadiyala
-
Get into COEP, Life will be sorted Learn Coding, build projects, Life will be sorted Get 20 LPA+ package, Life will be sorted Switch companies, get a…
Get into COEP, Life will be sorted Learn Coding, build projects, Life will be sorted Get 20 LPA+ package, Life will be sorted Switch companies, get a…
Liked by Ajay Kadiyala
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore MoreOthers named Ajay Kadiyala in India
-
Ajay Kadiyala
-
Ajay Kadiyala
Business Owner at srivarshinitechnology
-
Ajay Kadiyala
software engineering at Tech-Hunt IT Innovation Pvt.Ltd.
-
Ajay Kadiyala
--
12 others named Ajay Kadiyala in India are on LinkedIn
See others named Ajay Kadiyala