HOW TO ANALYZE BIG DATA WITH HADOOP TECHNOLOGIES

HOW TO ANALYZE BIG DATA WITH HADOOP TECHNOLOGIES

With the rapid innovations, frequent evolution's of technologies and growing internet population, systems and enterprises are generating huge amount of data from Terabytes to and even Petabytes of information. Since data is being generated with very huge volume and very frequent velocity in all multi-structured formats like Images, Videos, Weblogs, sensor data etc. from all different sources, there is a huge demand to efficiently store, process and analyze this large amount of data to make it usable.

Hadoop is undoubtedly the preferred choice for this requirement due to its key characteristics of being providing reliable, flexible, economical and scalable solution. While Hadoop provides the ability to store this large scale data on HDFS (Hadoop Distributed File System), there are multiple solutions available in the market for analyzing this huge data like MapReduce, Pig and Hive. With the advancements of these different data analysis technologies to analyze the big data, there are many different school of thoughts about which Hadoop data analysis technology should be used when and which could be efficient. 

Advantages of BiG Data Analysis

Well, Big data analysis allows market analysts, researchers and business users to have a deep insight into data, resulting into providing business advantage. Business users are able to make a precise analysis of the data and the key early indicators from this analysis can mean fortunes to business. Some of the exemplary use cases are as below:

·       Whenever users browse travel portal, shopping sites, search flights, hotels or add a particular item into their cart, then depending on the user experience, Ad Targeting companies can analyze this wide variety of data and activities and can provide better recommendations to the user regarding offers, discounts and deals based on the user browsing history and product history.

·       In the telecommunication space, if customers are moving from one service provider to another service provider, then by analyzing huge call data records, it can be discovered what the issues customers are facing are. There could be significant increase in the call drops or there may be some network congestion problems. Based on analyzing these issues, it can be identified if telecom company need to place a new tower in particular urban area or they need to revive the marketing strategy for a particular region as new player has come up there. That way customer churn prevention can be proactively minimized.

Stock market Data Case Study

Now let's have a case study for analyzing stock market data. We will evaluate various Big Data Technologies to analyze this stock market data from the sample ‘New York Stock Exchange’ dataset and calculate the Covariance for this stock data and will target to solve both storage and processing problem related to huge volume of data.

Covariance is the finance term that represents the degree or amount that two stocks or financial instruments move together or apart from each other. With covariance, investors have the opportunity to seek out different investment options based upon their respective risk profile. It is a statistical measure of how one investment moves in relation to another.

- A positive covariance means that asset returns moved together. If investment instruments or stocks tend to be up or down during the same time periods, they have positive covariance.

- A negative covariance means returns move inversely. If one investment instrument tends to be up while the other is down, they have negative covariance.

This will help a stock broker in recommending the stocks to his customers.

Dataset: The sample dataset provided is a comma separated file (CSV) named ‘NYSE_daily_prices_Q.csv’ that contains the stock information such as daily quotes, Stock opening price, Stock highest price etc. at New York Stock Exchange.

The dataset provided is just sample small dataset having around 3500 records, but in the real production environment there could be huge stock data into GBs or TBs. So our solution must be supported in real production environment.

Hadoop Data Analysis Technologies

Let’s have a look on the existing open source Hadoop data analysis technologies to analyze the huge stock data being generated very frequently.

Which Data Analysis Technologies Should be used?

Based on the provided sample dataset, it is having following properties:

  • Data is having structured format
  • It would require joins to calculate Stock Covariance
  • It could be organized into schema
  •  In real environment, data size would be too much

Based on these criteria's and comparing with the above analysis of technologies, we can conclude:

  • If we use MapReduce, then complex business logic needs to be written to handle the joins. We would have to think from map and reduce perspective, that which particular code snippet will go into map and which one will go into reduce side. Much development efforts need to be provided to decide how map and reduce joins will take place. We would not be able to map the data into schema format and all efforts need to be handled programmatically.
  • If we are going to use Pig, then we would not be able to partition the data, which can be used for sample processing from a subset of data by a particular stock symbol or particular date or month. In addition to that Pig is more like scripting language which is more suitable for prototyping and rapidly developing MapReduce based jobs. It also doesn’t provide the facility to map our data into explicit schema format that seems more suitable for this case study.
  • Hive not only provides a familiar programming model for people who know SQL, it also eliminates lots of boilerplate and sometimes tricky coding that we would have to do in mapreduce programming. If we apply Hive to analyze the stock data, then we would be able to leverage the Sql capabilities of Hive-QL as well as data can be managed in a particular schema. It will also reduce the development time as well and can manage joins between stock data also using Hive-QL which is off course pretty difficult in Mapreduce. Hive also have its thrift servers, by which we can submit our hive queries from anywhere to the hive server which in turn executes them. Hive Sql queries are being converted into map reduce jobs by hive compiler, leaving programmers to think beyond complex programming and provides opportunity to focus on business problem.

So based on the above discussion, Hive seems the perfect choice for the mentioned case study.

Problem Solution with Hive

Apache Hive is a data Warehousing package built on top of Hadoop for providing data summarization, query and analysis. The query language being used by hive is called Hive-QL and very similar to Sql.

Since you are now done with zero-in to the data analysis technology part, now it’s the time to get your feet wet with deriving solution for the mentioned case study.

  • Hive Configuration on Cloudera

Follow the steps mentioned in one of my blog to configure hive on Cloudera as provided into the below link:                                                     

How to Configure Hive On Cloudera

  • Create Hive Table

Use ‘create table’ hive command to create the Hive table for our provided csv dataset:  

hive> create table NYSE (exchange String,stock_symbol String,stock_date String,stock_price_open double, stock_price_high double, stock_price_low double, stock_price_close double, stock_volume double, stock_price_adj_close double) row format delimited fields terminated by ‘,’;

  • Load CSV Data into Hive Table

hive> load data local inpath '/home/cloudera/NYSE_daily_prices_Q.csv' into table NYSE;

  • Calculate the Covariance

We can calculate the Covariance for the provided stock dataset for the inputted year as below using Hive select query:

select a.STOCK_SYMBOL, b.STOCK_SYMBOL, month(a.STOCK_DATE),(AVG(a.STOCK_PRICE_HIGH*b.STOCK_PRICE_HIGH) - (AVG(a.STOCK_PRICE_HIGH)*AVG(b.STOCK_PRICE_HIGH))) from NYSE a join NYSE b on a.STOCK_DATE=b.STOCK_DATE where a.STOCK_SYMBOL<b.STOCK_SYMBOL and year(a.STOCK_DATE)=2008 Group by a.STOCK_SYMBOL, b. STOCK_SYMBOL, month(a.STOCK_DATE);                     

This hive select query will trigger the Mapreduce job as below:

The covariance results after the above stock data analysis are as below:

The covariance has been calculated between two different stocks for each month on a particular date for the provided year.

From the covariance results, Stock brokers or fund managers can provide below recommendations:

·       For Stocks QRR and QTM, these are having more positive covariance than negative covariance, so having high probability that stocks will move together in same direction.

·       For Stocks QRR and QXM, these are mostly having negative covariance, so there exist greater probability of stock prices move in inverse direction.

·       For Stocks QTM and QXM, these are mostly having positive covariance for most of all months, so tend to move in same direction mostly.

So similarly we can analyze more use cases of big data and can explore all possible solution to solve that use case and then by the comparison chart, the final best solution can be narrowed down.

Conclusion/Benefits

So this case study solves the ultimate both goals of Big Data technologies in the best possible way.

  • Storage :

By storing the huge stock data into HDFS, the solution provided is much robust, reliable, economical and scalable. Whenever data size is increasing, you can just add some more nodes, configure into Hadoop and that’s all. If sometime any node is down, then even other nodes are ready to handle the responsibility due to data replication.

By managing the hive schema into embedded database or any other standard sql database, we are able to utilize the power of Sql as well.

  • Processing:

Since Hive schema is created on standard sql database, you get the advantage of running sql queries on the huge dataset also and able to process GBs or TBs of data with simple sql queries. Since actual data resides into HDFS, so these hive sql queries are being converted into Mapreduce jobs and these parallelized map reduce jobs process these huge volume of data and achieve scalable and fault tolerance solution.

Shobhit Awasthi

Data Engineering Lead/Engineering Leader, Skills (SQL, Azure Synapse Analytics, Databricks (Scala, Pyspark), Databricks Lake house, Power BI, Microsoft Azure, MSBI (SSIS, SSAS, SSRS))

7y

Excellent Article himanshu...Keep it up

Bharat Bhushan

Testing Engineering Specialist Advisor in NTT DATA

7y

Thanks Himanshu Agrawal for sharing your knowledge with us.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics