Distributed Storage Cluster and Hadoop
In this article I am going to talk about Hadoop cluster.
Before that I would like to talk about why this technology came into picture or what is the need of this technology.
There is no place where Big Data does not exist! The curiosity about what is Big Data has been soaring in the past few years. Let me tell you some mind-boggling facts! Forbes reports that every minute, users watch 4.15 million YouTube videos, send 456,000 tweets on Twitter, post 46,740 photos on Instagram and there are 510,000 comments posted and 293,000 statuses updated on Facebook!
Just imagine the huge chunk of data that is produced with such activities. This constant creation of data using social media, business applications, telecom and various other domains is leading to the formation of Big Data.
There is no storage system where we can store this amount of data and if anyhow we are able to store this big data there are many problems we have to face such as.....
The first problem is storing the colossal amount of data.
Storing this huge data in a traditional system is not possible. The reason is obvious, the storage will be limited only to one system and the data is increasing at a tremendous rate.
Second problem is storing heterogeneous data.
Now, we know storing is a problem, but let me tell you, it is just a part of the problem. Since we discussed that the data is not only huge, but it is present in various formats. So, you need to make sure that, you have a system to store all these varieties of data, generated from various sources.
Third problem is accessing and processing speed.
The hard disk capacity is increasing but the disk transfer speed or the access speed is not increasing at similar rate. Let me explain you this with an example: If you have only one 100 Mbps I/O channel and you are processing 1TB of data, it will take around 2.91 hours. Now, if you have four machines with one I/O channel, for the same amount of data it will take 43 minutes approx. Thus, accessing and processing speed is the bigger problem than storing Big Data.
Therefor to handle this amount of huge data creators create a technology or cluster known as "Distributive Storage Cluster".
So, how this cluster works?
As the name suggests there is something about distribution. Hence, what these companies do they uses distributive technology name of that topology is Master-Slave Topology. Master/slave is a model of asymmetric communication or control where one device or process (the "master") controls one or more other devices or processes (the "slaves") and serves as their communication hub.
To implement this distributive storage cluster we use a framework named Hadoop.
Then, What is Hadoop?
Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallely. This hadoop works on different protocol than http, https.
It works on its separate protocol named HDFS(Hadoop Distributed File System) protocol.
Then you'll ask how Hadoop is a solution to these big data problems?
Let’s understand how Hadoop provided the solution to the Big Data problems that we just discussed.
The first problem is storing Big data.
HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks. So HDFS will divide data into 4 blocks as 512/128=4 and store it across different DataNodes, it will also replicate the data blocks on different DataNodes. Now, as we are using commodity hardware, hence storing is not a challenge.
It also solves the scaling problem. It focuses on horizontal scaling instead of vertical scaling. You can always add some extra data nodes to HDFS cluster as and when required, instead of scaling up the resources of your DataNodes. Let me summarize it for you basically for storing 1 TB of data, you don’t need a 1TB system. You can instead do it on multiple 128GB systems or even less.
Next problem was storing the variety of data.
With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation. And it also follows write once and read many model. Due to this, you can just write the data once and you can read it multiple times for finding insights.
Third challenge was accessing & processing the data faster.
Yes, this is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. What does it mean? Instead of moving data to the master node and then processing it. In MapReduce, the processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. Then the processed results are sent to the master node where the results is merged and the response is sent back to the client.
I hope this blog was informative and added value to your knowledge.
Thank You!
#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor
#worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation
Your work with big data is super impressive, especially how you dive into the details of Hadoop! To level up even more, you might want to explore machine learning - it’s a game-changer in analyzing big data. What’s your dream job in the tech world?
Software Engineer @ JPMorgan Chase & Co. || Ex SDE intern @ NielsenIQ || Leetcode @1700+ || Problem Solving
4yThanks for sharing Tejas Sanghai
Software Engineer @IBM-ISDL
4yWell explained👍