Distributed Storage Cluster and Hadoop

Tejas Sanghai

@Acquia | Ex-Arista

Published Sep 17, 2020

In this article I am going to talk about Hadoop cluster.

Before that I would like to talk about why this technology came into picture or what is the need of this technology.

There is no place where Big Data does not exist! The curiosity about what is Big Data has been soaring in the past few years. Let me tell you some mind-boggling facts! Forbes reports that every minute, users watch 4.15 million YouTube videos, send 456,000 tweets on Twitter, post 46,740 photos on Instagram and there are 510,000 comments posted and 293,000 statuses updated on Facebook!

Just imagine the huge chunk of data that is produced with such activities. This constant creation of data using social media, business applications, telecom and various other domains is leading to the formation of Big Data.

There is no storage system where we can store this amount of data and if anyhow we are able to store this big data there are many problems we have to face such as.....

The first problem is storing the colossal amount of data.

Storing this huge data in a traditional system is not possible. The reason is obvious, the storage will be limited only to one system and the data is increasing at a tremendous rate.

Second problem is storing heterogeneous data.

Now, we know storing is a problem, but let me tell you, it is just a part of the problem. Since we discussed that the data is not only huge, but it is present in various formats. So, you need to make sure that, you have a system to store all these varieties of data, generated from various sources.

Third problem is accessing and processing speed.

The hard disk capacity is increasing but the disk transfer speed or the access speed is not increasing at similar rate. Let me explain you this with an example: If you have only one 100 Mbps I/O channel and you are processing 1TB of data, it will take around 2.91 hours. Now, if you have four machines with one I/O channel, for the same amount of data it will take 43 minutes approx. Thus, accessing and processing speed is the bigger problem than storing Big Data.

Therefor to handle this amount of huge data creators create a technology or cluster known as "Distributive Storage Cluster".

So, how this cluster works?

As the name suggests there is something about distribution. Hence, what these companies do they uses distributive technology name of that topology is Master-Slave Topology. Master/slave is a model of asymmetric communication or control where one device or process (the "master") controls one or more other devices or processes (the "slaves") and serves as their communication hub.

To implement this distributive storage cluster we use a framework named Hadoop.

Then, What is Hadoop?

Hadoop is a framework that allows you to first store Big Data in a distributed environment, so that, you can process it parallely. This hadoop works on different protocol than http, https.

It works on its separate protocol named HDFS(Hadoop Distributed File System) protocol.

Then you'll ask how Hadoop is a solution to these big data problems?

Let’s understand how Hadoop provided the solution to the Big Data problems that we just discussed.

The first problem is storing Big data.

HDFS provides a distributed way to store Big data. Your data is stored in blocks across the DataNodes and you can specify the size of blocks. Basically, if you have 512MB of data and you have configured HDFS such that, it will create 128 MB of data blocks. So HDFS will divide data into 4 blocks as 512/128=4 and store it across different DataNodes, it will also replicate the data blocks on different DataNodes. Now, as we are using commodity hardware, hence storing is not a challenge.

It also solves the scaling problem. It focuses on horizontal scaling instead of vertical scaling. You can always add some extra data nodes to HDFS cluster as and when required, instead of scaling up the resources of your DataNodes. Let me summarize it for you basically for storing 1 TB of data, you don’t need a 1TB system. You can instead do it on multiple 128GB systems or even less.

Next problem was storing the variety of data.

With HDFS you can store all kinds of data whether it is structured, semi-structured or unstructured. Since in HDFS, there is no pre-dumping schema validation. And it also follows write once and read many model. Due to this, you can just write the data once and you can read it multiple times for finding insights.

Third challenge was accessing & processing the data faster.

Yes, this is one of the major challenges with Big Data. In order to solve it, we move processing to data and not data to processing. What does it mean? Instead of moving data to the master node and then processing it. In MapReduce, the processing logic is sent to the various slave nodes & then data is processed parallely across different slave nodes. Then the processed results are sent to the master node where the results is merged and the response is sent back to the client.

I hope this blog was informative and added value to your knowledge.

Thank You!

#bigdata #hadoop #bigdatamanagement #arthbylw #vimaldaga #righteducation #educationredefine #rightmentor

#worldrecordholder #ARTH #linuxworld #makingindiafutureready #righeudcation

Incredible Interns

8mo

Your work with big data is super impressive, especially how you dive into the details of Hadoop! To level up even more, you might want to explore machine learning - it’s a game-changer in analyzing big data. What’s your dream job in the tech world?

PRANAV RODGE

Software Engineer @ JPMorgan Chase & Co. || Ex SDE intern @ NielsenIQ || Leetcode @1700+ || Problem Solving

Thanks for sharing Tejas Sanghai

1 Reaction

Bhakti Chaudhari

Software Engineer @IBM-ISDL

Well explained👍

See more comments

To view or add a comment, sign in

See all

Distributed Storage Cluster and Hadoop

Tejas Sanghai

@Acquia | Ex-Arista

The first problem is storing the colossal amount of data.

Second problem is storing heterogeneous data.

Third problem is accessing and processing speed.

The first problem is storing Big data.

Next problem was storing the variety of data.

Third challenge was accessing & processing the data faster.

More articles by this author

Insights from the community

Others also viewed

What is the future of Hadoop?

Data Lake & Hadoop : How can they power your Analytics?

Innovate faster by migrating from Hadoop to Azure Databricks

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

All about BIG data

Increasing/decreasing the size of Hadoop Datanode dynamically

HADOOP HDFS

What is the Future of Hadoop? We asked the experts

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

Hadoop 1.x

Explore topics

The first problem is storing the colossal amount of data.

Second problem is storing heterogeneous data.

Third problem is accessing and processing speed.

The first problem is storing Big data.

Next problem was storing the variety of data.

Third challenge was accessing & processing the data faster.

7.2: Docker Task

Mar 14, 2021

What is Jenkins??

Mar 13, 2021

Benefits of Openshift to Industries

Mar 13, 2021

Use cases of Azure Kubernetes Service

Mar 6, 2021

Use Cases of Neural Network

Mar 6, 2021

Integration of LVM with Hadoop

Jan 17, 2021

What is Kubernetes??

Dec 26, 2020

🔰Ansible : A Case Study how industries are solving challenges using Ansible🔰

Dec 4, 2020

Benefits which MNCs are getting from AI/ML

Oct 20, 2020

Accessing AWS through CLI(Command Line Interface)

Oct 19, 2020

Insights from the community

Others also viewed

What is the future of Hadoop?

Data Lake & Hadoop : How can they power your Analytics?

Innovate faster by migrating from Hadoop to Azure Databricks

HADOOP: "How to share Limited Storage of Datanode to the Namenode in Hadoop Distributed Storage Cluster?"

All about BIG data

Increasing/decreasing the size of Hadoop Datanode dynamically

HADOOP HDFS

What is the Future of Hadoop? We asked the experts

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

Hadoop 1.x

Explore topics