Distributed Storage Cluster & Hadoop

Krushna Prasad Sahoo

★ OᴘᴇɴSʜɪғᴛ Eɴɢɪɴᴇᴇʀ ★ Rᴇᴅ Hᴀᴛ Cᴇʀᴛɪғɪᴇᴅ Sᴘᴇᴄɪᴀʟɪsᴛ Iɴ OᴘᴇɴSʜɪғᴛ Aᴅᴍɪɴ ★ Assᴏᴄ. Cʟᴏᴜᴅ Eɴɢɪɴᴇᴇʀ (GCP) ★ Mɪᴄʀᴏsᴏғᴛ Cᴇʀᴛɪғɪᴇᴅ Aᴢᴜʀᴇ Aᴅᴍɪɴ ★

Published Sep 17, 2020

Data is a collection of facts (numbers, words, measurements, observations, etc) that has been translated into a form that computers can process and can gain some insights that can be useful for some analysis as well as business purpose .We have various types of data such as structured & unstructered data. Besides that some more varieties can be Personal data , Sensor data , social feeds, web data , transactional data or may be logs and many more.

Various business are completely based on data only. So in today's world data is the new fuel. Data is the most vital thing in every field.There are 2.5 quintillion bytes of data created each day at our current pace, but that pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone 90 percent of the data in the world was generated. If you talk about social medias, only Facebook the most active of social networks, with over 1.4 billion active monthly users, generates the most amount of social data users like over 4 million posts every minute 4,166,667 to be exact, which adds up to 250 million posts per hour ! So this goes same with other giant companies like Google who provides best search engine , IBM, Instagram also.

So the data is huge, we need to store it , retrieve it as needed also we need to process and analyse it. Here comes the issue of BigData. Under this we have a storage issue which can also be called as volume. If suppose by any means we created large volume of Hard Discs then the i/o issue will come up. In simple words, a large volume of data takes huge time to be stored in a those HDs as well as takes a huge time to retrive from it. So where is the time to process it , analyse it and produce some helpful insights. Obviously this will hamper the demand in the market.

Hence we have a great concept which helps us to overcome all these issues i.e " Distributed Storage Cluster ", this concept refers that it is a infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. So to implement this concept we have a great software from Apache community i.e HADOOP . It is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly. This file system working is known as the Hadoop Distributed File System or HDFS .

So how do companies use this technology ? Well, there are many data driven companies which are using hadoop at a great scale. Google has its own cloud i.e Google Cloud Platform . It has Dataproc is considered as the Managed Hadoop for the cloud. There is another service provided by Google for Hadoop, which is Cloud Dataflow. Dataflow is generally used when the user wants to perform Stream processing (ETL) and Batch processing (ETL) both. In Dataproc, we can only perform Batch processing. Messaging in facebook has been one of its popular feature since its inception. Another features of facebook such has like button or status updates are done in Mysql database but applications such as facebook messaging system runs on the top of HBASE which is hadoop’s NoSql database framework.The data warehousing solution of facebook’s lies in HIVE which is built on the top of HDFS. The reporting needs of the FACEBOOK is also achieved by using HIVE. Post 2011 with increase in the magnitude of data and to improve the efficiency facebook started implementing apache corona which works very much like Yarn framework. When it comes about the size of the hadoop cluster,yahoo beats all by having the 42000 nodes in about 20 YARN clusters with 600 petabytes of data on HDFS to serve the company’s mobile, search, advertising, personalization, media, and communication efforts. Yahoo uses hadoop to block around 20.5 billion messages and checks it to enter it into its email server.Yahoo’s spam detection abilities has increased to manifolds since it started using hadoop. In the ever growing family of hadoop,yahoo has been one of the major contributor. Yahoo has been the pioneer of many new technologies which have already embraced itself into hadoop ecosystem. Few notable technologies which yahoo has been using apart from mapreduce and hdfs is Apache tez and spark. One of the main vehicle of yahoo’s hadoop chariot is pig which started in yahoo and it still tops the chart as 50-60 percent of jobs are processed using pig scripts.

So this is the power of Distributed Storage Cluster . Thank You Guys .

To view or add a comment, sign in

See all

Distributed Storage Cluster & Hadoop

Krushna Prasad Sahoo

★ OᴘᴇɴSʜɪғᴛ Eɴɢɪɴᴇᴇʀ ★ Rᴇᴅ Hᴀᴛ Cᴇʀᴛɪғɪᴇᴅ Sᴘᴇᴄɪᴀʟɪsᴛ Iɴ OᴘᴇɴSʜɪғᴛ Aᴅᴍɪɴ ★ Assᴏᴄ. Cʟᴏᴜᴅ Eɴɢɪɴᴇᴇʀ (GCP) ★ Mɪᴄʀᴏsᴏғᴛ Cᴇʀᴛɪғɪᴇᴅ Aᴢᴜʀᴇ Aᴅᴍɪɴ ★

More articles by this author

Insights from the community

Others also viewed

Data Lake & Hadoop : How can they power your Analytics?

Is Hadoop Sinking with the Emergence of AI & Machine Learning?

Innovate faster by migrating from Hadoop to Azure Databricks

Hadoop Market All Set To Grow At CAGR 37.3%, Market Value To Reach USD 851.4 billion By 2030

Big Data Technologies Resume

Hadoop - Managers' snapshot

Hadoop Operation Service Market Seeking Excellent Growth| Hortonworks, Cloudera, SAP, Google

Introduction:

What is the Future of Hadoop? We asked the experts

Unleashing the Power of Hadoop for Big Data Processing in Banking and Finance Industry

Explore topics

Mapper & Reducer Program using Aggregation Framework of MongoDB

Sep 5, 2021

Masterclass on Git & GitHub by REGex

Jun 9, 2021

Terraform : The Infrastructure As Code

May 4, 2021

ԨУƤƐᎡⳐOOƤ Ɛxρlaiɳeԃ iɳ 7 miɳutes ..

Apr 29, 2021

Explore BASH Shell just in 5 minutes ..

Apr 27, 2021

Ansible ROLE Explained With a Real Use Case

Mar 29, 2021

Let's Explore AWS Simple Queue Service .

Mar 15, 2021

RedHat OpenShift Success Stories🤩

Mar 13, 2021

What is Jenkins & how it helped Autodesk to take Cloud Initiative & Release Faster with CI/CD pipeline 📍

Mar 12, 2021

What is Azure Kubernetes Service & how it helped Bosch to solve the Wrong-Way Challenge !

Mar 4, 2021