Distributed Storage Cluster & Hadoop
Data is a collection of facts (numbers, words, measurements, observations, etc) that has been translated into a form that computers can process and can gain some insights that can be useful for some analysis as well as business purpose .We have various types of data such as structured & unstructered data. Besides that some more varieties can be Personal data , Sensor data , social feeds, web data , transactional data or may be logs and many more.
Various business are completely based on data only. So in today's world data is the new fuel. Data is the most vital thing in every field.There are 2.5 quintillion bytes of data created each day at our current pace, but that pace is only accelerating with the growth of the Internet of Things (IoT). Over the last two years alone 90 percent of the data in the world was generated. If you talk about social medias, only Facebook the most active of social networks, with over 1.4 billion active monthly users, generates the most amount of social data users like over 4 million posts every minute 4,166,667 to be exact, which adds up to 250 million posts per hour ! So this goes same with other giant companies like Google who provides best search engine , IBM, Instagram also.
So the data is huge, we need to store it , retrieve it as needed also we need to process and analyse it. Here comes the issue of BigData. Under this we have a storage issue which can also be called as volume. If suppose by any means we created large volume of Hard Discs then the i/o issue will come up. In simple words, a large volume of data takes huge time to be stored in a those HDs as well as takes a huge time to retrive from it. So where is the time to process it , analyse it and produce some helpful insights. Obviously this will hamper the demand in the market.
Hence we have a great concept which helps us to overcome all these issues i.e " Distributed Storage Cluster ", this concept refers that it is a infrastructure that can split data across multiple physical servers, and often across more than one data center. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. So to implement this concept we have a great software from Apache community i.e HADOOP . It is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly. This file system working is known as the Hadoop Distributed File System or HDFS .
So how do companies use this technology ? Well, there are many data driven companies which are using hadoop at a great scale. Google has its own cloud i.e Google Cloud Platform . It has Dataproc is considered as the Managed Hadoop for the cloud. There is another service provided by Google for Hadoop, which is Cloud Dataflow. Dataflow is generally used when the user wants to perform Stream processing (ETL) and Batch processing (ETL) both. In Dataproc, we can only perform Batch processing. Messaging in facebook has been one of its popular feature since its inception. Another features of facebook such has like button or status updates are done in Mysql database but applications such as facebook messaging system runs on the top of HBASE which is hadoop’s NoSql database framework.The data warehousing solution of facebook’s lies in HIVE which is built on the top of HDFS. The reporting needs of the FACEBOOK is also achieved by using HIVE. Post 2011 with increase in the magnitude of data and to improve the efficiency facebook started implementing apache corona which works very much like Yarn framework. When it comes about the size of the hadoop cluster,yahoo beats all by having the 42000 nodes in about 20 YARN clusters with 600 petabytes of data on HDFS to serve the company’s mobile, search, advertising, personalization, media, and communication efforts. Yahoo uses hadoop to block around 20.5 billion messages and checks it to enter it into its email server.Yahoo’s spam detection abilities has increased to manifolds since it started using hadoop. In the ever growing family of hadoop,yahoo has been one of the major contributor. Yahoo has been the pioneer of many new technologies which have already embraced itself into hadoop ecosystem. Few notable technologies which yahoo has been using apart from mapreduce and hdfs is Apache tez and spark. One of the main vehicle of yahoo’s hadoop chariot is pig which started in yahoo and it still tops the chart as 50-60 percent of jobs are processed using pig scripts.
So this is the power of Distributed Storage Cluster . Thank You Guys .