Hadoop
Hadoop HDFS
The Hadoop Distributed File System (HDFS) is Hadoop’s storage layer. Housed on multiple servers, data is divided into blocks based on file size. These blocks are then randomly distributed and stored across slave machines.
HDFS in Hadoop Architecture divides large data into different blocks. Replicated three times by default, each block contains 128 MB of data. Replications operate under two rules:
Two identical blocks cannot be placed on the same DataNode
When a cluster is rack aware, all the replicas of a block cannot be placed on the same rack
In this example, blocks A, B, C, and D are replicated three times and placed on different racks. If DataNode 7 crashes, we still have two copies of block C data on DataNode 4 of Rack 1 and DataNode 9 of Rack 3.
There are three components of the Hadoop Distributed File System:
NameNode (a.k.a. masternode): Contains metadata in RAM and disk
Secondary NameNode: Contains a copy of NameNode’s metadata on disk
Slave Node: Contains the actual data in the form of blocks