Unit-3 Hadoop Environment
Unit-3 Hadoop Environment
Unit-3 Hadoop Environment
By
• Hadoop Configuration,
• Security in Hadoop,
• Administering Hadoop,
• Hadoop Benchmarks,
• When Hadoop was first released in 2007 it was intended to manage large amounts
of web data in a trusted environment, so security was not a significant concern or
focus.
• It is also important to note that the commercial Hadoop distributions from software
vendors (e.g. Cloudera, Hortonworks, MapR) have additional, proprietary security that
is not included in the free Hadoop releases that are available from the Apache
Foundation.
Security in Hadoop
• Apache Hadoop is the most powerful, robust and highly scalable big data processing
framework capable enough to crunch petabytes of data with ease. Due to its unmatched
capabilities, back then, every business sector, health, military and finance departments
started using Hadoop.
• Hadoop started gaining popularity. This is when the Hadoop developers found a
monumental miscalculation. Hadoop lacked a dedicated security software inside it.
This affected many areas where Hadoop was in use.
Security in Hadoop
• Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage
unit, by offering a virtually impenetrable wall of security against any potential cyber threat.
Hadoop attains this high-calibre security wall by following the below security protocol.
• Around 2009, Hadoop’s security was designed and implemented and had been stabilizing
since then. In 2010, the security feature added in Hadoop with the following two
fundamental goals:
• Hadoop Security thus refers to the process that provides authentication, authorization,
auditing, and secure the Hadoop data storage unit by offering an inviolable wall of security
against any cyber threat. 3 A’s of Security
• Authorization: It is the second stage that defines what individual users can do
after they have been authenticated. Authorization controls what a particular user
can do to a specific file. It provides permission to the user whether he can access
the data or not.
Data Protection: It refers to the use of techniques like encryption and data masking
for preventing sensitive data access by unauthorized users and applications.
Kerberos
• Kerberos is one of the simplest and safest network authentication protocol used by
Hadoop for its data and network security. It was invented by MIT.
• The main objective of Kerberos is to eliminate the need to exchange passwords over a
network, and also, to secure the network from any potential cyber sniffing.
• KDC or Key Distribution Center is the Heart of Kerberos. It mainly consists of three
components.
Kerberos
The main components of Kerberos are:
Database: The Authentication Server verifies the access rights of users in the database.
Ticket Granting Server (TGS): The Ticket Granting Server issues the ticket for the
Server
Kerberos
Authenticatio
n server
1 Database
2
TGS
3
User
4
5
6
Server
Kerberos
Step-1: User login and request services on the host. Thus user requests for ticket-granting service.
Step-2: Authentication Server verifies user’s access right using database and then gives ticket-
granting-ticket and session key. Results are encrypted using the Password of the user.
Step-3: The decryption of the message is done using the password then send the ticket to Ticket
Granting Server. The Ticket contains authenticators like user names and network addresses.
Step-4: Ticket Granting Server decrypts the ticket sent by User and authenticator verifies the
request then creates the ticket for requesting services from the Server.
Step-5: The user sends the Ticket and Authenticator to the Server.
Step-6: The server verifies the Ticket and authenticators then generate access to the service. After
this User can access the services.
Transparent Encryption in HDFS
• For data protection, Hadoop HDFS implements transparent encryption. Once it is configured,
the data that is to be read from and written to the special HDFS directories is encrypted and
decrypted transparently without requiring any changes to the user application code.
• This encryption is end-to-end encryption, which means that only the client will encrypt or
decrypt the data. Hadoop HDFS will never store or have access to unencrypted data or
unencrypted data encryption keys, satisfying at-rest encryption, and in-transit encryption.
At-rest encryption refers to the encryption of data when data is on persistent media such as
a disk.
In-transit encryption means encryption of data when data is traveling over the network.
HDFS encryption enables the existing Hadoop applications to run transparently on the
encrypted data.
• The files or directories have different permissions for the owner, group members, and all
other users.
• For directories, r is the permission to list the content of the directory, w is the permission to
create or delete files/directories, and x is the permission to access a child of the directory.
• To restrict others except for the files/directory owner and the superuser, from deleting or
moving the files within the directory, we can add a sticky bit on directories.
• The owner of the file/directory is the user identity of the client process, and the group of
file/directory is the parent directory group.
• Also, every client process which is going to access the HDFS has a two-part identity that is a
user name and group list.
Tools for Hadoop Security
The Hadoop ecosystem contains some tools for supporting Hadoop Security. The two major
Apache open-source projects that support Hadoop Security are Knox and Ranger.
1. Knox
Knox is a REST API base perimeter security gateway that performs authentication, support
monitoring, auditing, authorization management, and policy enforcement on Hadoop clusters.
It authenticates user credentials generally against LDAP and Active Directory. It allows only the
successfully authenticated users to access the Hadoop cluster.
2. Ranger
It is an authorization system that provides or denies access to Hadoop cluster resources such
as HDFS files, Hive tables, etc. based on predefined policies. User request assumes to be
already authenticated while coming to Ranger. It has different authorization functionality for
different Hadoop components such as YARN, Hive, HBase, etc.
Write summary on following security frameworks:
Apache Sentry
Apache Ranger
https://www.oreilly.com/library/view/hadoop-security/9781491900970/
ch01.html
Hadoop Benchmarks
• The benchmark measures the number of operations performed by the name-node per
second.
• Specifically, for each operation tested, it reports the total running time in seconds
(Elapsed Time), operation throughput (Ops per sec), and average time for the
operations (Average Time). The higher, the better.
$ hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs
hdfs://nameservice:9000 -op open -threads 1000 -files 100000
• We need to fully utilize the clusters capability so that we can obtain the best
performance from the underlying infrastructure
• Benchmarks make good tests, as you also get numbers that you can compare with other
clusters as a sanity check on whether your new cluster is performing roughly as
expected.
• One can tune a cluster using benchmark results to squeeze the best performance out of
it. This is often done with monitoring systems in place (“Monitoring” ), so it can be seen
how resources are being used across the cluster.
• To get the best results, one should run benchmarks on a cluster that is not being used by
others.
• In practice, this is just before it is put into service and users start relying on it. Once
users have periodically scheduled jobs on a cluster, it is generally impossible to find a
time when the cluster is not being used (unless you arrange downtime with users), so
you should run benchmarks to your satisfaction before this happens.
Hadoop Benchmarks
1. DFSIO : DFSIO algorithms provides an HDFS IO throughput benchmarking calculation that
can help identify and benchmark the read and write throughput into HDFS.
• This test is particularly useful for performing stress test on Hadoop cluster by loading a
huge set of data onto HDFS and identify the HDFS throughput in terms of time taken to
read and write data into HDFS.
• This process helps to identify potential bottleneck in cluster and tune network of hardware
for better performance.
1. one focuses on writing data into HDFS and benchmarks HDFS write throughput.
Normally the output folder for this benchmarking is on /benchmarks/TestDFSIO
folder in HDFS.
In order to run DFSIO write algorithm and generate data use the following command
This command start a job that reads 4TB worth of data that was generated on
previous step. Using this benchmarking tool we can determine the cumulative read
throughput of the entire cluster by identifying how long it takes to read 4 TB worth of
data.
Hadoop Benchmarks
Terasort Benchmark
Terasort benchmarking focusing on providing throughput on CPU cycles spend of data
processing. The idea of terasort is to sort a set of data which is randomly generated as fast as
possible. The time taken to sort the data provides a clear picture of how well the cluster is
tuned to perform CPU intensive operations.
Terasort benchmarking are a test to CPU and RAM based processing operations
To start the terasort operation we start by generating random set of data into HDFS which will
be later sorted. To generate 4 TB of data we have the following commands:
The idea of this algorithm is to benchmark namenode hardware throughput but generating lots
of small HDFS files and storing the metadata into memory.
Mapreduce benchmark
The idea is to benchmark smaller jobs that runs in parallel into Hadoop cluster and
schedule multiple MR jobs into YARN scheduler. In order to run the process use the
following command that would generate 50 parallel jobs:
Hadoop on
• AWS : Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process
and analyze large datasets using the latest versions of big data processing frameworks
such as Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.
• Google Cloud: Google Dataproc is a fully-managed cloud service for running Apache
Hadoop and Spark clusters. It provides enterprise-grade security, governance, and
support, and can be used for general purpose data processing, analytics, and machine
learning.
Hadoop in the cloud
https://www.youtube.com/watch?v=7BSAbzwU6Ac&list=PL0hSJrxggIQoer5TwgW6qWHnVJfqz
cX3l
Azure HDInsight:
https://www.youtube.com/watch?v=phKiAIHMLAI
https://www.youtube.com/watch?v=IzmT0fH4KE0
Hadoop Alternatives
Ceph
• Hydra by AddThis
Happy Learning