Unit-3 Hadoop Environment

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Hadoop Environment

By

Prof Shibdas Dutta


Associate Professor,

DCG DATA CORE SYSTEMS INDIA PVT LTD


Kolkata
Hadoop Environment:

• Setting up a Hadoop Cluster,

• Hadoop Configuration,

• Security in Hadoop,

• Administering Hadoop,

• Hadoop Benchmarks,

• Hadoop in the cloud.


Security in Hadoop

• When Hadoop was first released in 2007 it was intended to manage large amounts
of web data in a trusted environment, so security was not a significant concern or
focus.

• As adoption rose and Hadoop evolved into an enterprise technology, it developed a


reputation as an unsecure platform. Most of the original Hadoop security shortcomings
have been addressed in subsequent releases, but perceptions are slow to change.
Hadoop’s security reputation has outlasted its reality.

• Security is actually quite inconsistent among Hadoop implementations because the


built-in security and available options are inconsistent among release versions.

• It is also important to note that the commercial Hadoop distributions from software
vendors (e.g. Cloudera, Hortonworks, MapR) have additional, proprietary security that
is not included in the free Hadoop releases that are available from the Apache
Foundation.
Security in Hadoop
• Apache Hadoop is the most powerful, robust and highly scalable big data processing
framework capable enough to crunch petabytes of data with ease. Due to its unmatched
capabilities, back then, every business sector, health, military and finance departments
started using Hadoop.

• Hadoop started gaining popularity. This is when the Hadoop developers found a
monumental miscalculation. Hadoop lacked a dedicated security software inside it.
This affected many areas where Hadoop was in use.
Security in Hadoop
• Hadoop Security is generally defined as a procedure to secure the Hadoop Data Storage
unit, by offering a virtually impenetrable wall of security against any potential cyber threat.
Hadoop attains this high-calibre security wall by following the below security protocol.

• Around 2009, Hadoop’s security was designed and implemented and had been stabilizing
since then. In 2010, the security feature added in Hadoop with the following two
fundamental goals:

• Hadoop Security thus refers to the process that provides authentication, authorization,
auditing, and secure the Hadoop data storage unit by offering an inviolable wall of security
against any cyber threat. 3 A’s of Security

Authentication Authorization Auditing


Security in Hadoop
• Authentication: It is the first stage that strongly authenticates the user to prove
their identities. In authentication, user credentials like UserId, password are
authenticated. Authentication ensures that the user who is seeking to perform an
operation is the one who he claims to be and thus trustable.

• Authorization: It is the second stage that defines what individual users can do
after they have been authenticated. Authorization controls what a particular user
can do to a specific file. It provides permission to the user whether he can access
the data or not.

• Auditing: Auditing is the process of keeping track of what an authenticated,


authorized user did once he gets access to the cluster. It records all the activity of
the authenticated user, including what data was accessed, added, changed, and
what analysis occurred by the user from the period when he login to the cluster.

Data Protection: It refers to the use of techniques like encryption and data masking
for preventing sensitive data access by unauthorized users and applications.
Kerberos
• Kerberos is one of the simplest and safest network authentication protocol used by
Hadoop for its data and network security. It was invented by MIT.

• The main objective of Kerberos is to eliminate the need to exchange passwords over a
network, and also, to secure the network from any potential cyber sniffing.

• KDC or Key Distribution Center is the Heart of Kerberos. It mainly consists of three
components.
Kerberos
The main components of Kerberos are:

Authentication Server (AS): The Authentication Server performs the initial


authentication and ticket for Ticket Granting Service.

Database: The Authentication Server verifies the access rights of users in the database.

Ticket Granting Server (TGS): The Ticket Granting Server issues the ticket for the
Server
Kerberos

Authenticatio
n server

1 Database
2
TGS

3
User
4
5

6
Server
Kerberos
Step-1: User login and request services on the host. Thus user requests for ticket-granting service.

Step-2: Authentication Server verifies user’s access right using database and then gives ticket-
granting-ticket and session key. Results are encrypted using the Password of the user.

Step-3: The decryption of the message is done using the password then send the ticket to Ticket
Granting Server. The Ticket contains authenticators like user names and network addresses.

Step-4: Ticket Granting Server decrypts the ticket sent by User and authenticator verifies the
request then creates the ticket for requesting services from the Server.

Step-5: The user sends the Ticket and Authenticator to the Server.

Step-6: The server verifies the Ticket and authenticators then generate access to the service. After
this User can access the services.
Transparent Encryption in HDFS

• For data protection, Hadoop HDFS implements transparent encryption. Once it is configured,
the data that is to be read from and written to the special HDFS directories is encrypted and
decrypted transparently without requiring any changes to the user application code.

• This encryption is end-to-end encryption, which means that only the client will encrypt or
decrypt the data. Hadoop HDFS will never store or have access to unencrypted data or
unencrypted data encryption keys, satisfying at-rest encryption, and in-transit encryption.

At-rest encryption refers to the encryption of data when data is on persistent media such as
a disk.

In-transit encryption means encryption of data when data is traveling over the network.

HDFS encryption enables the existing Hadoop applications to run transparently on the
encrypted data.

This HDFS-level encryption also prevents the filesystem or OS-level attacks.


HDFS file and directory permission
• The HDFS permission model is very similar to the Portable Operating System Interface
(POSIX) model. Every file and directory in HDFS is having an owner and a group.

• The files or directories have different permissions for the owner, group members, and all
other users.

• For files, r is for reading permission, w is for write or append permission.

• For directories, r is the permission to list the content of the directory, w is the permission to
create or delete files/directories, and x is the permission to access a child of the directory.

• To restrict others except for the files/directory owner and the superuser, from deleting or
moving the files within the directory, we can add a sticky bit on directories.

• The owner of the file/directory is the user identity of the client process, and the group of
file/directory is the parent directory group.

• Also, every client process which is going to access the HDFS has a two-part identity that is a
user name and group list.
Tools for Hadoop Security

The Hadoop ecosystem contains some tools for supporting Hadoop Security. The two major
Apache open-source projects that support Hadoop Security are Knox and Ranger.

1. Knox
Knox is a REST API base perimeter security gateway that performs authentication, support
monitoring, auditing, authorization management, and policy enforcement on Hadoop clusters.
It authenticates user credentials generally against LDAP and Active Directory. It allows only the
successfully authenticated users to access the Hadoop cluster.

2. Ranger
It is an authorization system that provides or denies access to Hadoop cluster resources such
as HDFS files, Hive tables, etc. based on predefined policies. User request assumes to be
already authenticated while coming to Ranger. It has different authorization functionality for
different Hadoop components such as YARN, Hive, HBase, etc.
Write summary on following security frameworks:

Apache Knox Apache


accumulo

Apache Sentry

Apache Ranger
https://www.oreilly.com/library/view/hadoop-security/9781491900970/
ch01.html
Hadoop Benchmarks
• The benchmark measures the number of operations performed by the name-node per
second.

• Specifically, for each operation tested, it reports the total running time in seconds
(Elapsed Time), operation throughput (Ops per sec), and average time for the
operations (Average Time). The higher, the better.
$ hadoop org.apache.hadoop.hdfs.server.namenode.NNThroughputBenchmark -fs
hdfs://nameservice:9000 -op open -threads 1000 -files 100000

--- open inputs ---


nrFiles = 100000
nrThreads = 1000
nrFilesPerDir = 4
--- open stats ---
# operations: 100000
Elapsed Time: 9510
Ops per sec: 10515.247108307045
Average Time: 90
Hadoop Benchmarks

• We need to fully utilize the clusters capability so that we can obtain the best
performance from the underlying infrastructure

• Benchmarks make good tests, as you also get numbers that you can compare with other
clusters as a sanity check on whether your new cluster is performing roughly as
expected.

• One can tune a cluster using benchmark results to squeeze the best performance out of
it. This is often done with monitoring systems in place (“Monitoring” ), so it can be seen
how resources are being used across the cluster.

• To get the best results, one should run benchmarks on a cluster that is not being used by
others.

• In practice, this is just before it is put into service and users start relying on it. Once
users have periodically scheduled jobs on a cluster, it is generally impossible to find a
time when the cluster is not being used (unless you arrange downtime with users), so
you should run benchmarks to your satisfaction before this happens.
Hadoop Benchmarks
1. DFSIO : DFSIO algorithms provides an HDFS IO throughput benchmarking calculation that
can help identify and benchmark the read and write throughput into HDFS.

• This test is particularly useful for performing stress test on Hadoop cluster by loading a
huge set of data onto HDFS and identify the HDFS throughput in terms of time taken to
read and write data into HDFS.

• This process helps to identify potential bottleneck in cluster and tune network of hardware
for better performance.

• DFSIO benchmarking algorithm has two parts –

1. one focuses on writing data into HDFS and benchmarks HDFS write throughput.
Normally the output folder for this benchmarking is on /benchmarks/TestDFSIO
folder in HDFS.

In order to run DFSIO write algorithm and generate data use the following command

hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -write -nrFiles 4000 –filesize 1000


Hadoop Benchmarks
2. The second part of DFSIO is to ensure that HDFS read can be properly initiated
and all reads from HDFS happens at best possible throughput. This read test also
provides an indication as how fast the processing algorithms will be able to read
and process data.

In order to start the read throughput we use the following command.

hadoop jar $HADOOP_HOME/hadoop-*test*.jar TestDFSIO -read -nrFiles 4000 –filesize 1000

This command start a job that reads 4TB worth of data that was generated on
previous step. Using this benchmarking tool we can determine the cumulative read
throughput of the entire cluster by identifying how long it takes to read 4 TB worth of
data.
Hadoop Benchmarks
Terasort Benchmark
Terasort benchmarking focusing on providing throughput on CPU cycles spend of data
processing. The idea of terasort is to sort a set of data which is randomly generated as fast as
possible. The time taken to sort the data provides a clear picture of how well the cluster is
tuned to perform CPU intensive operations.

Terasort benchmarking are a test to CPU and RAM based processing operations

To start the terasort operation we start by generating random set of data into HDFS which will
be later sorted. To generate 4 TB of data we have the following commands:

hadoop jar hadoop-*examples*.jar teragen 40000000000 /user/hduser/terasort-input


Hadoop Benchmarks
Namenode Benchmark

The idea of this algorithm is to benchmark namenode hardware throughput but generating lots
of small HDFS files and storing the metadata into memory.

In Order to perform namenode benchmark we perform the following:

hadoop jar hadoop-*test*.jar nnbench -operation create_write \

-maps 12 -reduces 6 -blockSize 1 -bytesToWrite 0 -numberOfFiles 1000 \

-replicationFactorPerFile 3 -readFileAfterOpen true \

-baseDir /benchmarks/NNBench-`hostname -s`


Hadoop Benchmarks

Mapreduce benchmark

The idea is to benchmark smaller jobs that runs in parallel into Hadoop cluster and
schedule multiple MR jobs into YARN scheduler. In order to run the process use the
following command that would generate 50 parallel jobs:

hadoop jar hadoop-*test*.jar mrbench -numRuns 50


Cloud Computing
• The buzz word before “Big Data”
• Larry Ellison’s response in 2009
• Cloud Computing is a general term used to describe a new class of network based
computing that takes place over the Internet
• A collection/group of integrated and networked hardware, software and Internet
infrastructure (called a platform).
• Using the Internet for communication and transport provides hardware, software
and networking services to clients
• These platforms hide the complexity and details of the underlying infrastructure
from users and applications by providing very simple graphical interface or API
• A technical point of view
• Internet-based computing (i.e., computers attached to network)
• A business-model point of view
• Pay-as-you-go (i.e., rental)
Cloud Computing Architecture
Cloud Computing Services
Cloud Computing Services
• Infrastructure as a service (IaaS)
• Offering hardware related services using the principles of cloud computing. These
could include storage services (database or disk storage) or virtual servers.
• Amazon EC2, Amazon S3

• Platform as a Service (PaaS)


• Offering a development platform on the cloud.
• Google’s Application Engine, Microsofts Azure

• Software as a service (SaaS)


• Including a complete software offering on the cloud. Users can access a software
application hosted by the cloud vendor on pay-per-use basis. This is a well-
established sector.
• Googles gmail and Microsofts hotmail, Google docs
Hadoop in the cloud
Hadoop in the cloud

Hadoop on

• AWS : Amazon Elastic Map/Reduce (EMR) is a managed service that allows you to process
and analyze large datasets using the latest versions of big data processing frameworks
such as Apache Hadoop, Spark, HBase, and Presto, on fully customizable clusters.

• Azure : Azure HDInsight is a managed, open-source analytics service in the cloud.


HDInsight allows users to leverage open-source frameworks such as Hadoop, Apache
Spark, Apache Hive, LLAP, Apache Kafka, and more, running them in the Azure cloud
environment.

• Google Cloud: Google Dataproc is a fully-managed cloud service for running Apache
Hadoop and Spark clusters. It provides enterprise-grade security, governance, and
support, and can be used for general purpose data processing, analytics, and machine
learning.
Hadoop in the cloud

Amazon Elastic Map/Reduce (EMR):

https://www.youtube.com/watch?v=7BSAbzwU6Ac&list=PL0hSJrxggIQoer5TwgW6qWHnVJfqz
cX3l

Azure HDInsight:

https://www.youtube.com/watch?v=phKiAIHMLAI

https://www.youtube.com/watch?v=IzmT0fH4KE0
Hadoop Alternatives

Ceph

• Hydra by AddThis
Happy Learning

You might also like