Big Dataa-Lab-Manual
Big Dataa-Lab-Manual
Big Dataa-Lab-Manual
LAB MANUAL
Experiment List
_______________________________________________________________
1
Big Data Lab Manual
INDEX
_______________________________________________________________
2
Big Data Lab Manual
PRACTICAL NO – 1
THEORY:
Apache Hadoop 3.1 have noticeable improvements any many bug fixes over the previous stable
3.0 releases. This version has many improvements in HDFS and MapReduce. This how-to guide
will help you to setup Hadoop 3.1.0 Single-Node Cluster on CentOS/RHEL 7/6/5, Ubuntu
18.04, 17.10, 16.04 & 14
.04, Debian 9/8/7 and LinuxMint Systems. This article has been tested with Ubuntu 18.04 LTS.
1. Prerequisites
Java is the primary requirement for running Hadoop on any system, So make sure you have
Java installed on your system using the following command. If you don’t have Java installed on
your system, use one of the following links to install it first. Hadoop supports only JAVA 8 If
already any other version is present then uninstall the following using these commands.
sudo apt-get purge openjdk-\* icedtea-\* icedtea6-\*
OR
sudo apt remove openjdk-8-jdk
_______________________________________________________________
3
Big Data Lab Manual
_______________________________________________________________
4
Big Data Lab Manual
_______________________________________________________________
5
Big Data Lab Manual
Hadoop NameNode started on port 9870 default. Access your server on port 9870 in
your favorite web browser.
http://localhost:9870/
Now access port 8042 for getting the information about the cluster and all
applications http:// localhost:8042/
_______________________________________________________________
6
Big Data Lab Manual
http://localhost:9870/explorer.html#/user/hadoop/logs/
_______________________________________________________________
7
Big Data Lab Manual
PRACTICAL NO – 2
Aim: Hadoop Programming: Word Count MapReduce Program Using Eclipse
THEORY:
Step-3
Make Java class File and write a code.
Click on WordCount project. There will be ‘src’ folder. Right click on ‘src’ folder -> New ->
Class. Write Class file name. Here is Wordcount. Click on Finish.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
8
Big Data Lab Manual
import org.apache.hadoop.fs.Path;
}
}
Step-4
Add external libraries from hadoop.
Right click on WordCount Project -> Build Path -> Configure Build Path -> Click on Libraries -
> click on ‘Add External Jars..’ button.
PRACTICAL NO – 3
produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of columns of N
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows of M.
return Set of (key,value) pairs that each key (i,k), has list with values (M,j,mij) and (N, j,njk)
for all possible values of j.
10
Big Data Lab Manual
11
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
@Override
Integer.parseInt(conf.get("p"));
11
Big Data Lab Manual
// (M, i, j, Mij);
PRACTICAL NO – 4
THEORY:
In this instructional post, we will explore and understand few important relational operators in
Pig which is widely used in big data industry. Before we understand relational operators, let us
see what Pig is.
Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in
writing mapper and reducer programs. Pig enables users to write complex data analysis code
without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig
Latin and has its own Pig runtime environment where PigLatin programs are executed. For
more details, I would suggest you to go through this blog.
Once you complete this blog, I would suggest you to get your hands dirty with a POC from
this blog.
Below are two datasets that will be used in this post.
Employee_details.txt
This data set have 4 columns i.e.
Emp_id: unique id for each employee
Name: name of the employee
Salary: salary of an employee
Ratings: Rating of an employee.
Employee_expenses.txt
This data set have 2 columns i.e.
Emp_id: id of an employee
Expense: expenses made by an employee
12
Big Data Lab Manual
As you have got the idea of datasets, let us proceed and perform some relational operations
using this data.
19
NOTE: All the analysis is performed in local mode. To work in local mode, you need to start
pig grunt shell using “pig -x local” command.
Relational Operators:
Load
To load the data either from local filesystem or Hadoop filesystem.
Syntax:
LOAD ‘path_of_data’ [USING function] [AS
schema]; Where;
path_of_data : file/directory name in single quotes.
USING : is the keyword.
function : If you choose to omit this, default load function PigStorage() is used.
AS : is the keyword
schema : schema of your data along with data type.
Eg:
The file named employee_details.txt is comma separated file and we are going to load it
from local file system.
A = LOAD ‘/home/acadgild/pig/employee_details.txt’ USING PigStorage(‘,’) AS
(id:int, name:chararray, salary:int, ratings:int);
13
Big Data Lab Manual
Similarly, you can load another data into another relation, say ‘B’
B = LOAD ‘/home/acadgild/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS
(id:int, expenses:int);
As the fields in this file are tab separated, you need to use ‘\t’
NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.
14
Big Data Lab Manual
PRACTICAL NO – 5
Aim: Implementing Database Operations on Hive.
THEORY:
Hive defines a simple SQL-like query language to querying and managing large datasets called
Hive-QL ( HQL ). It’s easy to use if you’re familiar with SQL Language. Hive allows
programmers who are familiar with the language to write the custom MapReduce framework to
perform more sophisticated analysis.
Uses of Hive:
1. The Apache Hive distributed storage.
2. Hive provides tools to enable easy data extract/transform/load (ETL)
3. It provides the structure on a variety of data formats.
4. By using Hive, we can access files stored in Hadoop Distributed File System (HDFS
is used to querying and managing large datasets residing in) or in other data storage systems
such as Apache HBase.
Limitations of Hive:
• Hive is not designed for Online transaction processing (OLTP ), it is only used
for the Online Analytical Processing.
• Hive supports overwriting or apprehending data, but not updates and deletes.
• In Hive, sub queries are not supported.
Why Hive is used inspite of Pig?
The following are the reasons why Hive is used in spite of Pig’s availability:
● Hive-QL is a declarative language line SQL, PigLatin is a data flow language.
● Pig: a data-flow language and environment for exploring very large datasets.
● Hive: a distributed data warehouse.
Components of Hive:
Metastore :
15
Big Data Lab Manual
Hive stores the schema of the Hive tables in a Hive Metastore. Metastore is used to hold all
the information about the tables and partitions that are in the warehouse. By default, the
metastore is run in the same process as the Hive service and the default Metastore is DerBy
Database.
SerDe :
Serializer, Deserializer gives instructions to hive on how to process a record.
Hive Commands :
Data Definition Language (DDL )
DDL statements are used to build and modify the tables and other objects in the database.
Example :
CREATE, DROP, TRUNCATE, ALTER, SHOW, DESCRIBE Statements.
Go to Hive shell by giving the command sudo hive and enter the
command ‘create database<data base name>’ to create the new database in the Hive.
25
To list out the databases in Hive warehouse, enter the command ‘show databases’.
Copy the input data to HDFS from local by using the copy From Local command.
16
Big Data Lab Manual
When we create a table in hive, it creates in the default location of the hive warehouse. –
“/user/hive/warehouse”, after creation of the table we can move the data from HDFS to
hive table.
The following command creates a table with in location of “/user/hive/warehouse/retail.db”
Note : retail.db is the database created in the Hive warehouse.
PRACTICAL NO – 6
Aim: Implementing Bloom Filter using Map-Reduce.
THEORY:
Bloom filtering is similar to generic filtering in that it is looking at each record and deciding
whether to keep or remove it. However, there are two major differences that set it apart from
generic filtering. First, we want to filter the record based on some sort of set membership
operation against the hot values. For example: keep or throw away this record if the value in
the user field is a member of a predetermined list of users.Second, the set membership is going
to be evaluated with a Bloom filter.
Formula for optimal size of bloom filter
1. OptimalBloomFilterSize = (-The number of members in the set * log(The desired
false positive rate))/log(2)^2
Formula to get the optimalK
1. OptimalK = (OptimalBloomFilterSize * log(2))/The number of members in the set
CODE:
We are reading the input file and storing the bloom filter hot words file in the local file system
17
Big Data Lab Manual
(I am using windows) ideally the file should be read and stored in the hdfs using hadoop hdfs
api for simplicity purpose have not included the code for hdfs filesystem.This Bloom filter file
can later be deserialized from HDFS or local system just as easily as it was written.Just open up
the file using the FileSystem object and pass it to BloomFilter.readFields.
import java.io.DataOutputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.hadoop.util.bloom.Key;
import org.apache.hadoop.util.hash.Hash;
public class DepartmentBloomFilterTrainer {
public static int getBloomFilterOptimalSize(int numElements, float falsePosRate) {
return (int) (-numElements * (float)
Math.log(falsePosRate) / Math.pow(Math.log(2), 2));
}
public static int getOptimalK(float numElements, float vectorSize) {
return (int) Math.round(vectorSize * Math.log(2) / numElements);
}
public static void main(String[] args) throws IOException {
args = new String[] { "32658", "0.2","Replace this string with Input file location",
29
"Replace this string with output path location where the bloom filter hot list data will
be stored","" };
int numMembers = Integer.parseInt(args[0]);
float falsePosRate = Float.parseFloat(args[1]);
int vectorSize = getBloomFilterOptimalSize(numMembers, falsePosRate);
int nbHash = getOptimalK(numMembers, vectorSize);
BloomFilter filter = new BloomFilter(vectorSize, nbHash, Hash.MURMUR_HASH);
ConfigFile configFile = new ConfigFile(args[2], FileType.script, FilePath.absolutePath);
String fileContent = configFile.getFileContent();
String[] fileLine = fileContent.split("\n");
for (String lineData : fileLine) {
String lineDataSplit[] = lineData.split(",", -1);
String departmentName = lineDataSplit[3];
filter.add(new Key(departmentName.getBytes()));
}
DataOutputStream dataOut =
newDataOutputStream(new
FileOutputStream(args[3]));
filter.write(dataOut);
18
Big Data Lab Manual
dataOut.flush();
dataOut.close();
}
}
In the setup method the bloom fileter file is deserialized and loaded into the bloom filter.In the
map method, the departmentName is extracted from each input record and tested against the
Bloom filter. If the word is a member, the entire record is output to the file system.Ideally to load
the bloom filter hot words we should be using DistributedCache a hadoop utility that ensures that
a file in HDFS is present on the local file system of each task that requires that file for simplicity
purpose i am loading it from my local file system. As we have trained the bloom filter with
PUBLIC LIBRARY department the output of the map reduce program will have only employee
data relevant to PUBLIC LIBRARY department.
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.util.bloom.BloomFilter;
import org.apache.hadoop.util.bloom.Key.
public class DepartmentBloomFilterMapper extends Mapper<Object, Text, Text,
NullWritable> {
private BloomFilter filter = new BloomFilter();
PRACTICAL NO – 7
Aim: Implementing Frequent Item Set Algorithm Using Map-Reduce.
THEORY:
Frequent Itemset Mining aims to find the regularities in transaction dataset. Map Reduce maps
the presence of set of data items in a transaction and reduces the Frequent Item set with low
frequency. The input consists of a set of transactions and each transaction contains several
items. The Map function reads the items from each transaction and generates the output with
key and value. Key is represented with item and value is represented by 1. After map phase is
completed, reduce function is executed and it aggregates the values corresponding to key. From
the results, the frequent items are computed on the basis of minimum support value.
CODE-
Mapper Function
19
Big Data Lab Manual
import
model.HashTreeNode;
import model.ItemSet;
import model.Transaction;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
20
Big Data Lab Manual
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import utils.AprioriUtils;
import utils.HashTreeUtils;
import org.apache.hadoop.fs.*;
/*
* A parallel hadoop-based Apriori
algorithm */
public class MRApriori extends Configured implements
Tool {
private static String jobPrefix = "MRApriori Algorithm Phase ";
// TODO : This is bad as I using a global shared variable between
functions which should
// ideally be a function parameter. Need to fix this later. These parameters
are required in
// reducer logic and have to be dynamica. How can I pass some
initialisation parameters to
// reducer ?
if(!isPassKMRJobDone)
{
System.err.println("Phase1 MapReduce job failed Exiting !!");
return -1; }
return 1;
}
private static boolean runPassKMRJob(String hdfsInputDir,
String hdfsOutputDirPrefix, int passNum, Double
MIN_SUPPORT_PERCENT, Integer MAX_NUM_TXNS)
throws IOException, InterruptedException,
ClassNotFoundException
{
boolean isMRJobSuccess = false;
22
Big Data Lab Manual
PRACTICAL NO – 8
Aim: Implementing Clustering Algorithm Using Map-Reduce
Theory:
MapReduce runs as a series of jobs, with each job essentially a separate Java application that
goes out into the data and starts pulling out information as needed. Based on the MapReduce
design, records are processed in isolation via tasks called Mappers. The output from the Mapper
tasks is further processed by a second set of tasks, the Reducers, where the results from the
different Mapper tasks are merged together. The Map and Reduce functions of MapReduce are
both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with
a type in one data domain, and returns a list of pairs in a different domain:
Map(k1,v1) → list(k2,v2)
The Map function is applied in parallel to every pair in the input dataset. This produces a list of
pairs for each call. After that, the MapReduce framework collects all pairs with the same key
from all lists and groups them together, creating one group for each key. The Reduce function is
then applied in parallel to each group, which in turn produces a collection of values in the same
domain: Reduce(k2, list (v2)) → list(v3)
40
23
Big Data Lab Manual
PRACTICAL NO – 9
Aim: Implementing Page Rank Algorithm Using Map-Reduce .
Theory:
PageRank is a way of measuring the importance of website pages. PageRank works by
counting the number and quality of links to a page to determine a rough estimate of how
important the website is. The underlying assumption is that more important websites are likely
to receive more links from other websites.
In the general case, the PageRank value for any page u can be expressed as:
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v
contained in the set Bu (the set containing all pages linking to page u), divided by the
number L(v) of links from page v.
Suppose consider a small network of four web pages: A, B, C and D. Links from a page to
itself, or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank, the
sum of PageRank over all pages was the total number of pages on the web at that time, so each
page in this example would have an initial value of 1.
The damping factor (generally set to 0.85) is subtracted from 1 (and in some variations of the
algorithm, the result is divided by the number of documents (N) in the collection) and this term is
then added to the product of the damping factor and the sum of the incoming PageRank scores.
That is,
So any page’s PageRank is derived in large part from the PageRanks of other pages.
The damping factor adjusts the derived value downward.
CODE:
import numpy as np
import scipy as sc
import pandas as pd
from fractions import Fraction
def display_format(my_vector, my_decimal):
return np.round((my_vector).astype(np.float), decimals=my_decimal)
my_dp = Fraction(1,3)
Mat = np.matrix([[0,0,1],
[Fraction(1,2),0,0],
[Fraction(1,2),1,0]])