Dsbda Lab
Dsbda Lab
Dsbda Lab
Anaconda :
Anaconda is a free and open-source distribution of the programming languages
Python and R (check out these Python online courses and R programming courses).
The distribution comes with the Python interpreter and various packages related to
machine learning and data science.
Basically, the idea behind Anaconda is to make it easy for people interested in those
fields to install all (or most) of the packages needed with a single installation.
NumPy
NumPy, which stands for Numerical Python, is a library consisting of multidimensional
array objects and a collection of routines for processing those arrays. Using NumPy,
mathematical and logical operations on arrays can be performed.
It provides :
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary datatypes can be defined. This allows
NumPy to integrate with a wide variety of databases seamlessly and speedily.
Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to
work with the broader SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like line,
bar, scatter, histogram etc.
Matplotlib comes with a wide variety of plots. Plots helps to understand trends,
patterns, and to make correlations. They are typically instrumenting for reasoning about
quantitative information. Some of the sample plots are covered here.
Pandas is a Python package that provides fast, flexible, and expressive data structures
designed to make working with structured (tabular, multidimensional, potentially
heterogeneous) and time series data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in
Python. Additionally, it has the broader goal of becoming the most powerful and flexible
open source data analysis / manipulation tool available in any language. It is already
well on its way toward this goal.
Code:
Code:
Manual:
4.259
2.101
Yes, the tobs is in the tail. In fact, even if one uses a probability level the t
is still in the tail. Thus, we conclude that we are 99.9 percent sure that
there is a significant difference between the two groups.
Older adults in this sample have significantly higher life satisfaction than
younger adults (t = 4.257, p < .001). As this is a quasi-experiment, we
cannot make any statements concerning the cause of the difference.
Description:
• Before ANOVA, multiple t-tests was the only option available to compare
population means of two or more groups.
• As the number of groups increases, the number of two sample t-test also
increases.
• With increases in the number of t-tests, the probability of making the type 1
error also increases.
Types of ANOVA
One-way ANOVA:
It is a hypothesis test in which only one categorical variable or single factor is
taken into consideration. With the help of F-distribution, it enables us to compare
the means of three or more samples. The Null hypothesis (H0) is the equity in all
population means while an Alternative hypothesis is a difference in at least one
mean.
Two-way ANOVA:
It examines the effect of two independent factors on a dependent variable. It
also studies the inter-relationship between independent variables influencing the
values of the dependent variable, if any.
For example, analyzing the test score of a class based on gender and age. Here test
score is a dependent variable and gender and age are the independent variables.
Two-way ANOVA can be used to find the relationship between these dependent and
independent variables.
Types of Correlation
Positive Correlation:
It refers to the extent to which the two variables increase or decreases in
parallel ( think of this as directly proportional, one increases other will increase, one
decreases other will follow the same).
Negative Correlation:
It refers to the extent to which one of the two variables increases as the other
decreases (think of this as inversely proportional, one increases other will decrease
or if one decreases other will increase).
The most common correlation in statistics is the Pearson correlation. The full name
is the Pearson Product Moment Correlation (PPMC). In layman terms, it is a number
between “+1” to “-1” which represents how strongly the two variables are associated.
Or to put this in simpler words, it states the measure of the strength of linear
association between two variables.
The chi-square test is one of the most common ways to examine relationships
between two or more categorical variables. Not surprisingly, It involves calculating a
number, called the chi-square statistic - χ2, Which follows a chi-square distribution.
The chi-square test relies on the difference between observed and expected values.
Task:
3. Find the value of the correlation coefficient from the following table
Program Code:
4. The local ice cream shop keeps track of how much ice cream they sell versus
the temperature on that day; here are their figures for the last 12 days. Find the
value of the correlation coefficient from the following table
Program Code:
Program Code:
Program Code:
Conclusion:
From this experiment working and procedure of chi square and analysis of
variance is known. Along with that the scenarios that are suitable for different Anova
tests and related python modules are also known. The use and way to calculate
correlation coefficient is also known.
Aim:
Implement time series forecasting using ARIMA model with Air Passengers dataset
Description:
Time Series
TS is a collection of data points collected at constant time intervals. These are
analyzed to determine the long-term trend to forecast the future or perform some
other form of analysis.
Variations
One of the most important features of a time series is variation. Variations are
patterns in the times series data. A time series that has patterns that repeat over
known and fixed periods of time is said to have seasonality. Seasonality is a general
term for variations that periodically repeat in data. In general, we think of variations
as 4 categories: Seasonal, Cyclic, Trend, and Irregular fluctuations.
Forecasting is the process of making predictions of the future, based on past and
present data. There are several methods for time series forecasting
• Naive Approach
• Simple Average
• Moving Average
• Weighted moving average
• Simple Exponential Smoothing
• Holt’s Linear Trend Model
• Holt Winters Method
• ARIMA
p is the parameter associated with the auto-regressive aspect of the model. The past
values used for forecasting the next value. The value of ‘p’ is determined using the
PACF plot. For example, forecasting that if it rained a lot over the past few days, you
state it is likely that it will rain tomorrow as well.
d is the parameter associated with the integrated part of the model,specifies the
number of times the differencing operation is performed on series to make it
stationary. Test like ADF and KPSS can be used to determine whether the series is
stationary and help in identifying the d value. You can imagine an example of this
as forecasting that the amount of rain tomorrow will be like the amount of rain today
if the daily amounts of rain have been similar over the past few days.
q is the parameter associated with the moving average part of the model. Used to
defines number of past forecast errors used to predict the future values. ACF plot is
used to identify the correct ‘q’ value.
If our model has a seasonal component we use a seasonal ARIMA model (SARIMA).
In that case we have another set of parameters: P,D, and Q which describe the same
associations as p, d, and q, but correspond with the seasonal components of the
model.
Although ARIMA is a very powerful model for forecasting time series data, the data
preparation and parameter tuning processes end up being really time consuming.
Before implementing ARIMA, you need to make the series stationary, and determine
the values of p and q using the plots we discussed above. Auto ARIMA makes this
task simple for us as it eliminates steps 3 to 6 we saw in the previous section.
Below are the steps you should follow for implementing auto ARIMA:
1. Load the data: This step will be the same. Load the data into your notebook
2. Preprocessing data: The input should be univariate, hence drop the other
columns
3. Fit Auto ARIMA: Fit the model on the univariate series
4. Predict values on validation set: Make predictions on the validation set
5. Calculate RMSE: Check the performance of the model using the predicted
values against the actual values
Here we can see there is an upward trend. We can use statsmodels to perform a
decomposition of this time series. The decomposition of time series is a statistical
task that deconstructs a time series into several components, each representing one
of the underlying categories of patterns. With statsmodels we will be able to see the
trend, seasonal, and residual components of our data.
• Additive model is used when it seems that the trend is more linear, and the
seasonality and trend components seem to be constant over time.
Ex: every year we add 100 units of energy production.
• Multiplicative model is more appropriate when we are increasing (or
decreasing) at a non-linear rate.
Ex: each year we double the amount of energy production every year.
From the plot above we can clearly see the seasonal component of the data, and we
can also see the separated upward trend of the data.
Irregular fluctuations are abrupt changes that are random and unpredictable.
Now that we have analyzed the data, we can clearly see we have a time series with a
seasonal component, so it makes sense to use a Seasonal ARIMA model. To do this,
we will need to choose p, d, q values for the ARIMA, and P,D,Q values for the Seasonal
component.
There are many ways to choose these values statistically, such as looking at auto-
correlation plots, correlation plots, domain experience, etc.
The pyramid-arima library for Python allows us to quickly perform this grid search
and even creates a model object that you can fit to the training data.
We can then fit the stepwise_model object to a training data set. Because this is a
time series forecast, we will “chop off” a portion of our latest data and use that as
the test set. Then we will train on the rest of the data and forecast into the future.
Afterwards we can compare our forecast with the section of data we chopped off.
We can then train the model by simply calling .fit on the stepwise model and passing
in the training data.
Now that the model has been fitted to the training data, we can forecast into the
future. We use .predict() method call.
Conclusion:
From this experiment we came to know the concepts starting from the very basics
of forecasting, AR, MA, ARIMA, SARIMA and finally the SARIMAX model.
Aim:
Install libraries important for Machine learning (ScikitLearn, statsmodels, scipy,
NLTK, etc.) and write brief introduction about those modules.
Description:
Machine Learning and Deep Learning have been on the rise recently with the push
in the AI industry. Machine learning is a subset of Artificial Intelligence (AI) which
provides machines the ability to learn automatically & improve from experience
without being explicitly programmed to do so.
Several programming languages can get you started with AI, ML and DL with each
language offering stronghold on a specific concept. Some of the popular
programming languages for ML and DL are Python, Julia, R, Java along with a few
more. But Python seems to be winning battle as preferred language of Machine
Learning. The availability of libraries and open source tools make it ideal choice for
developing ML models.
One of Python’s greatest assets is its extensive set of libraries. Libraries are sets of
routines and functions that are written in each language. A robust set of libraries
can make it easier for developers to perform complex tasks without rewriting many
lines of code.
It builds on two basic libraries of Python, NumPy and SciPy. It adds a set of algorithms
for common machine learning and data mining tasks, including clustering,
regression, and classification. Even tasks like transforming data, feature selection
and ensemble methods can be implemented in a few lines.
Advantages:
• Simple, easy to use, and effective.
• In rapid development, and constantly being improved.
• Wide range of algorithms, including clustering, factor analysis, principal
component analysis, and more.
• Can extract data from images and text.
• Can be used for NLP.
Disadvantages:
• This library is especially suited for supervised learning, and not very suited to
unsupervised learning applications like Deep Learning.
Statsmodels :
Statsmodels is another library to implement statistical learning algorithms. However,
it is more popular for its module that helps implement time series models. You can
easily decompose a time-series into its trend component, seasonal component, and
a residual component.
You can also implement popular ETS methods like exponential smoothing, Holt-
Winters method, and models like ARIMA and Seasonal ARIMA or SARIMA. The only
drawback is that this library does not have a lot of popularity and thorough
documentation as Scikit.
Advantages:
• Great for image manipulation.
• Provides easy handling of mathematical operations.
• Offers efficient numerical routines, including numerical integration and
optimization.
• Supports signal processing.
Disadvantages:
• There is both a stack and a library named SciPy. The library is part of the stack.
Beginners who do not know the difference may become confused.
NLTK :
NLTK is a framework and suite of libraries for developing both symbolic and
statistical Natural Language Processing (NLP) in Python. It is the standard tool for
NLP in Python.
Advantages:
• The Python library contains graphical examples, as well as sample data.
• Includes a book and cookbook making it easier for beginners to pick up.
• Provides support for different ML operations like classification, parsing, and
tokenization functionalities, etc.
• Acts as a platform for prototyping and building research systems.
• Compatible with several languages.
Disadvantages:
• Understanding the fundamentals of string processing is a prerequisite to using
the NLTK framework. Fortunately, the documentation is adequate to assist in
this pursuit.
• NLTK does sentence tokenization by splitting the text into sentences. This has
a negative impact on the performance.
Advantages:
• Contains tools and libraries that support Computer Vision, NLP , Deep
Learning, and many other ML programs.
• Developers can perform computations on Tensors with GPU acceleration.
• Helps in creating computational graphs.
• The default “define-by-run” mode is more like traditional programming.
• Uses a lot of pre-trained models and modular parts that are easy to combine.
Disadvantages:
• Because PyTorch is relatively new, there are comparatively fewer online
resources to be found. This makes it harder to learn from scratch, although it
is intuitive.
• PyTorch is not widely considered to be production-ready compared to Google’s
TensorFlow, which is more scalable.
https://pytorch.org/get-started/locally/
Keras :
Keras is a very popular ML for Python, providing a high-level neural network API
capable of running on top of TensorFlow, CNTK, or Theano.
Advantages:
• Great for experimentation and quick prototyping.
• Portable.
• Offers easy expression of neural networks.
• Great for use in modeling and visualization.
Disadvantages:
• Slow, since it needs to create a computational graph before it can perform
operations.
Advantages:
• Supports reinforcement learning and other algorithms.
• Provides computational graph abstraction.
• Offers a very large community.
• Provides TensorBoard, which is a tool for visualizing ML models directly in the
browser.
• Production ready.
• Can be deployed on multiple CPUs and GPUs.
Disadvantages:
• Runs dramatically slower than other frameworks utilizing CPUs/GPUs.
• Steep learning curve compared to PyTorch.
• Computational graphs can be slow.
• Not commercially supported.
Conclusion:
Python is a truly marvelous tool of development that not only serves as a general-
purpose programming language but also caters to specific niches of our project or
workflows. With loads of libraries and packages that expand the capabilities of
Python and make it an all-rounder and a perfect fit for anyone looking to get into
developing programs and algorithms. With some of the modern machine learning
and deep learning libraries for Python discussed briefly above, we can get an idea
about what each of these libraries has to offer and make our pick.
Aim:
• Tokenization, Stemming, Lemmatization and Stop Word removal using NLTK
• Implement Sentiment analysis for the reviews from any website using NLTK
Description:
Natural language processing is one of the fields in programming where the natural
language is processed by the software. This has many applications like sentiment
analysis, language translation, fake news detection, grammatical error detection etc.
The input in natural language processing is text. The data collection for this text
happens from a lot of sources. This requires a lot of cleaning and processing before
the data can be used for analysis.
In the past , only experts could be part of natural language processing projects that
required superior knowledge of mathematics, machine learning, and linguistics.
Now, developers can use ready-made tools that simplify text preprocessing so that
they can concentrate on building machine learning models.
There are many tools and libraries created to solve NLP problems. Some of the
amazing Python Natural Language Processing libraries are:
• Natural Language Toolkit (NLTK)
• TextBlob
• CoreNLP
• Gensim
• spaCy
• polyglot
• scikit–learn
• Pattern
Natural Language Tool Kit (NLTK) is a Python library to make programs that work
with natural language. It provides a user-friendly interface to datasets that are over
50 corpora and lexical resources such as WordNet Word repository. The library can
perform different operations such as tokenizing, stemming, classification, parsing,
tagging, and semantic reasoning.
Before processing a natural language, we need to identify the words that constitute
a string of characters. This is important because the meaning of the text could easily
be interpreted by analyzing the words present in the text.
We can use this tokenized form to :
• Count the number of words in the text
• Count the frequency of the word, that is, the number of times a particular word
is present
“He completed the task in spite of all the hurdles faced” is tokenized as
[‘He’, ‘completed’, ‘the’, ‘task’, ‘in’, ‘spite’, ‘of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]
If we add the ‘in spite of’ in the lexicon of the MWETokenizer,
[‘He’, ‘completed’, ‘the’, ‘task’, ‘in spite of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]
• The TweetTokenizer addresses the specific things for the tweets like handling
emojis.
There are mainly two errors that occur while performing Stemming
• Over-steaming occurs when two words are stemmed from the same root of
different stems. For example, university and universe. Some stemming
algorithms may reduce both the words to the stem univers, which would imply
both the words mean the same thing, and that is clearly wrong.
• Under-stemming occurs when two words are stemmed from the same root of non-
different stems. For example, consider the words “data” and “datum.” Some
algorithms may reduce these words to dat and datu respectively, which is
obviously wrong. Both must be reduced to the same stem dat.
To lemmatize, you need to create an instance of the WordNet Lemmatizer() and call
the lemmatize() function on a single word. Sometimes, the same word can have a
multiple lemma based on the meaning / context. This can be corrected if we provide
the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize()
NLTK supports stop word removal, and we can find list of the stop words in the
corpus module. To remove stop words from a sentence, we divide text into words
and then remove the word if it exits in the list of stop words provided by NLTK.
The expressions “US citizen” will be viewed as “us citizen” or “IT scientist” as
“it scientist”. Since both “us” and “it” are normally considered stop words, it would
result in an inaccurate outcome. The strategy regarding the treatment of stopwords
can thus be refined by identifying that “US” and “IT” are not pronouns in the above
examples, through a part-of-speech tagging step.
The Compound score is a metric that calculates the sum of all the lexicon ratings
which have been normalized between -1(most extreme negative) and +1 (most
extreme positive).
positive sentiment : (compound score >= 0.05)
neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
negative sentiment : (compound score <= -0.05)
Natural Language Processing is the fundamental tasks in all text processing tasks,
to transform unstructured text into a form that computers understand. From this
point on, we can use it to generate features and perform other tasks like named
entity extraction, sentiment analysis, topic detection.
Aim:
Installation of Big data technologies and building a Hadoop cluster
Description:
Natural A Hadoop cluster is a collection of computers, known as nodes, that are
networked together to perform these kinds of parallel computations on big data
sets. Unlike other computer clusters, Hadoop clusters are designed specifically to
store and analyze mass amounts of structured and unstructured data in a distributed
computing environment. Further distinguishing Hadoop ecosystems from other
computer clusters are their unique structure and architecture. Hadoop clusters
consist of a network of connected master and slave nodes that utilize high
availability, low-cost commodity hardware. The ability to linearly scale and quickly
add or subtract nodes as volume demands makes them well-suited to big data
analytics jobs with data sets highly variable in size.
Task:
The steps given below are to be followed to have Hadoop Multi-Node cluster setup:
1. Installing Java:
Java is the main prerequisite for Hadoop. Firstly, you should verify the
existence of java in your system using “java -version”. If java is not installed in
your system, then follow the steps for installing java:
i. Download java (JDK - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads1880260.html
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
ii. Generally, you will find the downloaded java file in Downloads folder. Verify
it and extract the jdk-7u71-linux-x64.gz file
iii. Make the java available to all the users, move it to the location usr/local.
iv. Set up PATH and JAVA_HOME variables.
5. Installing Hadoop
In the Master server, download and install Hadoop
Conclusion:
From this experiment we can know what Hadoop cluster is and how is it supposed
to be configured in the system and the system requirements to install the Hadoop.
Aim:
Steps for data loading from local machine to Hadoop and Hadoop to local machine
Description:
Starting HDFS
Initially we must format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.
After formatting the HDFS, start the distributed file system. The following command
will start the namenode as well as the data nodes as cluster.
$ start-dfs.sh
$ hadoop fs –command
Ls
After loading the information in the server, we can find the list of files in a directory,
status of a file, using ‘ls’. Given below is the syntax of ls that we can pass to a
directory or a filename as an argument.
Mkdir
This command is used to create the directory in Hdfs. Here “sujan” is directory name.
Cat
In linux file system we use cat command both to read and create the file. But in
Hadoop system we cannot create files in HDFS. We can only load the data. So we use
cat command in Hdfs only to read the file. “sujan.txt” is my file name and the
command will show all the contents of this file on the screen.
In this, we are going to load data from a local machine's disk to HDFS. Assume we
have data in the file called “sujan.txt” in the local system which is ought to be saved
in the hdfs file system.
Performing this is as simple as copying data from one folder to another. There are a
couple of ways to copy data from the local machine to HDFS. Follow the steps given
below to insert the required file in the Hadoop file system.
To copy the file on HDFS, let us first create a input directory on HDFS and
then copy the file. Here are the commands to do this:
We will first create the input directory, and then put the local file in HDFS:
Transfer and store a data file from local systems to the Hadoop file system
using the put command.
We also use this command to load data from local to hdfs but this command
remove the file from the local.
We can validate that files have been copied to correct folders by listing the files:
In this, we are going to export/copy data from HDFS to the local machine.
Performing this is as simple as copying data from one folder to the other. There are
a couple of ways in which you can export data from HDFS to the local machine. Given
below is a simple demonstration for retrieving the required file from the Hadoop file
system.
You can shut down the HDFS by using the following command.
$ stop-dfs.sh
Conclusion:
The command put is like copyFromLocal. Although put is slightly more general, it
can copy multiple files into HDFS, and can read input from stdin. copyFromLocal
returns 0 on success and -1 on error.
The get Hadoop shell command can be used in place of the copyToLocal command.
At this time, they share the same implementation. The copyToLocal command does
a Cyclic Redundancy Check (CRC) to verify that the data copied was unchanged. A
failed copy can be forced using the optional –ignorecrc argument. The file and its
CRC can be copied using the optional –crc argument.
Aim:
Prepare a document for the Map Reduce concept
Description:
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte datasets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
A MapReduce job usually splits the input dataset into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically, both
the input and the output of the job are stored in a filesystem. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a
set of <key, value> pairs as the output of the job, conceivably of different types.
1. Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value
pair. Hadoop’s Mapper store saves this intermediate data into the local disk.
1. Input Split
It is the logical representation of data. It represents a block of work that
contains a single map task in the MapReduce Program.
2. RecordReader
It interacts with the Input split and converts the obtained data in the form
of Key-Value Pairs.
2. Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
3. Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for
setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper
and Reducer Classes long with data types and their respective job names.
Now, suppose, we have to perform a word count on the sample.txt using MapReduce.
So, we will be finding the unique words and the number of occurrences of those
unique words.
MapReduce Architecture
• Once the job is complete, the map output can be thrown away. So, storing it in
HDFS with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the reduce
task, Hadoop reruns the map task on another node and re-creates the map
output.
• Reduce task does not work on the concept of data locality. An output of every
map task is fed to the reduce task. Map output is transferred to the machine
where reduce task is running.
• On this machine, the output is merged and then passed to the user-defined
reduce function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored
on the local node and other replicas are stored on off-rack nodes). So, writing the
reduce output
The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of
submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job
• A job is divided into multiple tasks which are then run onto multiple data nodes
in a cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks
to run on different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on
every data node executing part of the job.
• Task tracker's responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker to
notify him of the current state of the system.
• Thus, job tracker keeps track of the overall progress of each job. In the event of
task failure, the job tracker can reschedule it on a different task tracker.
TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.
Conclusion:
MapReduce is a Hadoop framework that helps you process vast volumes of data in
multiple nodes. From this experiment, we learned about what MapReduce is, and the
essential features of MapReduce, how MapReduce algorithm works and its benefits
• Scalability
Businesses can process the petabytes of data stored in the Hadoop Distributed
File System (HDFS).
• Flexibility
Hadoop enables the easier access to multiple sources of data and multiple
types of data.
• Speed
With the parallel processing and minimal data movement, Hadoop offers fast
processing of massive amounts of data.
• Simple
Developers can write the code in a choice of languages, including Java, C++
and Python.
160117733178 SUJAN.CH CBIT,HYDERABAD
Week – 11
Aim:
Documentation for developing and handling a NOSQL database with HBase
Description:
Until the 1970s we were using RDBMS but that was not enough to handle a large
amount of data. Because, we have witnessed the explosion of data and it has been
always challenging for us to store and retrieve the data. The rise of growing data
gave us the NoSQL databases and HBase is one of the NoSQL databases built on top
of Hadoop. HBase is suitable for the applications which require a real-time read/write
access to huge datasets.
Big data has proven itself a huge attraction point for many researchers and
academics across the world. Due to vast usage of social media applications data is
growing rapidly nowadays. This data is often formless, disorganized, and
unpredictable. Storing and analyzing this data is not an easy task. However, NoSQL
databases are the databases by which we can handle and extract this data with ease.
There are many NoSQL databases available for the data scientists.
NoSQL Types
The three main types of NoSQL are.
1. Column Database (column-oriented)
A NoSQL database that stores data in tables and manages them by columns
instead of rows. Called as the columnar database management system
(CDBMS).It converts columns into data files. HBase is of this kind.
One benefit is the fact that it can compress data, allowing operations such as
the minimum, maximum, sum, counting, and averages.
They can be auto indexed, using less disk space than a relational database
system including the same data.
HBase is abbreviated as the Hadoop Database and it runs on top of the Hadoop as a
scalable big data store. It is the Hadoop database that means it has the advantages
of Hadoop’s distributed file system and MapReduce model by default. It is referred
as the columnar database because in contrast to a relational database which stores
data in rows, HBase stores data in columns.
HBase is modeled after the Google’s BigTable so it provides distributed data storage
capabilities like the BigTable on HDFS. HBase has capabilities of allowing access to
sparse data. Sparse data is defined as the data which is small but valuable data within
the gigantic volume of unstructured data used for Big Data analytics. It can report
failures automatically, data reproduction throughout clusters and coherent read and
write. There are several advantages of HBase over relational databases like the later
one is hard to scale and must have a schema for them.
There is a small history behind the HBase. Back in 2004, when Google was facing
problem on how they could provide efficient search results they developed BigTable
technology. In 2007 Mike Cafarella released code for open source implementation
of BigTable known as HBase. At the start, the initial model of HBase was developed
as a contributing data model for Hadoop. Moreover, between the year 2008 and
2010 HBase became the top-level project under the Apache Hadoop.
160117733178 SUJAN.CH CBIT,HYDERABAD
BigTable
BigTable is a distributed storage system developed by Google designed to store the
gigantic size of data across several severs. Many projects such as Google Earth,
Google analytics, personalized search, all web indexing, and financial data stored in
BigTable. All these applications have various demands regarding the size and latency
but BigTable provides a flexible and high-performance solutions for them. It provides
a data model that supports dynamic control over data format rather than a relational
data model. The API of BigTable delivers functionalities for creating and deleting
tables and columns, changing metadata, cluster, and access control. All these
capabilities have adopted by HBase, however, there are many features that
differentiate both these technologies.
• Row Key: Each row has a unique row key; the row key does not have a data
type and is treated internally as a byte array.
• Column Family: Data inside a row is organized into column families; each row
has the same set of column families, but across rows, the same column
families do not need the same column qualifiers. Under-the-hood, HBase
stores column families in their own data files, so they need to be defined
upfront, and changes to column families are difficult to make.
• Column Qualifier: Column families define actual columns, which are called
column qualifiers. You can think of column qualifiers as the columns
themselves.
• Version: Each column can have a configurable number of versions, and you
can access the data for a specific version of a column qualifier.
An individual row is accessible through its row key and is composed of one or more
column families. Each column family has one or more column qualifiers (called
“column” in above figure ) and each column can have one or more versions. To access
an individual piece of data, you need to know its row key, column family, column
qualifier, and version.
When designing an HBase data model, it is helpful to think about how the data is
going to be accessed. You can access HBase data in two ways:
• Through their row key or via a table scan for a range of row keys
• In a batch manner using map-reduce
This dual approach to data access is something that makes HBase particularly
powerful. Typically, storing data in Hadoop means that it is good for offline or batch
analysis (and it is very, very good at batch analysis) but not necessarily for real-time
access. HBase addresses this by being both a key/value store for real-time analysis
and supporting map-reduce for batch analysis.
HBase has a distributed and huge environment where HMaster alone is not sufficient
to manage everything. So, we would be wondering what helps HMaster to manage
this huge environment? That is where ZooKeeper comes into the picture. After we
understood how HMaster manages HBase environment, we will understand how
Zookeeper helps HMaster in managing the environment.
Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first must check with .META Server in which
Region Server a region belongs, and it gets the path of that Region Server. The
.META file maintains the table in form of keys and values. Key represents the start
key of the region and its id whereas the value contains the path of the Region Server.
• WAL: As we can conclude from the above image, Write Ahead Log (WAL) is a
file attached to every Region Server inside the distributed environment. The
WAL stores the new data that has not been persisted or committed to the
permanent storage. It is used in case of failure to recover the data sets.
• Block Cache: From the above image, it is clearly visible that Block Cache
resides in the top of Region Server. It stores the frequently read data in the
memory. If the data in BlockCache is least recently used, then that data is
removed from BlockCache.
• MemStore: It is the write cache. It stores all the incoming data before
committing it to the disk or permanent memory. There is one MemStore for
each column family in a region. As you can see in the image, there are multiple
MemStores for a region because each region contains multiple column
families. The data is sorted in lexicographical order before committing it to
the disk.
• HFile: From the above figure you can see HFile is stored on HDFS. Thus, it
stores the actual cells on the disk. MemStore commits the data to HFile when
the size of MemStore exceeds.
1. Whenever the client has a write request, the client writes the data to the WAL
(Write Ahead Log).
• The edits are then appended at the end of the WAL file.
• This WAL file is maintained in every Region Server and Region Server uses
it to recover data which is not committed to the disk.
2. Once data is written to the WAL, then it is copied to the MemStore.
3. Once the data is placed in MemStore, then client receives the acknowledgment.
4. When the MemStore reaches threshold, it dumps or commits the data into a HFile.
General Commands
• status - Provides the status of HBase, for example, the number of servers.
• version - Provides the version of HBase being used.
• table_help - Provides help for table-reference commands.
• whoami - Provides information about the user.
From this experiment, we came to know about the architecture and structure of
HBase, its current usage and its limitations. The world of data is growing at a rapid
pace and we will need some better solutions for handling and analyzing this data in
future.
Aim:
Documentation for loading data from RDBMS to HDFS by using SQOOP.
Description:
We know that Apache Flume is a data ingestion tool for unstructured sources, but
organizations store their operational data in relational databases. So, there was a
need for a tool which can import and export data from relational databases.
Therefore, Apache Sqoop was born. Sqoop can easily integrate with Hadoop and
dump structured data from relational databases on HDFS, complimenting the power
of Hadoop.
Initially, Sqoop was developed and maintained by Cloudera. Later, on 23 July 2011,
it was incubated by Apache. In April 2012, the Sqoop project was promoted as
Apache’s top-level project.
Generally, applications interact with the relational database using RDBMS, and thus
this makes relational databases one of the most important sources that generate Big
Data. Such data is stored in RDB Servers in the relational structure. Here, Apache
Sqoop plays an important role in the Hadoop ecosystem, providing feasible
interaction between the relational database server and HDFS.
So, Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS (Hadoop storage) and relational database servers like MySQL, Oracle
RDB, SQLite, Teradata, Netezza, Postgres etc. Apache Sqoop imports data from
relational databases to HDFS, and exports data from HDFS to relational databases. It
efficiently transfers bulk data between Hadoop and external data stores such as
enterprise data warehouses, relational databases, etc.
This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”
So, for this analysis, the data residing in the relational database management
systems need to be transferred to HDFS. The task of writing MapReduce code for
importing and exporting data from the relational database to HDFS is uninteresting
& tedious. This is where Apache Sqoop comes to rescue and removes their pain. It
automates the process of importing & exporting the data.
Sqoop makes the life of developers easy by providing CLI for importing and
exporting data. They just must provide basic information like database
authentication, source, destination, operations etc. It takes care of the remaining
part.
Sqoop internally converts the command into MapReduce tasks, which are then
executed over HDFS. It uses YARN framework to import and export the data, which
provides fault tolerance on top of parallelism.
The import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS.
When we submit Sqoop command, our main task gets divided into subtasks which
is handled by individual Map Task internally. Map Task is the subtask, which imports
part of data to the Hadoop Ecosystem. Collectively, all Map tasks imports whole data.
The export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in the table.
When we submit our Job, it is mapped into Map Tasks which brings the chunk of
data from HDFS. These chunks are exported to a structured data destination.
Combining all these exported chunks of data, we receive the whole data at the
destination, which in most of the cases is an RDBMS (MYSQL/Oracle/SQL Server).
Map job launch multiple mappers depending on the number defined by the user. For
Sqoop import, each mapper task will be assigned with a part of data to be imported.
Sqoop distributes the input data among the mappers equally to get high
performance. Then each mapper creates a connection with the database using JDBC
and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or
HBase based on the arguments provided in the CLI.
Flume vs Sqoop
The major difference between Flume and Sqoop is that:
• Flume only ingests unstructured data or semi-structured data into HDFS.
• While Sqoop can import as well as export structured data from RDBMS or
Enterprise data warehouses to HDFS or vice versa.
Basic Nature Sqoop works well with any RDBMS Flume works well for Streaming
which has JDBC (Java Database data source, which is continuously
Connectivity) like Oracle, MySQL, generating such as logs, JMS,
Teradata, etc. directory, crash reports, etc.
Data Flow Sqoop specifically used for parallel Flume is used for collecting and
data transfer. For this reason, the aggregating data because of its
output could be in multiple files distributed nature.
Driven Event Sqoop is not driven by events. Flume is complete event driven.
Usage Used for copying data faster and Used to pull data when companies
then using it to generate analytical want to analyze patterns, root
outcomes. causes or sentiment analysis using
logs and social media.
USE db;
sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
After code is executed, we can check Web UI of HDFS where data is imported
sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--m 1
--target-dir mydir1
we can control the number of mappers independently from the number of files
present in the directory. Export performance depends on the degree of
parallelism. By default, Sqoop will use four tasks in parallel for the export
process. This may not be optimal; we will need to experiment with our own
setup. Additional tasks may offer better concurrency, but if the database is
already bottlenecked on updating indices, invoking triggers, and so on, then
additional load may decrease performance.
sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--m 1
--where “id > 175”
--target-dir mydir2
We should specify append mode when importing a table where new rows are
continually being added with increasing row id values. We specify the column
containing the row’s id with –check-column. Sqoop imports rows where the
check column has a value greater than the one specified with –last-value.
When running a subsequent import, you should specify –last-value in this way
to ensure you import only the new or updated data. This is handled
automatically by creating an incremental import as a saved job, which is the
preferred mechanism for performing a recurring incremental import.
First we are inserting a new row which will be updated in our HDFS.
INSERT INTO student values(“Supreet”,”V”,179);
sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--target-dir mydir2
--incremental append
--check-column id
--last-value 1
sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
160117733178 SUJAN.CH CBIT,HYDERABAD
Sqoop — List Databases
You can list out the databases present in relation database using Sqoop. Sqoop
list-databases tool parses and executes the ‘SHOW DATABASES’ query against
the database server. The command for listing databases is:
sqoop list-databases
--connect jdbc:mysql://localhost/
--username sujan
--password 12345
sqoop list-tables
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
sqoop export
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student2
--export-dir /user/sujan/db
160117733178 SUJAN.CH CBIT,HYDERABAD
Sqoop — Codegen
In object-oriented application, every database table has one Data Access Object class
that contains ‘getter’ and ‘setter’ methods to initialize objects. Codegen generates
the DAO class automatically. It generates DAO class in Java, based on the Table
Schema structure.
sqoop codegen
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
Conclusion:
Apache Sqoop supports bi-directional movement of data between any RDBMS and
HDFS, Hive or HBase, etc. But structured data only. Sqoop automates most of this
process, relying on the database to describe the schema for the data to be imported.
Sqoop uses MapReduce to import and export the data, which provides parallel
operation as well as fault tolerance.
From this experiment, we came to know about the Sqoop, features of Sqoop, Sqoop
architecture and its working, flume vs Sqoop, commands of Sqoop, and import and
exporting data between RDBMS and HDFS using Sqoop.
Sqoop is more like a transport kind of thing with high security and within the budget
and we can use it efficiently and effectively everywhere. And as it is fast in-process
everyone wants this technology to be processed at their own sites to get better
results.