Dsbda Lab

Download as pdf or txt
Download as pdf or txt
You are on page 1of 76
At a glance
Powered by AI
The key takeaways are that Anaconda is a Python distribution for data science and machine learning that includes popular packages like NumPy, SciPy, Pandas, Matplotlib etc. It also discusses installing Anaconda on Windows and importing common data science libraries.

Anaconda is a free Python distribution for data science and machine learning. It includes a package manager called Conda, machine learning libraries like TensorFlow and scikit-learn, data science libraries like pandas and NumPy, and visualization libraries like matplotlib. It also includes the Jupyter notebook.

The steps to install Anaconda on Windows are to download the installer, click next through the installation process, select a destination folder, choose whether to add Anaconda to the PATH or register as the default Python installation, and click install.

Week-1

Anaconda :
Anaconda is a free and open-source distribution of the programming languages
Python and R (check out these Python online courses and R programming courses).
The distribution comes with the Python interpreter and various packages related to
machine learning and data science.

Basically, the idea behind Anaconda is to make it easy for people interested in those
fields to install all (or most) of the packages needed with a single installation.

What is included with Anaconda?


• An open-source package and environment management system called
Conda, which makes it easy to install/update packages and create/load
environments.
• Machine learning libraries like TensorFlow, scikit-learn and Theano.
• Data science libraries like pandas, NumPy and Dask.
• Visualization libraries like Bokeh, Datashader, matplotlib and Holoviews.
• Jupyter Notebook, a shareable notebook that combines live code,
visualizations, and text.

160117733178 SUJAN.CH CBIT,HYDERABAD


Installing Anaconda on Windows :
• Download the Anaconda installer.
• Double click the installer to launch and Click Next.
• Read the licensing terms and click “I Agree”.
• Select an install for “Just Me” unless you are installing for all users
(which requires Windows Administrator privileges) and click Next.
• Select a destination folder to install Anaconda and click the Next button

• Choose whether to add Anaconda to your PATH environment variable. We


recommend not adding Anaconda to the PATH environment variable, since this
can interfere with other software. Instead, use Anaconda software by opening
Anaconda Navigator or the Anaconda Prompt from the Start Menu.

160117733178 SUJAN.CH CBIT,HYDERABAD


• Choose whether to register Anaconda as your default Python. Unless you plan
on installing and running multiple versions of Anaconda or multiple versions
of Python, accept the default and leave this box checked.
• Click the Install button. If you want to watch the packages Anaconda is
installing, click Show details.
• Click the Next button.
• To install PyCharm for Anaconda, click on the link to
https://www.anaconda.com/pycharm.

• Or to install Anaconda without PyCharm, click the Next button


• After a successful installation you will see the “Thanks for installing Anaconda”
dialog box.

160117733178 SUJAN.CH CBIT,HYDERABAD


• If you wish to read more about Anaconda Cloud and how to get started with
Anaconda, check the boxes “Learn more about Anaconda Cloud” and “Learn
how to get started with Anaconda”. Click the Finish button.

Installation steps for NumPy, SciPy, matplotlib and pandas:

• Open command prompt or anaconda prompt and type


“pip install numpy scipy matplotlib pandas” to install the libraries.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week-2

Programs using python modules - NumPy, Matplotlib, Pandas

NumPy
NumPy, which stands for Numerical Python, is a library consisting of multidimensional
array objects and a collection of routines for processing those arrays. Using NumPy,
mathematical and logical operations on arrays can be performed.

It provides :

• a powerful N-dimensional array object


• sophisticated (broadcasting) functions
• tools for integrating C/C++ and Fortran code
• useful linear algebra, Fourier transform, and random number capabilities

Using NumPy, a developer can perform the following operations :

• Mathematical and logical operations on arrays.


• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-
dimensional container of generic data. Arbitrary datatypes can be defined. This allows
NumPy to integrate with a wide variety of databases seamlessly and speedily.

1. Create and print One-Dimensional Array using NumPy.

2. Create and print a Two-Dimensional Array using NumPy.

3. Product of a Two-Dimensional Array using NumPy.

160117733178 SUJAN.CH CBIT,HYDERABAD


4. Indexing, Slicing, Iterating, and Reshaping of an Array.

Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib
is a multi-platform data visualization library built on NumPy arrays and designed to
work with the broader SciPy stack. It was introduced by John Hunter in the year 2002.

One of the greatest benefits of visualization is that it allows us visual access to huge
amounts of data in easily digestible visuals. Matplotlib consists of several plots like line,
bar, scatter, histogram etc.

Matplotlib comes with a wide variety of plots. Plots helps to understand trends,
patterns, and to make correlations. They are typically instrumenting for reasoning about
quantitative information. Some of the sample plots are covered here.

160117733178 SUJAN.CH CBIT,HYDERABAD


1. Make line plot using Matplotlib.

2. Make Histogram using Matplotlib.

3. Make Scatterplot using Matplotlib.

160117733178 SUJAN.CH CBIT,HYDERABAD


4. Make 3D plot using Matplotlib.

5. Image plot using Matplotlib.

160117733178 SUJAN.CH CBIT,HYDERABAD


Pandas

Pandas is a Python package that provides fast, flexible, and expressive data structures
designed to make working with structured (tabular, multidimensional, potentially
heterogeneous) and time series data both easy and intuitive. It aims to be the
fundamental high-level building block for doing practical, real world data analysis in
Python. Additionally, it has the broader goal of becoming the most powerful and flexible
open source data analysis / manipulation tool available in any language. It is already
well on its way toward this goal.

Pandas is well suited for many kinds of data:


• Tabular data with heterogeneously typed columns, as in an SQL table or a Excel
spreadsheet.
• Ordered and unordered (not necessarily fixed frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and
column labels.
• Any other form of observational / statistical data sets. The data actually need not
be labeled at all to be placed into a pandas data structure.

The two primary data structures of pandas


• Series (1-dimensional) and
• Data Frame (2-dimensional)

1. Implement Data Frame using Pandas

2. Implement Series using Pandas

160117733178 SUJAN.CH CBIT,HYDERABAD


Week-3
1. Calculate a paired t test manually and execute with python programming.

Code:

160117733178 SUJAN.CH CBIT,HYDERABAD


Manual:

2. Who will have better sense of humor-women or men?


Researchers asked 10 men and 10 women in their study to categorize 30 cartoons
as either “funny” or “funny”. Each participant received a score that represents her,
or his percentage of cartoons found to be “funny”. Below are fictional data for 9
people; these fictional data have approximately the same means as were reported
in the original study (Azim, Mobbs, Jo, Menon, and Reiss, 2005).
Percentage of cartoons labeled as “funny”.
Women:84,97,58,90
Men: 88,90,52,97,86
How can we conduct an independent-sample t test for this scenario, using a two-
tailed test and a significant level of 0.05 .

160117733178 SUJAN.CH CBIT,HYDERABAD


Women Men
84 88
97 90
58 52
90 97
86

Code:

Manual:

160117733178 SUJAN.CH CBIT,HYDERABAD


3. A research study was conducted to examine the differences between older and
younger adults on perceived life satisfaction. A pilot study was conducted to
examine this hypothesis. Ten older adults (over the age of 70) and ten younger
adults (between 20 and 30) were giving a life satisfaction test (known to have
high reliability and validity). Scores on the measure range from 0 to 60 with high
scores indicative of high life satisfaction, low scores indicative of low life
satisfaction. The data are presented below. Compute the appropriate t-test.

Older Adults Younger Adults


45 34
38 22
52 15
48 27
25 37
39 41
51 24
46 19
55 26
46 36
Manual:

160117733178 SUJAN.CH CBIT,HYDERABAD


Code:

1. What is your computed answer?

4.259

2. What would be the null hypothesis in this study?

The null hypothesis would be that there are no significant differences


between younger and older adults on life satisfaction.

3. What would be the alternate hypothesis?

The alternate hypothesis would be that life satisfaction scores of older


and younger adults are different.

4. What is your tcrit?

2.101

5. Is there a significant difference between the two groups?

Yes, the tobs is in the tail. In fact, even if one uses a probability level the t
is still in the tail. Thus, we conclude that we are 99.9 percent sure that
there is a significant difference between the two groups.

6. Interpret your answer.

Older adults in this sample have significantly higher life satisfaction than
younger adults (t = 4.257, p < .001). As this is a quasi-experiment, we
cannot make any statements concerning the cause of the difference.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 4
Aim: To solve questions based on ANOVA and chi Square

Description:

Analysis of variance (ANOVA)

Analysis of Variance, shortly known as ANOVA is an extremely important tool


for analysis of data (both One-Way and Two-Way ANOVA is used). It is a statistical
method to compare the population means of two or more groups by analyzing
variance. The variance would differ only when the means are significantly different.
It is a generalized method of t-test for more than 2 groups but is more
conservative(results in less type 1 error) and hence suited to a wide range of practical
applications.

• Before ANOVA, multiple t-tests was the only option available to compare
population means of two or more groups.
• As the number of groups increases, the number of two sample t-test also
increases.
• With increases in the number of t-tests, the probability of making the type 1
error also increases.

Types of ANOVA

One-way ANOVA:
It is a hypothesis test in which only one categorical variable or single factor is
taken into consideration. With the help of F-distribution, it enables us to compare
the means of three or more samples. The Null hypothesis (H0) is the equity in all
population means while an Alternative hypothesis is a difference in at least one
mean.

Two-way ANOVA:
It examines the effect of two independent factors on a dependent variable. It
also studies the inter-relationship between independent variables influencing the
values of the dependent variable, if any.

For example, analyzing the test score of a class based on gender and age. Here test
score is a dependent variable and gender and age are the independent variables.
Two-way ANOVA can be used to find the relationship between these dependent and
independent variables.

160117733178 SUJAN.CH CBIT,HYDERABAD


Correlation

Correlation is a statistical measure. It’s a measure of how well two variables


are related to each other. There are positive as well as negative correlation. These
variables can be input data features which have been used to forecast our target
variable.

Types of Correlation

Positive Correlation:
It refers to the extent to which the two variables increase or decreases in
parallel ( think of this as directly proportional, one increases other will increase, one
decreases other will follow the same).

Negative Correlation:
It refers to the extent to which one of the two variables increases as the other
decreases (think of this as inversely proportional, one increases other will decrease
or if one decreases other will increase).

The most common correlation in statistics is the Pearson correlation. The full name
is the Pearson Product Moment Correlation (PPMC). In layman terms, it is a number
between “+1” to “-1” which represents how strongly the two variables are associated.
Or to put this in simpler words, it states the measure of the strength of linear
association between two variables.

Basically, a Pearson Product Moment Correlation (PPMC)attempts to draw a line to


best fit through the data of the given two variables, and the Pearson correlation
coefficient “r” indicates how far away all these data points are from the line of best
fit. The value of “r” ranges from +1 to -1 where:
• r= +1/-1 represents that all our data points lie on the line of best fit only i.e.
there is no data point which shows any variation from the line of best fit.
Hence, the stronger the association between the two variables, the closer r will
be to +1/-1.
• r = 0 means that there is no correlation between the two variables.
• The values of r between +1 and -1 indicate that there is a variation of data
around the line.

160117733178 SUJAN.CH CBIT,HYDERABAD


Chi-square (χ2) statistic

It is a test that measures how expectations compare to actual observed data


(or model results). The data used in calculating a chi-square statistic must be
random, raw, mutually exclusive, drawn from independent variables, and drawn
from a large enough sample.

The chi-square test is one of the most common ways to examine relationships
between two or more categorical variables. Not surprisingly, It involves calculating a
number, called the chi-square statistic - χ2, Which follows a chi-square distribution.
The chi-square test relies on the difference between observed and expected values.

Our hypotheses will be:

Steps to make a Chi-square test:

• Add marginal frequencies to a contingency table


• Translate joint and marginal frequencies into probabilities
• Estimate the expected probability for each cell
• Calculate x²
• Compare x² with table value and decide:
1. x² > table value = accept
2. x² ≤ table value = reject

Task:

1. Using the following data, perform a one-way analysis of variance using α


= 0.05

160117733178 SUJAN.CH CBIT,HYDERABAD


Solution:

160117733178 SUJAN.CH CBIT,HYDERABAD


Program Code:

2. Using the following summary data, perform a one-way analysis of variance


using α = .01

160117733178 SUJAN.CH CBIT,HYDERABAD


Program Code:

3. Find the value of the correlation coefficient from the following table

Program Code:

160117733178 SUJAN.CH CBIT,HYDERABAD


Solution:

4. The local ice cream shop keeps track of how much ice cream they sell versus
the temperature on that day; here are their figures for the last 12 days. Find the
value of the correlation coefficient from the following table

160117733178 SUJAN.CH CBIT,HYDERABAD


Program Code:

5. A public opinion poll surveyed a simple random sample of 1000 voters.


Respondents were classified by gender (male or female) and by voting
preference (Republican, Democrat, or Independent). Results are shown in
the table below. (Solve this problem with Chi square test).

Program Code:

6. A department store, A, has four competitors: B, C, D, and E. Store A hires a


consultant to determine if the percentage of shoppers who prefer each of the
five stores is the same. A survey of 1100 randomly selected shoppers are
conducted, and the results about which one of the stores shoppers prefer are
below. Is there enough evidence using a significance level α = 0.05 to conclude
that the proportions are really the same?

160117733178 SUJAN.CH CBIT,HYDERABAD


Solution:

Program Code:

160117733178 SUJAN.CH CBIT,HYDERABAD


7. A doctor believes that the proportions of births in this country on each day
of the week are equal. A simple random sample of 700 births from a recent year
is selected, and the results are below. At a significance level of 0.01, is there
enough evidence to support the doctor’s claim?

Program Code:

Conclusion:

From this experiment working and procedure of chi square and analysis of
variance is known. Along with that the scenarios that are suitable for different Anova
tests and related python modules are also known. The use and way to calculate
correlation coefficient is also known.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 5

Aim:
Implement time series forecasting using ARIMA model with Air Passengers dataset

Description:
Time Series
TS is a collection of data points collected at constant time intervals. These are
analyzed to determine the long-term trend to forecast the future or perform some
other form of analysis.

Reasons for TS different from regular regression problem:


1. It is time dependent. So, the basic assumption of a linear regression model
that the observations are independent does not hold in this case.
2. Along with an increasing or decreasing trend, most TS have some form
of seasonality trends, i.e. variations specific to a particular time frame. For
example, if you see the sales of a woolen jacket over time, you will invariably
find higher sales in winter seasons.

Variations
One of the most important features of a time series is variation. Variations are
patterns in the times series data. A time series that has patterns that repeat over
known and fixed periods of time is said to have seasonality. Seasonality is a general
term for variations that periodically repeat in data. In general, we think of variations
as 4 categories: Seasonal, Cyclic, Trend, and Irregular fluctuations.

• Seasonality - seasonal variances.


Ex: Ice cream sales increases in Summer only.
• Cyclicity - behavior that repeats itself after days, months, years etc.
Ex: The daily variation in temperature.
• Trend - Upward & downward movement of data with time over a large period.
Ex: Appreciation of Dollar vs rupee.
• Noise or Irregularity - Spikes & troughs at random intervals

Forecasting is the process of making predictions of the future, based on past and
present data. There are several methods for time series forecasting
• Naive Approach
• Simple Average
• Moving Average
• Weighted moving average
• Simple Exponential Smoothing
• Holt’s Linear Trend Model
• Holt Winters Method
• ARIMA

160117733178 SUJAN.CH CBIT,HYDERABAD


ARIMA
One of the most common methods for Time Series Forecasting is the ARIMA model,
which stands for Auto-Regressive Integrated Moving Average. ARIMA models work
on the following assumptions :
• The data series is stationary, which means that the mean and variance should
not vary with time. A series can be made stationary by using log
transformation or differencing the series.
• The data provided as input must be a univariate series, since ARIMA uses the
past values to predict the future values.

ARIMA has three components :


• AR (autoregressive term) p
• I (differencing term) d
• MA (moving average term) q

p is the parameter associated with the auto-regressive aspect of the model. The past
values used for forecasting the next value. The value of ‘p’ is determined using the
PACF plot. For example, forecasting that if it rained a lot over the past few days, you
state it is likely that it will rain tomorrow as well.

d is the parameter associated with the integrated part of the model,specifies the
number of times the differencing operation is performed on series to make it
stationary. Test like ADF and KPSS can be used to determine whether the series is
stationary and help in identifying the d value. You can imagine an example of this
as forecasting that the amount of rain tomorrow will be like the amount of rain today
if the daily amounts of rain have been similar over the past few days.

q is the parameter associated with the moving average part of the model. Used to
defines number of past forecast errors used to predict the future values. ACF plot is
used to identify the correct ‘q’ value.

Types of ARIMA Model


• ARIMA: Non-seasonal Autoregressive Integrated Moving Averages
• SARIMA: Seasonal ARIMA
• SARIMAX: Seasonal ARIMA with exogenous variables

If our model has a seasonal component we use a seasonal ARIMA model (SARIMA).
In that case we have another set of parameters: P,D, and Q which describe the same
associations as p, d, and q, but correspond with the seasonal components of the
model.

160117733178 SUJAN.CH CBIT,HYDERABAD


Steps for ARIMA implementation
The general steps to implement an ARIMA model are –
1. Load the data: The first step for model building is of course to load the
dataset
2. Preprocessing: Depending on the dataset, the steps of preprocessing will be
defined. This will include creating timestamps, converting the dtype of
date/time column, making the series univariate, etc.
3. Make series stationary: To satisfy the assumption, it is necessary to make
the series stationary. This would include checking the stationarity of the series
and performing required transformations
4. Determine d value: For making the series stationary, the number of times the
difference operation was performed will be taken as the d value
5. Create ACF and PACF plots: This is the most important step in ARIMA
implementation. ACF PACF plots are used to determine the input parameters
for our ARIMA model
6. Determine the p and q values: Read the values of p and q from the plots in
the previous step
7. Fit ARIMA model: Using the processed data and parameter values we
calculated from the previous steps, fit the ARIMA model
8. Predict values on validation set: Predict the future values
9. Calculate RMSE: To check the performance of the model, check the RMSE
value using the predictions and actual values on the validation set

Need of Auto ARIMA

Although ARIMA is a very powerful model for forecasting time series data, the data
preparation and parameter tuning processes end up being really time consuming.
Before implementing ARIMA, you need to make the series stationary, and determine
the values of p and q using the plots we discussed above. Auto ARIMA makes this
task simple for us as it eliminates steps 3 to 6 we saw in the previous section.

Below are the steps you should follow for implementing auto ARIMA:
1. Load the data: This step will be the same. Load the data into your notebook
2. Preprocessing data: The input should be univariate, hence drop the other
columns
3. Fit Auto ARIMA: Fit the model on the univariate series
4. Predict values on validation set: Make predictions on the validation set
5. Calculate RMSE: Check the performance of the model using the predicted
values against the actual values

We completely bypassed the selection of p and q feature

160117733178 SUJAN.CH CBIT,HYDERABAD


Program:
Working with a dataset that contains the number of airplane passengers vs month.

Loading and Handling Time Series in Pandas:

Pandas has dedicated libraries for handling TS objects, particularly


the datatime64[ns] class which stores time information and allows us to perform
some operations fast.
1. parse_dates: This specifies the column which contains the date-time
information. As we say above, the column name is ‘Month’.
2. index_col: A key idea behind using Pandas for TS data is that the index has to
be the variable depicting date-time information. So, this argument tells pandas
to use the ‘Month’ column as index.
3. date_parser: This specifies a function which converts an input string into
datetime variable. Be default Pandas reads data in format ‘YYYY-MM-DD
HH:MM:SS’. If the data is not in this format, the format must be manually
defined. Something like the dataparse function defined here can be used for
this purpose.

160117733178 SUJAN.CH CBIT,HYDERABAD


Decomposition:

Here we can see there is an upward trend. We can use statsmodels to perform a
decomposition of this time series. The decomposition of time series is a statistical
task that deconstructs a time series into several components, each representing one
of the underlying categories of patterns. With statsmodels we will be able to see the
trend, seasonal, and residual components of our data.

• Additive model is used when it seems that the trend is more linear, and the
seasonality and trend components seem to be constant over time.
Ex: every year we add 100 units of energy production.
• Multiplicative model is more appropriate when we are increasing (or
decreasing) at a non-linear rate.
Ex: each year we double the amount of energy production every year.

From the plot above we can clearly see the seasonal component of the data, and we
can also see the separated upward trend of the data.

Trends can be upward or downward and can be linear or non-linear. It is important


to understand your data set to know whether a significant period has passed to
identify an actual trend.

Irregular fluctuations are abrupt changes that are random and unpredictable.

160117733178 SUJAN.CH CBIT,HYDERABAD


Performing the Seasonal ARIMA

Now that we have analyzed the data, we can clearly see we have a time series with a
seasonal component, so it makes sense to use a Seasonal ARIMA model. To do this,
we will need to choose p, d, q values for the ARIMA, and P,D,Q values for the Seasonal
component.

There are many ways to choose these values statistically, such as looking at auto-
correlation plots, correlation plots, domain experience, etc.

The pyramid-arima library for Python allows us to quickly perform this grid search
and even creates a model object that you can fit to the training data.

This library contains an auto_arima function that allows us to set a range of p, d, q,


P, D, and Q values and then fit models for all the possible combinations.

160117733178 SUJAN.CH CBIT,HYDERABAD


Train Test Split

We can then fit the stepwise_model object to a training data set. Because this is a
time series forecast, we will “chop off” a portion of our latest data and use that as
the test set. Then we will train on the rest of the data and forecast into the future.
Afterwards we can compare our forecast with the section of data we chopped off.

Train the Model

We can then train the model by simply calling .fit on the stepwise model and passing
in the training data.

160117733178 SUJAN.CH CBIT,HYDERABAD


Evaluation

Now that the model has been fitted to the training data, we can forecast into the
future. We use .predict() method call.

160117733178 SUJAN.CH CBIT,HYDERABAD


Forecast next three years using ARIMA Model

Conclusion:

From this experiment we came to know the concepts starting from the very basics
of forecasting, AR, MA, ARIMA, SARIMA and finally the SARIMAX model.

In the domain of machine learning, there is a collection technique for manipulating


and interpreting variables that depend on time. Among these include ARIMA which
can remove the trend component to accurately predict future values.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 6

Aim:
Install libraries important for Machine learning (ScikitLearn, statsmodels, scipy,
NLTK, etc.) and write brief introduction about those modules.

Description:
Machine Learning and Deep Learning have been on the rise recently with the push
in the AI industry. Machine learning is a subset of Artificial Intelligence (AI) which
provides machines the ability to learn automatically & improve from experience
without being explicitly programmed to do so.

Several programming languages can get you started with AI, ML and DL with each
language offering stronghold on a specific concept. Some of the popular
programming languages for ML and DL are Python, Julia, R, Java along with a few
more. But Python seems to be winning battle as preferred language of Machine
Learning. The availability of libraries and open source tools make it ideal choice for
developing ML models.

One of Python’s greatest assets is its extensive set of libraries. Libraries are sets of
routines and functions that are written in each language. A robust set of libraries
can make it easier for developers to perform complex tasks without rewriting many
lines of code.

Machine learning is largely based upon mathematics. Specifically, mathematical


optimization, statistics, and probability. Python libraries help researchers or
mathematicians who are less equipped with developer knowledge to easily “do
machine learning”.

Below are some of the most used libraries in machine learning:

160117733178 SUJAN.CH CBIT,HYDERABAD


Scikit Learn :
Scikit Learn is perhaps the most popular library for Machine Learning. It provides
almost every popular model – Linear Regression, Lasso-Ridge, Logistics Regression,
Decision Trees, SVMs and a lot more. Not only that, but it also provides an extensive
suite of tools to pre-process data, vectorizing text using BOW, TF-IDF or hashing
vectorization and many more.

It builds on two basic libraries of Python, NumPy and SciPy. It adds a set of algorithms
for common machine learning and data mining tasks, including clustering,
regression, and classification. Even tasks like transforming data, feature selection
and ensemble methods can be implemented in a few lines.

Advantages:
• Simple, easy to use, and effective.
• In rapid development, and constantly being improved.
• Wide range of algorithms, including clustering, factor analysis, principal
component analysis, and more.
• Can extract data from images and text.
• Can be used for NLP.
Disadvantages:
• This library is especially suited for supervised learning, and not very suited to
unsupervised learning applications like Deep Learning.

Statsmodels :
Statsmodels is another library to implement statistical learning algorithms. However,
it is more popular for its module that helps implement time series models. You can
easily decompose a time-series into its trend component, seasonal component, and
a residual component.

You can also implement popular ETS methods like exponential smoothing, Holt-
Winters method, and models like ARIMA and Seasonal ARIMA or SARIMA. The only
drawback is that this library does not have a lot of popularity and thorough
documentation as Scikit.

160117733178 SUJAN.CH CBIT,HYDERABAD


Scipy :
SciPy is a very popular ML library with different modules for optimization, linear
algebra, integration, and statistics.

Advantages:
• Great for image manipulation.
• Provides easy handling of mathematical operations.
• Offers efficient numerical routines, including numerical integration and
optimization.
• Supports signal processing.
Disadvantages:
• There is both a stack and a library named SciPy. The library is part of the stack.
Beginners who do not know the difference may become confused.

NLTK :
NLTK is a framework and suite of libraries for developing both symbolic and
statistical Natural Language Processing (NLP) in Python. It is the standard tool for
NLP in Python.

Advantages:
• The Python library contains graphical examples, as well as sample data.
• Includes a book and cookbook making it easier for beginners to pick up.
• Provides support for different ML operations like classification, parsing, and
tokenization functionalities, etc.
• Acts as a platform for prototyping and building research systems.
• Compatible with several languages.
Disadvantages:
• Understanding the fundamentals of string processing is a prerequisite to using
the NLTK framework. Fortunately, the documentation is adequate to assist in
this pursuit.
• NLTK does sentence tokenization by splitting the text into sentences. This has
a negative impact on the performance.

160117733178 SUJAN.CH CBIT,HYDERABAD


Pytorch :
PyTorch is a popular ML library for Python based on Torch, which is an ML library
implemented in C and wrapped in Lua. It was originally developed by Facebook, but
is now used by Twitter, Salesforce, and many other major organizations and
businesses.

Advantages:
• Contains tools and libraries that support Computer Vision, NLP , Deep
Learning, and many other ML programs.
• Developers can perform computations on Tensors with GPU acceleration.
• Helps in creating computational graphs.
• The default “define-by-run” mode is more like traditional programming.
• Uses a lot of pre-trained models and modular parts that are easy to combine.
Disadvantages:
• Because PyTorch is relatively new, there are comparatively fewer online
resources to be found. This makes it harder to learn from scratch, although it
is intuitive.
• PyTorch is not widely considered to be production-ready compared to Google’s
TensorFlow, which is more scalable.

https://pytorch.org/get-started/locally/

Keras :
Keras is a very popular ML for Python, providing a high-level neural network API
capable of running on top of TensorFlow, CNTK, or Theano.

Advantages:
• Great for experimentation and quick prototyping.
• Portable.
• Offers easy expression of neural networks.
• Great for use in modeling and visualization.
Disadvantages:
• Slow, since it needs to create a computational graph before it can perform
operations.

160117733178 SUJAN.CH CBIT,HYDERABAD


Tensorflow :
Originally developed by Google, TensorFlow is an open-source library for high-
performance numerical computation using data flow graphs. Under the hood, it is a
framework for creating and running computations involving tensors. The principal
application for TensorFlow is in neural networks, and especially deep learning where
it is widely used. That makes it one of the most important Python packages for
machine learning

Advantages:
• Supports reinforcement learning and other algorithms.
• Provides computational graph abstraction.
• Offers a very large community.
• Provides TensorBoard, which is a tool for visualizing ML models directly in the
browser.
• Production ready.
• Can be deployed on multiple CPUs and GPUs.
Disadvantages:
• Runs dramatically slower than other frameworks utilizing CPUs/GPUs.
• Steep learning curve compared to PyTorch.
• Computational graphs can be slow.
• Not commercially supported.

Conclusion:
Python is a truly marvelous tool of development that not only serves as a general-
purpose programming language but also caters to specific niches of our project or
workflows. With loads of libraries and packages that expand the capabilities of
Python and make it an all-rounder and a perfect fit for anyone looking to get into
developing programs and algorithms. With some of the modern machine learning
and deep learning libraries for Python discussed briefly above, we can get an idea
about what each of these libraries has to offer and make our pick.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 7

Aim:
• Tokenization, Stemming, Lemmatization and Stop Word removal using NLTK
• Implement Sentiment analysis for the reviews from any website using NLTK

Description:
Natural language processing is one of the fields in programming where the natural
language is processed by the software. This has many applications like sentiment
analysis, language translation, fake news detection, grammatical error detection etc.

The input in natural language processing is text. The data collection for this text
happens from a lot of sources. This requires a lot of cleaning and processing before
the data can be used for analysis.

These are some of the methods of processing the data in NLP:


• Tokenization
• Stop words removal
• Stemming
• Normalization
• Lemmatization
• Parts of speech tagging

In the past , only experts could be part of natural language processing projects that
required superior knowledge of mathematics, machine learning, and linguistics.
Now, developers can use ready-made tools that simplify text preprocessing so that
they can concentrate on building machine learning models.

There are many tools and libraries created to solve NLP problems. Some of the
amazing Python Natural Language Processing libraries are:
• Natural Language Toolkit (NLTK)
• TextBlob
• CoreNLP
• Gensim
• spaCy
• polyglot
• scikit–learn
• Pattern

Natural Language Tool Kit (NLTK) is a Python library to make programs that work
with natural language. It provides a user-friendly interface to datasets that are over
50 corpora and lexical resources such as WordNet Word repository. The library can
perform different operations such as tokenizing, stemming, classification, parsing,
tagging, and semantic reasoning.

160117733178 SUJAN.CH CBIT,HYDERABAD


Tokenization
Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text
document into smaller units, such as individual words or terms. Each of these smaller
units are called tokens.
For example, in the English Language-
The text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’.

Before processing a natural language, we need to identify the words that constitute
a string of characters. This is important because the meaning of the text could easily
be interpreted by analyzing the words present in the text.
We can use this tokenized form to :
• Count the number of words in the text
• Count the frequency of the word, that is, the number of times a particular word
is present

Ordinarily, there are two types of tokenization:


• Word Tokenization : Used to separate words via unique space
character. Depending on the application, word tokenization may also tokenize
multi-word expressions like New York. This is often is closely tied to a process
called Named Entity Recognition.
• Sentence Tokenization/Segmentation : Along with word tokenization, sentence
segmentation is a crucial step in text processing. This is usually performed based
on punctuations such as “.”, “?”, “!” as they tend to mark the sentence boundaries.

There are some other special tokenizers


• The MWETokenizer takes a string which is already been divided into tokens and
retokenizes it, merging multiword expressions into single token, by using lexicon
of MWEs.

“He completed the task in spite of all the hurdles faced” is tokenized as
[‘He’, ‘completed’, ‘the’, ‘task’, ‘in’, ‘spite’, ‘of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]
If we add the ‘in spite of’ in the lexicon of the MWETokenizer,
[‘He’, ‘completed’, ‘the’, ‘task’, ‘in spite of’, ‘all’, ‘the’, ‘hurdles’, ‘faced’]

• The TweetTokenizer addresses the specific things for the tweets like handling
emojis.

160117733178 SUJAN.CH CBIT,HYDERABAD


Stemming
Stemming is the process of reducing the words(generally modified or derived) to
their word stem or root form. The objective of stemming is to reduce related words
to the same stem even if the stem is not a dictionary word.
For example, in the English language-
• beautiful and beautifully are stemmed to beauti
• good, better, and best are stemmed to good, better, and best respectively

There are mainly two errors that occur while performing Stemming
• Over-steaming occurs when two words are stemmed from the same root of
different stems. For example, university and universe. Some stemming
algorithms may reduce both the words to the stem univers, which would imply
both the words mean the same thing, and that is clearly wrong.
• Under-stemming occurs when two words are stemmed from the same root of non-
different stems. For example, consider the words “data” and “datum.” Some
algorithms may reduce these words to dat and datu respectively, which is
obviously wrong. Both must be reduced to the same stem dat.

Python NLTK provides various stemmers like


• Porter Stemmer : It uses suffix striping to produce stems. It does not follow the
linguistic set of rules to produce stem for phases in different cases, due to this
reason porter stemmer does not generate stems, i.e. actual English words.
• Snowball Stemmer : It is an advanced version of Porter Stemmer, also named
as Porter2 Stemmer.
print(SnowballStemmer("english").stem("badly"))
Output: bad
print(SnowballStemmer("porter").stem("badly"))
Output: badli
Here, the word “badly” is stripped from the English language using Snowball
Stemmer and get an output as “bad”. And snowball Stemmer is used for stripping
the same word from the Porter language, we get the output as “badli”.

160117733178 SUJAN.CH CBIT,HYDERABAD


Lemmatization
Lemmatization is the process of reducing a group of words into their lemma or
dictionary form. It considers things like POS(Parts of Speech), the meaning of the
word in the sentence, the meaning of the word in the nearby sentences etc. before
reducing the word to its lemma.
For example, in the English Language-
• beautiful and beautifully are lemmatized to beautiful and beautifully respectively.
• good, better, and best are lemmatized to good, good, and good respectively.

Python NLTK provides WordNet Lemmatizer


WordNet Lemmatizer that uses the WordNet Database to lookup lemmas of words.
Wordnet is a large, freely, and publicly available lexical database for the English
language aiming to establish structured semantic relationships between words. It
offers lemmatization capabilities as well and is one of the earliest and most used
lemmatizers. NLTK offers an interface to it, but we must download it first to use it.

Follow the below instructions to download wordnet.


import nltk
nltk.download('wordnet')

To lemmatize, you need to create an instance of the WordNet Lemmatizer() and call
the lemmatize() function on a single word. Sometimes, the same word can have a
multiple lemma based on the meaning / context. This can be corrected if we provide
the correct ‘part-of-speech’ tag (POS tag) as the second argument to lemmatize()

160117733178 SUJAN.CH CBIT,HYDERABAD


Stop words removal
Stopwords are the English words which does not add much meaning to a sentence.
They can safely be ignored without sacrificing the meaning of the sentence.
For example, in the English Language-
“There is a pen on the table”. Now, the words “is”, “a”, “on”, and “the” add no meaning
to the statement while parsing it. Whereas words like “there”, “book”, and “table” are
the keywords and tell us what the statement is all about.

There are two considerations usually that motivate this removal.


• Irrelevance: Allows one to analyze only on content-bearing words. Stopwords do
not bear much meaning, introduce noise in the analysis/modeling process.
• Dimension: Removing the stopwords also allows one to reduce the tokens in
documents significantly, and thereby decreasing feature dimension.

NLTK supports stop word removal, and we can find list of the stop words in the
corpus module. To remove stop words from a sentence, we divide text into words
and then remove the word if it exits in the list of stop words provided by NLTK.

Follow the below instructions to download stopwords.


import nltk
nltk.download('stopwords')

The expressions “US citizen” will be viewed as “us citizen” or “IT scientist” as
“it scientist”. Since both “us” and “it” are normally considered stop words, it would
result in an inaccurate outcome. The strategy regarding the treatment of stopwords
can thus be refined by identifying that “US” and “IT” are not pronouns in the above
examples, through a part-of-speech tagging step.

160117733178 SUJAN.CH CBIT,HYDERABAD


Sentiment analysis
Sentiment analysis (also known as opinion mining) is a text analysis technique that
detects polarity (e.g. a positive or negative or neutral opinion) within text, whether
a whole document, paragraph, sentence, or clause. Sometimes, the third attribute is
not taken to keep it a binary classification problem. In recent tasks, sentiments like
"somewhat positive" and "somewhat negative" are also being considered.
For example, in the English Language-
1. "Titanic is a great movie." ( positive sentiment )
2. "Titanic is not a great movie." ( negative sentiment )
3. "Titanic is a movie." ( neutral sentiment )

NLTK provides VADER Sentiment Analysis.


VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-
based sentiment analysis tool that is specifically attuned to sentiments expressed in
social media. VADER uses a combination of A sentiment lexicon is a list of lexical
features (e.g., words) which are generally labeled according to their semantic
orientation as either positive or negative. VADER not only tells about the Positivity
and Negativity score but also tells us about how positive or negative a sentiment is.

Follow the below instructions to download vader_lexicon.


import nltk
nltk.download('vader_lexicon')

The Compound score is a metric that calculates the sum of all the lexicon ratings
which have been normalized between -1(most extreme negative) and +1 (most
extreme positive).
positive sentiment : (compound score >= 0.05)
neutral sentiment : (compound score > -0.05) and (compound score < 0.05)
negative sentiment : (compound score <= -0.05)

160117733178 SUJAN.CH CBIT,HYDERABAD


Conclusion:
NLP is a very vast and interesting topic and solves some challenging problems.
Specifically, the intersection of NLP and Deep Learning has given birth to some
fantastic products. It has completely revolutionized the way chatbots interact. The
list is never-ending.

Natural Language Processing is the fundamental tasks in all text processing tasks,
to transform unstructured text into a form that computers understand. From this
point on, we can use it to generate features and perform other tasks like named
entity extraction, sentiment analysis, topic detection.

Sentiment analysis can be applied to countless aspects of business, from brand


monitoring and product analytics, to customer service and market research. By
incorporating it into their existing systems and analytics, leading brands can work
faster, with more accuracy, toward more useful ends.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 8

Aim:
Installation of Big data technologies and building a Hadoop cluster
Description:
Natural A Hadoop cluster is a collection of computers, known as nodes, that are
networked together to perform these kinds of parallel computations on big data
sets. Unlike other computer clusters, Hadoop clusters are designed specifically to
store and analyze mass amounts of structured and unstructured data in a distributed
computing environment. Further distinguishing Hadoop ecosystems from other
computer clusters are their unique structure and architecture. Hadoop clusters
consist of a network of connected master and slave nodes that utilize high
availability, low-cost commodity hardware. The ability to linearly scale and quickly
add or subtract nodes as volume demands makes them well-suited to big data
analytics jobs with data sets highly variable in size.

Task:
The steps given below are to be followed to have Hadoop Multi-Node cluster setup:
1. Installing Java:
Java is the main prerequisite for Hadoop. Firstly, you should verify the
existence of java in your system using “java -version”. If java is not installed in
your system, then follow the steps for installing java:
i. Download java (JDK - X64.tar.gz) by visiting the following link
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-
downloads1880260.html
Then jdk-7u71-linux-x64.tar.gz will be downloaded into your system.
ii. Generally, you will find the downloaded java file in Downloads folder. Verify
it and extract the jdk-7u71-linux-x64.gz file
iii. Make the java available to all the users, move it to the location usr/local.
iv. Set up PATH and JAVA_HOME variables.

2. Creating User Account


Create a system user account on both master and slave systems to use the
Hadoop installation.

3. Mapping the nodes


You must edit hosts file in /etc/ folder on all nodes, specify the IP address of
each system followed by their host names.

4. Configuring Key Based Login


Setup ssh in every node such that they can communicate with one another
without any prompt for password.

5. Installing Hadoop
In the Master server, download and install Hadoop

160117733178 SUJAN.CH CBIT,HYDERABAD


6. Configuring Hadoop
You must configure Hadoop server by making the following changes as open
the coresite.xml file and hdfs-site.xml file and mapred-site.xml and hadoop-
env.sh and edit the files to the requirements.

7. Installing Hadoop on Slave Servers


Install Hadoop on all the slave servers how many ever present in the
requirement.

8. Configuring Hadoop on Master Server


Open the master server and configure it.

9. Configuring Master Node

10. Configuring Slave Nodes

11. Format Name Node on Hadoop Master


An important step because the Name node in Hadoop is of great purpose and
the one which manages the file system.

12. Starting Hadoop Services


The start-all.sh is used to start all the Hadoop services on the Hadoop-Master.

13. Adding a New DataNode in the Hadoop Cluster Networking:


Add new nodes to an existing Hadoop cluster with some appropriate network
configuration.

14. Adding User and SSH Access


i. Add a User:
On a new node, add "hadoop" user and set password of Hadoop user.
ii. Set Hostname of New Node:
You can set hostname in file /etc/sysconfig/network.

15. Start the DataNode on New Node


Start the datanode daemon manually using $HADOOP_HOME/bin/hadoop-
daemon.sh script. It will automatically contact the master NameNode and join
the cluster. We should also add the new node to the conf/slaves file in the
master server.The script-based commands will recognize the new node.

16. Login to new node

17. Start HDFS on a newly added slave node


Removing a DataNode from the Hadoop Cluster We can remove a node from a
cluster on the fly, while it is running, without any data loss. To use it, follow
the steps as given below:
160117733178 SUJAN.CH CBIT,HYDERABAD
1: Login to master
Login to master machine user where Hadoop is installed.
2: Change cluster configuration
An exclude file must be configured before starting the cluster. Add a
key named dfs.hosts.exclude to our $HADOOP_HOME /etc/ hadoop/
hdfs-site.xml file. The value associated with this key provides the full
path to a file on the NameNode's local file system which contains a list
of machines which are not permitted to connect to HDFS.
3: Determine hosts to decommission
Each machine to be decommissioned should be added to the file
identified by the hdfs_exclude.txt, one domain name per line. This will
prevent them from connecting to the NameNode.
4: Force configuration reload
This will force the NameNode to re-read its configuration, including the
newly updated ‘excludes’ file. It will decommission the nodes over a
period, allowing time for each node's blocks to be replicated onto
machines which are scheduled to remain active.
5: Shutdown nodes
After the decommission process has been completed, the
decommissioned hardware can be safely shut down for maintenance.
Run the report command to dfsadmin to check the status of
decommission. The following command will describe the status of the
decommission node and the connected nodes to the cluster.
6: Edit excludes file again
Once the machines have been decommissioned, they can be removed
from the ‘excludes’ file. Running "$HADOOP_HOME/ bin/
hadoopdfsadmin refreshNodes"again will read theexcludes file back
into the NameNode; allowing the DataNodes to rejoin the cluster after
the maintenance has been completed, or additional capacity is needed
in the cluster again, etc.

Conclusion:
From this experiment we can know what Hadoop cluster is and how is it supposed
to be configured in the system and the system requirements to install the Hadoop.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 9

Aim:
Steps for data loading from local machine to Hadoop and Hadoop to local machine

Description:

Starting HDFS
Initially we must format the configured HDFS file system, open namenode (HDFS
server), and execute the following command.

$ hadoop namenode –format

After formatting the HDFS, start the distributed file system. The following command
will start the namenode as well as the data nodes as cluster.

$ start-dfs.sh

Hadoop commands structure


All the commands which come under the Hadoop file system can be written by
using following syntax:-

$ hadoop fs –command

Ls
After loading the information in the server, we can find the list of files in a directory,
status of a file, using ‘ls’. Given below is the syntax of ls that we can pass to a
directory or a filename as an argument.

$ hadoop fs –ls <args>

Mkdir
This command is used to create the directory in Hdfs. Here “sujan” is directory name.

$ hadoop fs –mkdir sujan

Cat
In linux file system we use cat command both to read and create the file. But in
Hadoop system we cannot create files in HDFS. We can only load the data. So we use
cat command in Hdfs only to read the file. “sujan.txt” is my file name and the
command will show all the contents of this file on the screen.

$ hadoop fs –cat sujan.txt

160117733178 SUJAN.CH CBIT,HYDERABAD


Loading data from a local machine to HDFS

In this, we are going to load data from a local machine's disk to HDFS. Assume we
have data in the file called “sujan.txt” in the local system which is ought to be saved
in the hdfs file system.

To perform this, you should have an already Hadoop running cluster.

Performing this is as simple as copying data from one folder to another. There are a
couple of ways to copy data from the local machine to HDFS. Follow the steps given
below to insert the required file in the Hadoop file system.

• Using the copyFromLocal command

To copy the file on HDFS, let us first create a input directory on HDFS and
then copy the file. Here are the commands to do this:

$ hadoop fs –mkdir /mydir1


$ hadoop fs –copyFromLocal /usr/local/sujan.txt /mydir1

• Using the put command

We will first create the input directory, and then put the local file in HDFS:

$ hadoop fs –mkdir /mydir2

Transfer and store a data file from local systems to the Hadoop file system
using the put command.

$ hadoop fs –put /usr/local/sujan.txt /mydir2

• Using the MoveFromLocal command

We also use this command to load data from local to hdfs but this command
remove the file from the local.

$ hadoop fs –mkdir /mydir3


$ hadoop fs –moveFromLocal /usr/local/sujan.txt /mydir3

We can validate that files have been copied to correct folders by listing the files:

$ hadoop fs –ls /mydir1


$ hadoop fs –ls /mydir2
$ hadoop fs –ls /mydir3

160117733178 SUJAN.CH CBIT,HYDERABAD


Exporting HDFS data to a local machine

In this, we are going to export/copy data from HDFS to the local machine.

To perform this, we should already have a running Hadoop cluster.

Performing this is as simple as copying data from one folder to the other. There are
a couple of ways in which you can export data from HDFS to the local machine. Given
below is a simple demonstration for retrieving the required file from the Hadoop file
system.

• Using the copyToLocal command

$ hadoop fs –copyToLocal /mydir1/sujan.txt /home/ubuntu

• Using the get command

$ hadoop fs –get /mydir1/sujan.txt /home/ubuntu

Shutting Down the HDFS

You can shut down the HDFS by using the following command.

$ stop-dfs.sh

Conclusion:

The command put is like copyFromLocal. Although put is slightly more general, it
can copy multiple files into HDFS, and can read input from stdin. copyFromLocal
returns 0 on success and -1 on error.

The get Hadoop shell command can be used in place of the copyToLocal command.
At this time, they share the same implementation. The copyToLocal command does
a Cyclic Redundancy Check (CRC) to verify that the data copied was unchanged. A
failed copy can be forced using the optional –ignorecrc argument. The file and its
CRC can be copied using the optional –crc argument.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 10

Aim:
Prepare a document for the Map Reduce concept

Description:
Hadoop MapReduce is a software framework for easily writing applications which
process vast amounts of data (multi-terabyte datasets) in-parallel on large clusters
(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input dataset into independent chunks which are
processed by the map tasks in a completely parallel manner. The framework sorts
the outputs of the maps, which are then input to the reduce tasks. Typically, both
the input and the output of the job are stored in a filesystem. The framework takes
care of scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a
set of <key, value> pairs as the output of the job, conceivably of different types.

• MapReduce consists of two distinct tasks – Map and Reduce.


• As the name MapReduce suggests, the reducer phase takes place after the
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-
value pair) into a smaller set of tuples or key-value pairs which is the final output.

Input and Output types of a MapReduce job:


<k1,v1> -> map -> <k2,v2> -> combine -> <k2,v2> -> reduce -> <k3,v3>
(input) (output)

160117733178 SUJAN.CH CBIT,HYDERABAD


MapReduce majorly has the following three Classes. They are,

1. Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here,
RecordReader processes each Input record and generates the respective key-value
pair. Hadoop’s Mapper store saves this intermediate data into the local disk.
1. Input Split
It is the logical representation of data. It represents a block of work that
contains a single map task in the MapReduce Program.
2. RecordReader
It interacts with the Input split and converts the obtained data in the form
of Key-Value Pairs.
2. Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which
processes it and generates the final output which is then saved in the HDFS.
3. Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for
setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper
and Reducer Classes long with data types and their respective job names.

A Word Count Example of MapReduce


Consider we had a text file called sample.txt whose contents are as follows:
Welcome, to, Hadoop, Class, Hadoop, is, good, Hadoop, is, bad

Now, suppose, we have to perform a word count on the sample.txt using MapReduce.
So, we will be finding the unique words and the number of occurrences of those
unique words.

160117733178 SUJAN.CH CBIT,HYDERABAD


The data goes through the following phases
1. Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits.
Input split is a chunk of the input that is consumed by a single map
2. Mapping
This is the very first phase in the execution of map-reduce program. In this phase
data in each split is passed to a mapping function to produce output values. In
our example, a job of mapping phase is to count a number of occurrences of each
word from input splits (more details about input-split is given below) and prepare
a list in the form of <word, frequency>
3. Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the
relevant records from Mapping phase output. In our example, the same words
are clubbed together along with their respective frequency.
4. Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase
combines values from Shuffling phase and returns a single output value. In short,
this phase summarizes the complete dataset.

MapReduce Architecture

160117733178 SUJAN.CH CBIT,HYDERABAD


• One map task is created for each split which then executes map function for each
record in the split.
• It is always beneficial to have multiple splits because the time taken to process a
split is small as compared to the time taken for processing of the whole input.
When the splits are smaller, the processing is better to load balanced since we
are processing the splits in parallel.
• However, it is also not desirable to have splits too small in size. When splits are
too small, the overload of managing the splits and map task creation begins to
dominate the total job execution time.
• For most jobs, it is better to make a split size equal to the size of an HDFS block
(which is 64 MB, by default).
• Execution of map tasks results into writing output to a local disk on the respective
node and not to HDFS.
• Reason for choosing local disk over HDFS is, to avoid replication which takes place
in case of HDFS store operation.
• Map output is intermediate output which is processed by reduce tasks to produce
the final output.

• Once the job is complete, the map output can be thrown away. So, storing it in
HDFS with replication becomes overkill.
• In the event of node failure, before the map output is consumed by the reduce
task, Hadoop reruns the map task on another node and re-creates the map
output.
• Reduce task does not work on the concept of data locality. An output of every
map task is fed to the reduce task. Map output is transferred to the machine
where reduce task is running.
• On this machine, the output is merged and then passed to the user-defined
reduce function.
• Unlike the map output, reduce output is stored in HDFS (the first replica is stored
on the local node and other replicas are stored on off-rack nodes). So, writing the
reduce output

160117733178 SUJAN.CH CBIT,HYDERABAD


How MapReduce Organizes Work?
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)
2. Reduce tasks (Shuffling, Reducing)
as mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is
controlled by two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of
submitted job)
2. Multiple Task Trackers: Acts like slaves, each of them performing the job

JobTracker and TaskTracker are 2 essential process involved in MapReduce


execution in MRv1 (or Hadoop version1). Both processes are now deprecated in MRv2
(or Hadoop version2) and replaced by Resource Manager, Application Master and
Node Manager Daemons.

• A job is divided into multiple tasks which are then run onto multiple data nodes
in a cluster.
• It is the responsibility of job tracker to coordinate the activity by scheduling tasks
to run on different data nodes.
• Execution of individual task is then to look after by task tracker, which resides on
every data node executing part of the job.
• Task tracker's responsibility is to send the progress report to the job tracker.
• In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker to
notify him of the current state of the system.
• Thus, job tracker keeps track of the overall progress of each job. In the event of
task failure, the job tracker can reschedule it on a different task tracker.

160117733178 SUJAN.CH CBIT,HYDERABAD


Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is
replaced by ResourceManager/ApplicationMaster in MRv2.
3. JobTracker receives the requests for MapReduce execution from the client.
4. JobTracker talks to the NameNode to determine the location of the data.
5. JobTracker finds the best TaskTracker nodes to execute tasks based on the
data locality (proximity of the data) and the available slots to execute a task
on a given node.
6. JobTracker monitors the individual TaskTrackers and the submits back the
overall status of the job back to the client.
7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce
execution.
8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution cannot be started, and the existing MapReduce jobs will be halted.

TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.
2. TaskTracker is replaced by Node Manager in MRv2.
3. Mapper and Reducer tasks are executed on DataNodes administered by
TaskTrackers.
4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by
JobTracker.
5. TaskTracker will be in constant communication with the JobTracker signalling
the progress of the task in execution.
6. TaskTracker failure is not considered fatal. When a TaskTracker becomes
unresponsive, JobTracker will assign the task executed by the TaskTracker to
another node.

Conclusion:
MapReduce is a Hadoop framework that helps you process vast volumes of data in
multiple nodes. From this experiment, we learned about what MapReduce is, and the
essential features of MapReduce, how MapReduce algorithm works and its benefits
• Scalability
Businesses can process the petabytes of data stored in the Hadoop Distributed
File System (HDFS).
• Flexibility
Hadoop enables the easier access to multiple sources of data and multiple
types of data.
• Speed
With the parallel processing and minimal data movement, Hadoop offers fast
processing of massive amounts of data.
• Simple
Developers can write the code in a choice of languages, including Java, C++
and Python.
160117733178 SUJAN.CH CBIT,HYDERABAD
Week – 11

Aim:
Documentation for developing and handling a NOSQL database with HBase

Description:
Until the 1970s we were using RDBMS but that was not enough to handle a large
amount of data. Because, we have witnessed the explosion of data and it has been
always challenging for us to store and retrieve the data. The rise of growing data
gave us the NoSQL databases and HBase is one of the NoSQL databases built on top
of Hadoop. HBase is suitable for the applications which require a real-time read/write
access to huge datasets.

Big data has proven itself a huge attraction point for many researchers and
academics across the world. Due to vast usage of social media applications data is
growing rapidly nowadays. This data is often formless, disorganized, and
unpredictable. Storing and analyzing this data is not an easy task. However, NoSQL
databases are the databases by which we can handle and extract this data with ease.
There are many NoSQL databases available for the data scientists.

NoSQL Types
The three main types of NoSQL are.
1. Column Database (column-oriented)
A NoSQL database that stores data in tables and manages them by columns
instead of rows. Called as the columnar database management system
(CDBMS).It converts columns into data files. HBase is of this kind.

One benefit is the fact that it can compress data, allowing operations such as
the minimum, maximum, sum, counting, and averages.

They can be auto indexed, using less disk space than a relational database
system including the same data.

2. Key-Value Database (key/value oriented)


A key/value-oriented NoSQL stores data in collections of key/value pairs. For
example, a student id number may be the key, and the student’s name may
be the value.

It is a dictionary, storing a value, such as an integer, and a string (JSON or


Matrix file structure), along with the key to reference that value.

3. Document Database (document-oriented)


Document-oriented NoSQL are like key/value documents. NoSQL organizes
documents into collections analogous to relational tables. We can research
based on values, not just key-based ones.

160117733178 SUJAN.CH CBIT,HYDERABAD


HBase
HBase the NoSQL database type that is being used widely in many big companies.
Apache HBase is a NoSQL-oriented Columns. Developed to run on top of Hadoop
with HDFS. Designed from the concepts of the original columnar database and
developed by Google, called “BigTable”. HBase is a part of Apache and sometimes
referred as the Apache HBase.

HBase is abbreviated as the Hadoop Database and it runs on top of the Hadoop as a
scalable big data store. It is the Hadoop database that means it has the advantages
of Hadoop’s distributed file system and MapReduce model by default. It is referred
as the columnar database because in contrast to a relational database which stores
data in rows, HBase stores data in columns.

HBase is modeled after the Google’s BigTable so it provides distributed data storage
capabilities like the BigTable on HDFS. HBase has capabilities of allowing access to
sparse data. Sparse data is defined as the data which is small but valuable data within
the gigantic volume of unstructured data used for Big Data analytics. It can report
failures automatically, data reproduction throughout clusters and coherent read and
write. There are several advantages of HBase over relational databases like the later
one is hard to scale and must have a schema for them.

Some of the key features of HBase are listed below :


• Horizontally scalable
• Fault tolerant storage capability for sparse data
• Supports parallel processing, HDFS and MapReduce
• High adaptable data model
• Ability to host large tables
• Real-time lookups
• Automatic load balancing of tables
• Supports block cache and bloom filters for big amount of query optimization
• Easy JAVA API for clients

There is a small history behind the HBase. Back in 2004, when Google was facing
problem on how they could provide efficient search results they developed BigTable
technology. In 2007 Mike Cafarella released code for open source implementation
of BigTable known as HBase. At the start, the initial model of HBase was developed
as a contributing data model for Hadoop. Moreover, between the year 2008 and
2010 HBase became the top-level project under the Apache Hadoop.
160117733178 SUJAN.CH CBIT,HYDERABAD
BigTable
BigTable is a distributed storage system developed by Google designed to store the
gigantic size of data across several severs. Many projects such as Google Earth,
Google analytics, personalized search, all web indexing, and financial data stored in
BigTable. All these applications have various demands regarding the size and latency
but BigTable provides a flexible and high-performance solutions for them. It provides
a data model that supports dynamic control over data format rather than a relational
data model. The API of BigTable delivers functionalities for creating and deleting
tables and columns, changing metadata, cluster, and access control. All these
capabilities have adopted by HBase, however, there are many features that
differentiate both these technologies.

HBase structure and architecture


HBase enables the low expectancy read-write on top of HDFS. Tables in HBase stored
as a multidimensional sparse map with rows and columns enabling the random real-
time read-write access. Each cell has the timestamp and uniquely identified by the
table, row, column family, and timestamp. As HBase has Java client API, the tables in
HBase can be used as an input and output target for MapReduce job. HBase uses the
Zookeeper which is an open source project of Apache especially used for the
management of partial failures in databases. Additionally, it also provides the
maintenance of configuration information and distributed synchronization.
Zookeeper has fleeting nodes which represent the region servers which are used to
track the failures and network partitions.

As mentioned earlier HBase is the column-oriented database so the tables in HBase


are organized by the rows. The HBase table schema only defines the column families
with key value pairs. The tables are the collection of rows, rows are the collection of
column families, a column family is the collection of columns and these columns are
the collection of key-value pairs. The main benefit of the column-oriented database
over the row-oriented database is it can be used for the huge amount of data which
requires online analytical processing. Below figure illustrates the basic structure of
HBase table. As shown the columns has the key value pairs collections.

160117733178 SUJAN.CH CBIT,HYDERABAD


HBase defines a four-dimensional data model and the following four coordinates
define each cell :

• Row Key: Each row has a unique row key; the row key does not have a data
type and is treated internally as a byte array.
• Column Family: Data inside a row is organized into column families; each row
has the same set of column families, but across rows, the same column
families do not need the same column qualifiers. Under-the-hood, HBase
stores column families in their own data files, so they need to be defined
upfront, and changes to column families are difficult to make.
• Column Qualifier: Column families define actual columns, which are called
column qualifiers. You can think of column qualifiers as the columns
themselves.
• Version: Each column can have a configurable number of versions, and you
can access the data for a specific version of a column qualifier.

An individual row is accessible through its row key and is composed of one or more
column families. Each column family has one or more column qualifiers (called
“column” in above figure ) and each column can have one or more versions. To access
an individual piece of data, you need to know its row key, column family, column
qualifier, and version.

When designing an HBase data model, it is helpful to think about how the data is
going to be accessed. You can access HBase data in two ways:
• Through their row key or via a table scan for a range of row keys
• In a batch manner using map-reduce

This dual approach to data access is something that makes HBase particularly
powerful. Typically, storing data in Hadoop means that it is good for offline or batch
analysis (and it is very, very good at batch analysis) but not necessarily for real-time
access. HBase addresses this by being both a key/value store for real-time analysis
and supporting map-reduce for batch analysis.

160117733178 SUJAN.CH CBIT,HYDERABAD


The tables in HBase are divided into different regions which are vertically divided by
column families and these regions are served by the region servers. Typically, the
HBase architecture has three main features a master server (HMaster), region servers
and zookeeper.

Master Server or HMaster


The main task of master server is to the assignment of regions to the region servers
with the help of Zookeeper, controlling of load balancing of region servers. Load
balancing is a key feature of the master server in which it unloads the busy servers
and moves it to the unoccupied servers maintaining the state of the cluster. Master
server also liable for schema changes and a creation of tables and column families.
Other responsibilities of Master Servers are managing and monitoring of Hadoop
clusters and operating supervision on them, failover handling and DDL operations.

HBase has a distributed and huge environment where HMaster alone is not sufficient
to manage everything. So, we would be wondering what helps HMaster to manage
this huge environment? That is where ZooKeeper comes into the picture. After we
understood how HMaster manages HBase environment, we will understand how
Zookeeper helps HMaster in managing the environment.

160117733178 SUJAN.CH CBIT,HYDERABAD


Zookeeper – The Coordinator
Zookeeper is also an essential part of HBase architecture as it stands between the
client and HMaster. It is used as a scattered coordination facility to recover the any
crashes happened to the region servers by facilitating them on other region severs
which are fully functional. As it is the center between the client and master server it
maintains the structure of information and management. Whenever there is a
request from a client to access the regions their first contact point is the zookeeper
because master and region servers are registered with the zookeeper. It is also
referred as the coordinator. Zookeeper keeps information of all the region servers
such as how many region servers are available and each of them holding which data
nodes. Moreover, it facilitates the tracking of server failures and divisions,
maintenance of composition information and so on.

Zookeeper also maintains the .META Server’s path, which helps any client
in searching for any region. The Client first must check with .META Server in which
Region Server a region belongs, and it gets the path of that Region Server. The
.META file maintains the table in form of keys and values. Key represents the start
key of the region and its id whereas the value contains the path of the Region Server.

160117733178 SUJAN.CH CBIT,HYDERABAD


Region server
Regions are the tables spread within the region servers. The regions of the region
servers communicate with the client and control data related processes. The region
server contains memory stores and HFiles. Region servers can be said as the worker
nodes which handles the CRUD operations from clients. It runs on the HDFS data
nodes and it has four main components. Block cache, MemStore, WAL, HFile. A block
cache is basically the read cache work similar as the other cache systems. Here
intermittent data is stored and when the block cache is full most recent data will be
removed. In contrast to block cache, the MemStore is a write cache which stores new
data that has not been stored. WAL (Write ahead log) stores new data that is not
stored in the permanent storage. HFiles is the actual storing files which store the
sorted key values of rows.

• WAL: As we can conclude from the above image, Write Ahead Log (WAL) is a
file attached to every Region Server inside the distributed environment. The
WAL stores the new data that has not been persisted or committed to the
permanent storage. It is used in case of failure to recover the data sets.
• Block Cache: From the above image, it is clearly visible that Block Cache
resides in the top of Region Server. It stores the frequently read data in the
memory. If the data in BlockCache is least recently used, then that data is
removed from BlockCache.
• MemStore: It is the write cache. It stores all the incoming data before
committing it to the disk or permanent memory. There is one MemStore for
each column family in a region. As you can see in the image, there are multiple
MemStores for a region because each region contains multiple column
families. The data is sorted in lexicographical order before committing it to
the disk.
• HFile: From the above figure you can see HFile is stored on HDFS. Thus, it
stores the actual cells on the disk. MemStore commits the data to HFile when
the size of MemStore exceeds.

160117733178 SUJAN.CH CBIT,HYDERABAD


As every time, clients do not waste time in retrieving the location of Region Server
from META Server, thus, this saves time and makes the search process faster. Now,
let us see how writing takes place in HBase. What are the components involved in it
and how are they involved?

HBase Read Mechanism


Zookeeper stores the META table location. Whenever a client approaches with a read
or writes requests to HBase following operation occurs:
1. The client retrieves the location of the META table from the ZooKeeper.
2. The client then requests for the location of the Region Server of corresponding
row key from the META table to access it. The client caches this information
with the location of the META Table.
3. Then it will get the row location by requesting from the corresponding Region
Server.
For future references, the client uses its cache to retrieve the location of META table
and previously read row key’s Region Server. Then the client will not refer to the
META table, until and unless there is a miss because the region is shifted or moved.
Then it will again request to the META server and update the cache.

HBase Write Mechanism

The write mechanism goes through the following process sequentially

1. Whenever the client has a write request, the client writes the data to the WAL
(Write Ahead Log).
• The edits are then appended at the end of the WAL file.
• This WAL file is maintained in every Region Server and Region Server uses
it to recover data which is not committed to the disk.
2. Once data is written to the WAL, then it is copied to the MemStore.
3. Once the data is placed in MemStore, then client receives the acknowledgment.
4. When the MemStore reaches threshold, it dumps or commits the data into a HFile.

160117733178 SUJAN.CH CBIT,HYDERABAD


HBase Shell
HBase contains a shell using which you can communicate with HBase. HBase uses
the Hadoop File System to store its data. It will have a master server and region
servers. The data storage will be in the form of regions (tables). These regions will
be split up and stored in region servers. The master server manages these region
servers, and all these tasks take place on HDFS. Given below are some of the
commands supported by HBase Shell.

General Commands
• status - Provides the status of HBase, for example, the number of servers.
• version - Provides the version of HBase being used.
• table_help - Provides help for table-reference commands.
• whoami - Provides information about the user.

Data Definition Language


These are the commands that operate on the tables in HBase.
• create - Creates a table.
• list - Lists all the tables in HBase.
• disable - Disables a table.
• is_disabled - Verifies whether a table is disabled.
• enable - Enables a table.
• is_enabled - Verifies whether a table is enabled.
• describe - Provides the description of a table.
• alter - Alters a table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in the command.
• Java Admin API - Prior to all the above commands, Java provides an Admin
API to achieve DDL functionalities through programming.
HBaseAdmin and HTableDescriptor are the two important classes in
org.apache.hadoop.hbase.client package that provide DDL functionalities.

Data Manipulation Language


• put - Puts a cell value at a specified column in a specified row in a particular
table.
• get - Fetches the contents of row or a cell.
• delete - Deletes a cell value in a table.
• deleteall - Deletes all the cells in a given row.
• scan - Scans and returns the table data.
• count - Counts and returns the number of rows in a table.
• truncate - Disables, drops, and recreates a specified table.
• Java client API - Prior to all the above commands, Java provides a client API to
achieve DML functionalities, CRUD (Create Retrieve Update Delete) operations
and more through programming, under org.apache.hadoop.hbase.client
package. HTable Put and Get are the important classes in this package.
160117733178 SUJAN.CH CBIT,HYDERABAD
Conclusion:
HBase is a distributed column-oriented database with having prebuilt functionalities
of Hadoop. It is horizontally scalable and sharable across other databases. Currently,
being used by many big ventures across the globe.

From this experiment, we came to know about the architecture and structure of
HBase, its current usage and its limitations. The world of data is growing at a rapid
pace and we will need some better solutions for handling and analyzing this data in
future.

160117733178 SUJAN.CH CBIT,HYDERABAD


Week – 12

Aim:
Documentation for loading data from RDBMS to HDFS by using SQOOP.

Description:
We know that Apache Flume is a data ingestion tool for unstructured sources, but
organizations store their operational data in relational databases. So, there was a
need for a tool which can import and export data from relational databases.
Therefore, Apache Sqoop was born. Sqoop can easily integrate with Hadoop and
dump structured data from relational databases on HDFS, complimenting the power
of Hadoop.

Initially, Sqoop was developed and maintained by Cloudera. Later, on 23 July 2011,
it was incubated by Apache. In April 2012, the Sqoop project was promoted as
Apache’s top-level project.

Generally, applications interact with the relational database using RDBMS, and thus
this makes relational databases one of the most important sources that generate Big
Data. Such data is stored in RDB Servers in the relational structure. Here, Apache
Sqoop plays an important role in the Hadoop ecosystem, providing feasible
interaction between the relational database server and HDFS.

So, Apache Sqoop is a tool in Hadoop ecosystem which is designed to transfer data
between HDFS (Hadoop storage) and relational database servers like MySQL, Oracle
RDB, SQLite, Teradata, Netezza, Postgres etc. Apache Sqoop imports data from
relational databases to HDFS, and exports data from HDFS to relational databases. It
efficiently transfers bulk data between Hadoop and external data stores such as
enterprise data warehouses, relational databases, etc.

This is how Sqoop got its name – “SQL to Hadoop & Hadoop to SQL”

160117733178 SUJAN.CH CBIT,HYDERABAD


Why Sqoop?
For Hadoop developer, the actual game starts after the data is being loaded in HDFS.
They play around this data to gain various insights hidden in the data stored in HDFS.

So, for this analysis, the data residing in the relational database management
systems need to be transferred to HDFS. The task of writing MapReduce code for
importing and exporting data from the relational database to HDFS is uninteresting
& tedious. This is where Apache Sqoop comes to rescue and removes their pain. It
automates the process of importing & exporting the data.

Sqoop makes the life of developers easy by providing CLI for importing and
exporting data. They just must provide basic information like database
authentication, source, destination, operations etc. It takes care of the remaining
part.

Sqoop internally converts the command into MapReduce tasks, which are then
executed over HDFS. It uses YARN framework to import and export the data, which
provides fault tolerance on top of parallelism.

Key Features of Sqoop


• Full Load: Apache Sqoop can load the whole table by a single command. You can
also load all the tables from a database using a single command.
• Incremental Load: Apache Sqoop also provides the facility of incremental load
where you can load parts of table whenever it is updated.
• Parallel import/export: Sqoop uses YARN framework to import and export the
data, which provides fault tolerance on top of parallelism.
• Import results of SQL query: We can also import the result returned from an
SQL query in HDFS.
• Compression: We can compress your data by using deflate( gzip ) algorithm with
–compress argument, or by specifying –compression-codec argument. we can
also load compressed table in Apache Hive.
• Connectors for all major RDBMS Databases: Apache Sqoop provides
connectors for multiple RDBMS databases, covering almost the entire
circumference.
• Kerberos Security Integration: Kerberos is a computer network authentication
protocol which works based on ‘tickets’ to allow nodes communicating over a
non-secure network to prove their identity to one another in a secure manner.
Sqoop supports Kerberos authentication.
• Load data directly into HIVE/HBase: You can load data directly into Apache Hive
for analysis and dump your data in HBase, which is a NoSQL database.
• Support for Accumulo: You can also instruct Sqoop to import the table in
Accumulo rather than a directory in HDFS.

160117733178 SUJAN.CH CBIT,HYDERABAD


Sqoop Architecture & Working

The import tool imports individual tables from RDBMS to HDFS. Each row in a table
is treated as a record in HDFS.

When we submit Sqoop command, our main task gets divided into subtasks which
is handled by individual Map Task internally. Map Task is the subtask, which imports
part of data to the Hadoop Ecosystem. Collectively, all Map tasks imports whole data.

Export also works in a similar manner.

The export tool exports a set of files from HDFS back to an RDBMS. The files given
as input to Sqoop contain records, which are called as rows in the table.

When we submit our Job, it is mapped into Map Tasks which brings the chunk of
data from HDFS. These chunks are exported to a structured data destination.
Combining all these exported chunks of data, we receive the whole data at the
destination, which in most of the cases is an RDBMS (MYSQL/Oracle/SQL Server).

160117733178 SUJAN.CH CBIT,HYDERABAD


In addition, reduce phase is required in case of aggregations. But Apache Sqoop just
imports and exports the data; it does not perform any aggregations.

Map job launch multiple mappers depending on the number defined by the user. For
Sqoop import, each mapper task will be assigned with a part of data to be imported.
Sqoop distributes the input data among the mappers equally to get high
performance. Then each mapper creates a connection with the database using JDBC
and fetches the part of data assigned by Sqoop and writes it into HDFS or Hive or
HBase based on the arguments provided in the CLI.

Flume vs Sqoop
The major difference between Flume and Sqoop is that:
• Flume only ingests unstructured data or semi-structured data into HDFS.
• While Sqoop can import as well as export structured data from RDBMS or
Enterprise data warehouses to HDFS or vice versa.

Below is the comparison table between Sqoop and Flume.


Feature SQOOP FLUME

Basic Nature Sqoop works well with any RDBMS Flume works well for Streaming
which has JDBC (Java Database data source, which is continuously
Connectivity) like Oracle, MySQL, generating such as logs, JMS,
Teradata, etc. directory, crash reports, etc.

Data Flow Sqoop specifically used for parallel Flume is used for collecting and
data transfer. For this reason, the aggregating data because of its
output could be in multiple files distributed nature.

Driven Event Sqoop is not driven by events. Flume is complete event driven.

Architecture Sqoop follows connector-based Flume follows agent-based


architecture, which means architecture, where the code
connectors, knows how to connect written in it is known as an agent
to a different data source. that is responsible for fetching
data.

Usage Used for copying data faster and Used to pull data when companies
then using it to generate analytical want to analyze patterns, root
outcomes. causes or sentiment analysis using
logs and social media.

Performance It reduces excessive storage and Flume is fault-tolerant, robust and


processing loads by transferring has a tenable reliability mechanism
them to other systems and has for failover and recovery.
fast performance.

160117733178 SUJAN.CH CBIT,HYDERABAD


Prerequisites:
• Sqoop should be installed.
• This process assumes that we have a MySQL instance up and running that can
reach our Hadoop cluster. The mysql.user table is configured to accept a user
connecting from the machine where we will be running Sqoop. For more info
visit http://dev.mysql.com/doc/refman//5.5/en/installing.html on installing
and configuring MySQL.
• The MySQL JDBC driver JAR file has been copied to $SQOOP_HOME/libs. It
can be downloaded from http://dev.mysql.com/downloads/connector/j/.

Importing data from RDBMS to HDFS


Instead of moving data between clusters, Sqoop was designed to move data from
and into relational databases using a JDBC driver to connect. Its functionality is
extensive.

Steps to transfer data from a MySQL table to an HDFS file:


1. Create a new database in the MySQL instance:

CREATE DATABASE db;

2. Create and load the weblogs table:

USE db;

CREATE TABLE student(


fname VARCHAR(64),
lname VARCHAR(64),
id int
);

INSERT INTO student values(“Sujan”,”Ch”,178);

3. Import the data from MySQL to HDFS:

Sqoop — IMPORT Command


Import command is used to importing a table from relational databases to
HDFS. In our case, we are going to import tables from MySQL databases to
HDFS.

sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student

After code is executed, we can check Web UI of HDFS where data is imported

160117733178 SUJAN.CH CBIT,HYDERABAD


Sqoop — IMPORT Command with target directory
we can also import the table in a specific directory in HDFS using the below
command:

sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--m 1
--target-dir mydir1

Sqoop imports data in parallel from most database sources. -m property is


used to specify the number of mappers to be executed. we can specify the
number of map tasks (parallel processes) to use to perform the import by
using the -m or –num-mappers argument. Each of these arguments takes an
integer value which corresponds to the degree of parallelism to employ.

we can control the number of mappers independently from the number of files
present in the directory. Export performance depends on the degree of
parallelism. By default, Sqoop will use four tasks in parallel for the export
process. This may not be optimal; we will need to experiment with our own
setup. Additional tasks may offer better concurrency, but if the database is
already bottlenecked on updating indices, invoking triggers, and so on, then
additional load may decrease performance.

Sqoop — IMPORT Command with Where Clause


We can import a subset of a table using the ‘where’ clause in Sqoop import
tool. It executes the corresponding SQL query in the respective database
server and stores the result in a target directory in HDFS. we can use the
following command to import data with ‘where‘ clause:

sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--m 1
--where “id > 175”
--target-dir mydir2

160117733178 SUJAN.CH CBIT,HYDERABAD


Sqoop — Incremental Import
Sqoop provides an incremental import mode which can be used to retrieve
only rows newer than some previously imported set of rows. Sqoop supports
two types of incremental imports: append and lastmodified. we can use the
–incremental argument to specify the type of incremental import to perform.

We should specify append mode when importing a table where new rows are
continually being added with increasing row id values. We specify the column
containing the row’s id with –check-column. Sqoop imports rows where the
check column has a value greater than the one specified with –last-value.

An alternate table update strategy supported by Sqoop is


called lastmodified mode. You should use this when rows of the source table
may be updated, and each such update will set the value of a last-modified
column to the current timestamp.

When running a subsequent import, you should specify –last-value in this way
to ensure you import only the new or updated data. This is handled
automatically by creating an incremental import as a saved job, which is the
preferred mechanism for performing a recurring incremental import.

First we are inserting a new row which will be updated in our HDFS.
INSERT INTO student values(“Supreet”,”V”,179);

sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student
--target-dir mydir2
--incremental append
--check-column id
--last-value 1

Sqoop — Import All Tables


We can import all the tables from the RDBMS database server to the HDFS.
Each table data is stored in a separate directory and the directory name is
same as the table name. It is mandatory that every table in that database must
have a primary key field. The command for importing all the table from a
database is:

sqoop import
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
160117733178 SUJAN.CH CBIT,HYDERABAD
Sqoop — List Databases
You can list out the databases present in relation database using Sqoop. Sqoop
list-databases tool parses and executes the ‘SHOW DATABASES’ query against
the database server. The command for listing databases is:

sqoop list-databases
--connect jdbc:mysql://localhost/
--username sujan
--password 12345

Sqoop — List Tables


We can also list out the tables of a particular database in MySQL database
server using Sqoop. Sqoop list-tables tool parses and executes the ‘SHOW
TABLES’ query. The command for listing tables is a database is:

sqoop list-tables
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345

Exporting data from HDFS to RDBMS


We can also export data from the HDFS to the RDBMS database. The target table
must exist in the target database. The data is stored as records in HDFS. These
records are read and parsed and delimited with a user-specified delimiter. The
default operation is to insert all the record from the input files to the database table
using the INSERT statement. In update mode, Sqoop generates the UPDATE
statement that replaces the existing record into the database.

Steps to transfer data from HDFS to a MySQL table:


1. Creating an empty table, where we will export our data.

CREATE TABLE student2(


fname VARCHAR(64),
lname VARCHAR(64),
id int
);

2. Export data from HDFS to a relational database:

sqoop export
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student2
--export-dir /user/sujan/db
160117733178 SUJAN.CH CBIT,HYDERABAD
Sqoop — Codegen
In object-oriented application, every database table has one Data Access Object class
that contains ‘getter’ and ‘setter’ methods to initialize objects. Codegen generates
the DAO class automatically. It generates DAO class in Java, based on the Table
Schema structure.

sqoop codegen
--connect jdbc:mysql://localhost/db
--username sujan
--password 12345
--table student

It creates student.jar file which has all the details.

Conclusion:
Apache Sqoop supports bi-directional movement of data between any RDBMS and
HDFS, Hive or HBase, etc. But structured data only. Sqoop automates most of this
process, relying on the database to describe the schema for the data to be imported.
Sqoop uses MapReduce to import and export the data, which provides parallel
operation as well as fault tolerance.

From this experiment, we came to know about the Sqoop, features of Sqoop, Sqoop
architecture and its working, flume vs Sqoop, commands of Sqoop, and import and
exporting data between RDBMS and HDFS using Sqoop.

Sqoop is more like a transport kind of thing with high security and within the budget
and we can use it efficiently and effectively everywhere. And as it is fast in-process
everyone wants this technology to be processed at their own sites to get better
results.

160117733178 SUJAN.CH CBIT,HYDERABAD

You might also like