Unit-1: 1. What Is Big Data? Discuss Different Challenges of Conventional System. Answer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Unit-1

1. What is Big Data? Discuss different challenges of conventional system.


Answer
 Big Data is the next generation of data warehousing and business analytics and is
poised to deliver top line revenues cost efficiently for enterprises.
Data Challenges
 Volume
o The volume of data, especially machine-generated data, is exploding, how
fast that data is growing every year, with new sources of data that are
emerging. For example, in the year 2000, 800,000 petabytes (PB) of data
were stored in the world, and it is expected to reach 35 zetabytes (ZB) by
2020 (according to IBM). The challenge is how to deal with the size of Big
Data.
 Variety, Combining Multiple Data Sets
o A lot of this data is unstructured, or has a complex structure that’s hard to
represent in rows and columns. And organizations want to be able to
combine all this data and analyze it together in new ways. Semi structured
Web data such as A/B testing, sessionization, bot detection, and pathing
analysis all require powerful analytics on many petabytes of semi structured
Web data. Data come from sensors, smart devices, and social collaboration
technologies. Data are not only structured, but raw, semi structured,
unstructured data from web pages etc. The challenge is how to handle
multiplicity of types, sources, and formats.
 Velocity
o As businesses get more value out of analytics, it creates a success problem—
they want the data available faster, or in other words, want real-time
analytics. And they want more people to have access to it, or in other words,
high user volumes. One of the key challenges is how to react to the flood of
information in the time required by the application.
 Veracity, Data Quality, Data Availability
o There are several challenges: How can we cope with uncertainty,
imprecision, missing values, misstatements or untruths? How good is the
data? How broad is the coverage? How fine is the sampling resolution? How
timely are the readings? How well understood are the sampling biases? Is
there data available, at all?
 Quality and Relevance
o The challenge is determining the quality of data sets and relevance to
particular issues (i.e., the data set making some underlying assumption that
renders it biased or not informative for a particular question).
 Scale and complexity
o Managing large and rapidly increasing volumes of data is a challenging issue.
Traditional software tools are not enough for managing the increasing
volumes of data. Data analysis, organization, retrieval and modeling are also
challenges due to scalability and complexity of data that needs to be
analyzed.

 Process Challenges
o It can take significant exploration to find the right model for analysis, and the
ability to iterate very quickly and ‘fail fast’ through many (possible throw
away) models—at scale—is critical. Process challenges with deriving insights
include:
 Capturing data.
 Aligning data from different sources (e.g., resolving when two objects
are the same).
 Transforming the data into a form suitable for analysis.
 Modeling it, whether mathematically, or through some form of
simulation.
 Understanding the output, visualizing and sharing the results, think
for a second how to display complex analytics on a iPhone or a mobile
device.
 Management Challenges
o The main management challenges are
 Data privacy
 Security
 Governance
 Ethical
o The challenges are: Ensuring that data are used correctly (abiding by its
intended uses and relevant laws), tracking how the data are used,
transformed, derived, etc., and managing its lifecycle.

2. How Big Data can be represented on to a platform? Discuss trades of Big Data.
Answer
 Volume
o The Quantity of data that is generated is very easy important in this context.
o It is the size of the data which determines the value and potential of the data
under consideration and whether it can be actually be considered as big data or
not.
o The name ‘Bid Data’ itself contains a term which is related to size and hence the
characteristics.
 Variety
o The next aspect of Big Data is its variety.
o This means that the category to which Big Data belongs to is also a very essential
fact that needs to be known by the data analysis.
o This helps the people, who are closely analyzing the data and are associated with
it, to effectively use the data to their advantage and thus upholding the
importance of the Big Data.
 Velocity
o The term ‘Velocity’ in this context refers to the speed of generation of data or
how fast the data is generated and processed to meet the demands and the
challenges which lie ahead in the path of growth and development.

 Variability
o This is a factor which can be a problem for those who are analyze the data.
o This refers to the inconsistency which can be shown by the data at times, this
hampering the process of being able to handle and manage the data
effectively.
 Complexity
o Data management can become a very complex process, especially when large
volumes of data come from multiple sources.
o These data need to be linked, connected and correlated in order to be able to
grasp the information that is supposed to be conveyed by these data.
o This situation is therefore, termed as the ‘complexity’ of big data.

3. Write a short note on web data analysis and modern data analytics tools?
Answer
 Web analysis is the methodological study of online/offline patterns and trends.
 It is a technique that you can employ to collect, measure, report and analyze your
website data.
 It is normally carried out to analyze the performance of a website and optimize its web
usage.
 We use web analytics to track key metrics and analyze visitor’s activity and traffic flow.
 It is a tactical approach to collect data and generate reports.
 Web analytics is an ongoing is an ongoing to collect data helps in attracting more traffic
to a site thereby, increasing the return on investment.
 Analytics tools offers on insight into the performance of your websites, visitor’s
behavior and data flow.
 These tools are inexpensive and easy to use sometimes, they are even free.
 These tools are basically use to generate on –
o Acquisition analysis
o Behavior analysis
o Conversion analysis

4. Discuss different types of analysis tool with its processes.


Answer
1. Google analysis
2. Spring metrics
3. Clicky
4. Mint
5. Woopra
6. Kiss metrics
 All the above tools have the same underlying process model in general.
 The primary objective of carrying out wet analytics is to optimize the website in order
to provide better user experience.
 It provide a data driven report to measure visitors flow through the web sites.
 The following illustration depicts the process of wet analytics.
o Set the business goals.
o To back the goal achievement, set the key performance indicators (KPI).
o Collect correct and suitable data.
o To extract insights,(Analysis data).
o Based on assumption burned from the data analysis, test alternatives.
o Based on either data analysis or website testing, implement insights.

5. What is Regression modeling? Discuss working of regression process in detail.


Answer
Regression modeling
 Regression models are used to predict one variable from one or more other
variables.
 Regression models provide the scientist with a powerful tool, allowing prediction
about past, present or future events to be made with information about past or
present events.
 In order to constraint a regression model both the information which the prediction
and the information
 Which is going to be used to make the prediction and the information which is to be
predicted must be obtained from a sample of objects or individuals.
 Regression models involve the following variables :
 The unknown parameters, denoted as beta (B), which may represented a scalar or a
vector.
 The independent variable X
 The independent variable Y
 A regression model relates, Y to a function of X & B: Y = f (X, B).
 The approximation is usually formalized as E(Y/X) = f(X, B).
 To carryout regression analysis, the form of the function f must be specified.
 Sometimes the form of this function is based on knowledge about the relationship
between Y and X that does not rely on the data.
 If no such knowledge is available, a flexible or convenient from for f is chosen.
 The goal in the regression model is to create a model where the predicted and
observed values of the variable to be predicted are as similar as possible.

6. Define neural network? How one can help a neural network to learn and
generalize on specific topic.
Answer
 Neural network is a network inspired by biological neural networks, which area
used to estimate or approximate function that can depend on a large number of
impulse that can depend on a large number of impulse that are generally unknown.
 The principal reason why neural networks have attracted such interest, is the
existence of learning algorithms, for neural networks: algorithms of learning that
use data to estimate the optimal weights in a network to perform some task.
 These are three basic approaches to learning in neural network:
o Supervised: Learning uses a training set that consist of a set of pattern pair,
an input pattern and the corresponding desired output pattern.
o Reinforcement: If a networks aim to perform that some task, then the
reinforcement signal in a simple yes or no at the end of the task to indicate
whether the task has been performed satisfactorily.
o Unsupervised: Learning only uses input data there is no training signal.
o The role of neural network training is to identify this “mystery function”
given only the training data.
 What the training process does is to estimate the parameters of the function so that
it replicates the data as well as possible & generalizes to new data well.
 Generalization is measures that tell us how well the network perform on the actual
problem once training is complete.
 This can be measure by looking at the performance of the network on evaluation
data unseen during the training process.
 Following three form of principal to apply generalization:
o Good performance on the training data does not necessarily lead to good
generalization performance.
o Simple solutions are better than complex solutions.
o Larger network require more training data.

7. Write a short note on:


Answer
Supervised learning:
 Learning uses a training set that consist of a set of pattern pairs, an input pattern
and the corresponding desired output pattern.
 The basic approach in supervised learning is for the network to compute the output
its current weights produce for a given input and to compare this network output
with the desired output.
 The aim of the learning algorithm is to adjust the weights so as minimize the
difference between the network output and the desired output.
Unsupervised learning
 Leaning only uses input data there is no training signal.
 The aim of the supervised learning is make sense of some data set, for example
clustering similar pattern together.
Principal components analysis (PCM)
 PCM uses a linear projection method to reduce the number of parameters.
 It perform a orthogonal p projection to transfer a set of correlated variables into a
new set of uncorrelated variables.
 It maps the data into a space of lower dimensionality.
 It is a form of unsupervised learning.
 It can be viewed as a rotation of the existing axes to new positions in the space
defined by original variables.
 New axes are orthogonal and represent the direction with maximum variability.
Perception
 Single layer perception is a single layer neural network.
 It works with continuous or binary inputs.
 It starts pattern pairs(Ak, Ck) where Ak=(aik,…..,ank) and Ck=(c1k,…,cnk) are bipolar
valued[-1,+1].
 It applies the perception error correlated procedure which always converges.
 A perception is classified.
Algorithm
Assume we are given a data set X = {(x1, y1),…,(xi,yi)} where x ϵ Rn and y = {1,-1}.
Assume X is linearly separable i.e. w & t, such that
(WT xi + t) yi >0 for all i
f(net i )= 1 if net I >0
f(net I )=-1 otherwise
net i=WT xi
Starting with W(0) = 0 we follow the following learning rule.
W(t+1) =w(t) + *yixi

8. Define sampling with the help of example.


Answer
 Data sampling is a statistical analysis technique used to select, manipulate and
analyze a representative subset of data points in order to identify patterns and
trends in the larger data set being examined
 It is a process of choosing a representative sample from a largest population and
collecting data from that sample in order to understand something about the
population as whole.
 Sampling allows data scientists, predictive modellers and other data analysts to
work with a small, manageable amount of data in order to build and run analytical
models more quickly, while still producing accurate findings.
 Sampling can be particularly useful with data sets that are too large to efficiently
analyse in full -- for example, in big data analytics applications.
 An important consideration, though, is the size of the required data sample.
 In some cases, a very small sample can tell all of the most important information
about a data set.
 In others, using a larger sample can increase the likelihood of accurately
representing the data as a whole, even though the increased size of the sample may
impede ease of manipulation and interpretation.
 Either way, samples are best drawn from data sets that are as large and close to
complete as possible.
 There are many different methods for drawing samples from data, and the ideal one
depends on the data set and situation.
 Sampling can be based on probability, an approach that uses random numbers that
correspond to points in the data set.
 This approach ensures that there is no correlation between points that are chosen
for the sample.
 Further variations in probability sampling include simple, stratified and systematic
random sampling and multi-stage cluster sampling.
 Once generated, a sample can be used for predictive analytics.
 For example, a retail business might use data sampling to uncover patterns about
customer behavior and predictive modeling to create more effective sales strategies.
Example
 A company sample individuals in a particular market niche to find out what they
need and what problem they want to solve.
 The result of the sample help the business source the need of people in the market
niche.

9. How time series analysis is helpful? Discuss linear system analysis with
example.
Answer
Time series analysis
 Time series analysis comprises methods for analyzing time series data in order to
extract meaningful statistics and other characteristics of the data.
 Time series analysis can be applied to real-valued, continuous data, discrete
numeric data or discrete symbolic data.
 The usage of time series model is twofold.
 Obtain an understanding of the underlying, monitoring or even feedback and feed-
forward control.
Linear system Analysis
 Linear system analysis is concerned with the study of equilibrium and change in
dynamical system that is sin system that contain variables that may change with
time.
 To perform the analysis, relationships between these variables are described by a
set of equation known as model.
 In order for linear system analysis to be applicable, the model must process the
linearly property: It must be a linear model.
Example
Ax1(t)  Ay1 (t) = Ay1 (t) + By2 (t)
Linear System
Bx2(t)  By1 (t)

10. Why Fuzzy Logic is named as fuzzy? How one can extract data from Fuzzy
model.
Answer
 Fuzzy logic is a form of many valued logic in which the truth values of variables may
be any real number between 0 and 1 considered to be “Fuzzy”.
 Fuzzy logic has been employed to handle the concept of partial truth, where the
truth value may range between completely true and completely false. Hence they
are termed as fuzzy.
 A static / dynamic systems which make use of fuzzy set is called a fuzzy system.
 These are called rule-based system, also known as fuzzy models as extracting fuzzy
rules from data allows relationship in the data to be modeled by ‘if-then’ rules that
are easy to understand, verify and extend.
Example
 If antecedent proposition then consequent proposition.
 A classic example of extracting data from a fuzzy model is ‘fuzzy decision tree”.
 They combine fuzzy representation and it approximate reasoning with symbolic
decision tree to give an approximate output or generating data based on the
condition of the rule specified as the node of the tree.

11. Why Resampling is required? How it help in big data analysis?


Answer
 Reasoning technique allow us to base the analysis of a study solely on the design of
that study, rather than on a poorly-fitting model.
 Resampling is the method that consist of drawing repeated samples from the
original data samples.
 The method of resampling is a non-parametric method of statistical inference.
 In other word, the method of resampling does not involve the utilization of the
generic distributing tables in order to compute approximate portability values.
 Resampling is highly useful in big data because –
o Fever assumption
 Eg. Resampling methods do not require that distribution be normal or
that sample size be large.
o Generality
 Resampling methods are remarkably similar for a wide range of
statistics and do not require new formula for every statistic.
o Promote understanding
 Bootstrap procedures build intuition by providing concrete analysis
to theoretical concepts.
o Parametric assumption
 Collection of procedures to make statistical inference without relying
on parametric assumption.
o Unbias
 The method of sampling yields unbiased estimate an it is based on the
unbiased samples of all possible result of the data.

12. What is Fuzzy decision tree? How does it help in getting data from huge
database?
Answer
 Decision tree is a decision support tool that uses a tree like graph or model of
decisions and their possible consequences, including chance event outcomes,
resources costs and utility.
 Fuzzy decision tree are more advanced in the sense that they model uncertainty
around the split values of the feature, represented in soft instead of hard split.
 Fuzzy decision tree combines fuzzy representation and it approximate reasoning,
with symbolic decision tree.
 As such, they provide for handling of language related uncertainly, noise missing or
finally uncertainly, noise missing or finally features robust behavior, while also
providing comprehensive knowledge interpretation.
 Fuzzy decision tree help in forming flexible query, such as set of includes fuzzy
rules can be associated with a database as a knowledge base that can be used to help
answering frequent queries.
 Fuzzy decision tree help in information retrieval and mining because of its
capability to represent miscellaneous data in synthetic way, it robustness with
regard to changes of the parameter of the user environment and its unique
expression.

You might also like