Big Data Exercieses
Big Data Exercieses
Big Data Exercieses
1. What are the three characteristics of Big Data, and what are the main considerations in processing Big Data?
Big data is characterized by Volume, Variety, and Velocity each of which present unique and differing challenges.
Volume — Growing well beyond terabytes, big data can entail billions of rows and millions of columns. Data of
this size cannot efficiently be accommodated by traditional infrastructure or RDBMS.
Variety — Data that comes in many forms, not just well-structured tables with rows and columns. Some
unstructured data examples include: video files, audio files, XML, and free text. Traditional RDBMS provide little
support for these data types.
Velocity — Data that is collected and analyzed in real time. Often, this type of data is time sensitive and its
value diminishes with time. This type of data may require in-memory data grids to accommodate the real-time
nature of this data.
The main considerations in processing Big Data are how to cost effectively store and analyze the data in an
efficient manner. Often new tools and technologies (e.g. Hadoop) are necessary to accomplish these goals.
Chapter 3
Review of Basic Data Analytics Methods Using R
2. Two vectors, v1 and v2, are created with the following R code:
v1 <- 1:5
v2 <- 6:2
What are the results of cbind(v1,v2) and rbind(v1,v2)?
cbind() is used to combine variables column wise.
https://bookdown.org/ndphillips/YaRrr/creating- V1 V2
matrices-and-dataframes.html [1,] 1 6
cbind(v1 ,v2) [2,] 2 5
rbind() is used to combine datasets row wise. [3,] 3 4
it will
rbind(v1 merge the two vector into a matrix
,v2) [4,] 4 3
[5,] 5 2
[1,] [2,] [3,] [4,] [5,]
V1 1 2 3 4 5
V2 6 5 4 3 2
1
na.omit(data)
3. What R command(s) would you use to remove null values from a dataset?
https://statisticsglobe.com/na-omit-r-example/
is.na() - provides test for missing values
#:~:text=omit%20(Data%20Frame%2C%20Vector%20%26%20by%20Column),-
na.exclude() - returns the object with incomplete cases removed
Basic%20R%20Syntax&text=The%20na.,frame%2C%20matrix%20or%20vector).
4. What R command can be used to install an additional R package?
rug() function creates a one dimensional density plot on the bottom of the graph to emphasize the
distribution of observation.
7. How many sections does a box-and-whisker divide the data into? What are these sections?
The "box" of the box-and-whisker shows the range that contains the central 50% of the data, and the
line inside the box is the location of the median value. The upper and lower hinges of the boxes
correspond to the first and third quartiles of the data. Upper whisker extends from the hinge to the
highest value that is within 1.5 * IQR of the hinge. Lower whisker extends from the hinge to the lowest
value within 1.5 * IQR of the hinge. Points outside the whiskers are considered as possible outliers.
8. What attributes are correlated according to Figure 3-18? How would you describe their relationships?
According to the scatterplot, within certain species, there is a high correlation between:
• sepal.length and sepal.width (setosa)
• sepal.length and petal.length (veriscolor and virginica)
• sepal.width and petal.length (veriscolor)
• sepal.width and petal.width (veriscolor)
• petal.width and petal.length (veriscolor and virginica)
The relationship between these attributes is a linear relationship. The correlations can be determined
using the cor() function.
If the data is skewed and positive, viewing the logarithm of data can help detect structures that might
otherwise be overlooked in a graph with a non-logarithmic scale.
11. What is a type I error? What is a type II error? Is one always more serious than the other? Why?
Type 1 error is the rejection of null hypothesis when the null hypothesis is true.
Type 2 error is the acceptance of null hypothesis when the null hypothesis is false.
Committing one error is not necessarily more serious than the other. Given the underlying assumptions,
the type I error can be defined up front before any data is collected. For a given deviation from the null
hypothesis, the Type 2 error can be obtained by using a large enough sample size.
12. Suppose everyone who visits a retail website gets one promotional offer or no promotion at all. We
want to see if making a promotional offer makes a difference. What statistical method would you
recommend for this analysis?
Let's assume that the objective is to compare whether or not a person receiving an offer will spend
more than someone who does not receive an offer. If normality of the purchase amount distribution is
a reasonable assumption, the Student's t test could be used. Otherwise, a non-parametric test such as
the Wilcoxon rank-sum test could be applied.
13. You are analyzing two normally distributed populations, and your null hypothesis is that the mean μ1
of the first population is equal to the mean μ2 of the second. Assume the significance level is set at
0.05. If the observed p-value is 4.33e-05, what will be your decision regarding the null hypothesis?
P value of 0.0000433 < 0.05. Therefore, the decision will be to reject null hypothesis
3
14- A local retailer has a database that stores 10,000 transactions of last summer. After analyzing the
data, a data science team has identified the following statistics:
• {battery} appears in 6,000 transactions.
• {sunscreen} appears in 5,000 transactions.
• {sandals} appears in 4,000 transactions.
• {bowls} appears in 2,000 transactions.
• {battery,sunscreen} appears in 1,500 transactions.
• {battery,sandals} appears in 1,000 transactions.
• {battery,bowls} appears in 250 transactions.
• {battery,sunscreen,sandals} appears in 600 transactions.
4
15-
(b) You are to cluster eight points:x1=(2, 10), x2=(2, 5), x3=(8, 4), x4=(5, 8), x5=(7, 5), x6=(6, 4), x7=(1, 2)
and x8=(4, 9).Suppose, you assigned x1, x4 and x7 as initial cluster centers for K means clustering (k=
3). Using K means compute the three clusters for each round of the algorithm until convergence.
5
16-
(c) Trace the results of using the Apriori algorithm on the grocery store example with support threshold
s=33.34% and confidence threshold c=60%. Show the candidate and frequent item sets for each
database scan. Enumerate all the final frequent item sets. Also indicate the association rules that are
generated and highlight the strong ones, sort them by confidence.