Bigdata Assess1 PDF
Bigdata Assess1 PDF
Bigdata Assess1 PDF
Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization
mechanisms) when it comes to safeguarding big data. A spot that cannot
be ignored given that big data carries credit card information, personal
information and other sensitive data.
schema: Rigid schemas have no place. We want the technology to be
able to fit our big data and not the other way around. The need of the
hour is dynamic schema. Static (pre-defined schemas) are obsolete.
Continuous availability: The big question here is how to provide 24/7
support because almostall RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
Consistency: Should one opt for consistency or eventual consistency?
Partition tolerant: How to build partition tolerant systems that can take
care of both hardwareand software failures?
Data quality: How to maintain data quality-data accuracy, completeness,
timeliness, etc.? Dowe have appropriate metadata in place?
These three groups must work together closely to solve complex Big
Data challenges.
Most organizations are familiar with people in the latter two groups
mentioned, but thefirst group, Deep Analytical Talent, tends to be the
newest role for most and the least understood.
Phases of Data Analytics Lifecycle
The seven roles follow:
Business User: Someone who understands the domain area and usually
benefits from the results. This person can consult and advise the project
team on the context of the project, the value of the results, and how the
outputs will be operationalized. Usually a business analyst, line manager,
or deep subject matter expert in the project domain fulfills this role.
Project Sponsor: Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines the core business
problem. Generally provides the funding and gauges the degree of value
from the final outputs of the working team. This person sets the priorities
for the project and clarifies the desired outputs.
Project Manager: Ensures that key milestones and objectives are met on
time and at the expected quality.
Business Intelligence Analyst: Provides business domain expertise based
on a deep understanding of the data, key performance indicators (KPIs),
key metrics, and business intelligence from a reporting perspective.
Business Intelligence Analysts generally create dashboards and reports and
have knowledge of the data feeds and sources.
Database Administrator (DBA): Provisions and configures the database
environment to support the analytics needs of the working team. These
responsibilities may include providing access to key databases or tables
and ensuring the appropriate security levels are in place related to the data
repositories.
Data Engineer: Leverages deep technical skills to assist with tuning SQL
queries for data management and data extraction, and provides support for
data ingestion into the analytic sandbox, which was discussed in Chapter 1,
“Introduction to Big Data Analytics.” Whereas the DBAsets up and
configures the databases to be used, the data engineer executes the actual
data extractions and performs substantial data manipulation to facilitate the
analytics. The data engineer works closely with the data scientist to help
shape data in the right ways for analyses.
Data Scientist: Provides subject matter expertise for analytical techniques,
data modeling, and applying valid analytical techniques to given business
problems. Ensures overall analytics objectives are met. Designs and
executes analytical methods and approaches with the data available to the
project.
Although most of these roles are not new, the last two roles—data engineer and
data scientist—have become popular and in high demand [2] as interest in Big
Data has grown.
The accuracy (or the overall success rate) is a metric defining the
rate at which a model has classified the records correctly. It is defined as
the sum of TP and TN divided by the total number of instances, as shown
in Equation 7-18.
A good model should have a high accuracy score, but having a
high accuracy score alone does not guarantee the model is well
established. The true positive rate (TPR) shows what percent of positive
instances the classifier correctly identified. It's also illustrated in Equation
7-19.
A well-performed model should have a high TPR that is ideally 1
and a low FPR and FNR that are ideally 0. In some cases, a model with a
TPR of 0.95 and an FPR of 0.3 is more acceptable than a model with a
TPR of 0.9 and an FPR of 0.1 even if the second model is more accurate
overall. Precision is the percentage of instances marked positive that really
are positive, as shown in Equation 7-22.
The goal is to find the line that best approximates the relationship
between the outcome variable and the input variables. With OLS, the
objective is to find the line through these points that minimizes the sum of
the squares of the difference between each point and the line in the vertical
direction. In other words, find the values of β0 and β1, such that the
summation shown in Equation is minimized.
n
∑ y x
i 1 i 0 1 i
2
Figure 4.2: Scatterplot of y versus x with vertical distance from the
observed points to a fitted line
where:
y is the outcome variable
xj are the input variables, for j=1,2,...,p-1
β0 is the value of y when each xj equals zero
βj is the change in y based on a unit change in xj for j=1,2,...,p-1
↋ ~N(0,σ ) and the s are independent of each other
↋
2
normality assumption on the error terms and the effect on the outcome
variable, y, for a given value of x.
10
Engineering: Based on operating conditions and various diagnostic
measurements, determine the probability of a mechanical part
experiencing a malfunction or failure. With this, probability estimate,
schedule the appropriate preventive maintenance activity.
11
12