BD Unit1

UNIT-1
Introduction to Big Data
Data – a collection of facts (numbers, words, measurements, observations, etc) that

has been translated into a form that computers can process
Data is of much importance now a days
1)Data helps to make better decisions
 Finding new customers

 Increasing customer retention
 Improving customer service
 Better managing marketing efforts
 Tracking social media interaction
 Predicting sales trends
2) Data helps you solve problems
Ex After experiencing a slow sales month or watching a poor-performing marketing

campaign, how do you pinpoint what went wrong? Tracking and reviewing data from
business processes helps you uncover performance breakdowns so you can better
understand each part of the process and know which steps need to be fixed and which
are performing well.
3)Data helps you understand performance

One example: Say you have a top-performing sales rep who you send most leads to.
However, when you delve into the data it shows that she closes deals at a lower rate
than one of your other sales reps who receives fewer leads but closes deals at a higher
percentage. (In fact, here’s how you can easily track sales rep performance.)
Certainly some performance data that can affect how you portion out leads--which can
lead to revenue increase. Performance data provides the clarity needed for better
results.
4)Data helps you improve processes

Data helps you understand and improve business processes so you can reduce wasted
money and time. Every company feels the effects of waste. It depletes resources,
squanders time, and ultimately impacts the bottom line.
For example, bad advertising decisions can be one of the greatest wastes of resources
in a company. With data showing how different marketing channels are performing,
however, you can see which ones offer the greatest ROI and focus on those. Or you
could dig into why other channels are not performing as well and work to improve
their performance. This would allow you budget to generate more leads without
having to increase the advertising spend.
5)Data helps you understand consumers
What’s Big Data?

Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
The challenges include capture, curation, storage, search, sharing, transfer, analysis,
and visualization.
 The trend to larger data sets is due to the additional information derivable from
analysis of a single large set of related data, as compared to separate smaller
sets with the same total amount of data, allowing correlations to be found to
"spot business trends, determine quality of research, prevent diseases, link legal
citations, combat crime, and determine real-time roadway traffic conditions.”
TYPES OF DIGITAL DATA
1. Structured data –
Structured data is data whose elements are addressable for effective analysis.
It has been organized into a formatted repository that is typically a database.
It concerns all data which can be stored in database SQL in a table with rows
and columns. They have relational keys and can easily be mapped into pre-
designed fields. Today, those data are most processed in the development and
simplest way to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it easier to
analyze. With some process, you can store them in the relation database (it
could be very hard for some kind of semi-structured data), but Semi-
structured exist to ease space. Example: XML data.
3. Unstructured data –
Unstructured data is a data that is which is not organized in a pre-defined
manner or does not have a pre-defined data model, thus it is not a good fit for
a mainstream relational database. So for Unstructured data, there are
alternative platforms for storing and managing, it is increasingly prevalent in
IT systems and is used by organizations in a variety of business intelligence
and analytics applications. Example: Word, PDF, Text, Media logs.
History of Big Data innovation

Big Data As a Opportunity
5 V’S/Characteristics of Big Data
Velocity
Velocity refers to the speed at which the data is generated, collected and
analyzed. Data continuously flows through multiple channels such as computer
systems, networks, social media, mobile phones etc. In today’s data-driven
business environment, the pace at which data grows can be best described as
‘torrential’ and ‘unprecedented’. Now, this data should also be captured as close
to real-time as possible, making the right data available at the right time. The
speed at which data can be accessed has a direct impact on making timely and
accurate business decisions. Even a limited amount of data that is available in
real-time yields better business results than a large volume of data that needs a
long time to capture and analyze.
Several Big data technologies today allow us to capture and analyze data as it is
being generated in real-time.
Volume
Big data volume defines the ‘amount’ of data that is produced. The value of data is
also dependent on the size of the data.
Today data is generated from various sources in different formats – structured

and unstructured. Some of these data formats include word and excel documents,
PDFs and reports along with media content such as images and videos. Due to
the data explosion caused to digital and social media, data is rapidly being
produced in such large chunks, it has become challenging for enterprises to store
and process it using conventional methods of business intelligence and analytics.
Enterprises must implement modern business intelligence tools to effectively
capture, store and process such unprecendented amount of data in real-time.
Value
Although data is being produced in large volumes today, just collecting it is of no
use. Instead, data from which business insights are garnered add ‘value’ to the
company. In the context of big data, value amounts to how worthy the data is of
positively impacting a company’s business. This is where big data analytics come
into the picture. While many companies have invested in establishing data
aggregation and storage infrastructure in
their organizations, they fail to understand that the aggregation of data doesn’t
equal value addition. What you do with the collected data is what matters. With the
help of advanced data analytics, useful insights can be derived from the collected
data. These insights, in turn, are what add value to the decision-making process.
One way to ensure that the value of big data is considerable and worth investing
time and effort into, is by conducting a cost Vs benefit analysis. By calculating the
total cost of processing big data and comparing it with the ROI that the business
insights are expected to generate, companies can effectively decide whether or not
big data analytics will actually add any value to their business.
Variety
While the volume and velocity of data are important factors that add value to a
business, big data also entails processing diverse data types collected from varied
data sources. Data sources may involve external sources as well as internal
business units. Generally, big data is classified as structured, semi-structured and
unstructured data. While structured data is one whose format, length and volume
are clearly defined, semi- structured data is one that may partially conform to a
specific data format. On the other hand, unstructured data is unorganized data and
doesn’t conform with the traditional data formats. Data generated via digital and
social media (images, videos, tweets, etc.) can be classified as unstructured data,
The sheer volume of data that organizations usually collect and generate may look
chaotic and unstructured. In fact, almost 80 percent of data produced globally
including photos, videos, mobile data, and social media content, is unstructured in
nature.
Veracity/Validity
The Veracity of big data or Validity, as it is more commonly known, is the
assurance of quality or credibility of the collected data. Can you trust the data that
you have collected? Is this data credible enough to glean insights from? Should
we be basing our business decisions on the insights garnered from this data? All
these questions and more, are answered when the veracity of the data is known.
Since big data is vast and involves so many data sources, there is the possibility
that not all collected data will be of good quality or accurate in nature. Hence,
when processing big data sets, it is important that the validity of the data is
checked before proceeding for processing.
Types of Big Data Technologies:
Big Data Technology is mainly classified into two types:
1. Operational Big Data Technologies

2. Analytical Big Data Technologies
Firstly, The Operational Big Data is all about the normal day to day data that we
generate. This could be the Online Transactions, Social Media, or the data
from a Particular Organization etc. You can even consider this to be a kind of
Raw Data which is used to feed the Analytical Big Data Technologies.
A few examples of Operational Big Data Technologies are as follows:
Online ticket bookings, which includes your Rail tickets, Flight tickets, movie
tickets etc.
Online shopping which is your Amazon, Flipkart, Walmart, Snap deal and
many more.
Data from social media sites like Facebook, Instagram, what’s app and a lot
more.
The employee details of any Multinational Company.
So, with this let us move into the Analytical Big Data Technologies.
Analytical Big Data is like the advanced version of Big Data Technologies. It is a
little complex than the Operational Big Data. In short, Analytical big data is where
the actual performance part comes into the picture and the crucial real-time
business decisions are made by analyzing the Operational Big Data.
Few examples of Analytical Big Data Technologies are as follows:
Stock marketing
Carrying out the Space missions where every single bit of information is
crucial.
Weather forecast information.
Medical fields where a particular patients health status can be monitored.
TYPES OF BIG DATA ANALYTICS

For different stages of business analytics huge amount of data is processed at various
steps. Depending on the stage of the workflow and the requirement of data analysis,
there are four main kinds of analytics – descriptive, diagnostic, predictive and
prescriptive. These four types together answer everything a company needs to know-
from what’s going on in the company to what solutions to be adopted for optimising
the functions.
The four types of analytics are usually implemented in stages and no one type of
analytics is said to be better than the other. They are interrelated and each of these
offers a different insight. With data being important to so many diverse sectors- from
manufacturing to energy grids, most of the companies rely on one or all of these types
of analytics. With the right choice of analytical techniques, big data can deliver richer
insights for the companies
Before diving deeper into each of these, let’s define the four types of analytics:
1) Descriptive Analytics: Describing or summarising the existing data using existing
business intelligence tools to better understand what is going on or what has
happened.
2) Diagnostic Analytics: Focus on past performance to determine what happened and
why. The result of the analysis is often an analytic dashboard.
3) Predictive Analytics: Emphasizes on predicting the possible outcome using
statistical models and machine learning techniques.
4) Prescriptive Analytics: It is a type of predictive analytics that is used to
recommend one or more course of action on analyzing the data.
Let’s understand these in a bit more depth.
1. Descriptive Analytics
This can be termed as the simplest form of analytics. The mighty size of big data is
beyond human comprehension and the first stage hence involves crunching the data
into understandable chunks. The purpose of this analytics type is just to summarise
the findings and understand what is going on.
Among some frequently used terms, what people call as advanced analytics or
business intelligence is basically usage of descriptive statistics (arithmetic operations,
mean, median, max, percentage, etc.) on existing data. It is said that 80% of business
analytics mainly involves descriptions based on aggregations of past performance. It
is an important step to make raw data understandable to investors, shareholders and
managers. This way it gets easy to identify and address the areas of strengths and
weaknesses such that it can help in strategizing.
The two main techniques involved are data aggregation and data mining stating that
this method is purely used for understanding the underlying behavior and not to make
any estimations. By mining historical data, companies can analyze the consumer
behaviors and engagements with their businesses that could be helpful in targeted
marketing, service improvement, etc. The tools used in this phase are MS Excel,
MATLAB, SPSS, STATA, etc.
2. Diagnostic Analytics
Diagnostic analytics is used to determine why something happened in the past. It is
characterized by techniques such as drill-down, data discovery, data mining and
correlations. Diagnostic analytics takes a deeper look at data to understand the root
causes of the events. It is helpful in determining what factors and events contributed
to the outcome. It mostly uses probabilities, likelihoods, and the distribution of
outcomes for the analysis.
In a time series data of sales, diagnostic analytics would help you understand why the
sales have decrease or increase for a specific year or so. However, this type of
analytics has a limited ability to give actionable insights. It just provides an
understanding of causal relationships and sequences while looking backward.
A few techniques that uses diagnostic analytics include attribute importance, principle
components analysis, sensitivity analysis, and conjoint analysis. Training algorithms
for classification and regression also fall in this type of analytics
3. Predictive Analytics
As mentioned above, predictive analytics is used to predict future outcomes.
However, it is important to note that it cannot predict if an event will occur in the
future; it merely forecasts what are the probabilities of the occurrence of the event. A
predictive model builds on the preliminary descriptive analytics stage to derive the
possibility of the outcomes.
The essence of predictive analytics is to devise models such that the existing data is
understood to extrapolate the future occurrence or simply, predict the future data. One
of the common applications of predictive analytics is found in sentiment analysis
where all the opinions posted on social media are collected and analyzed (existing
text data) to predict the person’s sentiment on a particular subject as being- positive,
negative or neutral (future prediction).
Hence, predictive analytics includes building and validation of models that provide
accurate predictions. Predictive analytics relies on machine learning algorithms like
random forests, SVM, etc. and statistics for learning and testing the data. Usually,
companies need trained data scientists and machine learning experts for building
these models. The most popular tools for predictive analytics include Python, R,
RapidMiner, etc.
The prediction of future data relies on the existing data as it cannot be obtained
otherwise. If the model is properly tuned, it can be used to support complex forecasts
in sales and marketing. It goes a step ahead of the standard BI in giving accurate
predictions.
4. Prescriptive Analytics
The basis of this analytics is predictive analytics but it goes beyond the three
mentioned above to suggest the future solutions. It can suggest all favorable outcomes
according to a specified course of action and also suggest various course of actions to
get to a particular outcome. Hence, it uses a strong feedback system that constantly
learns and updates the relationship between the action and the outcome.
The computations include optimisation of some functions that are related to the
desired outcome. For example, while calling for a cab online, the application uses
GPS to connect you to the correct driver from among a number of drivers found
nearby. Hence, it optimises the distance for faster arrival time. Recommendation
engines also use prescriptive analytics.
The other approach includes simulation where all the key performance areas are
combined to design the correct solutions. It makes sure whether the key performance
metrics are included in the solution. The optimisation model will further work on the
impact of the previously made forecasts. Because of its power to suggest favorable
solutions, prescriptive analytics is the final frontier of advanced analytics or data
science, in today’s term.
Why is Big Data Important ?

Big Data Analytics As a Driver of Innovations and Product Development
 Another huge advantage of big data is the ability to help companies innovate
and redevelop their products.
 The importance of big data does not revolve around how much data a company
has but how a company utilizes the collected data.
 Every company uses data in its own way; the more efficiently a company uses
its data, the more potential it has to grow.
The company can take data from any source and analyze it to find answers
which will enable:
 Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of
data are to be stored and these tools also help in identifying more
efficient ways of doing business.
 Time Reductions: The high speed of tools like Hadoop and in-memory
analytics can easily identify new sources of data which helps businesses
analyzing data immediately and make quick decisions based on the
learning.
 Understand the market conditions:
 By analyzing big data you can get a better understanding of current
market conditions.
 For example, by analyzing customers’ purchasing behaviors, a company
can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
 Control online reputation:
 Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company.
 If you want to monitor and improve the online presence of your business,
then, big data tools can help in all this.
 Using Big Data Analytics to Boost Customer Acquisition and
Retention
 The customer is the most important asset any business depends on.
 There is no single business that can claim success without first having to
establish a solid customer base.
 However, even with a customer base, a business cannot afford to
disregard the high competition it faces.
 If a business is slow to learn what customers are looking for, then it is
very easy to begin offering poor quality products.
 Using Big Data Analytics to Solve Advertisers Problem and Offer
Marketing Insights
Big data analytics can help change all business operations. This includes
the ability to match customer expectation, changing company’s product
line and of course ensuring that the marketing campaigns are powerful.
Big data architecture:

A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional database
systems
Big data solutions typically involve one or more of the following types
of workload:
 Batch processing of big data sources at rest.
 Real-time processing of big data in motion.
 Interactive exploration of big data.
 Predictive analytics and machine learning.
Big Data Architectures components
1. Data sources:
All big data solutions start with one or more data sources. Examples
include:
 Application data stores, such as relational databases.
 Static files produced by applications, such as web server log files.
 Real-time data sources, such as IoT devices.
2. Data storage:
 Data for batch processing operations is typically stored in a distributed file
store that can hold high volumes of large files in various formats. This kind of store
is often called a data lake.
 Options for implementing this storage include Azure Data Lake Store or blob
(OBJECT) containers (MULTIMEDIA FILES) in Azure Storage.
3. Batch processing:
 Because the data sets are so large, often a big data solution must process data
files using long-running batch jobs to filter, aggregate, and otherwise prepare
the data for analysis.
 Usually these jobs involve reading source files, processing them,
and writing the output to new files.
4. Real-time message ingestion
 If the solution includes real-time sources, the architecture must include a way
to capture and store real-time messages for stream processing.
 This might be a simple data store, where incoming messages are dropped into a
folder for processing.
 However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other
message queuing semantics. Options include Azure Event Hubs, Azure IoT
Hubs, and Kafka
5. Stream processing:
 After capturing real-time messages, the solution must process them
by filtering, aggregating, and otherwise preparing the data for
analysis.
 The processed stream data is then written to an output sink.
 EXAMPLE: Azure Stream Analytics provides a managed stream
processing service based on perpetually running SQL queries that
operate on unbounded streams.
 Can also use open source Apache streaming technologies like Storm
and Spark Streaming in an HDInsight cluster.
6. Analytical data store
 Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using
analytical tools. The analytical data store used to serve these queries
can be a relational data warehouse, as seen in most traditional business
intelligence (BI) solutions
 The goal of most big data solutions is to provide insights into the data
through analysis and reporting.
 To empower users to analyze the data, the architecture may include a
data modeling layer, such as a multidimensional OLAP cube or tabular
data model in Azure Analysis Services
7. Orchestration
 Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data,
move data between multiple sources and sinks, load the processed
data into an analytical data store, or push the results straight to a
report or dashboard
 Orchestration is the automated configuration, management, and
coordination of computer systems, applications, and
services. Orchestration helps IT to more easily manage complex
tasks and workflows
Main Components Of Big Data

As we discussed above in the introduction to big data that what is big data,
Now we are going ahead with the main components of big data.
1. Machine Learning
It is the science of making computers learn stuff by themselves. In machine learning,
a computer is expected to use algorithms and statistical models to perform specific
tasks without any explicit instructions. Machine learning applications provide results
based on past experience. For example, these days there are some mobile applications
that will give you a summary of your finances, bills, will remind you of your bill
payments, and also may give you suggestions to go for some saving plans. These
functions are done by reading your emails and text messages.

2. Natural Language Processing (NLP)
It is the ability of a computer to understand human language as spoken. The most
obvious examples that people can relate to these days are google home and Amazon
Alexa. Both use NLP and other technologies to give us a virtual assistant experience.
NLP is all around us without us even realizing it. When writing a mail, while making
any mistakes, it automatically corrects itself and these days it gives auto-suggests for
completing the mails and automatically intimidates us when we try to send an email
without the attachment that we referenced in the text of the email, this is part of
Natural Language Processing Applications which are running at the backend.
3. Business Intelligence
Business Intelligence (BI) is a method or process that is technology-driven to gain
insights by analyzing data and presenting it in a way that the end-users (usually high-
level executives) like managers and corporate leaders can gain some actionable
insights from it and make informed business decisions on it.
4. Cloud Computing
we can define cloud computing as the delivery of computing services—servers,
storage, databases, networking, software, analytics, intelligence, and moreover the
Internet (“the cloud”) to offer faster innovation, flexible resources, and economies of
scale.
Advantages and Disadvantages
Advantages Disadvantages
Better decision-making Data quality: the quality of data needs to
be good and arranged to proceed with big
data analytics.
Increased productivity Hardware needs: Storage space that
needs to be there for housing the data,
networking bandwidth to transfer it to
and from analytics systems, are all
expensive to purchase and maintain the
Big Data environment.
Reduce costs Cybersecurity risks: Storing sensitive and
large amounts of data, can make
companies a more attractive target for
cyber attackers, which can use the data
for ransom or other wrongful purposes.

Improved customer service Hiccups in integrating with legacy
systems: Many old enterprises that have
been in business for a long time have
stored data in different applications and
systems throughout different architecture
and environments. This creates problems
in integrating outdated data sources and
moving data, which further adds to the
time and expense of working with big
data.
Why is Big Data Important ?

The importance of big data does not revolve around how much data a company has
but how a company utilises the collected data. Every company uses data in its own
way; the more efficiently a company uses its data, the more potential it has to grow.
The company can take data from any source and analyse it to find answers which will
enable:
1. Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based Analytics
can bring cost advantages to business when large amounts of data are to be stored
and these tools also help in identifying more efficient ways of doing business.
2. Time Reductions :The high speed of tools like Hadoop and in-memory analytics
can easily identify new sources of data which helps businesses analyzing data
immediately and make quick decisions based on the learnings.
3. Understand the market conditions : By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most
and produce products according to this trend. By this, it can get ahead of its
competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore,
you can get feedback about who is saying what about your company. If you want
to monitor and improve the online presence of your business, then, big data tools
can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no
single business that can claim success without first having to establish a solid
customer base. However, even with a customer base, a business cannot afford to
disregard the high competition it faces. If a business is slow to learn what
customers are looking for, then it is very easy to begin offering poor quality
products. In the end, loss of clientele will result, and this creates an adverse overall
effect on business success. The use of big data allows businesses to observe
various customer related patterns and trends. Observing customer behaviour is
important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights
Big data analytics can help change all business operations. This includes the ability
to match customer expectation, changing company’s product line and of course
ensuring that the marketing campaigns are powerful.
7. Big Data Analytics As a Driver of Innovations and Product Development
Another huge advantage of big data is the ability to help companies innovate and
redevelop their products.
Best Examples Of Big Data

The best examples of big data can be found both in the public and private sector.
From targeted advertising, education, and already mentioned massive industries
(healthcare, insurance, manufacturing or banking), to real-life scenarios, in guest
service or entertainment. by the year 2020, 1.7 megabytes of data will be generated
every second for every person on the planet, the potential for data-driven
organizational growth in the hospitality sector is enormous.
Big data can serve to deliver benefits in some surprising areas.
Big Data in Education industry
Following are some of the fields in education industry that have been transformed by
big data motivated changes
 Customized and dynamic learning programs:

 Reframing course material:
 Grading Systems:
 Career prediction:
Big Data in Insurance industry

The insurance industry holds importance not only for individuals but also business
companies. The reason insurance holds a significant place is because it supports
people during times of adversities and uncertainties. The data collected from these
sources are of varying formats and change at tremendous speeds.
Collecting information
As big data refers to gathering data from disparate sources, this feature creates a
crucial use case for the insurance industry to pounce on.
Determining customer experience and making customers the center of a company’s
attraction is of prime importance to organizations.
Fraud detection
Insurance frauds are a common incidence. Big data use case for reducing fraud is
highly effective.
Threat mapping
When an insurance agency sells an insurance, they want to be aware of all the
possibilities of things going unfavourably with their customer, making them file a
claim
Big Data features – security, compliance, auditing and protection
 The sheer size of Big Data brings with it a major security challenge. Proper
security entails more than keeping the bad guys out; it also means backing up
data and protecting data from corruption.
■ Data access: data can be protected if you eliminate access to the data! Not
pragmatic so we opt to control access.
■ Data availability: controlling where the data are stored and how it is distributed;
more control position you better to protect the data.
■ Performance: encryption and other measures can improve security but they carry a
processing burden that can severely affect the system performance!.
 Liability: accessible data carry with them liability, such as the sensitivity of the
data. The legal requirements connected to the data privacy issues, and IP
concerns.
 Adequate security becomes a strategic balancing act among the above
concerns. With planning, logic, and observations, security becomes
manageable. Effectively protecting data while allowing access to the
authorized users and systems.
Pragmatic Steps to Securing Big Data:
 First get rid of data that are no longer needed. If not possible to destroy then the
information should be securely archived and kept offline
 A real challenge is to decide which data is needed? As value can be found in
unexpected places. For example, activity logs represent a risk but logs can be
used to determine scale, use, and efficiency of big data analytics
> There is no easy answer to the above question, and it becomes a case of choosing
the lesser of two evils.
■ Classifying Data:
> Protecting data is much easier if data is classified into categories, e.g., internal
email between colleagues is different from financial report, etc.
> Simple classification can be: financial, HR, sales, inventory, and communications.
> Once organizations better understand their data, they can take important steps to
segregate the information and that makes it easier to employ security measures like
encryption and monitoring more
 Protecting Big Data Analytics:
 A real concern with Big Data is the fact that Big Data contains all of the things
you don't want to see when are trying to protect data, very unique sample set,
etc.

Such uniqueness also means that you can't leverage time-saving backup and
security technologies such as reduplication.
 Significant issue is the large size and number of files involved in Big Data
Analytics environment. Backup bandwidth and/or the backup appliance must
be large and the receiving devices must be able to ingest data at the delivery
rate of data.
 Big Data and Compliance:

 Compliance has major effect on how Big Data is protected, stored, accessed,
and archived.
 Big Data is not easily handled by RDBMS; this means it is harder to
understand how compliance affects the data.
 Big Data is transforming the storage and access paradigm to a new world of
horizontally scaling, unstructured databases, which are more suited to solve old
business problems with analytics.
 New data types and methodologies are still expected to meet the legislative
requirements expected by compliance laws
 Preventing compliance from becoming the next Big Data terrifying is going to
be the job of security professionals.
 Health care is a good example of Big Data compliance challenge, i.e., different
data types and vast rate of data from different devices, etc.
 NoSQL is evolving as the new data management approach to unstructured
data. No need for federating multiple RDBMS. Clustered single NoSQL
database and being deployed in the cloud.
 Unfortunately, most data stores in the NoSQL world (i.e., Hadoop, Cassandra
and MongoDB) do not incorporate sufficient data security tools to provide
what is needed.
 Big Data changed few things: For example network security developers spent
a great deal of time and money on perimeter-based security mechanisms (e.g.,
firewalls) but that cannot prevent unauthorized access to data once a
criminal/hacker has entered the network!
 POINTS TO REMEMBER:
 Secure the data at the data store level
 Protect the cryptographic keys and store them separately from the data
 Create trusted applications and stacks to protect data from rogue users
 Once you begin to map and understand the data,
opportunities will be evident that will lead to automating and monitoring
compliance and security compliance.
 Of course automation does not solve every problem; there are still basic rules
to be used to enable security while not derailing the value of Big Data:
→ Ensure that security does not obstruct performance or availability
→ Pick the right encryption scheme, i.e., file, document, column, etc.
→ Ensure that the security solution can evolve with your changing requirements
The Intellectual Property (IP) Challenge:
 One of the biggest issues with Big Data is the concept of IP.
 IP refers to creations of the human mind, such as inventions, literary and
artistic works, and symbols, names, images used in commerce.
Some basic rules are:
→ Understand what IP is and know what you have to protect
→ Prioritize protection
→ Label (confidential information should be labeled)
→ Educate employees
 Know your tools: tools that can be used to track IP stores
 Use a holistic approach: includes internal risks as well as external ones.
 Use a counterintelligence mind-set: think as if you are
spying on your company and ask how would you do it?
The above guidelines can be applied to almost any information security paradigm that
is geared toward
protecting IP.
Big Data Platform

 Big Data Platform is integrated IT solution for Big Data management which
combines several software system, software tools and hardware to provide easy
to use tools system to enterprises .It is a single one-stop solution for all Big
Data needs of an enterprise irrespective of size and data volume.
 Big Data Platform is enterprise class IT solution for developing, deploying
and managing Big Data .There are several Open source and commercial Big
Data Platform in the market with varied features which can be used in Big Data
environment.
 They can be divided into three categories based on their heritage technology:
1.Relational databases
2.Hadoop distributions
3. Cloud managed services
Hadoop distributions:
 Big data platforms based on Hadoop are market newcomers that have appeared
within the past several years.
 The primary vendors in this space (MapR, Hortonworks, and Cloudera) run
Hadoop as their core data processing platforms, which they supplement with a
storm of open source software and, in some cases, proprietary software.
Cloud managed services:
 This category includes pure-play cloud service providers that manage and
operate big data platforms on behalf of subscribers in the cloud.
 More than a platform-as-a-service, a cloud managed service lets customers
focus solely on analyzing data and building data-driven applications rather than
data infrastructure.
 In addition, cloud managed services provide a quick and easy way for
customers without information technology experts or available servers to try
out or deploy a big data platform.
 Leading cloud managed service providers include Altiscale, Qubole,
Treasure Data, Cazena, and Amazon Web Services (AWS).
Challenges of Conventional System
The challenges when dealing with Big Data in three dimensions:
 data,
 process,
and management.
Data Challenge
Volume
 The volume of data, especially machine-generated data, is exploding,
 how fast that data is growing every year, with new sources of data that are
emerging. •For example, in the year 2000, 800,000 petabytes (PB) of data
were stored in the world, and it is expected to reach 35 zettabytes (ZB) by
2020 (according to IBM).
 Social media plays a key role: Twitter generates 7+ terabytes (TB) of data
every day. Facebook, 10 TB.
 Mobile devices play a key role as well, as there were estimated 6 billion
mobile phones in 2011.
 The challenge is how to deal with the size of Big Data.
PROCESSING
Variety, Combining Multiple Data Sets
 More than 80% of today’s information is unstructured and it is typically
too big to manage effectively.
 Today, companies are looking to leverage a lot more
 data from a wider variety of sources both inside and outside the
organization.
 Things like documents, contracts, machine data, sensor data, social media,
health records, emails, etc. The list is endless really.
MANAGEMENT
A lot of this data is unstructured, or has a complex structure that’s hard to
represent in rows and
Modern Analytical tools

The growing demand and importance of data analytics in the market have
generated many openings worldwide. It becomes slightly tough to shortlist the
top data analytics tools as the open source tools are more popular, user-
friendly and performance oriented than the paid
version. There are many open source tools which doesn’t require much/any coding
and manages to deliver better results than paid versions e.g. – R programming in
data mining and Tableau public, Python in data visualization. Below is the list of
top 10 of data analytics tools, both open source and paid version, based on their
popularity, learning and performance.
1. R Programming
R is the leading analytics tool in the industry and widely used for statistics and
data modeling. It can easily manipulate your data and present in different ways. It
has exceeded SAS in many ways like capacity of data, performance and outcome.
R compiles and runs on a wide variety of platforms viz -UNIX, Windows and
MacOS. It has 11,556 packages and allows you to browse the packages by
categories. R also provides tools to automatically install all packages as per user
requirement, which can also be well assembled with Big data.
2. Tableau Public:
Tableau Public is a free software that connects any data source be it corporate
Data Warehouse, Microsoft Excel or web-based data, and creates data
visualizations, maps, dashboards etc. with real-time updates presenting on web.
They can also be shared through social media or with the client. It allows the
access to download the file in different formats. If you want to see the power of
tableau, then we must have very good data source. Tableau’s Big Data capabilities
makes them important and one can analyze and visualize data better than any other
data visualization software in the market.
3.Python
Python is an object-oriented scripting language which is easy to read, write,

maintain and is a free open source tool. It was developed by Guido van Rossum
in late 1980’s which supports both functional and structured programming
methods.
Phython is easy to learn as it is very similar to JavaScript, Ruby, and PHP. Also,
Python has very good machine learning libraries viz. Scikitlearn, Theano,
Tensorflow and Keras. Another important feature of Python is that it can be
assembled on any platform like SQL server, a MongoDB database or JSON.
Python can also handle text data very well.
4. SAS:
Sas is a programming environment and language for data manipulation and a
leader in analytics, developed by the SAS Institute in 1966 and further developed
in 1980’s and 1990’s. SAS is easily accessible, managable and can analyze data
from any sources. SAS introduced a large set of products in 2011 for customer
intelligence and numerous SAS modules for web, social media and marketing
analytics that is widely used for profiling customers and prospects. It can also
predict their behaviors, manage, and optimize communications.
5. Apache Spark
The University of California, Berkeley’s AMP Lab, developed Apache in 2009.

Apache Spark is a fast large-scale data processing engine and executes
applications in Hadoop clusters 100 times faster in memory and 10 times faster on
disk. Spark is built on data science and its concept makes data science effortless.
Spark is also popular for data pipelines and machine learning models
development.
Spark also includes a library – MLlib, that provides a progressive set of machine
algorithms for repetitive data science techniques like Classification, Regression,
Collaborative Filtering, Clustering, etc.
6. Excel
Excel is a basic, popular and widely used analytical tool almost in all industries.
Whether you are an expert in Sas, R or Tableau, you will still need to use Excel.
Excel becomes important when there is a requirement of analytics on the client’s
internal data. It analyzes the complex task that summarizes the data with a preview
of pivot tables that helps in filtering the data as per client requirement. Excel has
the advance business analytics option which helps in modelling capabilities which
have prebuilt options like automatic relationship detection, a creation of DAX
measures and time grouping.
7. RapidMiner:
RapidMiner is a powerful integrated data science platform developed by the same

company that performs predictive analysis and other advanced analytics like data
mining, text analytics,
machine learning and visual analytics without any programming. RapidMiner can
incorporate with any data source types, including Access, Excel, Microsoft SQL,
Tera data, Oracle, Sybase, IBM DB2, Ingres, MySQL, IBM SPSS, Dbase etc. The
tool is very powerful that can generate analytics based on real-life data
transformation settings, i.e. you can control the formats and data sets for predictive
analysis.
8. KNIME
KNIME Developed in January 2004 by a team of software engineers at University

of Konstanz. KNIME is leading open source, reporting, and integrated analytics
tools that allow you to analyze and model the data through visual programming, it
integrates various components for data mining and machine learning via its
modular data-pipelining concept.
9. QlikView
QlikView has many unique features like patented technology and has in-memory
data processing, which executes the result very fast to the end users and stores
the data in the report itself. Data association in QlikView is automatically
maintained and can be compressed to almost 10% from its original size. Data
relationship is visualized using colors – a specific color is given to related data
and another color for non-related data.
10. Splunk:
Splunk is a tool that analyzes and search the machine-generated data. Splunk pulls
all text-based log data and provides a simple way to search through it, a user can
pull in all kind of data, and perform all sort of interesting statistical analysis on it,
and present it in different formats.
Reporting vs Analysis
Living in the era of digital technology and big data has made organizations
dependent on the wealth of information data can bring. You might have seen how
reporting and analysis are used interchangeably, especially the manner which
outsourcing companies market their services. While both areas are part of web
analytics (note that analytics isn’t similar to analysis), there’s a vast difference
between them, and it’s more than just spelling.
It’s important that we differentiate the two because some organizations might be
selling themselves short in one area and not reap the benefits, which web analytics
can bring to the table. The first core component of web analytics, reporting, is
merely organizing data into summaries. On the other hand, analysis is the process
of inspecting, cleaning, transforming, and modeling these summaries (reports)
with the goal of highlighting useful information.
Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask “What?”
questions about the information, whereas analysis should answer to “Why”” and
“What can we do about it?”
Here are five differences between reporting and analysis:

1. Purpose
Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings
to their business, as reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not
analysis reports), analysis interprets this information and provides
recommendations on actions.
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy
to confuse tasks that have analysis labeled on top of them when all it does is
reporting. Hence, ensure that your analytics team has a healthy balance doing
both.
Here’s a great differentiator to keep in mind if what you’re doing is reporting or

analysis:
Reporting includes building, configuring, consolidating, organizing, formatting,

and summarizing. It’s very similar to the above mentioned like turning data into
charts, graphs, and linking data across multiple channels.
Analysis consists of questioning, examining, interpreting, comparing, and

confirming. With big data, predicting is possible as well.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their
outputs. Reporting has a push approach, as it pushes information to users and
outputs come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are comprised
of insights, recommended actions, and a forecast of its impact on the
company—all in a language that’s easy to understand at the level of the user
who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a
standard report is not similar to a meaningful analytics.
4. Delivery
Considering that reporting involves repetitive tasks—often with truckloads of data,
automation has been a lifesaver, especially now with big data. It’s not surprising
that the first thing outsourced are data entry services since outsourcing companies
are perceived as data reporting experts.
Analysis requires a more custom approach, with human minds doing superior
reasoning and analytical thinking to extract insights, and technical skills to provide
efficient steps towards accomplishing a specific goal. This is why data analysts
and scientists are demanded these days, as organizations depend on them to come
up with recommendations for leaders or business executives make decisions about
their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding
that both are indispensable when looking at the big picture. It should help
businesses grow, expand, move forward, and make more profit or increase their
value.
This Path to Value diagram illustrates how data converts into value by reporting
and analysis such that it’s not achievable without the other.
Data alone is useless, and action without data is baseless. Both reporting and
analysis are vital to bringing value to your data and operations.
Reporting and Analysis are Valuable
Not to undermine the role of reporting in web analytics, but organizations need to
understand that reporting itself is just numbers. Without drawing insights and
getting reports aligned with your organization’s big picture, you can’t make
decisions based on reports alone.
Data analysis is the most powerful tool to bring into your business. Employing
the powers of analysis can be comparable to finding gold in your reports, which
allows your business to increase profits and further develop.

BD Unit1

Uploaded by

Copyright:

Available Formats

BD Unit1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BD Unit1

Uploaded by

Copyright:

Available Formats

UNIT-1

Introduction to Big Data

Data – a collection of facts (numbers, words, measurements, observations, etc) that

Data is of much importance now a days

1)Data helps to make better decisions

 Finding new customers

2) Data helps you solve problems

Ex After experiencing a slow sales month or watching a poor-performing marketing

3)Data helps you understand performance

4)Data helps you improve processes

What’s Big Data?

TYPES OF DIGITAL DATA

History of Big Data innovation

5 V’S/Characteristics of Big Data

Today data is generated from various sources in different formats – structured

Types of Big Data Technologies:

Big Data Technology is mainly classified into two types:

1. Operational Big Data Technologies

A few examples of Operational Big Data Technologies are as follows:

Few examples of Analytical Big Data Technologies are as follows:

TYPES OF BIG DATA ANALYTICS

Why is Big Data Important ?

Big data architecture:

Main Components Of Big Data

a computer is expected to use algorithms and statistical models to perform specific

functions are done by reading your emails and text messages.

Natural Language Processing Applications which are running at the backend.

insights from it and make informed business decisions on it.

storage, databases, networking, software, analytics, intelligence, and moreover the

Better decision-making Data quality: the quality of data needs to

be good and arranged to proceed with big

Increased productivity Hardware needs: Storage space that

needs to be there for housing the data,

networking bandwidth to transfer it to

and from analytics systems, are all

expensive to purchase and maintain the

Big Data environment.

Reduce costs Cybersecurity risks: Storing sensitive and

large amounts of data, can make

companies a more attractive target for

cyber attackers, which can use the data

for ransom or other wrongful purposes.

systems: Many old enterprises that have

been in business for a long time have

stored data in different applications and

systems throughout different architecture

and environments. This creates problems

in integrating outdated data sources and

moving data, which further adds to the

time and expense of working with big

Why is Big Data Important ?

Best Examples Of Big Data

 Customized and dynamic learning programs:

Big Data in Insurance industry

 Big Data and Compliance:

Big Data Platform

Modern Analytical tools

Python is an object-oriented scripting language which is easy to read, write,