BIG DATA Technology: Subtitle

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 34

BIG DATA Technology

Subtitle
What is Big Data?
▪ 'Big Data' is also a data but with a huge size. 'Big Data' is a term used to
describe collection of data that is huge in size and yet growing
exponentially with time.
▪ such a data is so large and complex that none of the traditional data
management tools are able to store it or process it efficiently.
What comes under Big Data?
▪ Big data involves the data produced by different devices and applications.
Given below are some of the fields that come under the umbrella of Big
Data.
▪ Black Box Data 
▪ Social Media Data
▪ Stock Exchange Data
▪ Power Grid Data
▪ Search Engine Data
Types of Data

▪  Big Data includes huge volume, high velocity, and extensible variety
of data. The data in it will be of three types.
▪ Structured data : Relational data.
▪ Semi Structured data : XML data.
▪ Unstructured data : Word, PDF, Text, Media Logs.
Importance of Big Data:

▪ The importance of big data doesn’t revolve around how much data
you have, but what you do with it. You can take data from any
source and analyze it to find answers that enable 
1) cost reductions,
2) time reductions
3) new product development and optimized offerings
4) smart decision making.
Importance of Big Data: (cont.)

▪ When you combine big data with high-powered analytics, you can
accomplish business-related tasks such as:
1.Determining root causes of failures, issues and defects in near-real
time.
2.Generating coupons at the point of sale based on the customer’s
buying habits.
3.Recalculating entire risk portfolios in minutes.
4.Detecting fraudulent behavior before it affects your organization.
4 V’s of Big Data / characteristics of
Big data
Volume

▪ The name 'Big Data' itself is related to a size which is


enormous.
▪ Size of data plays very crucial role in determining value out of
data.
▪ Also, whether a particular data can actually be considered as a
Big Data or not, is dependent upon volume of data.
▪ Hence, 'Volume' is one characteristic which needs to be
considered while dealing with 'Big Data'.
Velocity

▪ The term 'velocity' refers to the speed of generation of data.


▪ How fast the data is generated and processed to meet the demands,
determines real potential in the data.
▪ Big Data Velocity deals with the speed at which data flows in from
sources like business processes, application logs, networks and
social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
Variety

▪ Variety refers to heterogeneous sources and the nature of data, both


structured and unstructured.
▪ During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications.
▪ Now days, data in the form of emails, photos, videos, monitoring
devices, PDFs, audio, etc. is also being considered in the analysis
applications.
▪ This variety of unstructured data poses certain issues for storage,
mining and analyzing data.
Veracity

▪ Big data veracity refers to the biases, noise and abnormalities,


ambiguities, latency in data.
▪ Is the data that is being stored and mined, meaningful to the
problem being analyzed?
▪ Keep your data clean and processes to keep 'dirty data' from
accumulating in your systems.
Distributed file system

File System in computer


Distributed File System

▪ A distributed file system is a client/server-based application that allows


clients to access and process data stored on the server as if it were on their
own computer. 
▪ When a user accesses a file on the server, the server sends the user a copy
of the file, which is cached on the user's computer while the data is being
processed and is then returned to the server.
▪ a distributed file system organizes file and directory services of individual
servers into a global directory in such a way that remote data access is not
location-specific but is identical from any client. 
▪ All files are accessible to all users of the global file system and organization
is hierarchical and directory-based.
Distributed File System (Cont.)

▪ Since more than one client may access the same data simultaneously, the
server must have a mechanism in place (such as maintaining information
about the times of access) to organize updates so that the client always
receives the most current version of data and that data conflicts do not
arise.
▪  Distributed file systems typically use file or database replication
(distributing copies of data on multiple servers) to protect against data
access failures.
▪ Sun Microsystems' Network File System (NFS), Novell NetWare,
Microsoft's Distributed File System, and IBM/Transarc's DFS are some
examples of distributed file systems.
Benefits of DFS:
▪ Resources management  
users access all resources through a single point
▪ Accessibility
users do not need to know the physical location of the shared folder, then
can navigate to it through Explorer and domain tree)
▪ Fault tolerance
shares can be replicated, so if the server in Chicago goes down,
resources still will be available to users
▪ Work load management
DFS allows administrators to distribute shared folders and workloads
across several servers for more efficient network and server resources use.
Big Data Analytics

▪ Big Data analytics is the process of collecting, organizing and analyzing


large sets of data (called Big Data) to discover patterns and other useful
information
▪ Big Data analytics can help organizations to better understand the
information contained within the data and will also help identify the data
that is most important to the business and future business decisions.
▪ Analysts working with Big Data typically want the knowledge that comes
from analyzing the data.
Types of Big Data analytics
▪ Data analytics enable data-driven decision-making with scientific
backing so that decisions can be based on factual data and not
simply on past experience or intuition alone.
▪ There are four general categories of analytics that are
distinguished by the results they produce:
• descriptive analytics

• diagnostic analytics
• predictive analytics
• prescriptive analytics
Types of Big Data analytics (cont.)
Descriptive Analytics
▪descriptive analytics are more about summarizing and reporting data.
▪This type of data analytics is geared towards what is currently happening or
what has already happened. 
▪Descriptive analytics are often carried out via ad-hoc reporting or dashboards
▪The reports are generally static in nature and display historical data that is
presented in the form of data grids or charts.
Diagnostic Analytics

▪ It is a  form of advanced analytics which examines data or


content to answer the question “Why did it happen?”, and is
characterized by techniques such as drill-down, data
discovery, data mining and correlations.
▪ Diagnostic analytics provide more value than descriptive
analytics but require a more advanced skill set.
Predictive Analytics

▪ In predictive analytics, we need to make sense of why certain


things happened and then build a model to project what could
happen in the future.
▪ Predictive analysis can be remarkably beneficial for businesses
as it can serve as a guide in making their operations more
efficient by cutting down on the costs.
▪ The process can also make certain that businesses could
preserve the resources needed to take advantage of future
opportunities.
Big Data Applications
▪ all the industries today are leveraging Big Data applications in one
or the other way.
▪ All industries are getting benefited by using Big Data. The
Following are the applications of Big Data
1. Smart Healthcare
2. Telecom
3. Retail
4. Traffic Control
5. Manufacturing
6. Search Quality
Smarter Healthcare:
Making use of the petabytes of patient’s data, the organization
can extract meaningful information and then build applications
that can predict the patient’s deteriorating condition in
advance.
Telecom:
Telecom sectors collects information, analyzes it and provide
solutions to different problems. By using Big Data applications,
telecom companies have been able to significantly reduce data
packet loss, which occurs when networks are overloaded, and
thus, providing a seamless connection to their customers.
Retail:
Retail has some of the tightest margins, and is one of
the greatest beneficiaries of big data. The beauty of
using big data in retail is to understand consumer
behavior. Amazon’s recommendation engine provides
suggestion based on the browsing history of the
consumer.
Traffic control:
Traffic congestion is a major challenge for many
cities globally. Effective use of data and sensors will be
key to managing traffic better as cities become
Manufacturing:
Analyzing big data in the manufacturing industry can
reduce component defects, improve product quality,
increase efficiency, and save time and money.
Search Quality:
Every time we are extracting information from
Google, we are simultaneously generating data for it.
Google stores this data and uses it to improve its search
quality
Challenges with Big Data

▪ Data Quality – The problem here is the 4th V i.e. Veracity. The


data here is very messy, inconsistent and incomplete. Dirty
data cost $600 billion to the companies every year in the
United States.
▪ Discovery – Finding insights on Big Data is like finding a needle
in a haystack. Analyzing petabytes of data using extremely
powerful algorithms to find patterns and insights are very
difficult.

Challenges with Big Data (cont.)

▪ Storage – The more data an organization has, the more


complex the problems of managing it can become. The
question that arises here is “Where to store it?”. We
need a storage system which can easily scale up or
down on-demand.
▪ Analytics – In the case of Big Data, most of the time we
are unaware of the kind of data we are dealing with, so
analyzing that data is even more difficult.
Challenges with Big Data (cont.)
▪ Security – Since the data is huge in size, keeping it secure is another
challenge. It includes user authentication, restricting access based on
a user, recording data access histories, proper use of data encryption
etc.
▪ Lack of Talent – There are a lot of Big Data projects in major
organizations, but a sophisticated team of developers, data scientists
and analysts who also have sufficient amount of domain knowledge
is still a challenge.

Drivers of Big Data
▪ Social networking sites: Face book, Google, LinkedIn all these sites
generates huge amount of data on a day to day basis as they have billions
of users worldwide.
▪ E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge
amount of logs from which users buying trends can be traced.
▪ Weather Station: All the weather station and satellite gives very huge
data which are stored and manipulated to forecast weather.
▪ Telecom company: Telecom giants like Airtel, Vodafone study the user
trends and accordingly publish their plans and for this they store the data
of its million users.
▪ Share Market: Stock exchange across the world generates huge amount
of data through its daily transaction.
What is
MapReduce is a programming framework that allows us to
MapReduce?
perform distributed and parallel processing on large data sets in a
distributed environment.
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, reducer phase takes place
after mapper phase has been completed.
• So, the first is the map job, where a block of data is read and
processed to produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input
to the Reducer.
• The reducer receives the key-value pair from multiple map
jobs.
• Then, the reducer aggregates those intermediate data tuples
(intermediate key-value pair) into a smaller set of tuples or key-
MapReduce - Algorithm
The MapReduce algorithm contains two important tasks, namely Map and
Reduce.
• The map task is done by means of Mapper Class
• The reduce task is done by means of Reducer Class.

Mapper class takes the input, tokenizes it, maps and sorts it. The output of
Mapper class is used as input by Reducer class, which in turn searches
matching pairs and reduces them.
MapReduce implements various mathematical algorithms to divide a task into
small parts and assign them to multiple systems

These mathematical algorithms may include the following −


• Sorting
• Searching
• Indexing
• TF-IDF
Matrix vector multiplication using
MapReduce

▪ We have a sparse matrix A stored in the form < i; j; aij >,


where i; j are the row and column indices and a vector v
stored as < j; vj >. We wish to compute Av.
▪ For the following algorithm, we assume v is small enough
to t into the memory of the mapper.
▪ Algorithm 1 Matrix Vector Multiplication on MapReduce
1: function map(< i; j; aij >)
2: Emit(i, aijv[j])
3: end function
4: function reduce(key,values)
5: ret 0
6: for val 2 values do
7: ret ret + val
8: end for
9: Emit(key, ret)
10: end function

You might also like