Big Data Analytics On Large Scale Shared Storage System: First Seminar

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 22

Big Data Analytics on Large Scale

Shared Storage System

First Seminar

Supervised by Presented by
Dr. Ni Lar Kyar Nyo Aye
Thein 7Ph.D-1
1
Outline

 Abstract
 Introduction
 Motivation
 Objective
 Contribution
 Background Theory
 Proposed System Architecture
 Conclusion

2
Abstract

 In today’s world, almost every enterprise is seeing an explosion


of data.
 They are getting huge amount of digital data generated daily.
 Such huge amount of data needs to be stored, managed,
processed and analyzed in a scalable, fault tolerant and efficient
manner.
 The challenges of big data are most of them is semi structured or
unstructured data, need to carry out complex computations over
them and the time required to process them is as low as possible.
 To address these challenges, big data platform based on Hadoop
MapReduce framework and Gluster file system over large scale
shared storage system is proposed.
3
Introduction

 There is a high-level categorization of big data platforms to store


and process big data in a scalable, fault tolerant and efficient
manner.
 The first category includes massively parallel processing (MPP)
data warehouses that are designed to store huge amount of
structured data across a cluster of servers and perform parallel
computations over it.
 Most of these solutions follow shared nothing architecture which
means that every node will have a dedicated disk, memory and
processor.
 As they are designed to hold structured data so there is a need to
extract the structure from the data using an ETL tool and populate
these data sources with the structured data.

4
Cont’d

 These MPP Data Warehouses include:


 MPP Databases: these are generally the distributed systems designed to
run on a cluster of commodity servers. E.g. Aster nCluster, Greenplum,
IBM DB2 and Teradata.
 Appliances: a purpose-built machine with preconfigured MPP hardware
and software designed for analytical processing. E.g. Oracle Optimized
Warehouse, Teradata Machines, Netezza Performance Server and Sun’s
Data Warehousing Appliance.
 Columnar Databases: they store data in columns instead of rows,
allowing greater compression and faster query performance. E.g. Sybase
IQ, Vertica, InfoBright Data Warehouse, ParAccel.
 Another category includes distributed file systems like Hadoop to store
huge unstructured data and perform MapReduce computations on it over
a cluster built of commodity hardware.
5
Motivation

 Big Data analytics is an area of rapidly growing diversity.


 It can be defined in relationship to the need to parse large data sets from
multiple sources and to produce information in real-time or near real-
time.
 It requires massive performance and scalability- common problems that
old platforms can’t scale to big data volumes, load data too slowly,
respond to queries too slowly, lack processing capacity for analytics and
can’t handle concurrent mixed workloads.
 Traditional data warehousing is a large but relatively slow producer of
information to analytics users and mostly ideal for analyzing structured
data from various systems.
 Hadoop-based platform is well suited to deal with not only structured
data but also semi structured and unstructured data.

6
Objective

 To analyze high volume and variety of data


 To improve decision-making process
 To maximize performance for big data analytics
 To deliver big data platform that can perform large scale
data analysis efficiently and effectively
 To gain deep insights from big data
 To accelerate analytical processes
 To achieve low cost big data platform that is built upon
open source

7
Contribution

 An approach to extract deep insight from a high volume and


variety of data in cost-effective manner
 An approach to solve information challenges that don’t natively
fit within a traditional relational database approach for handling
the problem at hand
 An approach for highly scalable, fault tolerant big data solution

8
Background Theory

9
Big Data

 The term big data applies to information that can’t be processed


or analyzed using traditional processes or tools.
 There are three characteristics of big data: volume, variety, and
velocity.
 Volume: the amount of data to be handled (scale from terabytes to
zettabytes)
 Variety: manage and benefit from diverse data types and data structures
 Velocity: analyze streaming data and large volumes of persistent data
 There are two types of big data:
 data at rest (e.g. collection of what has streamed, web logs, emails, social
media, unstructured documents and structured data from disparate system)
 data in motion (e.g. twitter/facebook comments, stock market data and
sensor data).

10
Big Data Analytics

 Big data analytics is the application of advanced analytic techniques to very big
data sets.
 Advanced analytics is a collection of techniques and tool types, including
predictive analytics, data mining, statistical analysis, and so on.
 There are two main techniques for analyzing big data: the store and analyze
approach, and the analyze and store approach .
 The store and analyze approach integrates source data into a consolidated data
store before it is analyzed.
 Two important big data trends for supporting the store and analyze approach
are relational DBMS products optimized for analytical workloads (analytic
RDBMSs or ADBMSs) and non-relational systems (NoSQL systems) for
processing unstructured data.
 The analyze and store approach analyzes data as it flows through processes,
across networks, and between systems.

11
Big Data Storage
 Many organizations are struggling to deal with increasing data
volumes, and big data simply makes the problem worse.
 To solve this problem, organizations need to reduce the amount
of data being stored and exploit new storage technologies that
improve performance and storage utilization.
 From a big data perspective there are three important directions:
 Reducing data storage requirements using data compression and
new physical storage structures such as columnar storage.
 Improving input/output (I/O) performance using solid-state
drives (SSDs).
 Increasing storage utilization by using tiered storage to store data
on different types of devices based on usage

12
Proposed System Architecture

13
Big Data Storage Architecture

 The large scale shared storage system


architecture for big data is shown in
figure 1.
 In this architecture, scale out approach is
used and clustered architecture,
distributed/parallel file system and
commodity hardware are required.
 There are a number of significant
benefits to these new scale-out systems
that meet the needs of big data
challenges.
 They are manageability, elimination of
stovepipes, just in time scalability and Figure 1. Large scale shared storage
increased utilization rates. system architecture for big data

14
Proposed Big Data Platform
 In general for big data analytics, there are three approaches:
 direct analytics over massively parallel processing data warehouses:
uses an analytic tool directly over any of the MPP DW.
 indirect analytics over hadoop: analytics over hadoop Data indirectly
by first processing, transforming and structuring it inside hadoop and
then exporting the structured data into RDBMS.
 direct analytics over hadoop: all the queries that an analytic tool
wants to execute against the data will be executed as MR jobs over
big unstructured data placed into hadoop.
 The proposed approach performs analytics over Hadoop MapReduce
framework and Gluster file system.
 All the queries for analytics are executed as Map Reduce jobs over big
unstructured data placed into Gluster file system.

15
Cont’d
Figure 2 describes the proposed big data approach.

Figure 2. Proposed big data approach

16
Cont’d

 Proposed big data platform is shown in Figure 3.

Figure 3. Proposed big data platform


17
Hadoop and MapReduce Framework

 Data growth – particularly of unstructured data – poses a special challenge as


the volume and diversity of data types outstrip the capabilities of older
technologies such as relational databases.
 Organizations are investigating next generation technologies for data analytics.
 One of the most promising technologies is the Apache Hadoop software and
MapReduce framework for dealing with this “big data” problem.
 A MapReduce framework typically divides the input data-set into independent
tasks which are processed by the map tasks in a completely parallel manner.
 The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
 Typically both the input and the output of the jobs are stored in a file-system.
 The framework takes care of scheduling tasks, monitoring them and
reexecuting the failed tasks.

18
Gluster File System
 Gluster file system is a scalable open source clustered file system that offers a global
namespace, distributed front end, and scales to hundreds of petabytes without difficulty.
 It is also a software-only, highly available, scalable, centrally managed storage pool for
unstructured data.
 It is also scale-out file storage software for NAS, object, big data.
 By leveraging commodity hardware, Gluster also offers extraordinary cost advantages
benefits that are unmatched in the industry.
 Gluster file system can be used in place of HDFS, which brings all that software-based
data protection and functionality to the Hadoop cluster and removes the single point of
failure issue.
 GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard
file system APIs available in Hadoop to provide a new storage option for Hadoop
deployments.

19
Cont’d

 Figure 4 describes Gluster file


system compatibility for Apache
Hadoop.
 The following are the advantages of
Hadoop Compatible Storage with
GlusterFS:
 provides simultaneous file-based and
object-based access within Hadoop
 eliminates the centralized metadata
server,
 provides compatibility with
MapReduce applications and rewrite
is not required and
 provides a fault tolerant file system. Figure 4. GlusterFS compatibility for Apache
Hadoop
20
Conclusion

 Big data is a growing problem for corporations as a result of sheer data


volume along with radical changes in the types of data being stored and
analyzed, and its characteristics.
 The main challenges of big data are data variety, velocity and volume,
and analytical workload complexity and agility.
 To address these challenges, many vendors have developed big data
platforms.
 Big data platform for large scale data analysis by using Hadoop
MapReduce Framework and Gluster file system over scale-out NAS is
proposed.
 But Hadoop MapReduce is batch-like, and not immediately suitable for
real time analysis, unsuited to ad hoc queries.
 Hadoop solves volume and variety issues and so velocity issue is needed
to solve.
21
22

You might also like