Big Data Analytics On Large Scale Shared Storage System: First Seminar
Big Data Analytics On Large Scale Shared Storage System: First Seminar
Big Data Analytics On Large Scale Shared Storage System: First Seminar
First Seminar
Supervised by Presented by
Dr. Ni Lar Kyar Nyo Aye
Thein 7Ph.D-1
1
Outline
Abstract
Introduction
Motivation
Objective
Contribution
Background Theory
Proposed System Architecture
Conclusion
2
Abstract
4
Cont’d
6
Objective
7
Contribution
8
Background Theory
9
Big Data
10
Big Data Analytics
Big data analytics is the application of advanced analytic techniques to very big
data sets.
Advanced analytics is a collection of techniques and tool types, including
predictive analytics, data mining, statistical analysis, and so on.
There are two main techniques for analyzing big data: the store and analyze
approach, and the analyze and store approach .
The store and analyze approach integrates source data into a consolidated data
store before it is analyzed.
Two important big data trends for supporting the store and analyze approach
are relational DBMS products optimized for analytical workloads (analytic
RDBMSs or ADBMSs) and non-relational systems (NoSQL systems) for
processing unstructured data.
The analyze and store approach analyzes data as it flows through processes,
across networks, and between systems.
11
Big Data Storage
Many organizations are struggling to deal with increasing data
volumes, and big data simply makes the problem worse.
To solve this problem, organizations need to reduce the amount
of data being stored and exploit new storage technologies that
improve performance and storage utilization.
From a big data perspective there are three important directions:
Reducing data storage requirements using data compression and
new physical storage structures such as columnar storage.
Improving input/output (I/O) performance using solid-state
drives (SSDs).
Increasing storage utilization by using tiered storage to store data
on different types of devices based on usage
12
Proposed System Architecture
13
Big Data Storage Architecture
14
Proposed Big Data Platform
In general for big data analytics, there are three approaches:
direct analytics over massively parallel processing data warehouses:
uses an analytic tool directly over any of the MPP DW.
indirect analytics over hadoop: analytics over hadoop Data indirectly
by first processing, transforming and structuring it inside hadoop and
then exporting the structured data into RDBMS.
direct analytics over hadoop: all the queries that an analytic tool
wants to execute against the data will be executed as MR jobs over
big unstructured data placed into hadoop.
The proposed approach performs analytics over Hadoop MapReduce
framework and Gluster file system.
All the queries for analytics are executed as Map Reduce jobs over big
unstructured data placed into Gluster file system.
15
Cont’d
Figure 2 describes the proposed big data approach.
16
Cont’d
18
Gluster File System
Gluster file system is a scalable open source clustered file system that offers a global
namespace, distributed front end, and scales to hundreds of petabytes without difficulty.
It is also a software-only, highly available, scalable, centrally managed storage pool for
unstructured data.
It is also scale-out file storage software for NAS, object, big data.
By leveraging commodity hardware, Gluster also offers extraordinary cost advantages
benefits that are unmatched in the industry.
Gluster file system can be used in place of HDFS, which brings all that software-based
data protection and functionality to the Hadoop cluster and removes the single point of
failure issue.
GlusterFS 3.3 beta 2 includes compatibility for Apache Hadoop and it uses the standard
file system APIs available in Hadoop to provide a new storage option for Hadoop
deployments.
19
Cont’d