Big Data Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Big data is a term for data sets that are so large or complex that traditional data processing

application software is inadequate to deal with them.

Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage, and process data within a tolerable elapsed time. Big Data
encompasses unstructured, semi-structured and structured data, however the main focus is on
unstructured data. Big data "size" is a constantly moving target, as of 2012 ranging from a
few dozen terabytes to many petabytes of data.

Big Data represents the Information assets characterized by such a High Volume, Velocity
and Variety to require specific Technology and Analytical Methods for its transformation into
Value.

Volume: big data doesn't sample; it just observes and tracks what happens.
Velocity: big data is often available in real-time.

Variety: big data draws from text, images, audio, video; plus it completes missing pieces
through data fusion

Characteristics
Big data can be described by the following characteristics:

Volume
The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
Velocity
In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The quality of captured data can vary greatly, affecting the accurate analysis.

Big data value


Measuring the value of data is a boundless process with endless options and approaches
whether structured or unstructured, data is only as valuable as the business outcomes it makes
possible.

It is how we make use of data that allows us to fully recognize its true value and potential to
improve our decision making capabilities and, from a business stand point, measure it against
the result of positive business outcomes.

There are multiple approaches to improving a businesss decision-making process and to


determine the ultimate value of data, including data warehouses, business intelligence
systems, and analytics sandboxes and solutions.

These approaches place high emphasis on the importance of every individual data item that
goes into these systems and, as a result, highlight the importance of every single outcome
linking to business impacts delivered.

Big data characteristics are defined popularly through the four Vs: volume, velocity, variety
and veracity. Adapting these four characteristics provides multiple dimensions to the value of
data at hand.

Essentially, there is an assumption that the data has great potential, but no one has explored
where that might be. Unlike a business intelligence system, where analysts know what
information they are seeking, the possibilities of exploring big data are all linked to
identifying connections between things we dont know. It is all about designing the system to
decipher this information.

A possible approach could be to take the four Vs into prime consideration and determine
what kind of value they deliver while solving a particular business problem.

Volume-based value

Now that organisations have the ability to store as much data as possible in a cost-effective
manner, they have the capabilities to do broader analysis across different data dimensions and
also deeper analysis going back to multiple years of historical context behind data.

In essence, they no longer need to do sampling of data we can carry out their analysis on
the entire data set. The scenario applies heavily into developing true customer-centric
profiles, as well as richer customer centric offerings at a micro level.

The more data businesses have on the customers, both recent and historical, the greater the
insights. This will in turn lead to generating better decisions around acquiring, retaining,
increasing and managing those customer relationships.

Velocity-based value

This is all about speed, which is now more important than ever. The faster businesses can
inject data into their data and analytics platform, the more time they will have to ask the right
questions and seek answers. Rapid analysis capabilities provide businesses with the right
decision in time to achieve their customer relationship management objectives.

Variety-based value

In the digital era, capability to acquire and analyse varied data is extremely valuable, as the
more diverse customer data businesses have, the more multi-faceted view they develop about
their customers.

This in turn provides deep insights into successfully developing and personalising customer
journey maps, and provides a platform for businesses to be more engaged and aware of
customer needs and expectations.

Veracity-based value

While many question the quality and accuracy of data in the big data context, but for
innovative business offerings the accuracy of data is not that critical at least in the early
stages of concept design and validations. Thus the more business hypotheses that can be
churned out from this vast amount of data, the greater the potential for business
differentiation edge.

Developing a framework of measurement taking these aspects into account allows businesses
to easily measure the value of data in their most important metric money.

Once implementing a big data analytics platform, which measures along the four Vs,
businesses can utilize and extend the outcomes to directly impact on customer acquisition,
onboarding, retention, upsell, cross-sell and other revenue generating indicators.

This can also lead to measuring the value of parallel improvements in operational
productivity and the influence of data across the enterprise for other initiatives.

On the other side of the spectrum, however, it is important to note that amassing a lot of data
does not necessarily deliver insights. Businesses now have access to more data than ever
before, but having access to more data can make it harder to distill insights, since the bigger
the datasets, the harder it becomes to search, visualise, and analyse.

It is not the amount of data that matters, its how smart organisations are with the data they
have. In reality, they can have tons of data, but if theyre not using it intelligently it seldom
delivers what they are looking for.

Development of Big Data

Big data for development is a concept that refers to the identification of sources of big data
relevant to the policies and planning for development programs. It differs from both
traditional development data and what the private sector and mainstream media call big
data.

In general, sources of big data for development are those which can be analyzed to gain
insight into human well-being and development, and generally share some or all of the
following features:

Digitally generated: Data is created digitally, not digitized manually, and can be
manipulated by computers.
Passively produced: Data is a by-product of interactions with digital services.
Automatically collected: A system is in place that automatically extracts and stores the
relevant data that is generated.
Geographically or temporarily trackable: For instance, this is the case in mobile phone
location data or call duration time.
Continuously analyzed: Information is relevant to human well-being and development,
and can be analyzed in real time.
Big data for development is constantly evolving. However, a preliminary categorization of
sources may reflect:

What people say (online content): International and local online news sources, publicly
accessible blogs, forum posts, comments and public social media content, online advertising,
e-commerce sites and websites created by local retailers that list prices and inventory.

What people do (data exhaust): Passively collected transactional data from the use of digital
services such as financial services (including purchase, money transfers, savings and loan
repayments), communications services (such as anonymized records of mobile phone usage
patterns) or information services (such as anonymized records of search queries).

Before it can be used effectively, big data needs to be managed and filtered through data
analytics - tools and methodologies that can transform massive quantities of raw data into
data about the data for analytical purposes. Only then it is possible to detect changes in
how communities access services that may be useful proxy indicators of human well-being.

If properly mined and analyzed, big data can improve the understanding of human behavior
and offer policymaking support for global development in three main ways:
Early warning: Early detection of anomalies can enable faster responses to population in
times of crisis.
Real-time awareness: Fine-grained representation of reality through big data can inform
the design and targeting of programs and policies.
Real-time feedback: Adjustments can be made possible by real-time monitoring of the
impact of policies and programs.
Global Pulse is a United Nations innovation initiative of the Secretary-General, exploring
how big data can help policymakers gain a better understanding of changes in human well-
being. Through strategic public-private partnerships and R&D carried out across its network
of Pulse Labs in New York, Jakarta and Kampala, Global Pulse functions as a hub for
applying innovations in data science and analytics to global development and humanitarian
challenges.

Big data analytics is not a panacea for age-old development challenges, and real-time
information does not replace the quantitative statistical evidence that governments
traditionally use for decision making. However, it does have the potential to inform whether
further targeted investigation is necessary, or prompt immediate response.

Big data challenges

The major challenges associated with big data are as follows:

Capturing data

Curation

Storage

Searching

Sharing

Transfer

Analysis

Presentation
These V-based characterizations represent ten different challenges associated with the main
tasks involving big data (as mentioned earlier: capture, cleaning, curation, integration,
storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization).
Volume: = lots of data (which I have labeled a Tonnabytes, to suggest that the
actual numerical scale at which the data volume becomes challenging in a particular
setting is domain-specific, but we all agree that we are now dealing with a ton of
bytes).

Variety: = complexity, thousands or more features per data item, the curse of
dimensionality, combinatorial explosion, many data types, and many data formats.

Velocity: = high rate of data and information flowing into and out of our systems,
real-time, incoming!

Veracity: = necessary and sufficient data to test many different hypotheses, vast
training samples for rich micro-scale model-building and model validation, micro-
grained truth about every object in your data collection, thereby empowering
whole-population analytics.

Validity: = data quality, governance, master data management (MDM) on massive,


diverse, distributed, heterogeneous, unclean data collections.

Value: = the all-important V, characterizing the business value, ROI, and potential
of big data to transform your organization from top to bottom (including the bottom
line).

Variability: = dynamic, evolving, spatiotemporal data, time series, seasonal, and


any other type of non-static behavior in your data sources, customers, objects of
study, etc.

Venue: = distributed, heterogeneous data from multiple platforms, from different


owners systems, with different access and formatting requirements, private vs.
public cloud.

Vocabulary: = schema, data models, semantics, ontologies, taxonomies, and other


content- and context-based metadata that describe the datas structure, syntax,
content, and provenance.

Vagueness: = confusion over the meaning of big data (Is it Hadoop? Is it something
that weve always had? Whats new about it? What are the tools? Which tools
should I use? etc.)
What is NoSQL?

NoSQL encompasses a wide variety of different database technologies that were developed in
response to the demands presented in building modern applications:

Developers are working with applications that create massive volumes of new, rapidly
changing data types structured, semi-structured, unstructured and polymorphic data.

Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams
work in agile sprints, iterating quickly and pushing code every week or two, some even
multiple times every day.

Applications that once served a finite audience are now delivered as services that must be
always-on, accessible from many different devices and scaled globally to millions of users.

Organizations are now turning to scale-out architectures using open source software,
commodity servers and cloud computing instead of large monolithic servers and storage
infrastructure.

Relational databases were not designed to cope with the scale and agility challenges that face
modern applications, nor were they built to take advantage of the commodity storage and
processing power available today.

NoSQL with the MongoDB Atlas: Hosted Database as a Service

Try out the easiest way to start learning and prototyping applications on MongoDB, the
leading non-relational database.

Launching an application on any database typically requires careful planning to ensure


performance, high availability, security, and disaster recovery and these obligations
continue as long as you run the application. With MongoDB Atlas, you receive all of the
features of MongoDB without any of the operational heavy lifting, allowing you to focus
instead on learning and building your apps. Features include:

On-demand, pay as you go model

Seamless upgrades and auto-healing

Fully elastic. Scale up and down with ease


Deep monitoring & customizable alerts

Highly secure by default

Continuous backups with point-in-time recovery

NoSQL Database Types

Document databases pair each key with a complex data structure known as a document.
Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents.
Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is
stored as an attribute name (or 'key'), together with its value. Examples of key-value stores
are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a
type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large
datasets, and store columns of data together, instead of rows.

The Benefits of NoSQL

When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational model
is not designed to address:

Large volumes of rapidly changing structured, semi-structured, and unstructured data

Agile sprints, quick schema iteration, and frequent code pushes

Object-oriented programming that is easy to use and flexible

Geographically distributed scale-out architecture instead of expensive, monolithic


architecture

Top 5 Considerations When Evaluating NoSQL Databases :

Selecting the appropriate data model: document, key-value & wide column, or graph model.

The pros and cons of consistent and eventually consistent systems.


Why idiomatic drivers minimize onboarding time for new developers and simplify
application development.

1.Dynamic Schemas

Relational databases require that schemas be defined before you can add data. For example,
you might want to store data about your customers such as phone numbers, first and last
name, address, city and state a SQL database needs to know what you are storing in
advance.

This fits poorly with agile development approaches, because each time you complete new
features, the schema of your database often needs to change. So if you decide, a few
iterations into development, that you'd like to store customers' favorite items in addition to
their addresses and phone numbers, you'll need to add that column to the database, and then
migrate the entire database to the new schema.

If the database is large, this is a very slow process that involves significant downtime. If you
are frequently changing the data your application stores because you are iterating rapidly
this downtime may also be frequent. There's also no way, using a relational database, to
effectively address data that's completely unstructured or unknown in advance.

NoSQL databases are built to allow the insertion of data without a predefined schema. That
makes it easy to make significant application changes in real-time, without worrying about
service interruptions which means development is faster, code integration is more reliable,
and less database administrator time is needed. Developers have typically had to add
application-side code to enforce data quality controls, such as mandating the presence of
specific fields, data types or permissible values. More sophisticated NoSQL databases allow
validation rules to be applied within the database, allowing users to enforce governance
across data, while maintaining the agility benefits of a dynamic schema.

2.Auto-sharding

Because of the way they are structured, relational databases usually scale vertically a single
server has to host the entire database to ensure acceptable performance for cross- table joins
and transactions. This gets expensive quickly, places limits on scale, and creates a relatively
small number of failure points for database infrastructure. The solution to support rapidly
growing applications is to scale horizontally, by adding servers instead of concentrating more
capacity in a single server.

'Sharding' a database across many server instances can be achieved with SQL databases, but
usually is accomplished through SANs and other complex arrangements for making hardware
act as a single server. Because the database does not provide this ability natively,
development teams take on the work of deploying multiple relational databases across a
number of machines. Data is stored in each database instance autonomously. Application
code is developed to distribute the data, distribute queries, and aggregate the results of data
across all of the database instances. Additional code must be developed to handle resource
failures, to perform joins across the different databases, for data rebalancing, replication, and
other requirements. Furthermore, many benefits of the relational database, such as
transactional integrity, are compromised or eliminated when employing manual sharding.

NoSQL databases, on the other hand, usually support auto-sharding, meaning that they
natively and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool. Data and
query load are automatically balanced across servers, and when a server goes down, it can be
quickly and transparently replaced with no application disruption.

3.Cloud computing makes this significantly easier, with providers such as Amazon Web
Services providing virtually unlimited capacity on demand, and taking care of all the
necessary infrastructure administration tasks. Developers no longer need to construct
complex, expensive platforms to support their applications, and can concentrate on writing
application code. Commodity servers can provide the same processing and storage
capabilities as a single high-end server for a fraction of the price.

4.Replication

Most NoSQL databases also support automatic database replication to maintain availability in
the event of outages or planned maintenance events. More sophisticated NoSQL databases
are fully self-healing, offering automated failover and recovery, as well as the ability to
distribute the database across multiple geographic regions to withstand regional failures and
enable data localization. Unlike relational databases, NoSQL databases generally have no
requirement for separate applications or expensive add-ons to implement replication.

5.Integrated Caching

A number of products provide a caching tier for SQL database systems. These systems can
improve read performance substantially, but they do not improve write performance, and they
add operational complexity to system deployments. If your application is dominated by reads
then a distributed cache could be considered, but if your application has just a modest write
volume, then a distributed cache may not improve the overall experience of your end users,
and will add complexity in managing cache invalidation.

Many NoSQL database technologies have excellent integrated caching capabilities, keeping
frequently-used data in system memory as much as possible and removing the need for a
separate caching layer. Some NoSQL databases also offer fully managed, integrated in-
memory database management layer for workloads demanding the highest throughput and
lowest latency.

NoSQL vs. SQL Summary

Types One type (SQL Many different types including key-value


database) with minor stores,document databases,
variations wide-column stores, and graph databases

Development Developed in 1970s to Developed in late 2000s to deal with limitations


History deal with first wave of of SQL databases,
data storage
applications especially scalability, multi-structured data, geo-
distribution and

agile development sprints

Examples MySQL, Postgres, MongoDB, Cassandra, HBase, Neo4j


Microsoft SQL Server,
Oracle Database

Data Storage Individual records Varies based on database type. For example, key-
Model (e.g., 'employees') are value stores
stored as rows in
tables, with each function similarly to SQL databases, but have
column storing a only two columns
specific piece of data ('key' and 'value'), with more complex
about that record (e.g., information sometimes
'manager,' 'date hired,'
etc.), much like a stored as BLOBs within the 'value' columns.
spreadsheet. Related Document databases
data is stored in
do away with the table-and-row model altogether,
separate tables, and
storing all
then joined together
when more complex relevant data together in single 'document' in
queries are executed. JSON, XML,
For example, 'offices'
might be stored in one or another format, which can nest values
table, and 'employees' hierarchically.
in another. When a user
wants to find the work
address of an
employee, the database
engine joins the
'employee' and 'office'
tables together to get
all the information
necessary.

Schemas Structure and data Typically dynamic, with some enforcing data
types are fixed in validation rules. Applications can add new fields
advance. To store on the fly, and unlike SQL table rows, dissimilar
information about a data can be stored together as necessary. For
new data item, the some databases (e.g., wide-column stores), it is
entire database must be somewhat more challenging to add new fields
altered, during which dynamically.
time the database must
be taken offline.

Scaling Vertically, meaning a Horizontally, meaning that to add capacity, a


single server must be database administrator can simply add more
made increasingly commodity servers or cloud instances. The
powerful in order to database automatically spreads data across
deal with increased servers as necessary.
demand. It is possible
to spread SQL
databases over many
servers, but significant
additional engineering
is generally required,
and core relational
features such as JOINs,
referential integrity and
transactions are
typically lost.

Development Mix of open-source Open-source


Model (e.g., Postgres,
MySQL) and closed
source (e.g., Oracle
Database)

Supports Yes, updates can be In certain circumstances and at certain levels


Transactions configured to complete (e.g., document
entirely or not at all
level vs. database level)

Data Specific language Through object-oriented APIs


Manipulation using Select, Insert,
and Update statements,
e.g. SELECT fields
FROM table
WHERE

Consistency Can be configured for Depends on product. Some provide strong


strong consistency consistency

(e.g., MongoDB, with tunable consistency for


reads)

whereas others offer eventual consistency (e.g.,


Cassandra).

Implementing a NoSQL Database within Education

Often, organizations will begin with a small-scale trial of a NoSQL database in their
organization, which makes it possible to develop an understanding of the technology in a
low-stakes way. Most NoSQL databases are also open-source, meaning that they can be
downloaded, implemented and scaled at little cost. Because development cycles are faster,
organizations can also innovate more quickly and deliver superior customer experience at a
lower cost.

As you consider alternatives to legacy infrastructures, you may have several motivations: to
scale or perform beyond the capabilities of your existing system, identify viable alternatives
to expensive proprietary software, or increase the speed and agility of development. When
selecting the right database for your business and application, there are five important
dimensions to consider.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.

To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.

There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology:
Operational Big Data

This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.

NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be run
inexpensively and efficiently. This makes operational big data workloads much easier to
manage, cheaper, and faster to implement.

Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that
may touch most or all of the data.

MapReduce provides a new method of analyzing data that is complementary to the


capabilities provided by SQL, and a system based on MapReduce that can be scaled up from
single servers to thousands of high and low end machines.

These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems


Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Writes and
Access Pattern Reads
Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist


Technology NoSQL MapReduce, MPP Database

Traditional Approach

In this approach, an enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present it
to the users for analysis purpose.

Limitation

This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data. But
when it comes to dealing with huge amounts of data, it is really a tedious task to process such
data through a traditional database server.

Googles Solution

Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns those parts to many computers connected over the network,
and collects the results to form the final result dataset.
Above diagram shows various commodity hardwares which could be single CPU machines or
servers with higher capacity.

Hadoop

Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy
elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.

Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
A Hadoop frame-worked application works in an environment that provides distributed
storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.

Hadoop Architecture

Hadoop framework includes following four modules:

Hadoop Common:These are Java libraries and utilities required by other


Hadoop modules. These libraries provides filesystem and OS level abstractions and
contains the necessary Java files and scripts required to start Hadoop.

Hadoop YARN:This is a framework for job scheduling and cluster resource


management.

Hadoop Distributed File System (HDFS):A distributed file system that


provides high-throughput access to application data.

Hadoop MapReduce:This is YARN-based system for parallel processing of


large data sets.

We can use following diagram to depict these four components available in Hadoop
framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.

MapReduce

Hadoop MapReduceis a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.

The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:

The Map Task:This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples (key/value
pairs).

The Reduce Task:This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is always
performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.

The MapReduce framework consists of a single master JobTrackerand one slave


TaskTrackerper cluster-node. The master is responsible for resource management, tracking
resource consumption/availability and scheduling the jobs component tasks on the slaves,
monitoring them and re-executing the failed tasks. The slaves TaskTracker execute the tasks
as directed by the master and provide task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service which means
if JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as Local FS,
HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the
Hadoop Distributed File System (HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a singleNameNode that


manages the file system metadata and one or more slaveDataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set
of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.

HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter
along with appropriate examples.

How Does Hadoop Work?

Stage 1

A user/application can submit a job to the Hadoop (a hadoop job client) for required process
by specifying the following items:
The location of the input and output files in the distributed file system.

The java classes in the form of jar file containing the implementation of map
and reduce functions.

The job configuration by setting different parameters specific to the job.

Stage 2

The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.

Stage 3

The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.

Advantages of Hadoop

Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.

Hadoop does not rely on hardware to provide fault-tolerance and high


availability (FTHA), rather Hadoop library itself has been designed to detect and
handle failures at the application layer.

Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.

Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.

Hadoop Operation Modes


Once you have downloaded Hadoop, you can operate your Hadoop cluster in one of the three
supported modes:

Local/Standalone Mode : After downloading Hadoop in your system, by


default, it is configured in a standalone mode and can be run as a single java process.
Pseudo Distributed Mode : It is a distributed simulation on single machine.
Each Hadoop daemon such as hdfs, yarn, MapReduce etc., will run as a separate java
process. This mode is useful for development.

Fully Distributed Mode : This mode is fully distributed with minimum two
or more machines as a cluster. We will come across this mode in detail in the coming
chapters.

Hadoop - HDFS Overview


Hadoop File System was developed using distributed file system design. It is run on
commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and
designed using low-cost hardware.

HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.

Features of HDFS

It is suitable for the distributed storage and processing.

Hadoop provides a command interface to interact with HDFS.

The built-in servers of namenode and datanode help users to easily check the
status of cluster.

Streaming access to file system data.

HDFS provides file permissions and authentication.

HDFS Architecture

Given below is the architecture of a Hadoop File System.


HDFS follows the master-slave architecture and it has the following elements.

Namenode

The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks:

Manages the file system namespace.

Regulates clients access to files.

It also executes file system operations such as renaming, closing, and opening
files and directories.

Datanode

The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.

Datanodes perform read-write operations on the file systems, as per client


request.

They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block

Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.

Goals of HDFS

Fault detection and recovery : Since HDFS includes a large number of


commodity hardware, failure of components is frequent. Therefore HDFS should have
mechanisms for quick and automatic fault detection and recovery.

Huge datasets : HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.

Hardware at data : A requested task can be done efficiently, when the


computation takes place near the data. Especially where huge datasets are involved, it
reduces the network traffic and increases the throughput.

Hadoop MapReduce

MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.

The Algorithm

Generally MapReduce paradigm is based on sending the computer to where


the data resides!

MapReduce program executes in three stages, namely map stage, shuffle


stage, and reduce stage.

Map stage: The map or mappers job is to process the input data.
Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS). The input file is passed to the mapper function
line by line. The mapper processes the data and creates several small chunks of
data.

Reduce stage: This stage is the combination of the Shufflestage and


theReducestage. The Reducers job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored
in the HDFS.

During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.

The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.

Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.

After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.

Relationship between Hadoop and big data


Basically if we want to learn or know about the difference between big data and Hadoop we
need to focus on what big data and Hadoop actually is. There is a huge difference in mindset
of people about what Hadoop is and what big data is because there is a lot of confusion about
both the things. Professionals also get confused when they are asked to define big data and
Hadoop. Lets first define Hadoop and big data in detail. Big data as a term has huge
meaning, it can be described in number of ways but actually big data means data sets that are
so large or complex that conventional data processing applications are not appropriate. The
challenges which every professional faces are analysis, capture, data curation, search,
sharing, storage, transfer, visualization, querying, and updating and information privacy. The
term often refers simply to the use of analytics which can be predictive or certain other
advanced methods to extract value from data, and sort them into a particular size of data set.
Big data should be accurate so that it lead to more confident decision making, and better
decisions can result in greater operational efficiency, cost reduction and reduced risk. Now
lets talk about Hadoop what Hadoop actually is and how it is impacting todays data world.
Hadoop is a free, Java-based programming framework that supports the processing of large
data sets in a distributed computing environment. It is part of the Apache project sponsored
by the Apache Software Foundation. The use of Hadoop makes it possible to run applications
on systems with thousands of nodes involving thousands of terabytes. Its distributed file
system helps in rapid data transfer rates among nodes and allows the system to continue
operating uninterrupted in case of a node failure. This approach lowers the risk of
catastrophic system failure, even if a significant number of nodes become inoperative.
Hadoop is based on Google's MapReduce, a software framework in which an application is
broken down into a large number of small parts. Any of these parts can be run on any node in
the cluster. It was named after the creators child stuffed toy elephant. The current Hadoop
ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system
(HDFS) and a number of related projects. The Hadoop framework is used by major
companies including Google, Yahoo and IBM, largely for applications involving search
engines and advertising. The preferred operating systems are Windows and Linux but
Hadoop can also work with BSD and OS X. As we have discussed what Hadoop is and what
Big Data is now lets talk about the difference between Hadoop and Big Data how they are
differentiated from each other .What are the major things on which we can focus to represent
a difference between Hadoop and Big Data. Big Data is nothing but a concept which
facilitates handling large amount of data sets. Hadoop is just a single framework out of
dozens of tools. Hadoop is primarily used for batch processing. The difference between big
data and the open source software Hadoop is a distinct and fundamental one. The former is an
asset, often a complex and has many interpretations, while the latter is a program that
accomplishes a set of goals and objectives. Big data is simply the large sets of data that
businesses and other parties put together to serve specific goals and operations. Big data can
include many different kinds of data in many different kinds of formats. For example,
business might put a lot of work into collecting thousands of pieces of data on purchases in
currency formats, there can be many identifiers like name and special number, or there can be
information about product, sales and inventory.

BIG DATA & CLOUD COMPUTING The concept of big data became a major force of
innovation across both academics and corporations. The paradigm is viewed as an effort to
understand and get proper insights from big datasets (big data analytics), providing
summarized information over huge data loads. As such, this paradigm is regarded by
corporations as a tool to understand their clients, to get closer to them, find patterns and
predict trends. Furthermore, big data is viewed by scientists as a mean to store and process
huge scientific datasets. This concept is a hot topic and is expected to continue to grow in
popularity in the coming years. Although big data is mostly associated with the storage of
huge loads of data it also concerns ways to process and extract knowledge from it (Hashem et
al., 2014). The five different aspects used to describe big data (commonly referred to as the
five Vs) are Volume, Variety, Velocity, Value and Veracity (Sakr & Gaber, 2014):
Volume describes the size of datasets that a big data system deals with. Processing and
storing big volumes of data is rather difficult, since it concerns: scalability so that the system
can grow; availability, which guarantees access to data and ways to perform operations over
it; and bandwidth and performance. Variety concerns the different types of data from
various sources that big data frameworks have to deal with. Velocity concerns the different
rates at which data streams may get in or out the system and provides an abstraction layer so
that big data systems can store data independently of the incoming or outgoing rate. Value
concerns the true value of data (i.e., the potential value of the data regarding the information
they contain). Huge amounts of data are worthless unless they provide value. Veracity
refers to the trustworthiness of the data, addressing data confidentiality, integrity, and
availability. Organizations need to ensure that data as well as the analyses performed on the
data.

Cloud computing is another paradigm which promises theoretically unlimited on-demand


services to its users. Clouds ability to virtualize resources allows abstracting hardware,
requiring little interaction with cloud service providers and enabling users to access terabytes
of storage, high processing power, and high availability in a pay-as-you-go model (Gonzlez-
Martnez et al., 2015). Moreover, it transfers cost and responsibilities from the user to the
cloud provider, boosting small enterprises to which getting started in the IT business
represents a large endeavour, since the initial IT setup takes a big effort as the company has
to consider the total cost of ownership (TCO), including hardware expenses, software
licenses, IT personnel and infrastructure maintenance. Cloud computing provides an easy
way to get resources on a pay-as-you-go model, offering scalability and availability, meaning
that companies can easily negotiate resources with the cloud provider as required. Cloud
providers usually offer three different basic services: Infrastructure as a Service (IaaS);
Platform as a Service (PaaS); and Software as a Service (SaaS): IaaS delivers infrastructure,
which means storage, processing power, and virtual machines. The cloud provider satisfies
the needs of the client by virtualizing resources according to the service level agreements
(SLAs); PaaS is built atop of IaaS and allows users to deploy cloud applications created
using the programming and run-time environments supported by the provider. It is at this
level that big data DBMS are implemented; SaaS is one of the most known cloud models
and consists of applications running directly in the cloud provider; These three basic services
are closely related: SaaS is developed over PaaS and ultimately PaaS is built atop of IaaS.
From the general cloud services other services such as Database as a Service (DBaaS)
(Oracle, 2012), BigData as a Service (BDaaS) and Analytics as a Service (AaaS) arose. Since
the cloud virtualizes resources in an ondemand fashion, it is the most suitable and compliant
framework for big data processing, which through hardware virtualization creates a high
processing power environment for big data.

BIG DATA IN THE CLOUD Storing and processing big volumes of data requires
scalability, fault tolerance and availability. Cloud computing delivers all these through
hardware virtualization. Thus, big data and cloud computing are two compatible concepts as
cloud enables big data to be available, scalable and fault tolerant. Business regard big data as
a valuable business opportunity. As such, several new companies such as Cloudera,
Hortonworks, Teradata and many others, have started to focus on delivering Big Data as a
Service (BDaaS) or DataBase as a Service (DBaaS). Companies such as Google, IBM,
Amazon and Microsoft also provide ways for consumers to consume big data on demand.
Next, we present two examples, Nokia and RedBus, which discuss the successful use of big
data within cloud environments. 3.1 Nokia Nokia was one of the first companies to
understand the advantage of big data in cloud environments (Cloudera, 2012). Several years
ago, the company used individual DBMSs to accommodate each application requirement.
However, realizing the advantages of integrating data into one application, the company
decided to migrate to Hadoop-based systems, integrating data within the same domain,
leveraging the use of analytics algorithms to get proper insights over its clients. As Hadoop
uses commodity hardware, the cost per terabyte of storage was cheaper than a traditional
RDBMS (Cloudera, 2012). Since Cloudera Distributed Hadoop (CDH) bundles the most
popular open source projects in the Apache Hadoop stack into a single, integrated package,
with stable and reliable releases, it embodies a great opportunity for implementing Hadoop
infrastructures and transferring IT and technical concerns onto the vendors specialized
teams. Nokia regarded Big Data as a Service (BDaaS) as an advantage and trusted Cloudera
to deploy a Hadoop environment that copes with its requirements in a short time frame.
Hadoop, and in particular CDH, strongly helped Nokia to fulfil their needs (Cloudera, 2012).
3.2 RedBus RedBus is the largest company in India specialized in online bus ticket and hotel
booking. This company wanted to implement a powerful data analysis tool to gain insights
over its bus booking service (Kumar, 2006). Its datasets could easily stretch up to 2 terabytes
in size. The application would have to be able to analyse booking and inventory data across
hundreds of bus operators serving more than 10.000 routes. Furthermore, the company
needed to avoid setting up and maintaining a complex in-house infrastructure. At first,
RedBus considered implementing inhouse clusters of Hadoop servers to process data.
However they soon realized it would take too much time to set up such a solution and that it
would require specialized IT teams to maintain such infrastructure. The company then
regarded Google bigQuery as the perfect match for their needs, allowing them to: Know
how many times consumers tried to find an available seat but were unable to do it due bus
overload; Examine decreases in bookings; Quickly identify server problems by analysing
data related to server activity; Moving towards big data brought RedBus business advantages.
Google bigQuery armed RedBus with real-time data analysis capabilities at 20% of the cost
of maintaining a complex Hadoop infrastructure (Kumar, 2006). As supported by Nokia and
RedBus examples, switching towards big data enables organizations to gain competitive
advantage. Additionally, BDaaS provided by big data vendors allows companies to leave the
technical details for big data vendors and focus on their core business needs.

The rise of cloud computing and cloud data stores have been a precursor and facilitator to
the emergence of big data. Cloud computing is the commodification of computing time and
data storage by means of standardized technologies.

It has significant advantages over traditional physical deployments. However, cloud


platforms come in several forms and sometimes have to be integrated with traditional
architectures.

This leads to a dilemma for decision makers in charge of big data projects. How and which
cloud computing is the optimal choice for their computing needs, especially if it is a big data
project? These projects regularly exhibit unpredictable, bursting, or immense computing
power and storage needs. At the same time business stakeholders expect swift, inexpensive,
and dependable products and project outcomes. This article introduces cloud computing and
cloud storage, the core cloud architectures, and discusses what to look for and how to get
started with cloud computing.

CLOUD PROVIDERS

A decade ago an IT project or start-up that needed reliable and Internet connected computing
resources had to rent or place physical hardware in one or several data centers. Today,
anyone can rent computing time and storage of any size. The range starts with virtual
machines barely powerful enough to serve web pages to the equivalent of a small
supercomputer. Cloud services are mostly pay-as-you-go, which means for a few hundred
dollars anyone can enjoy a few hours of supercomputer power. At the same time cloud
services and resources are globally distributed. This setup ensures a high availability and
durability unattainable by most but the largest organizations.

The cloud computing space has been dominated by Amazon Web Services until recently.
Increasingly serious alternatives are emerging like Google Cloud Platform, Microsoft Azure,
Rackspace, or Qubole to name only a few. Importantly for customers a struggle on platform
standards is underway. The two front-running solutions are Amazon Web Services
compatible solutions, i.e. Amazons own offering or companies with application
programming interface compatible offerings, and OpenStack, an open source project with a
wide industry backing. Consequently, the choice of a cloud platform standard has
implications on which tools are available and which alternative providers with the same
technology are available.

CLOUD STORAGE

Professional cloud storage needs to be highly available, highly durable, and has to scale from
a few bytes to petabytes. Amazons S3 cloud storage and Microsoft Azure Blob Storage are
the most prominent solutions in the space. They promise in the range of 99.9% monthly
availability and 99.999999999% durability per year. This is less than an hour outage per
month. The durability can be illustrated with an example. If a customer stores 10,000 objects
he can expect to lose one object every 10,000,000 years on average. They sometime achieve
this by storing data in multiple facilities with error checking and self-healing processes to
detect and repair errors and device failures. This is completely transparent to the user and
requires no actions or knowledge.

A company could build and achieve a similarly reliable storage solution but it would require
tremendous capital expenditures and operational challenges. Global data centered companies
like Google or Facebook have the expertise and scale to do this economically. Big data
projects and start-ups, however, benefit from using a cloud storage service. They can trade
capital expenditure for an operational one, which is excellent since it requires no capital
outlay or risk. It provides from the first byte reliable and scalable storage solutions of a
quality otherwise unachievable.

This enables new products and projects with a viable option to start on a small scale with low
costs. When a product proves successful these storage solutions scale virtually indefinitely.
Cloud storage is effectively a boundless data sink. Importantly for computing performances is
that many solutions also scale horizontally, i.e. when data is copied in parallel by cluster or
parallel computing processes the throughput scales linear with the number of nodes reading
or writing.

CLOUD COMPUTING

Cloud computing employs visualization of computing resources to run numerous


standardized virtual servers on the same physical machine. Cloud providers achieve with this
economies of scale, which permit low prices and billing based on small time intervals, e.g.
hourly.

This standardization makes it an elastic and highly available option for computing needs. The
availability is not obtained by spending resources to guarantee reliability of a single instance
but by their interchangeability and a limitless pool of replacements. This impacts design
decisions and requires to deal with instance failure gracefully.

The implications for an IT project or company using cloud computing are significant and
change the traditional approach to planning and utilization of resources. Firstly, resource
planning becomes less important. It is required for costing scenarios to establish the viability
of a project or product. However, deploying and removing resources automatically based on
demand needs to be focused on to be successful. Vertical and horizontal scaling becomes
viable once a resource becomes easily deployable.

Horizontal scaling refers to the ability to replace a single small computing resource with a
bigger one to account for increased demand. Cloud computing supports this by making
various resource types available to switch between them. This also works in the opposite
direction, i.e. to switch to a smaller and cheaper instance type when demand decreases. Since
cloud resources are commonly paid on a usage basis no sunk cost or capital expenditures are
blocking fast decision making and adaptation. Demand is difficult to anticipate despite
planning efforts and naturally results in most traditional projects in over- or under-provision
resources. Therefore, traditional projects tend to waste money or provide poor outcomes.

Cloud Big Data Challenges

Vertical scaling achieves elasticity by adding additional instances with each of them serving a
part of the demand. Software like Hadoop are specifically designed as distributed systems to
take advantage of vertical scaling. They process small independent tasks in massive parallel
scale. Distributed systems can also serve as data stores like NoSQL databases, e.g. Cassandra
or HBase, or filesystems like Hadoops HDFS. Alternatives like Storm provide coordinated
stream data processes in near real-time through a cluster of machines with complex
workflows.

The interchangeability of the resources together with distributed software design absorbs
failure and equivalently scaling of virtual computing instances unperturbed. Spiking or
bursting demands can be accommodated just as well as personalities or continued growth.
Renting practically unlimited resources for short periods allows one-off or periodical projects
at a modest expense. Data mining and web crawling are great examples. It is conceivable to
crawl huge web sites with millions of pages in days or hours for a few hundred dollars or
less. Inexpensive tiny virtual instances with minimal CPU resources are ideal for this purpose
since the majority of crawling the web is spent waiting for IO resources. Instantiating
thousands of these machines to achieve millions of requests per day is easy and often costs
less than a fraction of a cent per instance hour.

Of course, such mining operations should be mindful of the resources of the web sites or
application interfaces they mine, respect their terms, and not impede their service. A poorly
planned data mining operation is equivalent to a denial of service attack. Lastly, cloud
computing is naturally a good fit for storing and processing the big data accumulated form
such operations.

Cloud Architecture

Three main cloud architecture models have developed over time; private, public and hybrid
cloud. They all share the idea of resource commodification and to that end usually virtualize
computing and abstract storage layers.

PRIVATE CLOUD

Private clouds are dedicated to one organization and do not share physical resources. The
resource can be provided in-house or externally. A typical underlying requirement of private
cloud deployments are security requirements and regulations that need a strict separation of
an organizations data storage and processing from accidental or malicious access through
shared resources.Private cloud setups are challenging since the economical advantages of
scale are usually not achievable within most projects and organizations despite the utilization
of industry standards. The return of investment compared to public cloud offerings is rarely
obtained and the operational overhead and risk of failure is significant.

Additionally, cloud providers have captured the trend for increased security and provide
special environments, i.e. dedicated hardware to rent and encrypt virtual private networks as
well as encrypted storage to address most security concerns. Cloud providers may also offer
data storage, transfer, and processing restricted to specific geographic regions to ensure
compliance with local privacy laws and regulations.

Another reason for private cloud deployments are legacy systems with special hardware
needs or exceptional resource demand, e.g. extreme memory or computing instances which
are not available in public clouds. These are valid concerns however if these demands are
extraordinary the question if a cloud architecture is the correct solution has to be raised. One
reason can be to establish a private cloud for a transitionary period to run legacy and
demanding systems in parallel while their services are ported to a cloud environment
culminating in a switch to a cheaper public or hybrid cloud.
PUBLIC CLOUD

Public clouds share physical resources for data transfers, storage, and processing. However,
customers have private visualized computing environments and isolated
storage. Security concerns, which entice a few to adopt private clouds or custom
deployments, are for the vast majority of customers and projects irrelevant. Visualization
makes access to other customers data extremely difficult.

Real-world problems around public cloud computing are more mundane like data lock-in and
fluctuating performance of individual instances. The data lock-in is a soft measure and works
by making data inflow to the cloud provider free or very cheap. The copying of data out to
local systems or other providers is often more expensive. This is not an insurmountable
problem and in practice encourages to utilize more services from a cloud provider instead of
moving data in and out for different services or processes. Usually this is not sensible anyway
due to network speed and complexities around dealing with multiple platforms.

The varying performance of instances stems typically from the dependency on what kind of
load other customers generate on the shared physical infrastructure. Secondly, over time the
physical infrastructure providing the virtual resources changes and is updated. The available
resources for each customer on a physical machine are usually throttled to ensure that each
customer receives a guaranteed level of performance. Larger resources generally deliver very
predictable performance since they are much closer aligned with the physical instances
performance. Horizontally scaling projects with small instance should not rely on an exact
performance of each instance but be adaptive and focus on the average performance required
and scale according to need.

HYBRID CLOUD

The hybrid cloud architecture merges private and public cloud deployments. This is often an
attempt to achieve security and elasticity, or provide cheaper base load and burst capabilities.
Some organizations experience short periods of extremely high loads, e.g. as a result of
seasonality like black Friday for retail, or marketing events like sponsoring a popular TV
event. These events can have huge economic impact to organizations if they are serviced
poorly.

The hybrid cloud provides the opportunity to serve the base load with in-house services and
rent for a short period a multiple of the resources to service the extreme demand. This
requires a great deal of operational ability in the organization to seamlessly scale between the
private and public cloud. Tools for hybrid or private cloud deployments exist like Eucalyptus
for Amazon Web Services. On the long-term the additional expense of the hybrid approach
often is not justifiable since cloud providers offer major discounts for multi-year
commitments. This makes moving base load services to the public cloud attractive since it is
accompanied by a simpler deployment strategy.

Cloud and big data

Typical cloud big data projects focus on scaling or adopting Hadoop for data processing.
MapReduce has become a de facto standard for large scale data processing. Tools like Hive
and Pig have emerged on top of Hadoop which make it feasible to process huge data sets
easily. Hive for example transforms SQL like queries to MapReduce jobs. It unlocks data set
of all sizes for data and business analysts for reporting and greenfield analytics projects.

Data can be either transferred to or collected in a cloud data sink like Amazons S3, and
Microsoft Blob Storage, e.g. to collect log files or export text formatted data. Alternatively
database adapters can be utilized to access data from databases directly with Hadoop, Hive,
and Pig. Qubole is a leading provider of cloud based services in this space. They provide
unique database adapters that can unlock data instantly, which otherwise would be
inaccessible or require significant development resource. One great example is their
mongoDB adapter. It gives Hive table like access to mongoDB collections. Qubole scales
Hadoop jobs to extract data as quickly as possible without overpowering the mongoDB
instance.

Ideally a cloud service provider offers Hadoop clusters that scale automatically with the
demand of the customer. This provides maximum performance for large jobs and optimal
savings when little and no processing is going on. Amazon Web Services Elastic MapReduce
and Azure HDInsight, for example, allow scaling of Hadoop clusters. However, the scaling is
not automatically with the demand and requires user actions. The scaling itself is not optimal
since it does not utilize HDFS well and squanders Hadoops strong point, data locality. This
means that an Elastic MapReduce cluster wastes resources when scaling and has diminishing
return with more instance. Furthermore, Amazons Elastic MapReduce and HDInsight
require a customer to explicitly request a cluster every time when it is needed and remove it
when it is not required anymore. There is also no user friendly interface for interaction with
or exploration of the data. This results in operational burden and excludes all but the most
proficient users.

Qubole scales and handles Hadoop clusters very differently. The clusters are managed
transparently without any action required by the user. When no activity is taking place
clusters are stopped and no further expenses accumulate. The Qubole system detects demand,
e.g. when a user queries Hive, and starts a new cluster if needed. It does this even faster than
Amazon raises its clusters on explicit user requests. The clusters that Qubole manages for the
user have a user defined minimum and maximum size and scale as needed to provide the user
with the optimal performance and minimal expense.

Importantly users, developers, data engineers and business analysts alike, require an easy to
use graphical interface for ad hoc data analysis access, and to design jobs and workflows.
Qubole provides a powerful web interface including workflow management and querying
capabilities. Data is accessed from permanent data store like S3 or Azure Blob Storage and
database connectors with transient clusters. The pay-as-you go billing of cloud computing
makes it easy to compare and try out systems. Sign up to Qubole and try it for free to
experience how easy it is to use.

Big data and the Internet of Things: Two sides of the same coin
Read each statement below and determine if its referring to big data or the Internet of
Things:
1. Every minute, we send 204 million emails, generate 1.8 million Facebook likes, send 278
thousand tweets, and upload 200 thousand photos to Facebook. Is this statement about big
data or the Internet of Things?
2. 12 million RFID tags (used to capture data and track movement of objects in the physical
world) were sold in 2011. By 2021, its estimated this number will increase to 209 billion as
[big data or the Internet of Things?] takes off.
3. The boom of [big data or the Internet of Things?] will mean that the amount of devices that
connect to the internet will rise from about 13 billion today to 50 billion by 2020.
4. The [big data or the Internet of Things?] industry is expected to grow from US$10.2 billion
in 2013 to about US$54.3 billion by 2017.
Heres the answers: 1 big data; 2 Internet of Things; 3 Internet of Things; and 4 big
data.

Big Data

Implement privacy by design.

Be transparent about what data is collected, how data is processed, for what purposes
data will be used, and whether data will be distributed to third parties.

Define the purpose of collection at the time of collection and, at all times, limit use
of the data to the defined purpose.

Obtain consent.

Collect and store only the amount of data necessary for the intended lawful purpose.

Allow individuals access to data maintained about them, information on the source
of the data, key inputs into their profile, and any algorithms used to develop their profile.

Allow individuals to correct and control their information.

Conduct a privacy impact assessment.

Consider data anonymization.

Limit and carefully control access to personal data.

Conduct regular reviews to verify if results from profiling are responsible, fair and
ethical and compatible with and proportionate to the purpose for which the profiles are being
used.

Allow for manual assessments of any algorithmic profiling outcomes with


significant effects to individuals.

Internet of Things
Self-determination is an inalienable right for all human beings.

Data obtained from connected devices is high in quantity, quality and sensitivity
and, as such, should be regarded and treated as personal data.

Those offering connected devices should be clear about what data they collect, for
what purposes and how long this data is retained.

Privacy by design should become a key selling point of innovative technologies.

Data should be processed locally, on the connected device itself. Where it is not
possible to process data locally, companies should ensure end-to-end encryption.

Data protection and privacy authorities should seek appropriate enforcement action
when the law has been breached.

All actors in the internet of things ecosystem should engage in a strong, active and
constructive debate on the implications of the internet of things and the choices to be made.

There is clearly a relationship between big data and IoT. Big data is a subset of the IoT .

Big data is about data, plain and simple. Yes, you can add all sorts of adjectives when
talking about big data, but at the end of the day, its all data.

IoT is about data, devices, and connectivity. Data big and small is front and
center in the IoT world of connected devices.

Data Centres and Big Data trends

Several converging trends, such as IT consumerization, increased number of users, more


devices and a lot more data, have pushed the storage environment to a new level. Now, these
new technologies arent only driving the cloud theyre pushing forward all of the
technologies that support cloud computing. At the epicentre of the cloud sits the data centre.
This is the central point where all information is gathered, and then distributed to other data
centres or to the end-user.
Todays infrastructure is being tasked with supporting many more applications, users and
workloads. Because of this, the storage infrastructure of a data centre especially one thats
cloud-facing must be adaptable and capable of intelligent data management.
The TechTarget 2015 IT Priority Survey points out where IT budget is being spent in 2015
with regards to data centres. Overall, 61% of respondents will see their information
technology budget grow this year. When asked which initiative their company will
implement in 2015, 40% of respondents said data centre consolidation and upgrades.
Big data also got a positive vote with 30% of respondents planning to implement this hot data
centre trend in 2015. The onset of big data is collecting so much information that storage will
be a concern for most data centres. There are now new options for storage, such as solid state
drives, which have experienced a gradual price drop in the last year. This drop could fulfil the
storage needs of big data projects without overrunning the budget.
Origin recommends specific Data Centre SSDs and will only use Micron, Intel or Seagate
SSDs/SAS SSDs in our Data centre solutions. To find the right solution, please visit our
Product Selector. For more information on the advantages of using Intel SSDs in Data
Centres, download our White Paper.
With these in mind, many storage vendors have evolved their solutions to provide more
efficient systems capable of much more to help these new IT and business demands.
Solid State Drives (SSD) and Flash.There is a growing argument around this
technology. Will it take over all storage or is it still a niche player? The truth is that
SSD and flash are really designed to play a specific role within storage. For workloads
that require very high IOPS (Input/Output Operation Per Second) VDI (Virtual
Desktop Infrastructure) or database processing, for example working with SSD or
flash systems may be the right move. Now, organizations looking to offload heavy
cycles from their primary spinning disks can load flash or SSD to help control that
load. In many cases, a good array can off-load 80 percent to 90 percent of the IOPS
from spinning disks that may be a part of the controller.

Replicatio
n. A big part of cloud computing and storage is the process of data distribution and
replication. New storage systems must be capable of not only managing data at the
primary site but they must also be able to replicate that information efficiently to
other locations. Why? There is a direct need to manage branch office, remote sites,
other data centres, and of course disaster recovery. Setting the right replication
infrastructure will mean managing bandwidth, scheduling and what data is actually
pushed out. Storage can be a powerful tool for both cloud computing and business
continuity. The key is understanding the value of your data and identifying where that
data fits in with your organization.
Data deduplication.Control over the actual data within the storage environment has always
been a big task as well. Storage resources arent only finite, theyre expensive. So, data
deduplication can help manage data that sits on the storage array as well as information being
used for other systems. For example, instead of sending out 100 20mb attachments, the
storage array would be intelligent enough to only store one file and create 99 pointers. If a
change was made to the file, the system is smart enough to log those changes and create
secondary pointers to a new file.
Because cloud computing will only continue to advance, there will be new demands placed
around storage. Even now, conversations around big data and storage are already heating up.
Whether its big data, a distributed file system, cloud computing or just the user environment
the storage infrastructure will always play an important role. The idea will always revolve
around ease of management and control over the data. In developing a solid storage platform,
always make sure to plan for the future since data growth will be an inevitable part of todays
cloud environment.

Big Data Generation and Acquisition


We have introduced several key technologies related to big data, i.e., cloud computing, IoT,
data center, and Hadoop. Next, we will focus on the value chain of big data, which can be
generally divided into four phases: data generation, data acquisition, data storage, and data
analysis. If we take data as a raw material, data generation and data acquisition are
exploitation process, data storage is a storage process, and data analysis is a production
process that utilizes the raw material to create new value.
1.Big Data Generation
Data generation is the rst step of big data. Specically, it is large-scale, highly diverse,
and complex datasets generated through longitudinal and distributed data sources. Such
data sources include sensors, videos, click streams, and/or all other available data sources.
At present, main sources of big data are the operation and trading information in
enterprises, logistic and sensing information in the IoT, human interaction information and
position information in the Internet world, and data generated in scientic research, etc.
The information far surpasses the capacities of IT architectures and infrastructures of
existing enterprises, while its real-time requirement also greatly stresses the existing
computing capacity.
1.1 Enterprise Data

The internal data of enterprises mainly consists of online trading data andonline analysis data,
most of which are historically static data and are managed byRDBMSs in a structured
manner. In addition, production data, inventory data, sales data, and nancial data, etc., also
constitute enterprise internal data, which aims to capture informationized and data-driven
activities in enterprises, so as to record all activities of enterprises in the form of internal data.
Over the past decades, IT and digital data have contributed a lot to improve the protability
of business departments. It is estimated that the business data volume of all companies in the
world may double every 1.2 years [1], in which, the business turnover through the Internet,
enterprises to enterprises, and enterprises to consumers per day will reach USD 450 billion
[2]. The continuously increasing business data volume requires more effective real-time
analysis so as to fully harvest its potential. For example, Amazon processes millions of
terminal operations and more than 500,000 queries from third-party sellers per day [3].
Walmart processes one million customer trades per hour and such trading data are imported
into a database with a capacity of over 2.5PB [4]. Akamai analyzes 75 million events per day
for its target advertisements [5]

1.2 IoT Data


As discussed, IoT is an important source of big data. Among smart cities constructed based
on IoT, big data may come from industry, agriculture, trafc and transportation, medical care,
public departments, and households, etc.According to the processes of data acquisition and
transmission in IoT, its network architecture may be divided into three layers: the sensing
layer, the network layer, and the application layer. The sensing layer is responsible for data
acquisition and mainly consists of sensor networks. The network layer is responsible for
information transmission and processing, where close transmission may rely on sensor
networks, and remote transmission shall depend on the Internet. Finally, the application layer
support specic applications of IoT.According to the characteristics of IoT, the data
generated from IoT has the following features:
Large-Scale Data: in IoT, masses of data acquisition equipments are distributedly deployed,
which may acquire simple numeric data (e.g., location) or complex multimedia data (e.g.,
surveillance video). In order to meet the demands of analysis and processing, not only the
currently acquired data, but also the
historical data within a certain time frame should be stored. Therefore, data generated by IoT
are characterized by large scales.
Heterogeneity: because of the variety data acquisition devices, the acquired data is also
different and such data features heterogeneity.
Strong Time and Space Correlation: in IoT, every data acquisition device are placed at a
specic geographic location and every piece of data has a time stamp.The time and space
correlations are important properties of data from IoT. During data analysis and processing,
time and space are also important dimensions for statistical analysis.
Effective Data Accounts for Only a Small Portion of the Big Data: a great quantity of noises
may occur during the acquisition and transmission of data in IoT. Among datasets acquired
by acquisition devices, only a small amount of abnormal data is valuable. For example,
during the acquisition of trafc video,
the few video frames that capture the violation of trafc regulations and trafc accidents are
more valuable than those only capturing the normal ow of trafc
1.3 Internet Data
Internet data consists of searching entries, Internet forum posts, chatting records, and
microblog messages, among others, which have similar features, such as high value and low
density. Such Internet data may be valueless individually, but through exploitation of
accumulated big data, useful information such as habits and hobbies of users can be
identied, and it is even possible to forecast users behavior and emotional moods.

1.4Bio-medicalData
As a series of high-throughput bio-measurement technologies are innovatively developed in
the beginning of the twenty-rst century, the frontier research in the bio-medicine eld also
enters the era of big data. By constructing smart, efcient, and accurate analytical models and
theoretical systems for bio-medicine applications, the essential governing mechanism behind
complex biological phenomena may be revealed. Not only the future development of bio-
medicine can be determined, but also the leading roles can be assumed in the development of
a series of important strategic industries related to the national economy, peoples livelihood
and national security, with important applications such as medical care, new drug R&D, and
grain production (e.g., transgenic crops).It is predictable that, with the development of bio-
medicine technologies, gene sequencing will become faster and more convenient, and thus
making big data of bio-medicine continuously grow beyond all doubt.Apart from such small
and medium-sized enterprises, other well-known IT companies, such as Google, Microsoft,
and IBM have invested extensively in the research and computational analysis of methods
related to high-throughput biological big data

2 Big Data Acquisition


As the second phase of the big data system, big data acquisition includes data collection, data
transmission, and data pre-processing. During big data acquisition, once the raw data is
collected, an efcient transmission mechanism should be used to send it to a proper storage
management system to support different analytical applications. The collected datasets may
sometimes include much redundant or useless data, which unnecessarily increases storage
space and affects the subsequent data analysis. For example, high redundancy is very
common among datasets collected by sensors for environment monitoring. Data compression
techniques can be applied to reduce the redundancy. Therefore, data pre-processing
operations are indispensable to ensure efcient data storage and exploitation.
2.1 Data Collection
Data collection is to utilize special data collection techniques to acquire raw data from a
specic data generation environment. Four common data collection methods are shown as
follows.

Log Files: As one widely used data collection method, log les are record les automatically
generated by the data source system, so as to record activities in designated le formats for
subsequent analysis. Log les are typically used in nearly all digital devices. For example,
web servers record in log les number of clicks, click rates, visits, and other property records
of web users . To capture activities of users at the web sites, web servers mainly include the
following three log le formats: public log le format (NCSA), expanded log format (W3C),
and IIS log format (Microsoft). All the three types of log les are in the ASCII text format.
Databases other than text les may sometimes be used to store log information to improve the
query efciency of the massive log store . There are also some other log les based on data
collection, including
stock indicators in nancial applications and determination of operating states in network
monitoring and trafc management.
Sensors: Sensors are common in daily life to measure physical quantities and transform
physical quantities into readable digital signals for subsequent processing (and storage).
Sensory data may be classied as sound wave, voice, vibration, automobile, chemical,
current, weather, pressure, temperature, etc.Sensed information is transferred to a data
collection point through wired or wireless networks. For applications that may be easily
deployed and managed, e.g., video surveillance system [10], the wired sensor network is a
convenient solution to acquire related information. Sometimes the accurate position of a
specic phenomenon is unknown, and sometimes the monitored environment does not have
the energy or communication infrastructures. Then wireless communication must be used to
enable data transmission among sensor nodes
under limited energy and communication capability. In recent years, WSNs have received
considerable interest and have been applied to many applications, such as environmental
research , water quality monitoring ], civil engineering and wildlife habit monitoring. A WSN
generally consists of a large number of geographically distributed sensor nodes, each being a
micro device powered by battery. Such sensors are deployed at designated positions as
required by the application to collect remote sensing data. Once the sensors are deployed, the
base station will send control information for network conguration/management or data
collection to sensor nodes. Based on such control information, the sensory data is assembled
in different sensor nodes and sent back to the base station for further processing. Interested
readers are referred to for more detailed discussions.

Methods for Acquiring Network Data: At present, network data acquisition is


accomplished using a combination of web crawler, word segmentation system,task system,
and index system, etc. Web crawler is a program used by search engines for downloading and
storing web pages. Generally speaking, web crawler starts from the uniform resource locator
(URL) of an initial web page to access other linked web pages, during which it stores and
sequences all the retrieved URLs. Web crawler acquires a URL in the order of precedence
through a URL queue and then downloads web pages, and identies all URLs in the
downloaded web pages, and extracts new URLs to be put in the queue. This process is
repeated until the web crawler is stopped. Data acquisition through a web crawler is widely
applied in applications based on web pages, such as search engines or web caching.
Traditional web page extraction technologies feature multiple efcient solutions and
considerable research has been done in this eld. As more advanced web page applications
are emerging, some extraction strategies are proposed in to cope with rich Internet
applications.The current network data acquisition technologies mainly include traditional
Libpcap-based packet capture technology, zero-copy packet capture technology,as well as
some specialized network monitoring software such as Wireshark,SmartSniff, and
WinNetCap
Libpcap-Based Packet Capture Technology: Libpcap (packet capture library) is a
widely used network data packet capture function library. It is a general tool that does
not depend on any specic system and is mainly used to capture data in the data link
layer. It features simplicity, easy-to-use, and portability, but has a relatively low
efciency. Therefore, under a high-speed network environment,considerable packet
losses may occur when Libpcap is used.
Zero-Copy Packet Capture Technology: The so-called zero-copy (ZC) means that no
copies between any internal memories occur during packet receiving and sending at a node.
In sending, the data packets directly start from the user buffer of applications, pass through
the network interfaces, and arrive at an external network. In receiving, the network interfaces
directly send data packets to the user buffer. The basic idea of zero-copy is to reduce data
copy times, reduce system calls, and reduce CPU load while datagrams are passed from
network
equipments to user program space. The zero-copy technology rst utilizes direct memory
access (DMA) technology to directly transmit network datagrams to an address space pre-
allocated by the system kernel, so as to avoid the participation of CPU. In the meanwhile, it
maps the internal memory of the datagrams in the system kernel to the that of the detection
program, or builds a cache region in the user space and maps it to the kernel space. Then the
detection program directly accesses the internal memory, so as to reduce internal memory
copy from system kernel to user space and reduce the amount of system calls.
Mobile Equipments: At present, mobile devices are more widely used. As mobile device
functions become increasingly stronger, they feature more complex and multiple means of
data acquisition as well as more variety of data. Mobile devices may acquire geographical
location information through positioning systems; acquire audio information through
microphones; acquire pictures, videos, streetscapes, two-dimensional barcodes, and other
multimedia information through cameras; acquire user gestures and other body language
information through touch screens and gravity sensors. Over the years, wireless operators
have improved the service level of the mobile Internet by acquiring and analyzing such
information. For example, iPhone itself is a mobile spy. It may collect wireless data and
geographical location information, and then send such information back to Apple Inc. for
processing, of which the user may not be aware. Apart from Apple, smart phone operating
systems such as Android of Google and Windows Phone of Microsoft can also collect
information in the similar manner.
In addition to the aforementioned three data acquisition methods of main data sources, there
are many other data collect methods or systems. For example, in scientic experiments, many
special tools can be used to collect experimental data, such as magnetic spectrometers and
radio telescopes. We may classify data collection methods from different perspectives. From
the perspective of data sources, data collection methods can be classied into two categories:
collection methods recording through data sources and collection methods recording through
other auxiliary tools

2.2 Data Transportation


Upon the completion of raw data collection, data will be transferred to a data storage
infrastructure for processing and analysis. Big data is mainly stored in a data center. The data
layout should be adjusted to improve computing efciency or facilitate hardware
maintenance. In other words, internal data transmission may occur in the data center.
Therefore, data transmission consists of two phases: Inter-DCN transmissions and Intra-DCN
transmissions.
Inter-DCN transmissions are from data source to data center, which is generally achieved
with the existing physical network infrastructure. Because of the rapid growth of trafc
demands, the physical network infrastructure in most regions around the world are
constituted by high-volume, high-rate, and cost-effective optic ber transmission systems.
Over the past 20 years, advanced management equipment and technologies have been
developed, such as IP-based wavelength division multiplexing (WDM) network architecture,
to conduct smart control and management of optical ber networks . WDM is a technology
that multiplexes multiple optical carrier signals with different wave lengths and couples them
to the same optical ber of the optical link. In such technology, lasers with different wave
lengths carry different signals. By far, the backbone network have been deployed with WDM
optical transmission systems with single channel rate of 40 Gb/s. At present, 100 Gb/s
commercial interface are available and 100 Gb/s systems (or TB/s systems) will be available
in the near future.
However, traditional optical transmission technologies are limited by the bandwidth of the
electronic bottleneck. Recently, orthogonal frequency-division multiplexing (OFDM),
initially designed for wireless systems, is regarded as one of the main candidate technologies
for future high-speed optical transmission.OFDM is a multi-carrier parallel transmission
technology. It segments a high-speed data ow to transform it into low-speed sub-data-ows
to be transmitted over multiple orthogonal sub-carriers . Compared with xed channel
spacing of WDM, OFDM allows sub-channel frequency spectrums to overlap with eachother
. Therefore, it is a exible and efcient optical networking technology.

Intra-DCN transmissions are the data communication ows within data centers.Intra-DCN
transmissions depend on the communication mechanism within the data center (i.e., on
physical connection plates, chips, internal memories of data servers, network architectures of
data centers, and communication protocols).
A data center consists of multiple integrated server racks interconnected with its internal
connection networks. Nowadays, the internal connection networks of most data centers are
fat-tree, two-layer or three-layer structures based on multi-commodity network ows . In the
two-layer topological structure, the racks are connected by 1 Gbps top rack switches (TOR)
and then such top rack switches are connected with 10 Gbps aggregation switches in the
topological structure. The three-layer topological structure is a structure augmented with one
layer on the top of the two-layer topological structure and such layer is constituted by 10 or
100 Gbps core switches to connect aggregation switches in the topological structure. There
are also other topological structures which aim to improve the data center networks.
Because of the inadequacy of electronic packet switches, it is difcult to increase
communication bandwidths while keeps energy consumption is low. Over the years, due to
the huge success achieved by optical technologies, the optical interconnection among the
networks in data centers has drawn great interest. Optical interconnection is a high-
throughput, low-delay, and low-energy-consumption solution. At present, optical
technologies are only used for point-to-point links in data centers. Such optical links provide
connection for the switches using the low-cost multi-mode ber (MMF) with 10 Gbps data
rate. Optical interconnection (switching in the optical domain) of networks in data centers is a
feasible solution, which can provide Tbps-level transmission bandwidth with low energy
consumption.
Recently, many optical interconnection plans are proposed for data center networks . Some
plans add optical paths to upgrade the existing networks and other plans completely replace
the current switches . As a strengthening technology, Zhou et al. in adopt wireless links in
the 60GHz frequency band to strengthen wired links. Network virtualization should also be
considered to improve the efciency and utilization of data center networks.
2.3 Data Pre-processing
Because of the wide variety of data sources, the collected datasets vary with respect to noise,
redundancy, and consistency, etc., and it is undoubtedly a waste to store meaningless data. In
addition, some analytical methods have stringent requirements on data quality. Therefore,
data should be pre-processed under many circumstances to integrate the data from different
sources, so as to enable effective data analysis.Pre-processing data not only reduces storage
expense, but also improves analysis accuracy. Some relational data pre-processing techniques
are discussed in the following.

2.3.1 Integration
Data integration is the cornerstone of modern commercial informatics, which involves the
combination of data from different sources and provides users with a uniform view of data.
This is a mature research eld for traditional database.Historically, two methods have been
widely recognized: data warehouse and data federation.
. Data warehousing includes a process named ETL (Extract, Transform and Load).
Extraction involves connecting source systems, selecting, collecting,analyzing, and
processing necessary data. Transformation is the execution of a series of rules to transform
the extracted data into standard formats. Loading means importing extracted and transformed
data into the target storage infrastructure.Loading is the most complex procedure among the
three, which includes operations such as transformation, copy, clearing, standardization,
screening, and data
organization. A virtual database can be built to query and aggregate data from different data
sources, but such database does not contain data. On the contrary,it includes information or
metadata related to actual data and its positions. Such two storage-reading approaches do
not satisfy the high performance requirements of data ows or search programs and
applications. Compared with queries, data in such two approaches is more dynamic and must
be processed during data transmission.Generally, data integration methods are accompanied
with ow processing engines and search engines .
2.3.2 Cleaning
Data cleaning is a process to identify inaccurate, incomplete, or unreasonable data, and then
modify or delete such data to improve data quality. Generally, data cleaning includes ve
complementary procedures: dening and determining error types, searching and identifying
errors, correcting errors, documenting error examples and error types, and modifying data
entry procedures to reduce future errors. During cleaning, data formats, completeness,
rationality, and restriction shall be inspected. Data cleaning is of vital importance to keep the
data consistency,which is widely applied in many elds, such as banking, insurance, retail
industry,telecommunications, and trafc control.
In e-commerce, most data is electronically collected, which may have serious data quality
problems. Classic data quality problems mainly come from software defects, customized
errors, or system mis-conguration.Data cleaning in e-commerce by crawlers and regularly
re-copying customer and account information. The problem of cleaning RFID data was
examined.RFID is widely used in many applications, e.g., inventory management and target
tracking. However, the original RFID features low quality, which includes a lot of abnormal
data limited by the physical design and affected by environmental noises.A probability model
was developed to cope with data loss in mobile environments. Khoussainova proposed a
system to automatically
correct errors of input data by dening global integrity constraints.

2.3.3 Redundancy Elimination


Data redundancy refers to data repetitions or surplus, which usually occurs in many datasets.
Data redundancy can increase the unnecessary data transmission expense and cause defects
on storage systems, e.g., waste of storage space, leading to data inconsistency, reduction of
data reliability, and data damage. Therefore, various redundancy reduction methods have
been proposed, such as redundancy detection,data ltering, and data compression. Such
methods may apply to different datasets or application environments. However, redundancy
reduction may also bring about certain negative effects. For example, data compression and
decompression cause additional computational burden. Therefore, the benets of redundancy
reduction and the cost should be carefully balanced.Data collected from different elds will
increasingly appear in image or video formats. It is well-known that images and videos
contain considerable redundancy,including temporal redundancy, spacial redundancy,
statistical redundancy, and sensing redundancy. Video compression is widely used to reduce
redundancy invideo data, as specied in the many video coding standards (MPEG-2, MPEG-
4, H.263, and H.264/AVC). The problem of videocompression in a video surveillance system
with a video sensor network. New MPEG-4 based method by investigating the contextual
redundancy related to background and foreground in a scene. The low complexity and the
low compression ratio of the proposed approach were demonstrated by the
evaluation results.
On generalized data transmission or storage, repeated data deletion is a special data
compression technology, which aims to eliminate repeated data copies. With repeated data
deletion, individual data blocks or data segments will be assigned with identiers (e.g., using
a hash algorithm) and stored, with the identiers added to the identication list. As the
analysis of repeated data deletion continues, if a new data block has an identier that is
identical to that listed in the identication list, the new data block will be deemed as
redundant and will be replaced by the corresponding stored data block. Repeated data
deletion can greatly reduce storage requirement, which is particularly important to a big data
storage system.
Apart from the aforementioned data pre-processing methods, specic data objects shall go
through some other operations such as feature extraction. Such operation plays an important
role in multimedia search and DNA analysis. Usually high-dimensional feature vectors (or
high-dimensional feature points) are used to describe such data objects and the system stores
the dimensional feature vectors for future retrieval. Data transfer is usually used to process
distributed heterogeneous data sources, especially business datasets

As a matter of fact, in consideration of various datasets, it is non-trivial, or impossible, to


build a uniform data pre-processing procedure and technology that is applicable to all types
of datasets. on the specic feature, problem, performance requirements, and other factors of
the datasets should be considered, so as to select a proper data pre-processing strategy

Big data applications

1. Banking and securities industry


The Securities Exchange Commission (SEC) is using big data to monitor financial market
activity. They are currently using network analytics and natural language processors to catch
illegal trading activity in the financial markets.

Retail traders, Big banks, hedge funds and other so-called big boys in the financial markets
use big data for trade analytics used in high frequency trading, pre-trade decision-support
analytics, sentiment measurement, Predictive Analytics etc.

This industry also heavily relies on big data for risk analytics including; anti-money
laundering, demand enterprise risk management, "Know Your Customer", and fraud
mitigation.

Big Data providers specific to this industry include: 1010data, Panopticon Software,
Streambase Systems, Nice Actimize and Quartet FS

2. Communications, Media and Entertainment


Organizations in this industry simultaneously analyze customer data along with behavioral
data to create detailed customer profiles that can be used to:

Create content for different target audiences

Recommend content on demand

Measure content performance


A case in point is the Wimbledon Championships (YouTube Video) that leverages big data to
deliver detailed sentiment analysis on the tennis matches to TV, mobile, and web users in
real-time.

Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data from its
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users.

Amazon Prime, which is driven to provide a great customer experience by offering, video,
music and Kindle books in a one-stop shop also heavily utilizes big data.

Big Data Providers in this industry include:Infochimps, Splunk, Pervasive Software, and
Visible Measures

3. Healthcare sector

Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions
of patients, to allow doctors to use evidence-based medicine as opposed to administering
several medical/lab tests to all patients who go to the hospital. A battery of tests can be
efficient but they can also be expensive and usually ineffective.

Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease.

Obamacare has also utilized big data in a variety of ways.

4. Education

Big data is used quite significantly in higher education. For example, The University of
Tasmania. An Australian university with over 26000 students, has deployed a Learning and
Management System that tracks among other things, when a student logs onto the system,
how much time is spent on different pages in the system, as well as the overall progress of a
student over time.

In a different use case of the use of big data in education, it is also used to measure teachers
effectiveness to ensure a good experience for both students and teachers. Teachers
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioral classification and several other variables.

On a governmental level, the Office of Educational Technology in the U. S. Department of


Education, is using big data to develop analytics to help course correct students who are
going astray while using online big data courses. Click patterns are also being used to detect
boredom.

5. manufacturing and natural resources

In the natural resources industry, big data allows for predictive modeling to support decision
making that has been utilized to ingest and integrate large amounts of data from geospatial
data, graphical data, text and temporal data. Areas of interest where this has been used
include; seismic interpretation and reservoir characterization.

Big data has also been used in solving todays manufacturing challenges and to gain
competitive advantage among other benefits.

In the graphic below, a study by Deloitte shows the use of supply chain capabilities from big
data currently in use and their expected use in the future.

6. In Government

In public services, big data has a very wide range of applications including: energy
exploration, financial market analysis, fraud detection, health related research and
environmental protection.

Some more specific examples are as follows:

Big data is being used in the analysis of large amounts of social disability claims, made to the
Social Security Administration (SSA), that arrive in the form of unstructured data. The
analytics are used to process medical information rapidly and efficiently for faster decision
making and to detect suspicious or fraudulent claims.

The Food and Drug Administration (FDA) is using big data to detect and study patterns of
food-related illnesses and diseases. This allows for faster response which has led to faster
treatment and less death.

The Department of Homeland Security uses big data for several different use cases. Big data
is analyzed from different government agencies and is used to protect the country.

7. In the insurance industry

Big data has been used in the industry to provide customer insights for transparent and
simpler products, by analyzing and predicting customer behavior through data derived from
social media, GPS-enabled devices and CCTV footage. The big data also allows for better
customer retention from insurance companies.
When it comes to claims management, predictive analytics from big data has been used to
offer faster service since massive amounts of data can be analyzed especially in the
underwriting stage. Fraud detection has also been enhanced.

Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.

8. In the Retail and Wholesale industry

Big data from customer loyalty data, POS, store inventory, local demographics data continues
to be gathered by retail and wholesale stores.

In New Yorks Big Show retail trade conference in 2014, companies like Microsoft, Cisco
and IBMpitched the need for the retail industry to utilize big data for analytics and for other
uses including:

Optimized staffing through data from shopping patterns, local events, and so on

Reduced fraud

Timely analysis of inventory


Social media use also has a lot of potential use and continues to be slowly but surely adopted
especially by brick and mortar stores. Social media is used for customer prospecting,
customer retention, promotion of products, and more.

9. In the transportation industry

Some applications of big data by governments, private organizations and individuals include:

Governments use of big data: traffic control, route planning, intelligent transport
systems, congestion management (by predicting traffic conditions)

Private sector use of big data in transport: revenue management, technological


enhancements, logistics and for competitive advantage (by consolidating shipments and
optimizing freight movement)

Individual use of big data includes: route planning to save on fuel and time, for travel
arrangements in tourism etc.

10. In the energy and utilities industry


Smart meter readers allow data to be collected almost every 15 minutes as opposed to once a
day with the old meter readers. This granular data is being used to analyze consumption of
utilities better which allows for improved customer feedback and better control of utilities
use.

In utility companies the use of big data also allows for better asset and workforce
management which is useful for recognizing errors and correcting them as soon as possible
before complete failure is experienced.

You might also like