Big Data Notes
Big Data Notes
Big Data Notes
Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage, and process data within a tolerable elapsed time. Big Data
encompasses unstructured, semi-structured and structured data, however the main focus is on
unstructured data. Big data "size" is a constantly moving target, as of 2012 ranging from a
few dozen terabytes to many petabytes of data.
Big Data represents the Information assets characterized by such a High Volume, Velocity
and Variety to require specific Technology and Analytical Methods for its transformation into
Value.
Volume: big data doesn't sample; it just observes and tracks what happens.
Velocity: big data is often available in real-time.
Variety: big data draws from text, images, audio, video; plus it completes missing pieces
through data fusion
Characteristics
Big data can be described by the following characteristics:
Volume
The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
Variety
The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
Velocity
In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability
Inconsistency of the data set can hamper processes to handle and manage it.
Veracity
The quality of captured data can vary greatly, affecting the accurate analysis.
It is how we make use of data that allows us to fully recognize its true value and potential to
improve our decision making capabilities and, from a business stand point, measure it against
the result of positive business outcomes.
These approaches place high emphasis on the importance of every individual data item that
goes into these systems and, as a result, highlight the importance of every single outcome
linking to business impacts delivered.
Big data characteristics are defined popularly through the four Vs: volume, velocity, variety
and veracity. Adapting these four characteristics provides multiple dimensions to the value of
data at hand.
Essentially, there is an assumption that the data has great potential, but no one has explored
where that might be. Unlike a business intelligence system, where analysts know what
information they are seeking, the possibilities of exploring big data are all linked to
identifying connections between things we dont know. It is all about designing the system to
decipher this information.
A possible approach could be to take the four Vs into prime consideration and determine
what kind of value they deliver while solving a particular business problem.
Volume-based value
Now that organisations have the ability to store as much data as possible in a cost-effective
manner, they have the capabilities to do broader analysis across different data dimensions and
also deeper analysis going back to multiple years of historical context behind data.
In essence, they no longer need to do sampling of data we can carry out their analysis on
the entire data set. The scenario applies heavily into developing true customer-centric
profiles, as well as richer customer centric offerings at a micro level.
The more data businesses have on the customers, both recent and historical, the greater the
insights. This will in turn lead to generating better decisions around acquiring, retaining,
increasing and managing those customer relationships.
Velocity-based value
This is all about speed, which is now more important than ever. The faster businesses can
inject data into their data and analytics platform, the more time they will have to ask the right
questions and seek answers. Rapid analysis capabilities provide businesses with the right
decision in time to achieve their customer relationship management objectives.
Variety-based value
In the digital era, capability to acquire and analyse varied data is extremely valuable, as the
more diverse customer data businesses have, the more multi-faceted view they develop about
their customers.
This in turn provides deep insights into successfully developing and personalising customer
journey maps, and provides a platform for businesses to be more engaged and aware of
customer needs and expectations.
Veracity-based value
While many question the quality and accuracy of data in the big data context, but for
innovative business offerings the accuracy of data is not that critical at least in the early
stages of concept design and validations. Thus the more business hypotheses that can be
churned out from this vast amount of data, the greater the potential for business
differentiation edge.
Developing a framework of measurement taking these aspects into account allows businesses
to easily measure the value of data in their most important metric money.
Once implementing a big data analytics platform, which measures along the four Vs,
businesses can utilize and extend the outcomes to directly impact on customer acquisition,
onboarding, retention, upsell, cross-sell and other revenue generating indicators.
This can also lead to measuring the value of parallel improvements in operational
productivity and the influence of data across the enterprise for other initiatives.
On the other side of the spectrum, however, it is important to note that amassing a lot of data
does not necessarily deliver insights. Businesses now have access to more data than ever
before, but having access to more data can make it harder to distill insights, since the bigger
the datasets, the harder it becomes to search, visualise, and analyse.
It is not the amount of data that matters, its how smart organisations are with the data they
have. In reality, they can have tons of data, but if theyre not using it intelligently it seldom
delivers what they are looking for.
Big data for development is a concept that refers to the identification of sources of big data
relevant to the policies and planning for development programs. It differs from both
traditional development data and what the private sector and mainstream media call big
data.
In general, sources of big data for development are those which can be analyzed to gain
insight into human well-being and development, and generally share some or all of the
following features:
Digitally generated: Data is created digitally, not digitized manually, and can be
manipulated by computers.
Passively produced: Data is a by-product of interactions with digital services.
Automatically collected: A system is in place that automatically extracts and stores the
relevant data that is generated.
Geographically or temporarily trackable: For instance, this is the case in mobile phone
location data or call duration time.
Continuously analyzed: Information is relevant to human well-being and development,
and can be analyzed in real time.
Big data for development is constantly evolving. However, a preliminary categorization of
sources may reflect:
What people say (online content): International and local online news sources, publicly
accessible blogs, forum posts, comments and public social media content, online advertising,
e-commerce sites and websites created by local retailers that list prices and inventory.
What people do (data exhaust): Passively collected transactional data from the use of digital
services such as financial services (including purchase, money transfers, savings and loan
repayments), communications services (such as anonymized records of mobile phone usage
patterns) or information services (such as anonymized records of search queries).
Before it can be used effectively, big data needs to be managed and filtered through data
analytics - tools and methodologies that can transform massive quantities of raw data into
data about the data for analytical purposes. Only then it is possible to detect changes in
how communities access services that may be useful proxy indicators of human well-being.
If properly mined and analyzed, big data can improve the understanding of human behavior
and offer policymaking support for global development in three main ways:
Early warning: Early detection of anomalies can enable faster responses to population in
times of crisis.
Real-time awareness: Fine-grained representation of reality through big data can inform
the design and targeting of programs and policies.
Real-time feedback: Adjustments can be made possible by real-time monitoring of the
impact of policies and programs.
Global Pulse is a United Nations innovation initiative of the Secretary-General, exploring
how big data can help policymakers gain a better understanding of changes in human well-
being. Through strategic public-private partnerships and R&D carried out across its network
of Pulse Labs in New York, Jakarta and Kampala, Global Pulse functions as a hub for
applying innovations in data science and analytics to global development and humanitarian
challenges.
Big data analytics is not a panacea for age-old development challenges, and real-time
information does not replace the quantitative statistical evidence that governments
traditionally use for decision making. However, it does have the potential to inform whether
further targeted investigation is necessary, or prompt immediate response.
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
These V-based characterizations represent ten different challenges associated with the main
tasks involving big data (as mentioned earlier: capture, cleaning, curation, integration,
storage, processing, indexing, search, sharing, transfer, mining, analysis, and visualization).
Volume: = lots of data (which I have labeled a Tonnabytes, to suggest that the
actual numerical scale at which the data volume becomes challenging in a particular
setting is domain-specific, but we all agree that we are now dealing with a ton of
bytes).
Variety: = complexity, thousands or more features per data item, the curse of
dimensionality, combinatorial explosion, many data types, and many data formats.
Velocity: = high rate of data and information flowing into and out of our systems,
real-time, incoming!
Veracity: = necessary and sufficient data to test many different hypotheses, vast
training samples for rich micro-scale model-building and model validation, micro-
grained truth about every object in your data collection, thereby empowering
whole-population analytics.
Value: = the all-important V, characterizing the business value, ROI, and potential
of big data to transform your organization from top to bottom (including the bottom
line).
Vagueness: = confusion over the meaning of big data (Is it Hadoop? Is it something
that weve always had? Whats new about it? What are the tools? Which tools
should I use? etc.)
What is NoSQL?
NoSQL encompasses a wide variety of different database technologies that were developed in
response to the demands presented in building modern applications:
Developers are working with applications that create massive volumes of new, rapidly
changing data types structured, semi-structured, unstructured and polymorphic data.
Long gone is the twelve-to-eighteen month waterfall development cycle. Now small teams
work in agile sprints, iterating quickly and pushing code every week or two, some even
multiple times every day.
Applications that once served a finite audience are now delivered as services that must be
always-on, accessible from many different devices and scaled globally to millions of users.
Organizations are now turning to scale-out architectures using open source software,
commodity servers and cloud computing instead of large monolithic servers and storage
infrastructure.
Relational databases were not designed to cope with the scale and agility challenges that face
modern applications, nor were they built to take advantage of the commodity storage and
processing power available today.
Try out the easiest way to start learning and prototyping applications on MongoDB, the
leading non-relational database.
Document databases pair each key with a complex data structure known as a document.
Documents can contain many different key-value pairs, or key-array pairs, or even nested
documents.
Graph stores are used to store information about networks of data, such as social
connections. Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is
stored as an attribute name (or 'key'), together with its value. Examples of key-value stores
are Riak and Berkeley DB. Some key-value stores, such as Redis, allow each value to have a
type, such as 'integer', which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large
datasets, and store columns of data together, instead of rows.
When compared to relational databases, NoSQL databases are more scalable and provide
superior performance, and their data model addresses several issues that the relational model
is not designed to address:
Selecting the appropriate data model: document, key-value & wide column, or graph model.
1.Dynamic Schemas
Relational databases require that schemas be defined before you can add data. For example,
you might want to store data about your customers such as phone numbers, first and last
name, address, city and state a SQL database needs to know what you are storing in
advance.
This fits poorly with agile development approaches, because each time you complete new
features, the schema of your database often needs to change. So if you decide, a few
iterations into development, that you'd like to store customers' favorite items in addition to
their addresses and phone numbers, you'll need to add that column to the database, and then
migrate the entire database to the new schema.
If the database is large, this is a very slow process that involves significant downtime. If you
are frequently changing the data your application stores because you are iterating rapidly
this downtime may also be frequent. There's also no way, using a relational database, to
effectively address data that's completely unstructured or unknown in advance.
NoSQL databases are built to allow the insertion of data without a predefined schema. That
makes it easy to make significant application changes in real-time, without worrying about
service interruptions which means development is faster, code integration is more reliable,
and less database administrator time is needed. Developers have typically had to add
application-side code to enforce data quality controls, such as mandating the presence of
specific fields, data types or permissible values. More sophisticated NoSQL databases allow
validation rules to be applied within the database, allowing users to enforce governance
across data, while maintaining the agility benefits of a dynamic schema.
2.Auto-sharding
Because of the way they are structured, relational databases usually scale vertically a single
server has to host the entire database to ensure acceptable performance for cross- table joins
and transactions. This gets expensive quickly, places limits on scale, and creates a relatively
small number of failure points for database infrastructure. The solution to support rapidly
growing applications is to scale horizontally, by adding servers instead of concentrating more
capacity in a single server.
'Sharding' a database across many server instances can be achieved with SQL databases, but
usually is accomplished through SANs and other complex arrangements for making hardware
act as a single server. Because the database does not provide this ability natively,
development teams take on the work of deploying multiple relational databases across a
number of machines. Data is stored in each database instance autonomously. Application
code is developed to distribute the data, distribute queries, and aggregate the results of data
across all of the database instances. Additional code must be developed to handle resource
failures, to perform joins across the different databases, for data rebalancing, replication, and
other requirements. Furthermore, many benefits of the relational database, such as
transactional integrity, are compromised or eliminated when employing manual sharding.
NoSQL databases, on the other hand, usually support auto-sharding, meaning that they
natively and automatically spread data across an arbitrary number of servers, without
requiring the application to even be aware of the composition of the server pool. Data and
query load are automatically balanced across servers, and when a server goes down, it can be
quickly and transparently replaced with no application disruption.
3.Cloud computing makes this significantly easier, with providers such as Amazon Web
Services providing virtually unlimited capacity on demand, and taking care of all the
necessary infrastructure administration tasks. Developers no longer need to construct
complex, expensive platforms to support their applications, and can concentrate on writing
application code. Commodity servers can provide the same processing and storage
capabilities as a single high-end server for a fraction of the price.
4.Replication
Most NoSQL databases also support automatic database replication to maintain availability in
the event of outages or planned maintenance events. More sophisticated NoSQL databases
are fully self-healing, offering automated failover and recovery, as well as the ability to
distribute the database across multiple geographic regions to withstand regional failures and
enable data localization. Unlike relational databases, NoSQL databases generally have no
requirement for separate applications or expensive add-ons to implement replication.
5.Integrated Caching
A number of products provide a caching tier for SQL database systems. These systems can
improve read performance substantially, but they do not improve write performance, and they
add operational complexity to system deployments. If your application is dominated by reads
then a distributed cache could be considered, but if your application has just a modest write
volume, then a distributed cache may not improve the overall experience of your end users,
and will add complexity in managing cache invalidation.
Many NoSQL database technologies have excellent integrated caching capabilities, keeping
frequently-used data in system memory as much as possible and removing the need for a
separate caching layer. Some NoSQL databases also offer fully managed, integrated in-
memory database management layer for workloads demanding the highest throughput and
lowest latency.
Data Storage Individual records Varies based on database type. For example, key-
Model (e.g., 'employees') are value stores
stored as rows in
tables, with each function similarly to SQL databases, but have
column storing a only two columns
specific piece of data ('key' and 'value'), with more complex
about that record (e.g., information sometimes
'manager,' 'date hired,'
etc.), much like a stored as BLOBs within the 'value' columns.
spreadsheet. Related Document databases
data is stored in
do away with the table-and-row model altogether,
separate tables, and
storing all
then joined together
when more complex relevant data together in single 'document' in
queries are executed. JSON, XML,
For example, 'offices'
might be stored in one or another format, which can nest values
table, and 'employees' hierarchically.
in another. When a user
wants to find the work
address of an
employee, the database
engine joins the
'employee' and 'office'
tables together to get
all the information
necessary.
Schemas Structure and data Typically dynamic, with some enforcing data
types are fixed in validation rules. Applications can add new fields
advance. To store on the fly, and unlike SQL table rows, dissimilar
information about a data can be stored together as necessary. For
new data item, the some databases (e.g., wide-column stores), it is
entire database must be somewhat more challenging to add new fields
altered, during which dynamically.
time the database must
be taken offline.
Often, organizations will begin with a small-scale trial of a NoSQL database in their
organization, which makes it possible to develop an understanding of the technology in a
low-stakes way. Most NoSQL databases are also open-source, meaning that they can be
downloaded, implemented and scaled at little cost. Because development cycles are faster,
organizations can also innovate more quickly and deliver superior customer experience at a
lower cost.
As you consider alternatives to legacy infrastructures, you may have several motivations: to
scale or perform beyond the capabilities of your existing system, identify viable alternatives
to expensive proprietary software, or increase the speed and agility of development. When
selecting the right database for your business and application, there are five important
dimensions to consider.
Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions,
and reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data,
we examine the following two classes of technology:
Operational Big Data
This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing
architectures that have emerged over the past decade to allow massive computations to be run
inexpensively and efficiently. This makes operational big data workloads much easier to
manage, cheaper, and faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data
with minimal coding and without the need for data scientists and additional infrastructure.
This includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that
may touch most or all of the data.
These two classes of technology are complementary and frequently deployed together.
Writes and
Access Pattern Reads
Reads
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data
will be stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated
softwares can be written to interact with the database, process the required data and present it
to the users for analysis purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by
standard database servers, or up to the limit of the processor which is processing the data. But
when it comes to dealing with huge amounts of data, it is really a tedious task to process such
data through a traditional database server.
Googles Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the
task into small parts and assigns those parts to many computers connected over the network,
and collects the results to form the final result dataset.
Above diagram shows various commodity hardwares which could be single CPU machines or
servers with higher capacity.
Hadoop
Doug Cutting, Mike Cafarella and team took the solution provided by Google and started an
Open Source Project called HADOOP in 2005 and Doug named it after his son's toy
elephant. Now Apache Hadoop is a registered trademark of the Apache Software Foundation.
Hadoop runs applications using the MapReduce algorithm, where the data is processed in
parallel on different CPU nodes. In short, Hadoop framework is capable enough to develop
applications capable of running on clusters of computers and they could perform complete
statistical analysis for a huge amounts of data.
Hadoop is an Apache open source framework written in java that allows distributed
processing of large datasets across clusters of computers using simple programming models.
A Hadoop frame-worked application works in an environment that provides distributed
storage and computation across clusters of computers. Hadoop is designed to scale up from
single server to thousands of machines, each offering local computation and storage.
Hadoop Architecture
We can use following diagram to depict these four components available in Hadoop
framework.
Since 2012, the term "Hadoop" often refers not just to the base modules mentioned above but
also to the collection of additional software packages that can be installed on top of or
alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark etc.
MapReduce
Hadoop MapReduceis a software framework for easily writing applications which process
big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware
in a reliable, fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop
programs perform:
The Map Task:This is the first task, which takes input data and converts it
into a set of data, where individual elements are broken down into tuples (key/value
pairs).
The Reduce Task:This task takes the output from a map task as input and
combines those data tuples into a smaller set of tuples. The reduce task is always
performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care
of scheduling tasks, monitoring them and re-executes the failed tasks.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means
if JobTracker goes down, all running jobs are halted.
Hadoop can work directly with any mountable distributed file system such as Local FS,
HFTP FS, S3 FS, and others, but the most common file system used by Hadoop is the
Hadoop Distributed File System (HDFS).
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on large clusters (thousands of
computers) of small computer machines in a reliable, fault-tolerant manner.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set
of DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The
DataNodes takes care of read and write operation with the file system. They also take care of
block creation, deletion and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to
interact with the file system. These shell commands will be covered in a separate chapter
along with appropriate examples.
Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process
by specifying the following items:
The location of the input and output files in the distributed file system.
The java classes in the form of jar file containing the implementation of map
and reduce functions.
Stage 2
The Hadoop job client then submits the job (jar/executable etc) and configuration to the
JobTracker which then assumes the responsibility of distributing the software/configuration
to the slaves, scheduling tasks and monitoring them, providing status and diagnostic
information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and
output of the reduce function is stored into the output files on the file system.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed
systems. It is efficient, and it automatic distributes the data and work across the
machines and in turn, utilizes the underlying parallelism of the CPU cores.
Servers can be added or removed from the cluster dynamically and Hadoop
continues to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Fully Distributed Mode : This mode is fully distributed with minimum two
or more machines as a cluster. We will come across this mode in detail in the coming
chapters.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS
The built-in servers of namenode and datanode help users to easily check the
status of cluster.
HDFS Architecture
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware. The
system having the namenode acts as the master server and it does the following tasks:
It also executes file system operations such as renaming, closing, and opening
files and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be a
datanode. These nodes manage the data storage of their system.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.
Goals of HDFS
Huge datasets : HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
Hadoop MapReduce
MapReduce is a framework using which we can write applications to process huge amounts
of data, in parallel, on large clusters of commodity hardware in a reliable manner.
What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing based
on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the MapReduce form, scaling
the application to run over hundreds, thousands, or even tens of thousands of machines in a
cluster is merely a configuration change. This simple scalability is what has attracted many
programmers to use the MapReduce model.
The Algorithm
Map stage: The map or mappers job is to process the input data.
Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS). The input file is passed to the mapper function
line by line. The mapper processes the data and creates several small chunks of
data.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that
reduces the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
BIG DATA & CLOUD COMPUTING The concept of big data became a major force of
innovation across both academics and corporations. The paradigm is viewed as an effort to
understand and get proper insights from big datasets (big data analytics), providing
summarized information over huge data loads. As such, this paradigm is regarded by
corporations as a tool to understand their clients, to get closer to them, find patterns and
predict trends. Furthermore, big data is viewed by scientists as a mean to store and process
huge scientific datasets. This concept is a hot topic and is expected to continue to grow in
popularity in the coming years. Although big data is mostly associated with the storage of
huge loads of data it also concerns ways to process and extract knowledge from it (Hashem et
al., 2014). The five different aspects used to describe big data (commonly referred to as the
five Vs) are Volume, Variety, Velocity, Value and Veracity (Sakr & Gaber, 2014):
Volume describes the size of datasets that a big data system deals with. Processing and
storing big volumes of data is rather difficult, since it concerns: scalability so that the system
can grow; availability, which guarantees access to data and ways to perform operations over
it; and bandwidth and performance. Variety concerns the different types of data from
various sources that big data frameworks have to deal with. Velocity concerns the different
rates at which data streams may get in or out the system and provides an abstraction layer so
that big data systems can store data independently of the incoming or outgoing rate. Value
concerns the true value of data (i.e., the potential value of the data regarding the information
they contain). Huge amounts of data are worthless unless they provide value. Veracity
refers to the trustworthiness of the data, addressing data confidentiality, integrity, and
availability. Organizations need to ensure that data as well as the analyses performed on the
data.
BIG DATA IN THE CLOUD Storing and processing big volumes of data requires
scalability, fault tolerance and availability. Cloud computing delivers all these through
hardware virtualization. Thus, big data and cloud computing are two compatible concepts as
cloud enables big data to be available, scalable and fault tolerant. Business regard big data as
a valuable business opportunity. As such, several new companies such as Cloudera,
Hortonworks, Teradata and many others, have started to focus on delivering Big Data as a
Service (BDaaS) or DataBase as a Service (DBaaS). Companies such as Google, IBM,
Amazon and Microsoft also provide ways for consumers to consume big data on demand.
Next, we present two examples, Nokia and RedBus, which discuss the successful use of big
data within cloud environments. 3.1 Nokia Nokia was one of the first companies to
understand the advantage of big data in cloud environments (Cloudera, 2012). Several years
ago, the company used individual DBMSs to accommodate each application requirement.
However, realizing the advantages of integrating data into one application, the company
decided to migrate to Hadoop-based systems, integrating data within the same domain,
leveraging the use of analytics algorithms to get proper insights over its clients. As Hadoop
uses commodity hardware, the cost per terabyte of storage was cheaper than a traditional
RDBMS (Cloudera, 2012). Since Cloudera Distributed Hadoop (CDH) bundles the most
popular open source projects in the Apache Hadoop stack into a single, integrated package,
with stable and reliable releases, it embodies a great opportunity for implementing Hadoop
infrastructures and transferring IT and technical concerns onto the vendors specialized
teams. Nokia regarded Big Data as a Service (BDaaS) as an advantage and trusted Cloudera
to deploy a Hadoop environment that copes with its requirements in a short time frame.
Hadoop, and in particular CDH, strongly helped Nokia to fulfil their needs (Cloudera, 2012).
3.2 RedBus RedBus is the largest company in India specialized in online bus ticket and hotel
booking. This company wanted to implement a powerful data analysis tool to gain insights
over its bus booking service (Kumar, 2006). Its datasets could easily stretch up to 2 terabytes
in size. The application would have to be able to analyse booking and inventory data across
hundreds of bus operators serving more than 10.000 routes. Furthermore, the company
needed to avoid setting up and maintaining a complex in-house infrastructure. At first,
RedBus considered implementing inhouse clusters of Hadoop servers to process data.
However they soon realized it would take too much time to set up such a solution and that it
would require specialized IT teams to maintain such infrastructure. The company then
regarded Google bigQuery as the perfect match for their needs, allowing them to: Know
how many times consumers tried to find an available seat but were unable to do it due bus
overload; Examine decreases in bookings; Quickly identify server problems by analysing
data related to server activity; Moving towards big data brought RedBus business advantages.
Google bigQuery armed RedBus with real-time data analysis capabilities at 20% of the cost
of maintaining a complex Hadoop infrastructure (Kumar, 2006). As supported by Nokia and
RedBus examples, switching towards big data enables organizations to gain competitive
advantage. Additionally, BDaaS provided by big data vendors allows companies to leave the
technical details for big data vendors and focus on their core business needs.
The rise of cloud computing and cloud data stores have been a precursor and facilitator to
the emergence of big data. Cloud computing is the commodification of computing time and
data storage by means of standardized technologies.
This leads to a dilemma for decision makers in charge of big data projects. How and which
cloud computing is the optimal choice for their computing needs, especially if it is a big data
project? These projects regularly exhibit unpredictable, bursting, or immense computing
power and storage needs. At the same time business stakeholders expect swift, inexpensive,
and dependable products and project outcomes. This article introduces cloud computing and
cloud storage, the core cloud architectures, and discusses what to look for and how to get
started with cloud computing.
CLOUD PROVIDERS
A decade ago an IT project or start-up that needed reliable and Internet connected computing
resources had to rent or place physical hardware in one or several data centers. Today,
anyone can rent computing time and storage of any size. The range starts with virtual
machines barely powerful enough to serve web pages to the equivalent of a small
supercomputer. Cloud services are mostly pay-as-you-go, which means for a few hundred
dollars anyone can enjoy a few hours of supercomputer power. At the same time cloud
services and resources are globally distributed. This setup ensures a high availability and
durability unattainable by most but the largest organizations.
The cloud computing space has been dominated by Amazon Web Services until recently.
Increasingly serious alternatives are emerging like Google Cloud Platform, Microsoft Azure,
Rackspace, or Qubole to name only a few. Importantly for customers a struggle on platform
standards is underway. The two front-running solutions are Amazon Web Services
compatible solutions, i.e. Amazons own offering or companies with application
programming interface compatible offerings, and OpenStack, an open source project with a
wide industry backing. Consequently, the choice of a cloud platform standard has
implications on which tools are available and which alternative providers with the same
technology are available.
CLOUD STORAGE
Professional cloud storage needs to be highly available, highly durable, and has to scale from
a few bytes to petabytes. Amazons S3 cloud storage and Microsoft Azure Blob Storage are
the most prominent solutions in the space. They promise in the range of 99.9% monthly
availability and 99.999999999% durability per year. This is less than an hour outage per
month. The durability can be illustrated with an example. If a customer stores 10,000 objects
he can expect to lose one object every 10,000,000 years on average. They sometime achieve
this by storing data in multiple facilities with error checking and self-healing processes to
detect and repair errors and device failures. This is completely transparent to the user and
requires no actions or knowledge.
A company could build and achieve a similarly reliable storage solution but it would require
tremendous capital expenditures and operational challenges. Global data centered companies
like Google or Facebook have the expertise and scale to do this economically. Big data
projects and start-ups, however, benefit from using a cloud storage service. They can trade
capital expenditure for an operational one, which is excellent since it requires no capital
outlay or risk. It provides from the first byte reliable and scalable storage solutions of a
quality otherwise unachievable.
This enables new products and projects with a viable option to start on a small scale with low
costs. When a product proves successful these storage solutions scale virtually indefinitely.
Cloud storage is effectively a boundless data sink. Importantly for computing performances is
that many solutions also scale horizontally, i.e. when data is copied in parallel by cluster or
parallel computing processes the throughput scales linear with the number of nodes reading
or writing.
CLOUD COMPUTING
This standardization makes it an elastic and highly available option for computing needs. The
availability is not obtained by spending resources to guarantee reliability of a single instance
but by their interchangeability and a limitless pool of replacements. This impacts design
decisions and requires to deal with instance failure gracefully.
The implications for an IT project or company using cloud computing are significant and
change the traditional approach to planning and utilization of resources. Firstly, resource
planning becomes less important. It is required for costing scenarios to establish the viability
of a project or product. However, deploying and removing resources automatically based on
demand needs to be focused on to be successful. Vertical and horizontal scaling becomes
viable once a resource becomes easily deployable.
Horizontal scaling refers to the ability to replace a single small computing resource with a
bigger one to account for increased demand. Cloud computing supports this by making
various resource types available to switch between them. This also works in the opposite
direction, i.e. to switch to a smaller and cheaper instance type when demand decreases. Since
cloud resources are commonly paid on a usage basis no sunk cost or capital expenditures are
blocking fast decision making and adaptation. Demand is difficult to anticipate despite
planning efforts and naturally results in most traditional projects in over- or under-provision
resources. Therefore, traditional projects tend to waste money or provide poor outcomes.
Vertical scaling achieves elasticity by adding additional instances with each of them serving a
part of the demand. Software like Hadoop are specifically designed as distributed systems to
take advantage of vertical scaling. They process small independent tasks in massive parallel
scale. Distributed systems can also serve as data stores like NoSQL databases, e.g. Cassandra
or HBase, or filesystems like Hadoops HDFS. Alternatives like Storm provide coordinated
stream data processes in near real-time through a cluster of machines with complex
workflows.
The interchangeability of the resources together with distributed software design absorbs
failure and equivalently scaling of virtual computing instances unperturbed. Spiking or
bursting demands can be accommodated just as well as personalities or continued growth.
Renting practically unlimited resources for short periods allows one-off or periodical projects
at a modest expense. Data mining and web crawling are great examples. It is conceivable to
crawl huge web sites with millions of pages in days or hours for a few hundred dollars or
less. Inexpensive tiny virtual instances with minimal CPU resources are ideal for this purpose
since the majority of crawling the web is spent waiting for IO resources. Instantiating
thousands of these machines to achieve millions of requests per day is easy and often costs
less than a fraction of a cent per instance hour.
Of course, such mining operations should be mindful of the resources of the web sites or
application interfaces they mine, respect their terms, and not impede their service. A poorly
planned data mining operation is equivalent to a denial of service attack. Lastly, cloud
computing is naturally a good fit for storing and processing the big data accumulated form
such operations.
Cloud Architecture
Three main cloud architecture models have developed over time; private, public and hybrid
cloud. They all share the idea of resource commodification and to that end usually virtualize
computing and abstract storage layers.
PRIVATE CLOUD
Private clouds are dedicated to one organization and do not share physical resources. The
resource can be provided in-house or externally. A typical underlying requirement of private
cloud deployments are security requirements and regulations that need a strict separation of
an organizations data storage and processing from accidental or malicious access through
shared resources.Private cloud setups are challenging since the economical advantages of
scale are usually not achievable within most projects and organizations despite the utilization
of industry standards. The return of investment compared to public cloud offerings is rarely
obtained and the operational overhead and risk of failure is significant.
Additionally, cloud providers have captured the trend for increased security and provide
special environments, i.e. dedicated hardware to rent and encrypt virtual private networks as
well as encrypted storage to address most security concerns. Cloud providers may also offer
data storage, transfer, and processing restricted to specific geographic regions to ensure
compliance with local privacy laws and regulations.
Another reason for private cloud deployments are legacy systems with special hardware
needs or exceptional resource demand, e.g. extreme memory or computing instances which
are not available in public clouds. These are valid concerns however if these demands are
extraordinary the question if a cloud architecture is the correct solution has to be raised. One
reason can be to establish a private cloud for a transitionary period to run legacy and
demanding systems in parallel while their services are ported to a cloud environment
culminating in a switch to a cheaper public or hybrid cloud.
PUBLIC CLOUD
Public clouds share physical resources for data transfers, storage, and processing. However,
customers have private visualized computing environments and isolated
storage. Security concerns, which entice a few to adopt private clouds or custom
deployments, are for the vast majority of customers and projects irrelevant. Visualization
makes access to other customers data extremely difficult.
Real-world problems around public cloud computing are more mundane like data lock-in and
fluctuating performance of individual instances. The data lock-in is a soft measure and works
by making data inflow to the cloud provider free or very cheap. The copying of data out to
local systems or other providers is often more expensive. This is not an insurmountable
problem and in practice encourages to utilize more services from a cloud provider instead of
moving data in and out for different services or processes. Usually this is not sensible anyway
due to network speed and complexities around dealing with multiple platforms.
The varying performance of instances stems typically from the dependency on what kind of
load other customers generate on the shared physical infrastructure. Secondly, over time the
physical infrastructure providing the virtual resources changes and is updated. The available
resources for each customer on a physical machine are usually throttled to ensure that each
customer receives a guaranteed level of performance. Larger resources generally deliver very
predictable performance since they are much closer aligned with the physical instances
performance. Horizontally scaling projects with small instance should not rely on an exact
performance of each instance but be adaptive and focus on the average performance required
and scale according to need.
HYBRID CLOUD
The hybrid cloud architecture merges private and public cloud deployments. This is often an
attempt to achieve security and elasticity, or provide cheaper base load and burst capabilities.
Some organizations experience short periods of extremely high loads, e.g. as a result of
seasonality like black Friday for retail, or marketing events like sponsoring a popular TV
event. These events can have huge economic impact to organizations if they are serviced
poorly.
The hybrid cloud provides the opportunity to serve the base load with in-house services and
rent for a short period a multiple of the resources to service the extreme demand. This
requires a great deal of operational ability in the organization to seamlessly scale between the
private and public cloud. Tools for hybrid or private cloud deployments exist like Eucalyptus
for Amazon Web Services. On the long-term the additional expense of the hybrid approach
often is not justifiable since cloud providers offer major discounts for multi-year
commitments. This makes moving base load services to the public cloud attractive since it is
accompanied by a simpler deployment strategy.
Typical cloud big data projects focus on scaling or adopting Hadoop for data processing.
MapReduce has become a de facto standard for large scale data processing. Tools like Hive
and Pig have emerged on top of Hadoop which make it feasible to process huge data sets
easily. Hive for example transforms SQL like queries to MapReduce jobs. It unlocks data set
of all sizes for data and business analysts for reporting and greenfield analytics projects.
Data can be either transferred to or collected in a cloud data sink like Amazons S3, and
Microsoft Blob Storage, e.g. to collect log files or export text formatted data. Alternatively
database adapters can be utilized to access data from databases directly with Hadoop, Hive,
and Pig. Qubole is a leading provider of cloud based services in this space. They provide
unique database adapters that can unlock data instantly, which otherwise would be
inaccessible or require significant development resource. One great example is their
mongoDB adapter. It gives Hive table like access to mongoDB collections. Qubole scales
Hadoop jobs to extract data as quickly as possible without overpowering the mongoDB
instance.
Ideally a cloud service provider offers Hadoop clusters that scale automatically with the
demand of the customer. This provides maximum performance for large jobs and optimal
savings when little and no processing is going on. Amazon Web Services Elastic MapReduce
and Azure HDInsight, for example, allow scaling of Hadoop clusters. However, the scaling is
not automatically with the demand and requires user actions. The scaling itself is not optimal
since it does not utilize HDFS well and squanders Hadoops strong point, data locality. This
means that an Elastic MapReduce cluster wastes resources when scaling and has diminishing
return with more instance. Furthermore, Amazons Elastic MapReduce and HDInsight
require a customer to explicitly request a cluster every time when it is needed and remove it
when it is not required anymore. There is also no user friendly interface for interaction with
or exploration of the data. This results in operational burden and excludes all but the most
proficient users.
Qubole scales and handles Hadoop clusters very differently. The clusters are managed
transparently without any action required by the user. When no activity is taking place
clusters are stopped and no further expenses accumulate. The Qubole system detects demand,
e.g. when a user queries Hive, and starts a new cluster if needed. It does this even faster than
Amazon raises its clusters on explicit user requests. The clusters that Qubole manages for the
user have a user defined minimum and maximum size and scale as needed to provide the user
with the optimal performance and minimal expense.
Importantly users, developers, data engineers and business analysts alike, require an easy to
use graphical interface for ad hoc data analysis access, and to design jobs and workflows.
Qubole provides a powerful web interface including workflow management and querying
capabilities. Data is accessed from permanent data store like S3 or Azure Blob Storage and
database connectors with transient clusters. The pay-as-you go billing of cloud computing
makes it easy to compare and try out systems. Sign up to Qubole and try it for free to
experience how easy it is to use.
Big data and the Internet of Things: Two sides of the same coin
Read each statement below and determine if its referring to big data or the Internet of
Things:
1. Every minute, we send 204 million emails, generate 1.8 million Facebook likes, send 278
thousand tweets, and upload 200 thousand photos to Facebook. Is this statement about big
data or the Internet of Things?
2. 12 million RFID tags (used to capture data and track movement of objects in the physical
world) were sold in 2011. By 2021, its estimated this number will increase to 209 billion as
[big data or the Internet of Things?] takes off.
3. The boom of [big data or the Internet of Things?] will mean that the amount of devices that
connect to the internet will rise from about 13 billion today to 50 billion by 2020.
4. The [big data or the Internet of Things?] industry is expected to grow from US$10.2 billion
in 2013 to about US$54.3 billion by 2017.
Heres the answers: 1 big data; 2 Internet of Things; 3 Internet of Things; and 4 big
data.
Big Data
Be transparent about what data is collected, how data is processed, for what purposes
data will be used, and whether data will be distributed to third parties.
Define the purpose of collection at the time of collection and, at all times, limit use
of the data to the defined purpose.
Obtain consent.
Collect and store only the amount of data necessary for the intended lawful purpose.
Allow individuals access to data maintained about them, information on the source
of the data, key inputs into their profile, and any algorithms used to develop their profile.
Conduct regular reviews to verify if results from profiling are responsible, fair and
ethical and compatible with and proportionate to the purpose for which the profiles are being
used.
Internet of Things
Self-determination is an inalienable right for all human beings.
Data obtained from connected devices is high in quantity, quality and sensitivity
and, as such, should be regarded and treated as personal data.
Those offering connected devices should be clear about what data they collect, for
what purposes and how long this data is retained.
Data should be processed locally, on the connected device itself. Where it is not
possible to process data locally, companies should ensure end-to-end encryption.
Data protection and privacy authorities should seek appropriate enforcement action
when the law has been breached.
All actors in the internet of things ecosystem should engage in a strong, active and
constructive debate on the implications of the internet of things and the choices to be made.
There is clearly a relationship between big data and IoT. Big data is a subset of the IoT .
Big data is about data, plain and simple. Yes, you can add all sorts of adjectives when
talking about big data, but at the end of the day, its all data.
IoT is about data, devices, and connectivity. Data big and small is front and
center in the IoT world of connected devices.
Replicatio
n. A big part of cloud computing and storage is the process of data distribution and
replication. New storage systems must be capable of not only managing data at the
primary site but they must also be able to replicate that information efficiently to
other locations. Why? There is a direct need to manage branch office, remote sites,
other data centres, and of course disaster recovery. Setting the right replication
infrastructure will mean managing bandwidth, scheduling and what data is actually
pushed out. Storage can be a powerful tool for both cloud computing and business
continuity. The key is understanding the value of your data and identifying where that
data fits in with your organization.
Data deduplication.Control over the actual data within the storage environment has always
been a big task as well. Storage resources arent only finite, theyre expensive. So, data
deduplication can help manage data that sits on the storage array as well as information being
used for other systems. For example, instead of sending out 100 20mb attachments, the
storage array would be intelligent enough to only store one file and create 99 pointers. If a
change was made to the file, the system is smart enough to log those changes and create
secondary pointers to a new file.
Because cloud computing will only continue to advance, there will be new demands placed
around storage. Even now, conversations around big data and storage are already heating up.
Whether its big data, a distributed file system, cloud computing or just the user environment
the storage infrastructure will always play an important role. The idea will always revolve
around ease of management and control over the data. In developing a solid storage platform,
always make sure to plan for the future since data growth will be an inevitable part of todays
cloud environment.
The internal data of enterprises mainly consists of online trading data andonline analysis data,
most of which are historically static data and are managed byRDBMSs in a structured
manner. In addition, production data, inventory data, sales data, and nancial data, etc., also
constitute enterprise internal data, which aims to capture informationized and data-driven
activities in enterprises, so as to record all activities of enterprises in the form of internal data.
Over the past decades, IT and digital data have contributed a lot to improve the protability
of business departments. It is estimated that the business data volume of all companies in the
world may double every 1.2 years [1], in which, the business turnover through the Internet,
enterprises to enterprises, and enterprises to consumers per day will reach USD 450 billion
[2]. The continuously increasing business data volume requires more effective real-time
analysis so as to fully harvest its potential. For example, Amazon processes millions of
terminal operations and more than 500,000 queries from third-party sellers per day [3].
Walmart processes one million customer trades per hour and such trading data are imported
into a database with a capacity of over 2.5PB [4]. Akamai analyzes 75 million events per day
for its target advertisements [5]
1.4Bio-medicalData
As a series of high-throughput bio-measurement technologies are innovatively developed in
the beginning of the twenty-rst century, the frontier research in the bio-medicine eld also
enters the era of big data. By constructing smart, efcient, and accurate analytical models and
theoretical systems for bio-medicine applications, the essential governing mechanism behind
complex biological phenomena may be revealed. Not only the future development of bio-
medicine can be determined, but also the leading roles can be assumed in the development of
a series of important strategic industries related to the national economy, peoples livelihood
and national security, with important applications such as medical care, new drug R&D, and
grain production (e.g., transgenic crops).It is predictable that, with the development of bio-
medicine technologies, gene sequencing will become faster and more convenient, and thus
making big data of bio-medicine continuously grow beyond all doubt.Apart from such small
and medium-sized enterprises, other well-known IT companies, such as Google, Microsoft,
and IBM have invested extensively in the research and computational analysis of methods
related to high-throughput biological big data
Log Files: As one widely used data collection method, log les are record les automatically
generated by the data source system, so as to record activities in designated le formats for
subsequent analysis. Log les are typically used in nearly all digital devices. For example,
web servers record in log les number of clicks, click rates, visits, and other property records
of web users . To capture activities of users at the web sites, web servers mainly include the
following three log le formats: public log le format (NCSA), expanded log format (W3C),
and IIS log format (Microsoft). All the three types of log les are in the ASCII text format.
Databases other than text les may sometimes be used to store log information to improve the
query efciency of the massive log store . There are also some other log les based on data
collection, including
stock indicators in nancial applications and determination of operating states in network
monitoring and trafc management.
Sensors: Sensors are common in daily life to measure physical quantities and transform
physical quantities into readable digital signals for subsequent processing (and storage).
Sensory data may be classied as sound wave, voice, vibration, automobile, chemical,
current, weather, pressure, temperature, etc.Sensed information is transferred to a data
collection point through wired or wireless networks. For applications that may be easily
deployed and managed, e.g., video surveillance system [10], the wired sensor network is a
convenient solution to acquire related information. Sometimes the accurate position of a
specic phenomenon is unknown, and sometimes the monitored environment does not have
the energy or communication infrastructures. Then wireless communication must be used to
enable data transmission among sensor nodes
under limited energy and communication capability. In recent years, WSNs have received
considerable interest and have been applied to many applications, such as environmental
research , water quality monitoring ], civil engineering and wildlife habit monitoring. A WSN
generally consists of a large number of geographically distributed sensor nodes, each being a
micro device powered by battery. Such sensors are deployed at designated positions as
required by the application to collect remote sensing data. Once the sensors are deployed, the
base station will send control information for network conguration/management or data
collection to sensor nodes. Based on such control information, the sensory data is assembled
in different sensor nodes and sent back to the base station for further processing. Interested
readers are referred to for more detailed discussions.
Intra-DCN transmissions are the data communication ows within data centers.Intra-DCN
transmissions depend on the communication mechanism within the data center (i.e., on
physical connection plates, chips, internal memories of data servers, network architectures of
data centers, and communication protocols).
A data center consists of multiple integrated server racks interconnected with its internal
connection networks. Nowadays, the internal connection networks of most data centers are
fat-tree, two-layer or three-layer structures based on multi-commodity network ows . In the
two-layer topological structure, the racks are connected by 1 Gbps top rack switches (TOR)
and then such top rack switches are connected with 10 Gbps aggregation switches in the
topological structure. The three-layer topological structure is a structure augmented with one
layer on the top of the two-layer topological structure and such layer is constituted by 10 or
100 Gbps core switches to connect aggregation switches in the topological structure. There
are also other topological structures which aim to improve the data center networks.
Because of the inadequacy of electronic packet switches, it is difcult to increase
communication bandwidths while keeps energy consumption is low. Over the years, due to
the huge success achieved by optical technologies, the optical interconnection among the
networks in data centers has drawn great interest. Optical interconnection is a high-
throughput, low-delay, and low-energy-consumption solution. At present, optical
technologies are only used for point-to-point links in data centers. Such optical links provide
connection for the switches using the low-cost multi-mode ber (MMF) with 10 Gbps data
rate. Optical interconnection (switching in the optical domain) of networks in data centers is a
feasible solution, which can provide Tbps-level transmission bandwidth with low energy
consumption.
Recently, many optical interconnection plans are proposed for data center networks . Some
plans add optical paths to upgrade the existing networks and other plans completely replace
the current switches . As a strengthening technology, Zhou et al. in adopt wireless links in
the 60GHz frequency band to strengthen wired links. Network virtualization should also be
considered to improve the efciency and utilization of data center networks.
2.3 Data Pre-processing
Because of the wide variety of data sources, the collected datasets vary with respect to noise,
redundancy, and consistency, etc., and it is undoubtedly a waste to store meaningless data. In
addition, some analytical methods have stringent requirements on data quality. Therefore,
data should be pre-processed under many circumstances to integrate the data from different
sources, so as to enable effective data analysis.Pre-processing data not only reduces storage
expense, but also improves analysis accuracy. Some relational data pre-processing techniques
are discussed in the following.
2.3.1 Integration
Data integration is the cornerstone of modern commercial informatics, which involves the
combination of data from different sources and provides users with a uniform view of data.
This is a mature research eld for traditional database.Historically, two methods have been
widely recognized: data warehouse and data federation.
. Data warehousing includes a process named ETL (Extract, Transform and Load).
Extraction involves connecting source systems, selecting, collecting,analyzing, and
processing necessary data. Transformation is the execution of a series of rules to transform
the extracted data into standard formats. Loading means importing extracted and transformed
data into the target storage infrastructure.Loading is the most complex procedure among the
three, which includes operations such as transformation, copy, clearing, standardization,
screening, and data
organization. A virtual database can be built to query and aggregate data from different data
sources, but such database does not contain data. On the contrary,it includes information or
metadata related to actual data and its positions. Such two storage-reading approaches do
not satisfy the high performance requirements of data ows or search programs and
applications. Compared with queries, data in such two approaches is more dynamic and must
be processed during data transmission.Generally, data integration methods are accompanied
with ow processing engines and search engines .
2.3.2 Cleaning
Data cleaning is a process to identify inaccurate, incomplete, or unreasonable data, and then
modify or delete such data to improve data quality. Generally, data cleaning includes ve
complementary procedures: dening and determining error types, searching and identifying
errors, correcting errors, documenting error examples and error types, and modifying data
entry procedures to reduce future errors. During cleaning, data formats, completeness,
rationality, and restriction shall be inspected. Data cleaning is of vital importance to keep the
data consistency,which is widely applied in many elds, such as banking, insurance, retail
industry,telecommunications, and trafc control.
In e-commerce, most data is electronically collected, which may have serious data quality
problems. Classic data quality problems mainly come from software defects, customized
errors, or system mis-conguration.Data cleaning in e-commerce by crawlers and regularly
re-copying customer and account information. The problem of cleaning RFID data was
examined.RFID is widely used in many applications, e.g., inventory management and target
tracking. However, the original RFID features low quality, which includes a lot of abnormal
data limited by the physical design and affected by environmental noises.A probability model
was developed to cope with data loss in mobile environments. Khoussainova proposed a
system to automatically
correct errors of input data by dening global integrity constraints.
Retail traders, Big banks, hedge funds and other so-called big boys in the financial markets
use big data for trade analytics used in high frequency trading, pre-trade decision-support
analytics, sentiment measurement, Predictive Analytics etc.
This industry also heavily relies on big data for risk analytics including; anti-money
laundering, demand enterprise risk management, "Know Your Customer", and fraud
mitigation.
Big Data providers specific to this industry include: 1010data, Panopticon Software,
Streambase Systems, Nice Actimize and Quartet FS
Spotify, an on-demand music service, uses Hadoop big data analytics, to collect data from its
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by offering, video,
music and Kindle books in a one-stop shop also heavily utilizes big data.
Big Data Providers in this industry include:Infochimps, Splunk, Pervasive Software, and
Visible Measures
3. Healthcare sector
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions
of patients, to allow doctors to use evidence-based medicine as opposed to administering
several medical/lab tests to all patients who go to the hospital. A battery of tests can be
efficient but they can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease.
4. Education
Big data is used quite significantly in higher education. For example, The University of
Tasmania. An Australian university with over 26000 students, has deployed a Learning and
Management System that tracks among other things, when a student logs onto the system,
how much time is spent on different pages in the system, as well as the overall progress of a
student over time.
In a different use case of the use of big data in education, it is also used to measure teachers
effectiveness to ensure a good experience for both students and teachers. Teachers
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioral classification and several other variables.
In the natural resources industry, big data allows for predictive modeling to support decision
making that has been utilized to ingest and integrate large amounts of data from geospatial
data, graphical data, text and temporal data. Areas of interest where this has been used
include; seismic interpretation and reservoir characterization.
Big data has also been used in solving todays manufacturing challenges and to gain
competitive advantage among other benefits.
In the graphic below, a study by Deloitte shows the use of supply chain capabilities from big
data currently in use and their expected use in the future.
6. In Government
In public services, big data has a very wide range of applications including: energy
exploration, financial market analysis, fraud detection, health related research and
environmental protection.
Big data is being used in the analysis of large amounts of social disability claims, made to the
Social Security Administration (SSA), that arrive in the form of unstructured data. The
analytics are used to process medical information rapidly and efficiently for faster decision
making and to detect suspicious or fraudulent claims.
The Food and Drug Administration (FDA) is using big data to detect and study patterns of
food-related illnesses and diseases. This allows for faster response which has led to faster
treatment and less death.
The Department of Homeland Security uses big data for several different use cases. Big data
is analyzed from different government agencies and is used to protect the country.
Big data has been used in the industry to provide customer insights for transparent and
simpler products, by analyzing and predicting customer behavior through data derived from
social media, GPS-enabled devices and CCTV footage. The big data also allows for better
customer retention from insurance companies.
When it comes to claims management, predictive analytics from big data has been used to
offer faster service since massive amounts of data can be analyzed especially in the
underwriting stage. Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.
Big data from customer loyalty data, POS, store inventory, local demographics data continues
to be gathered by retail and wholesale stores.
In New Yorks Big Show retail trade conference in 2014, companies like Microsoft, Cisco
and IBMpitched the need for the retail industry to utilize big data for analytics and for other
uses including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Some applications of big data by governments, private organizations and individuals include:
Governments use of big data: traffic control, route planning, intelligent transport
systems, congestion management (by predicting traffic conditions)
Individual use of big data includes: route planning to save on fuel and time, for travel
arrangements in tourism etc.
In utility companies the use of big data also allows for better asset and workforce
management which is useful for recognizing errors and correcting them as soon as possible
before complete failure is experienced.