BDCC 03 00032 v2 PDF
BDCC 03 00032 v2 PDF
BDCC 03 00032 v2 PDF
cognitive computing
Review
Big Data and Business Analytics: Trends, Platforms,
Success Factors and Applications
Ifeyinwa Angela Ajah * and Henry Friday Nweke
Department of Computer Science, Ebonyi State University, P.M.B 053, Abakaliki 480214, Nigeria;
[email protected]
* Correspondence: [email protected]
Received: 4 April 2019; Accepted: 5 June 2019; Published: 10 June 2019
Abstract: Big data and business analytics are trends that are positively impacting the business world.
Past researches show that data generated in the modern world is huge and growing exponentially.
These include structured and unstructured data that flood organizations daily. Unstructured data
constitute the majority of the world’s digital data and these include text files, web, and social media
posts, emails, images, audio, movies, etc. The unstructured data cannot be managed in the traditional
relational database management system (RDBMS). Therefore, data proliferation requires a rethinking
of techniques for capturing, storing, and processing the data. This is the role big data has come to
play. This paper, therefore, is aimed at increasing the attention of organizations and researchers to
various applications and benefits of big data technology. The paper reviews and discusses, the recent
trends, opportunities and pitfalls of big data and how it has enabled organizations to create successful
business strategies and remain competitive, based on available literature. Furthermore, the review
presents the various applications of big data and business analytics, data sources generated in these
applications and their key characteristics. Finally, the review not only outlines the challenges for
successful implementation of big data projects but also highlights the current open research directions
of big data analytics that require further consideration. The reviewed areas of big data suggest that
good management and manipulation of the large data sets using the techniques and tools of big data
can deliver actionable insights that create business values.
Keywords: big data; business analytics; business intelligence; Hadoop ecosystem; big data tools;
review and business value
1. Introduction
In the late 1980s, data warehouse technology, which is generally categorized as online analytical
processing (OLAP) was introduced by the relational database management system (RDBMS) companies
to support the business decision and business intelligence. It was originally designed to archive large
amounts of data out of production databases and to keep them lean and mean for good performance.
In data warehousing, multiple copies of data are located on multiple database servers referred to as
data mart. The data mart can be independent or an enterprise data mart. From there, data is then
extracted and loaded into two analytical data marts. Here, the data analysts create their algorithms
to run their jobs. One of the data marts links to a statistical analyst and the other to a business user.
While data warehouse has not failed in creating business value through detailed reporting based on
complex statistical modeling [1,2], it is challenging to continuously move data over the network and
takes a long time to yield results [3]. Furthermore, there are limitations in the data volume that can be
stored on the system. In addition, current data creation is continuously generated, thereby making it
difficult to process big data. Big data has garnered lots of attention recently in government, industries,
sciences, engineering, healthcare and medicine, finance and prominently in businesses [4]. Accordingly,
data generated in these areas are characterized by high volume, inability to be categorized into the
relational database management system and the data are generated, captured and processed rapidly [1].
Therefore, the major challenges facing various organizations, industries, and other business sectors
are how to design appropriate techniques to handle and process this large volume of data to ensure
effective and efficient decision-making.
Recently, big data and business analytics approaches have been developed and implemented to
analyze a large volume of data generated by different business organizations. Consequently, every
business needs faster insight into growing volumes of transactional data. Analyzing data in real time
helps organizations view the past and foresee the future. This is the beauty of streaming analytics and
is endowed by knowing what occurred (descriptive), understanding why it happened (diagnostic),
looking ahead to what might take place (predictive) and, ultimately, determining how to influence
future occurrences (prescriptive). These four analytics flavors which are explained in Section 3 of this
article have huge business benefits but are progressively more difficult to implement and use. The
big data opportunity is not only for achieving high efficiency in business operations. There are also
important opportunities for economic growth and improving the standard of living to the society.
There are various ways in which big data analytics can improve business organizational outputs and
industries. These include improved health care delivery, the standard of education, national security,
and enable good governance [5,6]. In addition, it has potential to assist policy-makers to gain insight
in enabling policies that will grant safe playground for investors, help waste managers find the type
of waste that is more generated from a particular locality and provide insight for sharing of waste
collection material. Moreover, education monitoring agency can deploy big data and business analytics
approaches to evaluate the performance of teachers and improve work attitude. Furthermore, mobile
network location data can be used for traffic management to prevent traffic jams in big cities or better
plan the public transport system.
The goal of this study is to implement a comprehensive investigation into big data and business
analytics methods for improved business decision making, technological approaches, applications,
and open research challenges. Furthermore, the study attempts to draw attention to the tremendous
benefits big data has brought to companies in developed countries and how these can be replicated by
indigenous business organizations. Moreover, the study discusses various challenges facing big data
analytics with a focus on data security, management, characteristics, regulation, and compliances.
The phenomenon of big data analytics researches and implementation have been conducted by
various researchers and industries for over a decade. This is due to the vital applications of big data
in various areas such as the healthcare system, business decision-making, educational development,
network optimization, travel estimation, and financial services. Therefore, quite a number of studies
and reviews have been published in big data analytics, implementations and related technologies in
recent time. Sing et al. [7] reviewed hardware and software parameters for effective big data analytics
developments. Additionally, Hashem et al. [4] presented taxonomy and intersection of cloud computing
and big data analytics. However, these studies focused on big data in cloud computing, software and
hardware parameters such as data availability, scalability, and data size for implementation of big
data analytics. The studies failed to discuss important big data analytics tools, their strengths, and
weaknesses. Recently, reviews on big data analytics, open sources tools for big data implementation
and iterative clustering algorithms for big data analysis were presented by [8–10]. Tsai et al. [8] outlined
big data analytics approaches in terms of data mining and knowledge discovery. The authors primarily
discussed data mining algorithm that can be extended for big data analytics. Nonetheless, challenges,
applications, current tools and data sources for big data analytics were not comprehensively discussed.
Lanset et al. [9] presented open sources tools for big data analytics, their advantages and drawbacks.
However, the review is narrowed only to tools while other criteria for effective big data
implementation were not sufficiently covered. A closely related survey was presented recently
by Mohammedi et al. [11] and discussed big data technologies, applications and opens source tools for
big data analytics. Conversely, our study differs with their review in many ways. First, the present
Big Data Cogn. Comput. 2019, 3, 32 3 of 30
review provides a broader view by focusing on the recent trends in big data and business analytics
development. Second, we discussed platforms, opens source tools, their strengths and weaknesses.
Third, this study presents big data success factors for analytic teams, their major functions, and
challenges for the implementation of analytics in organizations. Fourth, the current study presents
recent data sources and applications for big data and business analytics. Finally, the current review
outlines and discusses open research directions in big data and analytics. The review is a timely
exploration of big data and business analytics. The major differences between recent reviews and the
current study are presented in Table 1 below:
The remainder of this paper is organized as follows: Section 2 discusses the recent developments
in big data technologies. Section 3 presents big data analytics platforms while Section 4 explores the
success factors and challenges of big data implementation. Section 5 outlines the main applications
The solution will utilize the unique digital identification number (id) and stream mobile payment
transaction data through a mobile device into a big data repository. The collected data are
continuously monitored and standard machine learning techniques can be applied to discover if
there is an occurrence of fraudulent or false payment alert from a customer to a merchant. Such
happening would trigger a warning alert that could be shared with their mobile operators, and the
Big Data Cogn. Comput. 2019, 3, 32 4 of 30
merchant’s bank, possibly even before the merchant releases his product. At the mobile operator
end, the Sim registration record and Global Positioning System (GPS) technology can be used to
create the customer’s crime chart and alert the police for the offender’s arrest. At the back end, the
and data sources for big data and business analytics. Section 6 summarizes the study and explores
intelligent agent model running in the bank application would trigger a warning alert to the
open research directions. Figure 1 outlines the structure of the paper.
merchant to ignore such a transaction request.
Review Structure
healthcare systems. Moreover, recent development in in-memory databases has provided an increase
in database performance and makes data collection through the Internet of things (IoT) and cloud
computing facilities that provide persistent large-scale data storage and transformation achievable.
The surge in data volume is driven by a number of technologies, which include:
i. Distributed computing: Big data in large-scale distributed computing systems, which is based
on open-source technology, are providing direct access and long-term storage for petabytes of
data while powering extreme performance.
ii. Flash memory in solid-state drives allows computers to become universal. It delivers
random-access speeds of less than 0.1 milliseconds unlike disk access of 3 to 12 milliseconds.
There is a high possibility that future big data solutions will use a lot of flash memory to
improve access time to data [17].
iii. Mobile devices: Which represent computers everywhere, create much of the big data, and equally
receives outputs from big data solutions.
iv. Cloud computing: This created an entirely new economy of computing by moving storage,
databases, services, into the cloud and offers great access for rapidly deploying big data solutions.
v. Data analytics: This is a multistage approach that includes data collection, preparation, and
processing, analyzing and visualizing large scale data to produce actionable insight for
business intelligence.
vi. In-memory applications: These are significantly increasing database performance [12].
A huge percentage of these data for big data analytics is unstructured data derived from various
data sources and applications such as text files, weblogs, and social media posts, emails, photo images,
audio, and movie. Big data are meant to handle and manage unstructured data using key-value pairs.
The concept of big data is defined by Will Dailey and Gartner [17,18]. Dailey [17] defined big data as,
“a supercomputing environment engineered to parallel process compute jobs across massive amounts
of distributed data for the purpose of analysis.” He viewed big data as Global Data Fabric in action and
the Centerpiece for the entire biosphere of modern computing. The Global Data Fabric idea shows how
big data creates strong connections among institutions and enables them to work as a team. On the
other hand, Gartner [18] defined big data as data with high-volume, velocity and variety information
assets that demand cost-effective, innovative forms of information processing for enhanced insight and
decision-making. There are various areas that big data analytics have been actively implemented for
developing effective business decision making.
For example, a solution can be developed to tie customer/merchants bank verification number
(BVN) and subscriber identification module (SIM) registration details to a unique digital identity.
The solution will utilize the unique digital identification number (id) and stream mobile payment
transaction data through a mobile device into a big data repository. The collected data are continuously
monitored and standard machine learning techniques can be applied to discover if there is an occurrence
of fraudulent or false payment alert from a customer to a merchant. Such happening would trigger a
warning alert that could be shared with their mobile operators, and the merchant’s bank, possibly even
before the merchant releases his product. At the mobile operator end, the Sim registration record and
Global Positioning System (GPS) technology can be used to create the customer’s crime chart and alert
the police for the offender’s arrest. At the back end, the intelligent agent model running in the bank
application would trigger a warning alert to the merchant to ignore such a transaction request.
While at the big data repository, all of this data can then be mapped to other data, such as network
failure log, failed payment transaction, technology awareness data and wrong debit record. These can
undergo further analysis to understand users experience and ascertain the root cause of low acceptance of
mobile money by merchant across the country. The information could then be used to develop an intelligent
business model and enable policy that will build merchants and customers trust for mobile money payment.
This, in general, will rapidly help actualize the government initiative of a cashless society [18].
Big data are characterized by various vectors as outlined by Gartner and shown in Figure 2 below.
cause of low acceptance of mobile money by merchant across the country. The information could
then be used to develop an intelligent business model and enable policy that will build merchants
and customers trust for mobile money payment. This, in general, will rapidly help actualize the
government initiative of a cashless society [18].
Big Data Cogn.
BigComput. 2019,
data are 3, 32
characterized by various vectors as outlined by Gartner and shown in Figure 26 of 30
below.
Batch
Structured &
Unstructured
Big
Structured data Streaming
data
Zettabytes
Terabytes
Volume
Figure2.2.The
Figure The Gartner’s
Gartner’s Vector
Vectormodel.
model.
These These
vectorsvectors
includeinclude volume,
volume, variety,
variety, velocity,
velocity, veracity,
veracity, andand value.
value. TheThebig big
datadata volume
volume focuses
focuses on the size of data set generated through various applications
on the size of data set generated through various applications and sources and are growing at the and sources and are growing
at the rate of megabytes to petabytes. Variety aims at the heterogeneous nature of data that
rate of megabytes to petabytes. Variety aims at the heterogeneous nature of data that constitute big
constitute big data. These include textual data, social media data, traffic information, health-related
data. These include textual data, social media data, traffic information, health-related data, and other
data, and other multimodal data. Velocity refers to the speed and dynamic nature of the data
multimodal data. Velocity refers to the speed and dynamic nature of the data collection process and
collection process and how to generate these data in real-time. Furthermore, veracity depicts the
how to generate thesesources
reliability of data data inand real-time. Furthermore,
if the sources veracity can
of data generation depicts the reliability
be trusted. Finally, theof data
valuesources
of
and ifbig
the sources
data showsof thedata generation
insight and hidden canvalues
be trusted.
that canFinally, the value
be discovered fromofa big data
large amountshows the insight
of dataset
and hidden
[4]. values that can be discovered from a large amount of dataset [4].
TheseThese vectors
vectors made made it challenging
it challenging for for traditional
traditional data
data warehousetechnology
warehouse technologyto tohandle
handle huge
huge data
data volumes of hundreds of terabytes [5,13]. Furthermore, big data
volumes of hundreds of terabytes [5,13]. Furthermore, big data is not quantifiable, not the same is not quantifiable, not thefor all
same for all companies, and does not depict better data. There is no
companies, and does not depict better data. There is no quantifiable amount of data that determinesquantifiable amount of data
that determines whether your data met some artificial thresholds. The size of big data varies from
whether your data met some artificial thresholds. The size of big data varies from organization to
organization to organization. Bigger data is not necessarily better data, but data usually is always
organization. Bigger data is not necessarily better data, but data usually is always better than no
better than no data [19–21]. Accordingly, big data analytics provide hosts of great new tools
data including
[19–21]. Accordingly,
business analytics big for
data analyticsand
visualizing provide hosts ofdata
manipulating great new tools
insights. including
This makes it easybusiness
to
analytics for visualizing
visualize data into charts, and graphs,
manipulating
models, data
and 3D.insights. Thisbig
Therefore, makes
data it easy toisvisualize
analytics a collection data
of into
charts, graphs,
tools models, and
and techniques aimed3D.atTherefore,
handling abig largedata analytics
volume is a collection
of unstructured dataofthat
tools is and
beyondtechniques
the
aimedcapability
at handling a large volume of unstructured data that is beyond the capability
of the traditional database system. Big data analytics solutions help the organization of the traditional
see
changes
database in their
system. Bigbusiness and innovate
data analytics solutions in real
helptime [22]. Different see
the organization companies
changeshave different
in their use and
business
casesinand
innovate realobviously
time [22].different
Different data. A solution
companies have thatdifferent
works for useone company
cases may be ineffective
and obviously or
different data. A
completely wrong for another. While it is valuable to benchmark
solution that works for one company may be ineffective or completely wrong for another. While it is others, it is necessary to
understand the motivations that drive their technology choices and the analytics they use to capture
valuable to benchmark others, it is necessary to understand the motivations that drive their technology
the true sensitivity of their businesses. Replication of solution is, therefore, necessary where it
choices and the analytics they use to capture the true sensitivity of their businesses. Replication of
makes sense, but most importantly understands your business drivers for the application of big
solution
data.is, therefore, necessary where it makes sense, but most importantly understands your business
drivers for the application of big data.
Recent analyses show that big data giants like Google, Facebook and Twitter have used big data
analytics effectively. Google indexes the entire internet for rapid Google searches and was said to
process 24 petabytes of data per day in 2009. It offers cloud storage (Google Drive) and big data solution
with Google Big Query. Moreover, Google performs machine learning and analytics on massive data
sets (think reverse image search and voice recognition). With their rapid growth, they continue to be
the world’s leading search engine. On the other hand, Facebook and Twitter each store information on
over a billion users. There are hundreds of millions of shares, likes, tweets, image posts, etc., a day that
must be tracked. They use machine learning tools and algorithms to recommend friends and display
trending topics. Their estimated revenue for 2014 was $12.5 billion, for Facebook and Twitter made
$1.4 billion respectively.
Big Data Cogn. Comput. 2019, 3, 32 7 of 30
Other businesses that have successfully implemented a big data analytics framework are Wal-Mart
and American Express. Wal-Mart uses big data and machine learning to improve product searches
and recommendations. The adoption saw its purchase completion rate increased by 10-15 percent.
American Express analyzes its big data to predict customer churn and identify 24% of Australian
accounts that will close within four months. Macy’s adjusts product pricing in real time for millions of
items [23,24]. BancaCarige implemented IBM® DB2® Analytics Accelerator on a new IBM Enterprise®
EC12 that enabled rapid query response times. This helps over 1000 business users to get fast access to
vital insights. The positive results derived from big data analytics by various business organizations
have seen the development of various tools to aid organizational big data analysis. In this paper, these
tools are discussed in Section 4, with their strengths and weaknesses outlined to aid organizations’
choice of tools for their data analysis.
Analytics involves the use of statistical techniques (measures of central tendency, graphs, and so on),
information system software (data mining, sorting routines), and operations research methodologies
(linear programming) to explore, visualize, discover and communicate patterns or trends in data [25].
For example, weather measurements collected from metrological agencies can be analyzed and use to
predict weather pattern. Furthermore, analysis of business data held the key to the development of
successful new products. Analytics process in a big data world reveals how to tap into the powerful
tool of data analytics to create a strategic advantage and identify new business opportunities. It has
wide applications which include credit risk assessment, marketing, and fraud detection. There are
many types of analytics approaches, and these can be categorized as:
i. Descriptive analytics: This is a simple statistical technique (graph) that describes what is
contained in a data set or database. Descriptive statistics, including measures of central
tendency (mean, median, mode), measures of dispersion (standard deviation), charts, graphs,
sorting methods, frequency distributions, probability distributions, and sampling methods. The
result of this process can be used to find possible business-related opportunities. For example,
the smartphone ownership bar chart can be deployed to show the number of users that own
smartphones for an IT firm that wants to determine the market for their mobile payment app
based on phone ownership level.
ii. Predictive analytics is an application of advanced statistical, information software, or operations
research methods to identify predictive variables and build predictive models into a descriptive
analysis. The results here predict opportunities in which the firm can take advantage to
improve their products and services. For instance, multiple regression can be used to show
the relationship (or lack of relationship) between ease of use, cost, and security on merchants’
acceptance of mobile money payment. Knowing that relationships exist helps explain why one
set of independent variables influences dependent variables such as business performance.
iii. Diagnostic analytics uses the analysis of past data to ascertain the cause of certain events.
Therefore, diagnostic analytics augments descriptive analytics by asking why certain events
occurred using the patterns in the collected data. The diagnostic analytics process is effectively
utilized in machine health monitoring and prognosis, fault detection and maintenance.
iv. Prescriptive analytics deploys the power of decision science, management science, and
operations research methodologies (applied mathematical techniques) to make the best use of
allocated resources. Resources are allocated to take advantage of the predicted opportunities.
For example, a department store that has a limited advertising budget to target customers can
use linear programming models and decision theory to optimally allocate the budget to various
advertising media. Linear programming (a constrained optimization methodology) has been
used to maximize the profit in the design of supply chains [26].
a class and intervene in time to help them complete the course successfully. The process works by
aggregating basic student data, such as GPAs, SAT scores, student’s addresses, and other demographic
data, and then combine this information with course-specific data, such as how often students submit
assignments and engage with instructors through online forums. The information is analyzed through
predictive modeling and data mining, and the outcome presented an accurate picture of who is likely
to drop a particular class. In order to assist the students at risk, prescriptive modeling is applied to
give insight on how the instructors may prevent the foreseen occurrences and use an approach that
will engage the class as a whole [27].
Analytics, business analytics (BA), and business intelligence (BI) are often used interchangeably
in business literature and they convert data into useful information. However, they differ in purpose
and methodologies used for each of the descriptive, predictive, diagnostic and prescriptive analytics.
Analytics can involve any one of the four types of analytics processes. For clarity, definitions of these
terms are presented below:
• Business analytics (BA): According to a recent paper [28], business analytics is beyond plain
analytics. It sequentially applies a combination of descriptive (what is happening), predictive
(why something is happening, what new trends may exist, what will happen next), diagnostic (why
did it happen) and prescriptive analytics (what is the best course for the future) to generate new,
unique and valuable information that create an improvement in measurable business performance
as shown in Figure 3. Analyzed data can be sourced from business reports, database, and business
data stored in the cloud. Business analytics processes include reporting results about business
intelligence and in addition seeks to explain why the results occur based on the analysis.
• Business intelligence (BI): This focuses on querying and reporting and can include reported
information from a business analytics (BA) approach. Moreover, business intelligence seeks to
answer questions such as what is happening now and where, and also what business actions are
Big Data Cogn. Comput. 2019, 3, 32 9 of 32
needed based on prior experience.
InInthe
thepast, business
past, business analytics
analytics and and business
business intelligence
intelligence were usedwere used forDBMS-based
for structured structured DBMS-based
content
content to report and understand what happened in the past [28]. With the growth of biggrowth
to report and understand what happened in the past [28]. With the of big data, they
data, they
cancan
be be
used alongside
used alongsidebig
bigdata
dataanalytics
analytics techniques
techniques to to provide
provideopportunities
opportunities forfor extracting actionable
extracting
actionable
insight frominsight from
data by data by
using using analytical
analytical processesprocesses and tools.
and tools. TheirTheir implementation is
implementation is seen
seen in structured
in structured data analytics, text analytics, web analytics, network analytics, and mobile analytics
[29,30]. Moreover, the volume and velocity of big data present an opportunity to use big data and
analytical tools to predict the future and make new discoveries [31].
Business demand for business analytics and business intelligence has been demonstrated by a
number of studies as shown in recent studies [32,33]. Moreover, successful business intelligence and
analytics applications have also been reported in a broad range of industries, from health care and
Big Data Cogn. Comput. 2019, 3, 32 9 of 30
data analytics, text analytics, web analytics, network analytics, and mobile analytics [29,30]. Moreover,
the volume and velocity of big data present an opportunity to use big data and analytical tools to
predict the future and make new discoveries [31].
Business demand for business analytics and business intelligence has been demonstrated by a
number of studies as shown in recent studies [32,33]. Moreover, successful business intelligence and
analytics applications have also been reported in a broad range of industries, from health care and
airlines to major IT and telecommunication firms [34].
Most successes recorded by organizations that deploy big data analytics are largely noticed in
developed countries. This is perhaps why huge successes have not been seen for businesses in a
developing country. International Data Corporation (IDC) in 2011 showed that business analytics
was second Information Technology (IT) priorities for large enterprises that year [35]. An online
survey conducted by [29] asserted that among 930 businesses across the globe in various industries,
provides insight into the current state of business analytics in today’s organization. The research
findings highlighted the fact that most organizations still rely on traditional technology and depend on
spreadsheets for business analytics. There is moderate growth in the use of business analytics within
companies. Nonetheless, it is narrowly used within departments or business units, and not integrated
across the organization. For some organizations, analytics are used as part of the decision process at
varying levels. In addition, organizations are in search of analytics that will primarily help in reducing
costs, improving the bottom line, and managing risks. Meanwhile, fear of data accuracy, consistency,
and even access is a challenge in the adoption or use of business analytics. Many organizations lack
skills to implement analytics and some businesses that attempted it lack the knowledge to apply the
results. Companies that have built an “analytics culture” are reaping the benefits of their analytics
investments. Therefore, bridging the knowledge gap for the organization to apply big data and
business analytics in their organization is vital for effective decision making and business success.
To provide this knowledge gap, this paper also discusses the various teams for big data analytics
framework in Section 5. These teams include business expert, big data analyst, big data architecture
and Hadoop operators and engineers.
Hadoop Distributed File System (HDFS) is the storage layer that is responsible for creating a
distributed repository. While Yet Another Resource Negotiator (YARN) is the data refinery layer and is a
processing level for scheduling parallel compute jobs. This structure abstracts away the complexities
of distributed computing.
YARN provides resource management and job scheduling in the Hadoop distributed processing
platform. Moreover, Hadoop YARN extends the capability of Hadoop to support varieties of
applications thereby reducing the limitation of Hadoop to only run MapReduce applications.
Consequently, YARN enables Apache Hadoop to provide services such as interactive querying,
data streaming, and real-time analytics applications [4].
With Hadoop, writing a MapReduce job by the programmer is easy as they do not have to
determine data location or size and a number of parallels compute jobs. The primary components for
the Hadoop cluster include the master server, the switches, the racks, and the data servers as shown in
Figure 5. These are commonly called data workers, data nodes, or just nodes. The master server has
responsibilities for managing and coordinating the entire Hadoop cluster (data nodes). Furthermore, it
performs health checks and takes corrective action when required, mapping the location of all the data
and directing all data movement, scheduling and rescheduling all compute jobs, and error handling
(including loss of a data node and rescheduling of failed compute jobs). The data server is responsible
for data storage and processing, and further, provides resources such as CPU and memory. In addition,
data server reports health checks status and on-going job progress during data processing.
Another important component of the Hadoop ecosystem is Common. Common is made up
of utilities and tools to perform various operations such as codec compression, error detection,
input/output utilities and authorization of proxy users. Furthermore, common is responsible for data
and user authentication, services level authorization and configuration of rack awareness [9].
Generally, Hadoop is configured on rack-based servers. On top of each rack, the network switch
is configured for intra-rack communication. Furthermore, another network switch is configured for
handling communication between rack switches and the client that runs the Hadoop client-related
software. Hadoop uses HDFS for holding files. It is responsible for breaking large files into smaller
chunks (128 MB - configurable), placing them in different slave nodes and replicating them for providing
high availability.
The data server is responsible for data storage and processing, and further, provides resources such
as CPU and memory. In addition, data server reports health checks status and on-going job
progress during data processing.
Another important component of the Hadoop ecosystem is Common. Common is made up of
utilities and tools to perform various operations such as codec compression, error detection,
Big Data Cogn. Comput.utilities
input/output 2019, 3, 32
and authorization of proxy users. Furthermore, common is responsible for 11 of 30
data and user authentication, services level authorization and configuration of rack awareness [9].
Figure 5. 5.The
Figure Theprimary
primary component
component ofofHadoop
Hadoop cluster.
cluster.
i. 3.2.Clustering
Common Bigand Analytics Tools Divides a large collection of entities into smaller groups that
Datasegmentation:
shows some similarities. An example is analyzing a collection of customers to differentiate
smaller segments for targeted marketing.
ii. Classification is a process of organizing data into predefined classes based on attributes
that are either pre-selected by an analyst or identified as a result of a clustering model. An
example is using the segmentation model to determine which segment a new customer would
be categorized.
iii. Regression is used to discover relationships among a dependent variable and one or more
independent variables and helps determine how the dependent variable’s values change in
relation to the independent variable values. An example is using mobile money subscription
data, usage level, transaction type, transaction amount and geographic location to predict the
future penetration of mobile money payment.
iv. Association and itemset mining looks for statistically relevant relationships among variables
in a large data set. For example, this could help direct digital banking representatives to offer
specific incentives to mobile money app users based on the usage level, transaction amount
and transaction volume.
v. Similarity and correlation, which is used to inform undirected clustering algorithms.
Similarity-scoring algorithms can be used to determine the similarity of entities placed in a
candidate cluster.
However, the huge volume of big data has rendered the traditional data analysis approach
ineffective for processing huge amounts of generated data in the current cyber-physical and mobile
connected world. Therefore, various big data tools have been proposed and implemented recently
for efficient data generation, transmission, processing, storage and analysis of big data [10]. Big data
analytics tools and approaches are shown in Figure 6.
Big Big
DataData Cogn.
Cogn. Comput.
Comput. 3, 323, 32
2019,
2019, 12 of13
30of 32
Big data storage Big data management Big data processing Statistical analysis and
programming
NoSQL MapReduce
MLlib
YARN
Hbase
HDFS
Python
Pig
Hive
Apache Tez R
Casandra
Mahout Scala
Sqoop MATLAB
Oozie
Drill Zookeeper
Flume
Kafka
Storm
Flink
Chukwa
Avro
Figure 6.
Figure 6. Overview
Overview of
of big
big data
data and
and business
business analytics
analytics in
in Hadoop.
Hadoop.
These tools are continuously updated and many new tools are being introduced on a regular
Big Data Cogn. Comput. 2019, 3, 32 14 of 32
basis. There is always some meaning attached to the names given to the software projects, but there
are no criteria defined
However, the huge for volume
the namespace. Therefore,
of big data there is no
has rendered theconnection
traditionalbetween their names.
data analysis For
approach
example, flume is named after the water race used in some sawmills to bring logs to
ineffective for processing huge amounts of generated data in the current cyber-physical and mobilethe mill and pig
was named on the fact that pigs eat anything. A few of the most common tools
connected world. Therefore, various big data tools have been proposed and implemented recentlyused in a Hadoop
cluster are depicted
for efficient in Figuretransmission,
data generation, 7. Other popular tools, storage
processing, can be found from distro
and analysis of bigcompanies such
data [10]. Big as
data
Cloudera, Hortonworks, IBM, or MapR.
analytics tools and approaches are shown in Figure 9.
Figure 7.
Figure 7. Common
Common tools
tools used
used in
in a
a Hadoop cluster.
Hadoop cluster.
vi. Apache Spark [41]: Spark is an open source computing framework that can run data on a disk
and in- memory. Spark is built to run onto HDFS and is able to use YARN. It is designed
to combine SQL, streaming, and complex analytics. It has high-level libraries that enable
programmers to rapidly write jobs for streaming, machine learning, graph processing, and the
R statistical programming language. The fast processing power of the Apache Spark makes it
gain popularity over other existing solutions like Apache Mahout and MapReduce. In machine
learning, Spark runs compute job ten times faster than Apache Mahout. On a large-scale
statistical analysis, it is benchmarked to run a hundred times faster in memory than the same job
running in MapReduce. Spark is robust and versatile. It has successfully combined a number
of different functions into a single software solution. Spark applications can be written in Java,
Scala, and Python and this makes it easy for programmers to write in their native language. It
can read any existing Hadoop data file. It also reads from HBase, Cassandra, and many other
data sources. Spark is scalable to 2000 nodes and it will continue to expand its ability to scale
compute jobs.
vii. Oozie [42]: Oozie is a workflow and coordination tool used in a Hadoop cluster. It runs across
a supercomputing platform. It allows jobs to run in parallel while waiting for input from
other jobs. One of the interesting advantages to Oozie is that it comes with a very complex
scheduling tool. This allows for coordination of jobs waiting for other dependencies within the
supercomputing platform.
viii. HBase [43]: This is a popular NoSQL columnar database deployed on top of Hadoop. HBase is
an Apache project based on Google’s Big Table model of data storage. It has no schema and
provides a column-oriented view of data.
ix. Mahout [44]: Mahout is a scalable, simple and extensible machine learning library supported
by Java, Scala, and Python for building distributed learning algorithm in Hadoop. The
current version of Mahout called Samsara focuses on math environment for the task such
as linear algebra, statistical operation and data structure using the R like syntax. Some of
the commonly distributed machine-learning algorithm in the Mahout Library is singular
value decomposition, principal component analysis, collaborative filtering, clustering, and
classification. Mahout-Samara allows the user to build a distributed machine learning algorithm
instead of depending on the pre-made algorithms. Mahout has provided comprehensive
algorithm suits for MapReduce and Apache Spark.
x. MLib [45]: MLlib is an open source machine learning library native to Apache Spark. It
has a Spark API that allows the user to develop distributed machine learning algorithms in
Java, Scala, Python, and R. The main features of MLlib include easy deployment capabilities,
and runs faster than Mahout that use the MapReduce framework due to high in-memory
computation and Spark Resilient distributed dataset. Moreover, MLlib contains a number
of machine algorithms for large scale learning. These include classification, clustering, topic
modeling, model evaluation, distributed linear algebra, and feature transformation.
xi. Apache Tez [46]: Apache Tez is an open source platform built on top of YARN for the analysis of
directed Acyclic-graph (DAG) task. It provides a simplified API in Java and python for iterative
shell task. Moreover, the Apache Tez platform has higher performance than MapReduce and
allows Hive and Pig to run complex DAG task.
xii. Flink [47]: Another distributed platform for stream and batch processing and providing machine
learning, Table and Dataset API for creating an application in Java and Scala. It combines the
flexibility, scalability, fast and reliability of Distribute MapReduce to efficiently analyze big data
which can be implemented in a single node cluster or in the cloud enterprise system.
xiii. Storm [48]: Storm provides a platform for stream and real-time processing. The basic components
of the storm are the Spout, a Twitter streaming API and Bolt for computational logic and data
processing. It provides online machine learning, real-time data analytics and is deployed by
many organizations such as Twitter, Yahoo, and Spotify Japan Yelp for processing of a large
Big Data Cogn. Comput. 2019, 3, 32 15 of 30
amount of real-time data within seconds. Storm runs heterogeneous topologies for different
tasks and can be integrated with HBase, HDFS Kafta for large-scale data processing and storage.
Storm being an open source Apache project provide distributed real-time computation system
using programming APIs such as Java and Scala. In addition, the platform is built on top of
Hadoop for data integration, end to end authentication and data transfer between Hadoop and
relational database.
Apart from the above-listed tools, other tools for big data storage, processing, and management
include Apache Casandra [49], NoSQL, and Zookeeper [50], Avro [51], Chukwa [52], Python [53], R [54]
and Scala [55] programming languages. These tools key functions, features, strengths and weakness
are summarized in Table 2 below.
• Infrastructure layer. This is the first layer and includes all the hardware, network, and software
used to support and to operate a Hadoop cluster. Software ranges from operating system versions
to all the commonly used tools for monitoring and reporting on the Hadoop cluster.
• Data Repository layer. The second layer that deals with the movement of data within a distributed
computing environment. The primary repository here is the Hadoop distributed file system. Data
transport tools such as Sqoop and Flume are also included. All the NoSQL databases, such as
Accumulo and HBase is a form of data repository.
• The Data Refinery layer. This is the third layer and provides a platform for manipulation and
processing of data using the parallel processing framework. The primary technology in the data
refinery layer includes Yarn and MapReduce.
• The Data Factory layer. Software under this class is called data workers. They are designed to
interface into Hadoop and enables easier access to the full power of Hadoop. Many of them
actually allow the users to create compute jobs in an easily understood language, such as SQL,
and then they translate these inputs into MapReduce jobs. The tools here include Hive, Pig, Spark,
and Oozie.
• The Data Fusion. This layer is the application layer and the true business end of the big
data solution. Data analytics and data visualization tools fall under this layer and are used
to create applications, implement algorithms, and visualize data. The tools include machine
learning technology, such as Mahout, or Data Visualization tools, such as Datameter or Pentahoe
and Tableau.
• The Business Value Layer. This is the final layer which utilizes a wide range of tools to express
the requirements, the service levels, and the request for changes, the cost and the expenditures for
the business.
Big Data Cogn. Comput. 2019, 3, 32 16 of 30
Table 2. Key features of big data analytics tools, strengths, and weaknesses.
Table 2. Cont.
Table 2. Cont.
Table 2. Cont.
i. Recognizing elements of Gartner’s vector model by identifying the characteristics of big data.
ii. Consider solutions from a number of major vendors like Cloudera, Hortonworks, IBM, and
MapR and choose the solution that will support the environment to meet business objectives.
The culture of big data giants like Amazon, Google, and Facebook should be considered as well.
iii. Identifying the risks of open source software. Evaluate competing solutions based on any or all
of the criterion for their development, deployment and response times. Therefore, knowing
which NoSQL database works best with which data type is essential.
iv. Recognizing Hadoop cluster elements and their functions.
v. Create a secure analytics platform to deliver data-driven insights to business users across
the group.
vi. Develop a set of core requirements for its analytics platform. For example, the core requirements
could focus on function, cost and time.
vii. Proof of concept is needed to reduce risk in the implementation process. This exercise confirms
how the performance and scalability of the solution chosen will assure meeting the targets set
at the beginning of the big data project. This exercise is done jointly by the solutions’ expert
with the organization IT team.
Moreover, big data and business analytics teams play a vital role in the success of a big data
analytics project. The size and complexity of big data technology require highly motivated team
members, who are smart and determined. In addition, a successful implementation of the big data
solution requires a good team member with the right mission statement. As a result of the complexity
of the supercomputing platform, all levels of the team must have an in-depth mastery of the big data
ecosystem [57]. Organizations must invest the time and money in developing their own expertise in
the big data analytics team and computing infrastructure. The big data time members and their roles
are briefly described below and include:
(a) The business team: It comprises of
i. Executives/Stakeholders: Executives are leaders in their business and industry, they generate
business strategy and goals, find opportunity in crises, and lead through incidents. They also
have the flexibility to pilot and have a strong overview of the big data ecosystem.
ii. Product Managers/Data Stewards: They provide leadership to achieve business goals and
understand data, its value and limitations. Furthermore, they identify and define risk, are open
to new opportunities, and maintain a working knowledge of the big data ecosystem.
i. Data Scientist: The data scientist should be an academic scientist, a subject-matter expert in
their area of business, and possess advanced skills in mathematics and statistical modeling.
Moreover, a data scientist should be focused on research, analytic approaches and should be
skilled in statistical programming languages.
Big Data Cogn. Comput. 2019, 3, 32 21 of 30
ii. Business Analysts/Data Analysts: They should be aligned with business goals and directions.
They produce a detailed analysis for business, report on data quality, and are skilled in a wide
range of data modeling and data analytics tools. They have a working knowledge of the big
data ecosystem.
(c) The Big Data architects team
i. Global architect/platform engineers: They are subject matter experts in supercomputing platforms
and are skilled in data architecting. They are specialists in applicable use cases, outstanding in
root cause analysis and are exceptional in performance tuning. A global architect has a broader
knowledge of the big data ecosystem, while a platform engineer has a deeper understanding of
the software running the supercomputing platform. Both need a good understanding of the
data being ingested and digested by the distributed computing environment.
ii. Data architect/data wranglers: They possess industry knowledge, strong skills in mathematics
and statistics, and are specialists in applicable use cases. They are also subject matter experts in
data analytics, data visualization, NoSQL, and ETL.
(d) The Big Data Hadoop operators team
The real frontline troops in managing and operating a Hadoop cluster are;
i. Hadoop engineers: They are subject-matter experts in supercomputing platforms and experts
in Java and Python. They can write and deploy Hadoop jobs, knowledgeable in the Hadoop
cluster performance and implementation, and proficient in debugging and troubleshooting.
ii. Hadoop operators: They are subject-matter experts in the Hadoop cluster, Linux systems, and
networking. They are also skilled in Kerberos, experts in troubleshooting, proficient in
performance tuning, and knowledgeable in DC hardware.
In addition, the big data Hadoop operator team must have in-depth knowledge and experience
working with the supercomputing platform.
In addition, the aforementioned considerations, the organization should also consider switching to
global architecture where a supercomputing platform is best operated as a single entity and components
are tightly coupled together rather than the usual enterprise architecture where each layer of the stack of
component and clearly defines boundaries. Furthermore, the single team should have the responsibility
for both the development and the operation of the supercomputing. The management, the platform
engineers, the software developers, and the operators should work in a single team, frequently in a
single location. The proximity of the team helps the team build working knowledge. This working
attitude is the secret of the big data giants such as Yahoo, Google, and Facebook. Moreover, mastering
the big data ecosystem by the team is another key criterion. Big data is built on the principles of
supercomputing. The complexity of this platform mandates sophisticated knowledge of the big data
ecosystem at all levels of the team. The operators, the engineers, the architects, and the business
managers must be well versed in the big data ecosystem. Everyone involved in the project must be a
generalist with a solid understanding of how a supercomputing platform works, and this includes the
stakeholders and executive management. Organizations must invest the time and money in developing
their own expertise in their supercomputing platform [1,12].
• Security and privacy: The digital world has experienced threat from criminal attacks. The fact
that big data encourages the collection and analysis of everything, people privacy and civil liberties
are at high risk. Big data technology is being used by organized crime to now run a cyber-scam.
The criminals use the platform to identify victims, normally elderly, and their relationship to new
Big Data Cogn. Comput. 2019, 3, 32 23 of 30
relatives who are traveling to foreign countries. They then make a call and impersonate foreign
officials – asking for immediate payment to post bail, or to pay for urgent medical care. The
collection of big data gives them enough data to make the scam work and intrude upon one’s
privacy. The collection of vast amounts of data can be used to attack the economy, infrastructure,
and personnel of the opposition. Today, there are real threats of using cyber blackmail to bend
an enemy to your will. For example, in Nigeria, political parties use this platform to attack one
another and win members to their group. In worst cases, world leaders have been insulted via this
means. This seems to create a society void of respect and dignity. Privacy and civil liberty issues
around big data are extremely controversial on whether big data is the end of privacy. There is
no doubt that in today’s’ world, people leave an ever-increasing detailed and complete digital
footprint, there is a number of companies that make revenue by tracking every click, and every
second you spend on the Internet. The number of companies, government agencies, and research
organizations that track and use the telephony data from mobile phones is growing rapidly. They
track every movement of a switched on a mobile phone, and store all this collection into a big
data solution. Perhaps, creating a new version of the community through broad public education
and discussion to determine the right standards, policies, regulations, and laws might resolve the
perception of big data as an end to privacy.
• Regulatory and compliance: Deeper knowledge of big data technology has led to an increase
in regulatory requirements. Europe is taking lead in setting rules around the capture and use
of various sources of data such as e-mails, instant messages, web forms, mobile records, and
mobile data. The tools and the practices for ensuring compliance with these new regulations
are immature, or they do not exist. It will require continuous attention to detail and to new
tool offerings, to ensure we can manage compliance. This concern calls for enhancements in
the Hadoop cluster. Hadoop has the problem of encrypting data. Currently, Kerberos is one
of the most common security technologies deployed with a Hadoop cluster to ensure security.
Kerberos is an open source project that originated at MIT. It is fundamentally a network protocol,
designed as a client-server model and uses the highest available cryptography to ensure mutual
authentication for both the user and the server.
• Healthcare: Improved health is important for economic growth, good physical and mental health.
Healthcare industry generates a huge amount of data that can be used to enhance decision making
by both doctors and other health practitioners. In addition, the use of big data in healthcare
can help to develop a real-time analysis of disease thereby improving the quality of life to the
public. There are lots of research in this regards and range from fault tolerance system to support
data generation, integration and analysis to continuous monitoring for early detection of an
environmental condition that may trigger asthma attack [59,60]. Moreover, public health care data
require big data analytics techniques due to their large scale to track, monitor, store and analyze
individual moving objects with their level of exposure to harmful environmental factors in order
to ascertain the relationship between the data and environmental risk. Furthermore, big data
analytics have played a vital role in predicting the outbreak of diseases such as Ebola virus using
call detail records and sensor data to provide feedback mechanism in order to improve quality of
healthcare delivery system [61].
Big Data Cogn. Comput. 2019, 3, 32 24 of 30
• Network Optimization: Big data and business analytics approach can be used to design a
mobile network to provide efficient services. The area of interest is in content-centric analysis,
traffic analysis, network signaling to ensure effective service delivery and quality of service
delivery. Network operators can incorporate framework to collect, store and analyze user or
core network data for efficient signaling, predict traffic variation, network overload, intelligent
network optimization, automatic self-configuration of the network and intelligent transportation
development [62,63].
• Travel Estimation: High volume of data generated by mobile users during calls often referred to as
call data records (CDRs) has enabled researchers to aggregate, store, process and analyze travel
estimation particularly in route recommendation, location tracking, trip generation, commuter
origin and destination information and transportation management planning in the developing
economy [64–66]. Mobile big data can also aid route recommendation in a complex environment
by deploying smart multimodal platform that utilizes personal information and global constraint.
The algorithms monitor the state of the cities in real time and identify the congested route in order
to make alternative recommendations. This mechanism is not new as it has seen its applications in
drone routing, infectious disease, and hotspot identification and in an emergency situation [64]. To
ensure security, the datasets are usually anonymized using computer generated unique identifiers
to replace the phone numbers of subscribers. Researches in mobile big data for travel estimate
have proven to be important to improve transportation planning.
• User behavior modeling: User behavior modeling helps to understand navigation patterns in order
to develop user-centric applications. These applications are important in anomalies, fraud and
spam detection in social media and enable social behavior changes for target marketing [67].
• Human mobility modeling: Human beings maintain a regular pattern over a period of time.
Consequently, repeating such pattern enables efficient prediction of a global movement and this
can be applied in disease containment, transportation planning, emergency situation and prevent
the outbreak of diseases by leveraging the social network platform, GPS data, call data record and
geo-tagged data through big data analytics methods [68–70].
• Service recommendation: Big data and business analytics approaches have played a vital role in
services recommendation, target advertisement using user location information, product review,
time and product buying behavior. For instance, a recent study by Salehan and Kim [70] deployed
Hadoop and MapReduce to analyze customer review to understand the strengths and weaknesses
of the product. This approach helps to determine the predictors of review readership and how to
improve sales.
• Energy consumption analysis: Identification of the amount of energy in the household is a sure way
to promote green energy efficiency and conservation. The analysis using big data techniques
provides the usage patterns to promote green energy by fitting the electricity supplies with
sensors, communication network and analytics engine to digitalize, store and analyze the
consumption rate [4]. Moreover, this will help to improve energy sales and return on investment
for energy companies.
• Crowdsourcing and Sensing: Crowdsourcing implemented through opportunistic sensing is an
essential source of data for data-driven decision making in a business environment. Many
companies employ these techniques to enlist people to perform a specific task for solving complex
problems by leveraging smartphone with embedded sensors. Smartphones can be used to source
a huge amount of opinion data from the public and then analyze decision-making in an urban
emergency, location-based search and similarity services using mobile phone data [71].
• Educational development: Educational sector provides rich sources of data for big data analytics
processes. These data help to predict learner performances and achievement. Moreover, big
data analytics in education play an important role in course content management, personalized
recommendation module, development of smart education by leveraging areas such as natural
language processing and text summarization. In addition, data generated through massive online
Big Data Cogn. Comput. 2019, 3, 32 25 of 30
courses (MOOCs) helps to identify difficult areas of the subjects and provides support to students
in order to enhance teaching and learning [72–74].
• Financial Industries: The adoption of social media and internet-based approaches to financial
industries have resulted in the generation of the high volume of data. Therefore, to analyze these
data for effective decision-making requires big data techniques. Moreover, analysis of financial
statement and data would result in the detection and management of anti-money laundry, financial
statement fraud, financial spamming, impersonation, identity theft, and other financial fraud
related incidences [57].
These applications alongside key data sources and features and summarized in Table 3 below.
Table 3. Applications and key data sources for big data and business analytics.
• Data privacy and security: One of the major challenges in developing effective big data and business
analytics in our opinion is how to develop a security mechanism that ensures user security. With
such an approach, business owners will be confident in sharing their user data to develop the next
generation of big data analytics protocol that takes into cognizance the security challenges. This
may involve providing a dynamic security mechanism that takes care of the changing nature of
big data especially mobile big data or big data algorithms for data privacy during data extraction,
filtering techniques that reduce scarce bandwidth consumption in the mobile network through
computational offloading. In addition, areas such as ways to generate the right metadata to be
analyzed using scalable data mining [75] also require further research.
• Effective techniques for heterogeneous data analysis: Developing techniques and framework for the
analysis of heterogeneous big data for various economic enhancement and applications such
Big Data Cogn. Comput. 2019, 3, 32 27 of 30
Author Contributions: I.A.A. and H.F.N. conducted the research, drafted and proof-read the article. All authors
were involved in the review of the manuscript.
Funding: The authors received no external funding for this research.
Acknowledgments: We are grateful to the anonymous reviewers for their insightful comments and suggestions.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
Abbreviations Full meaning
API Application Programming Interface
BA Business Analytics
BI Business Intelligence
BVN Bank verification Numbers
CDR Call data record
DAG Direct Acyclic Graph
DBMS Database management system
ECG Electrocardiography
EMG Electromyography
GPA Grade point average
GPS Global positioning system
HDFS Hadoop distributed file systems
IDC International data corporation
MLlib Machine learning library
MOOCs Massive Online Courses
OLAP Online analytics processing
RDMBS Relational database management systems
SIM Subscriber identification module
SQL Structured query language
YARN Yet another resource negotiator
Big Data Cogn. Comput. 2019, 3, 32 28 of 30
References
1. Davenport, T.H. Big Data at Work: Dispelling the Myths, Uncovering the Opportunities; Harvard Business School
Publishing: Boston, MA, USA, 2014.
2. Davenport, T.H.; Harris, J.G. Competing on Analytics: The New Science of Winning; Harvard Business School
Publishing: Boston, MA, USA, 2014.
3. Davenport, T.H.; Barth, P.; Bean, R. How Big Data is Different. MIT Sloan Manag. Rev. 2012, 54, 21–24.
4. Hashem, I.A.T.; Yaqoob, I.; Anuar, N.B.; Mokhtar, S.; Gani, A.; Khan, S.U. The rise of “big data” on cloud
computing: Review and open research issues. Inf. Syst. 2015, 47, 98–115. [CrossRef]
5. Grover, V.; Chiang, R.H.L.; Liang, T.; Zhang, D. Creating Strategic Business Value from Big Analytics: A
Research Framework. J. Manag. Inf. Syst. 2018, 35, 388–423. [CrossRef]
6. Chahal, H.; Jyoti, J.; Wirtz, J. Business Analytics: Concepts and Applications. In Understanding the Role of
Business Analytics; Springer: London, UK, 2019; pp. 1–8.
7. Singh, D.; Reddy, C.K. A survey of Platforms for Big Data Analytics. J. Big Data 2015, 2, 8. [CrossRef]
8. Tsai, C.-W.; Lai, C.-F.; Chao, H.-C.; Vasilakos, A.V. Big Data Analytics: A survey. J. Big Data 2015, 2, 21.
[CrossRef]
9. Landset, S.; Khoshgoftaar, T.M.; Richter, A.N.; Hasanin, T. A survey of Open Source tools for machine
learning with big data in the Hadoop ecosystem. J. Big Data 2015, 2, 24. [CrossRef]
10. Mohebi, A.; Aghabozorgi, S.; Wah, T.Y.; Herawan, T.; Yayapour, R. Iterative big data clustering algorithms:
A review. Softw. Pract. Exp. 2016, 46, 107–129. [CrossRef]
11. Mohamed, A.; Nahafabadi, M.K.; Wah, Y.B.; Zaman, E.A.K.; Maskat, R. The state of the art and taxonomy of
big data analytics: View from the new big data framework. Artif. Intell. Rev. 2019, 1–49. [CrossRef]
12. Brynjolfsson, E.; Hitt, L.M.; Kim, H.H. Strength in Numbers: How Does Data-Driven Decision Making Affect
Firm Performance? 2011. Available online: http://ssrn.com/abstract=1819486 (accessed on 2 January 2019).
13. Manyika, J.; Chui, M.; Brown, B.; Bughin, J.; Dobbs, R.; Roxburgh, C.; Byers, A.H. Big Data: The Next
Frontier for Innovation, Competition, and Productivity. 2011. Available online: http://www.mckinsey.com/
insights/mgi/research/technology_and_innovationbig_data_th_next_frontier_for_innovation (accessed on
6 October 2018).
14. SAS, Big data meets Big Data Analytics. Available online: www.sas.com/content/dam/SAS/en.../big-data-
meets-big-data-analytics-105777.pdf (accessed on 10 February 2019).
15. McAfee, A.; Brynjolfsson, E. Big data: The management revolution. Harv. Bus. Rev. 2012, 90, 60–69.
[PubMed]
16. International Data Corporation (IDC). The Digital Universe of Opportunities: Rich Data and the Increasing
Value of the Internet of Things, 2014. Available online: http://www.emc.com/leadership/digital-universe/
2014iview/executive-summary.htm (accessed on 4 May 2018).
17. Dailey, W. The Big Data Technology Wave. Available online: https://www.skillsoft.com/courses/5372828-the-
big-data-technology-wave/ (accessed on 18 March 2019).
18. Sicular, S. Gartner’s Big Data Definition Consists of Three Parts, Not to Be Confused with Three “V”s. Available
online: http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-
parts-not-to-be-confused-with-three-vs/#95a45853bf622013 (accessed on 4 May 2018).
19. Davenport, T.H.; Dyché, J. Big Data in Big Companies. Available online: https://www.sas.com/resources/
asset/Big-Data-in-Big-Companies.pdf (accessed on 10 October 2018).
20. Jones, M.; Silberzahn, P. Three Reasons Why Big Data Doesn’t Make You Smarter—Lessons from the World
of Intelligence. Available online: http://www.forbes.com/sites/silberzahnjones/2013/07/02/three-reasons-
why-big-data-doesnt-make-you-smarter-lessons-from-the-world-of-intelligence/#2cbc03266562 (accessed
on 22 May 2018).
21. Noyes, K. Why Big Data Isn’t Always the Answer. Available online: http://www.computerworld.com/
article/2973436/big-data/why-big-data-isn\T1\textquoterightt-always-the-answer.html2015-08 (accessed on
22 May 2018).
22. Davenport, T. Three Big Benefits of Big Data Analytics. Available online: https://www.sas.com/en_ca/news/
sascom/2014q3/Big-data-davenport.html (accessed on 25 June 2018).
23. Frizzo-Barker, J.; Chow-White, P.A.; Mozafari, M.; Ha, D. An empirical study of the rise of big data in
business scholarship. Int. J. Inf. Manag. 2016, 36, 403–413. [CrossRef]
Big Data Cogn. Comput. 2019, 3, 32 29 of 30
24. Marr, B. Big Data Facts: How Many Companies Are Really Making Money from Their Data? 2016. Available
online: http://www.forbes.com/sites/bernardmarr/2016/01/13/big-data-60-of-companies-are-making-money-
from-it-are-ou/#3bbdb7143877 (accessed on 25 May 2018).
25. Schniederjans, M.J.; Schniederjans, D.G.; Starkey, C.M. Business Analytics Principles, Concepts, and Applications;
Pearson Education, Inc.: London, UK, 2014.
26. Paksoy, T.; Ozxeylan, E.; Weber, G.W. Profit-Oriented Supply Chain Network Optimization. Central Eur. J.
Oper. Res. 2012, 21, 455–478. [CrossRef]
27. Burns, E. Education Analytics Project Helps Marist, Students Make the Grade. Available
online: http://searchbusinessanalyticss.techtarget.com/feature/Education-analyticss-project-helps-Marist-
students-make-the-gradeon (accessed on 3 July 2018).
28. Stubbs, E. The Value of Business Analytics; John Wiley & Sons: Hoboken, NJ, USA, 2011.
29. Bloomberg Businessweek Research Services, The Current State of Business Analyticss: Where Do We Go from
Here? Available online: https://www.sas.comsources/asset/busanalyticssstudy_wp_08232011.pdf (accessed
on 19 February 2019).
30. Lim, E.P.; Chen, H.; Chen, G. Business intelligence and analytics: Research directions. ACM Trans. Manag.
Inf. Syst. 2013, 3, 17.1–17.10. [CrossRef]
31. Provost, F.; Fawcett, T. Data Science for Business; O’Reilly Media: Sebastopol, CA, USA, 2013.
32. Lavalle, S.; Lesser, E.; Shockley, R.; Hopkins, M.S.; Kruschwitz, N. Analytics: The new path to value: How
the smartest organizations are embedding analytics to transform insights into action. MIT Sloan Manag. Rev.
2010, 12, 1–28.
33. Lavalle, S.; Lesser, E.; Shockley, R.; Hopkins, M.S.; Kruschwitz, N. Big data, analytics and the path from
insights to value. MIT Sloan Manage. Rev. 2011, 52, 1–31.
34. Watson, H.J. Tutorial: Business intelligence—Past, present, and future. Commun. Assoc. Inf. Syst. 2009, 25,
487–510. [CrossRef]
35. IDC. Big Data Big Opportunities. Available online: http://www.emc.com/microsites/cio/articles/big-data-big-
opportunities/LCIA-BigDataOpportunities-Value.pdf (accessed on 25 July 2018).
36. White, T. Hadoop: The Definitive Guide, 3rd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2012.
37. Apache Hive. Available online: http://hive.apache.org/ (accessed on 10 January 2019).
38. Apache Pig. Available online: http://pig.apache.org/ (accessed on 10 January 2019).
39. Apache Flume. Available online: https://flume.apache.org/ (accessed on 10 January 2019).
40. Apache Sqoop. Available online: http://sqoop.apache.org/ (accessed on 10 January 2019).
41. Spark. Available online: https://spark.apache.org/ (accessed on 10 January 2019).
42. Apache Oozie Workflow Scheduler for Hadoop. Available online: http://oozie.apache.org/ (accessed on
5 February 2019).
43. Apache HBase. Available online: http://hbase.apache.org/ (accessed on 5 February 2019).
44. Mahout. Available online: http://mahout.apache.org/ (accessed on 5 February 2019).
45. MLLib. Available online: https://spark.apache.org/mllib/ (accessed on 5 February 2019).
46. Apache Tez. Available online: http://tez.apache.org/ (accessed on 5 February 2019).
47. Apache Flink. Available online: https://flink.apache.org/ (accessed on 5 February 2019).
48. Apache Storm. Available online: https://storm.apache.org/ (accessed on 5 February 2019).
49. Apache Cassandra. Available online: http://cassandra.apache.org/ (accessed on 12 December 2019).
50. Apache Zookeeper. Available online: https://zookeeper.apache.org/ (accessed on 7 November 2019).
51. Apache Avro. Available online: https://avro.apache.org/ (accessed on 19 February 2019).
52. Apache Chukwa. Available online: https://chukwa.apache.org/ (accessed on 19 February 2019).
53. Python Programming. Available online: https://www.python.org/ (accessed on 8 March 2019).
54. The R Project for Statistical Computing. Available online: http://www.r-project.org/ (accessed on
19 February 2019).
55. Scala programming. Available online: https://scala-lang.org/ (accessed on 6 March 2019).
56. Wang, K.; Shao, Y.; Shu, L.; Zhu, C.S.; Zhang, Y. Mobile Bid Data Fault-Tolerant Processing for eHealth
Networks. IEEE Netw. 2016, 30, 36–42. [CrossRef]
57. Bhadani, A.; Jothimani, D. Big data: Challenges, Opportunities, and Realities. In Effective Big Data Management
and Opportunities for Implementation; Singh, K.M., Kumar, D.G., Eds.; Information Science Reference: Hershey,
PA, USA, 2016.
Big Data Cogn. Comput. 2019, 3, 32 30 of 30
58. He, Y.; Yu, F.R.; Zhao, N.; Yin, H.; Yao, H.; Qiu, R.C. Big data analytics in mobile cellular networks. IEEE
Access 2016, 4, 1985–1996. [CrossRef]
59. Zhang, G.L.; Sun, J.; Chitkushev, L.; Brusic, V. Big Data Analytics in Immunology: A Knowledge-Based
Approach. Biomed. Res. Int. 2014, 2014, 437987. [CrossRef]
60. Wyber, R.; Vaillancourt, S.; Perry, W.; Mannava, P.; Folaranmi, T.; Celli, L.A. Big data in global health:
Improving health in low and middle-income countries. Bull. World Health Organ. 2015, 93, 2013–2018.
[CrossRef]
61. Khatib, E.J.; Barco, R.; Muñoz, P.; De La Bandera, I.; Serrano, I. Self-Healing in Mobile Networks with Big
Data. IEEE Commun. Mag. 2016, 54, 114–120. [CrossRef]
62. Dobre, C.; Xhafa, F. Intelligent service for Big Data Science. Futur. Gener. Comput. Syst. 2014, 37, 267–281.
[CrossRef]
63. De Domenico, M.; Lima, A.; Gonzalez, M.C.; Arenas, A. Personalized routing for multitudes in smart cities.
EPJ Data Sci. 2015, 4, 1–11. [CrossRef]
64. Dong, H.H.; Wu, M.C.; Ding, X.Q.; Chu, L.Y.; Jia, L.M.; Qin, Y.; Zhou, X.S. Traffic Zone division based on big
data from mobile phone-based stations. Transp. Res. Part C Emerg. Technol. 2015, 58, 278–291. [CrossRef]
65. Lokanathan, S.; Kreindler, G.E.; De Silva, N.N.; Miyauchi, Y.; Dhananjaya, D.; Samarajiva, R. The potential of
Mobile Network Big Data as a Tools in Colombo’s Transportation and Urban Planning. Inf. Technol. Int. Dev.
2016, 12, 63–73.
66. Douglas, R.W.; Meyer, D.A.; Ram, M.; Rideout, D.; Song, D.J. High-resolution population estimation from
telecommunication data. EPJ Data Sci. 2015, 4, 1–13. [CrossRef]
67. Lima, A. Digital Traces of Human Mobility and Interaction: Models and Applications. Ph.D. Thesis,
University of Birmingham, Birmingham, UK, 2016.
68. Finger, F.; Genolet, T.; Mari, L.; De Magny, G.C.; Manga, N.M.; Rinaldo, A.; Bertuzzo, E. Mobile phone data
highlights the role of mass gatherings in the spread of cholera outbreaks. Proc. Natl. Acad. Sci. USA 2016,
113, 6421–6426. [CrossRef] [PubMed]
69. Zhan, X.; Ukkusuri, S.V.; Zhu, F. Inferring Urban Land Use Using Large-Scale Social Media Check-in Data.
Netw. Spat. Econ. 2014, 14, 647–667. [CrossRef]
70. Salehan, M.; Kim, D.J. Predicting the performance of online consumer reviews: A sentiment mining approach
to big data analytics. Decis. Support Syst. 2016, 81, 30–40. [CrossRef]
71. Chatzimilioudis, G.; Kanstantinidis, A.; Laoudias, C.; Zeinalipour-Yazti, D. Crowdsourcing with smartphones.
IEEE Internet Comput. 2012, 16, 36–44. [CrossRef]
72. Yang, Y.Y.; Brinton, C.G.; Joe-wong, C.; Chiang, M. Behavior-based grade prediction for MOOCs via time
series Neural Networks. IEEE J. Sel. Top. Sign. Process. 2017, 11, 716–728. [CrossRef]
73. Mohammadi, M.; Alfuqaha, A.; Sorour, S.; Guizani, M. Deep Learning for IoT big data and Streaming
Analytics: A Survey. IEEE Commun. Sur. Tutor. 2018, 20, 2923–2960. [CrossRef]
74. Ochoa, S.F.; Fortino, G.; Di Fatta, G. Cyber-Physical systems, Internet of things and big data. Futur. Gener.
Comput. Syst. 2017, 75, 82–84. [CrossRef]
75. Xu, F.; Li, Y.; Chen, M.; Chen, S. Mobile cellular big data: Linking cyberspace and the physical world with
social ecology. IEEE Netw. 2016, 30, 6–12. [CrossRef]
76. Xu, Z.; Liu, Y.; Yen, N.; Mei, L.; Lou, X.; Wei, X.; Hu, C. Crowdsourcing-based description of the urban
emergency event using social media big data. IEEE Trans. Cloud Comput. 2016, 99, 1–11. [CrossRef]
77. Nweke, H.F.; Wah, T.Y.; Al-garadi, M.A.; Alo, U.R. Deep Learning Algorithms for human activity recognition
using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst. Appl. 2018,
105, 233–261. [CrossRef]
78. Nweke, H.F.; Wah, T.Y.; Mujtaba, G.; Al-garadi, M.A. Data fusion and multiple classifier systems for human
activity detection and monitoring: Review and Open Research Directions. Inf. Fus. 2019, 46, 147–170.
[CrossRef]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).