Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
Big Data Analytics For Healthcare Industry: Impact, Applications, and Tools
I S S N 22 2 0 9 6 - 0 6 54 l l 0 5 / 0 6 l l p p 4 8– 5 7
Volume 2, Number 1, March 2019
DOI: 10.26599/BDMA.2018.9020031
Abstract: In recent years, huge amounts of structured, unstructured, and semi-structured data have been generated
by various institutions around the world and, collectively, this heterogeneous data is referred to as big data. The
health industry sector has been confronted by the need to manage the big data being produced by various sources,
which are well known for producing high volumes of heterogeneous data. Various big-data analytics tools and
techniques have been developed for handling these massive amounts of data, in the healthcare sector. In this
paper, we discuss the impact of big data in healthcare, and various tools available in the Hadoop ecosystem for
handling it. We also explore the conceptual architecture of big data analytics for healthcare which involves the data
gathering history of different branches, the genome database, electronic health records, text/imagery, and clinical
decisions support system.
variety as a result of the linking of a diverse range data can improve healthcare delivery and reduce its
of biomedical data sources including, for example, cost, while supporting advanced patient care, improving
sensor data, imagery, gene arrays, laboratory tests, patient outcomes, and avoiding unnecessary costs[8] .
free text, and demographics[5] . Most data in healthcare Big data analytics is currently used to predict the
system (e.g., doctor’s notes, lab test results, and outcomes of decisions made by physicians, the outcome
clinical data) is unstructured and is not stored of a heart operation for a condition based on patient’s
electronically, i.e., it exists only in hard copies age, current condition, and health status. Essentially,
and its volume is increasing very rapidly. Currently, we can say that the role of big data in the health
there is a major focus on the digitization of these sector is to manage data sets related to healthcare,
vast stores of hard copy data. The revolutions of which are complex and difficult to manage using
data size are actually creating a problem in order current hardware, software, and management tools. In
to achieve this goal[6] . The various terminologies addition to the burgeoning volume of healthcare data,
and models that have been developed to resolve the reimbursement methods are also changing[9] . Therefore,
problems associated with big data focus on solving purposeful use and pay based on performance have
four issues known as the four Vs, namely: volume, emerged as important factors in the healthcare sector. In
variety, velocity, and veracity. The various classes 2011, organizations working in the field of healthcare
of data in healthcare applications include Electronic had produced more than 150 exabytes of data[10] ,
Health Records (EHR), machine generated/sensor data, all of which must be efficiently analyzed to be at
all useful to the healthcare system[11] . The storage
health information exchanges, patient registries, portals,
of healthcare related data in EHRs occurs in a
genetic databases, and public records. Public records
variety of forms. A sudden increase in data related
are major sources of big-data in the healthcare industry
to healthcare informatics has also been observed in
and require efficient data analytics to resolve their
the field of bioinformatics, where many terabytes of
associated healthcare problems. According to a survey
data are generated by genomic sequencing[11] . There
conducted in 2012, healthcare data totaled nearly 550
are a variety of analytical techniques available for
petabytes and will reach nearly 26 000 petabytes in
interpreting medical, which can then be used for patient
2020[5] . In light of the heterogeneous data formats,
care[12] . The diverse origins and forms of big data are
huge volume, and related uncertainties in the big-data
challenging the healthcare informatics community to
sources, the task of realizing the transformation of develop methods for data processing. There is a big
raw data into actionable information is daunting. Being demand for technique that combines dissimilar data
so complex, the identification of health features in sources[13] .
medical data and the selection of class attributes A number of conceptual approaches can be employed
for health analytics demands highly sophisticated and to recognize irregularities in vast amounts of data from
architecturalyl specific techniques and tools. different datasets. The frameworks available for the
analysis of healthcare data are as follows:
2 Big Data Analytics in Health Informatics
Predictive Analytics in Healthcare: For the past
The main difference between traditional health analysis two years, predictive analysis has been recognized
and big-data health analytics is the execution of as one of the major business intelligence approaches,
computer programming. In the traditional system, the but its real world applications extend far beyond the
healthcare industry depended on other industries for business context. Big data analytics includes various
big data analysis. Many healthcare shareholders trust methods, including text analytics and multimedia
information technology because of its meaningful analytics[14] . However, one of the most crucial
outcomes—their operating systems are functional and categories is predictive analytics which includes
they can process the data into standardized forms. statistical methods like data mining and machine
Today, the healthcare industry is faced with the learning that examine current and historical facts to
challenge of handling rapidly developing big healthcare predict the future. Predictive methods which are being
data. The field of big data analytics is growing and used today in the hospital context to determine if
has the potential to provide useful insights for the patient may be at risk for readmission[15] . This data can
healthcare system. As noted above, most of the massive help doctors to make important patient care decisions.
amounts of data generated by this system is saved Predictive analysis requires an understanding and use
in hard copies, which must then be digitized[7] . Big of machine learning, which is widely applied in this
50 Big Data Mining and Analytics, March 2019, 2(1): 48-57
must simply be collected, stored, and processed by a healthcare system and will directly impact the patient.
particular device. Structured data comprises just 5% to Right Living: Right living refers to the patient
10% of healthcare data. Unstructured or semi-structured living a better and healthier life[15] . By right living,
data includes e-mails, photos, videos, audios, and other patients could manage themselves by making the best
health related data such as hospital medical reports, decisions for themselves, based on the utilization of
physician’s notes, paper prescriptions, and radiograph information mining better choices and enhancing their
films[13] . wellbeing. By choosing the right path for their daily
Veracity: The veracity of data is the degree health, regarding their diet, preventive care, exercise,
of assurance that the meaning of data is consistent.
and other activities of daily living, patients can play an
Different data sources vary in their levels of data
active role in realizing a healthy life[16] .
credibility and reliability[9] . The outcomes of big-
Right Care: This pathway ensures that patients
data analytics must be credible and error-free, but in
healthcare, unsupervised machine learning algorithms receive the most appropriate treatment available and
make decisions that are used by automated machines that all providers obtain the same data and has the
based on data that may be worthless or misleading[4] . same objectives to avoid redundancy of planning and
Healthcare analytics are tasked with extracting useful effort[17] . This aspect has become more viable in the
insights from this data to treat patients and make the era of big data.
best possible decisions. Right Provider: Healthcare providers in this
pathway can obtain an overall view of their patients
4 Impact of Big Data on the Healthcare by combining data from various sources such as
System medical equipment, public health statistics, and
The potential of big data is that it could revolutionize socioeconomic data[15] . The accessibility of this
outcomes regarding the most suitable or accurate information enables human service providers to conduct
patient diagnosis and the accuracy information used targeted investigations and develop the skills and
in the health informatics system[15] . As such, the abilities to identify and provide better treatment options
investigation of huge amounts of information will have to patients[18] .
a powerful effect on medicinal services framework Right Innovation: This pathway recognizes
in five respects, or “pathways” (shown in Fig. 2). that new disease conditions, new treatments, and
Improving outcomes for patients with respect to these new medical will continue to evolve[15] . Likewise,
pathways, as described below, will be the focus of the advancements in the provision of patient services, for
example, upgrading medications and the efficiency of big-data technology. Various hospitals around the
research and development efforts, will enable new ways globe use Hadoop-based components in the Hadoop
to promote wellbeing and patient health via national Distributed File System (HDFS), including the Impala,
social insurance system[17] . The availability of early HBase, Hive, Spark, and Flume frameworks, to
trial data is important for stakeholders. This data can convert the huge amount of unstructured data generated
be used to explore high-potential targets and identify by sensors that take patient vital signs, heartbeats
techniques for improving traditional clinical treatment per minute, blood pressure, blood sugar level, and
methods. respiratory rate. Without Hadoop, these healthcare staff
Right Value: To improve the quality and value of could not analyze this unstructured data being generated
health-related services, providers must pay careful and by patient healthcare systems. In Atlanta, Georgia, there
ongoing attention to their patients. Patients must obtain are 6200 Intensive Care Units (ICUs) for pediatric
the most beneficial results identified by their social healthcare, where children can stay for more than
insurance system[18] . Measures that could be taken to one month depending on their problem. These ICUs
ensure the intelligent use of data includes, for example, are equipped with a sensor technology that tracks the
identifying and destroying data misrepresentation, child’s health status with respect to heartbeat, blood
manipulations, and waste, and improving resources[19] . pressure, and other vital signs. If any problem occurs,
an alert is automatically generated to medical staff to
5 Hadoop-Based Applications for Health
ensure the child’s safety.
Industry Hospital Network: Several hospitals use the
In light of the fact that healthcare data exists primarily Hadoop ecosystem’s NoSQL database to collect and
in printed form, there is a need for the active digitization manage their huge amounts of real-time data from
of print form data. The majority of this data is also diverse sources related to patient care, finances, and a
unstructured, so it is a major challenge for this industry payroll, which helps them identify high-risk patients
to extract meaningful information regarding patient while also reducing day-to-day expenditures.
care, clinical operations, and research. The collection of Healthcare Intelligence: Hadoop technology
software utilities known as the Hadoop ecosystem can also supports the healthcare intelligence applications
help the healthcare sector to manage this vast amount of used by hospitals and insurance companies. Hadoop
data. The various applications of the Hadoop ecosystem ecosystem’s Pig, Hive, and MapReduce technologies
in the healthcare sector are as follows: process large datasets related to medicines, diseases,
Treatment of Cancer and Genomics: We know symptoms, opinions, geographic regions, and other
that human DNA contains three billion base pairs. To factors to extract meaningful information (e.g., desired
fight cancer, it is vital that large amounts of data are age) for insurance companies.
efficiently organized. The patterns of cancer mutations Prevention and Detection of Frauds: In
and their reactions vary based on individual genetics, the early faces of big data analytics, health-based
which explains the non-curability of some cancer. insurance groups utilize multiple paths to identify
Oncologists have determined that in recognizing the fraud activity and establish methods to prevent medical
patterns of cancer, it is important to provide specific fraud. With Hadoop, companies use applications based
treatment for specific cancers, based on the patient’s on a prediction model to identify those committing
genetic makeup. The Hapdoop technology MapReduce fraud via data regarding their previous health claims,
facilitates the mapping of three billion DNA base pairs voice recordings, wages, and demographics. Hadoop’s
to determine the appropriate cancer treatment for each NoSQL database is also helpful in preventing fraud
particular patient. Arizona State University is working related to medical claims at an early stage by the use of
on project to develop a healthcare model that takes real-time Hadoop based health applications, authentic
individual genomic data and selects a treatment based medical claim bills, weather forecasting data, voice data
on identification of the patient’s cancer gene. This recordings, and other data sources.
model provides basis for treatment through big data 6 Big Data Analytics Architecture for
analysis to improve the chances of saving patients lives.
Health Informatics
Monitoring of Patient Vitals: Hospital staff
throughout the world connect their work output using Currently, the main focus in big-data analytics is
Sunil Kumar et al.: Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools 53
to gain an in-depth insight and understanding of third component, big data analytics applications have
big data rather than to collect it[20] . Data analytics a storage domain to integrate accessed databases that
involves the development and application of algorithms use different applications[26] . In the fourth component,
for analyzing various complex data sets to extract are the most popular big-data analytics applications
meaningful knowledge, patterns, and information. In in healthcare systems, which include reports, Online
recent years, researchers have begun to consider Analytical Processing (OLAP), queries, and data
the appropriate architectural framework for healthcare mining.
systems that utilize big-data analytics, one of which As shown in Fig. 3, healthcare data come from a
uses a four-layer architecture that comprises a range of sources including EHRs, genome databases,
transformation layer, data-source layer, big data genome data files, text and imagery (unstructured data
platform layer, and analytical layer[14] . In this layered sources), clinical decision support systems, government
system, data originates from different sources and has related sources, medical test labs and pharmacies, and
various formats and storage systems. Each layer has health insurance companies. These data are frequently
a specific data-processing functionality for performing available in different scheme tables, and are in
specific tasks on the HDFS, using the MapReduce ASCII/text and stored at various locations.
processing model. The other layers perform other tasks, In the next section, we describe the various big-
i.e., report generation, query passing, data mining data Hadoop-based processing tools that support the
processing, and online analytical processing. development of health-based applications for the health
The main requirement in big-data analytical industry.
processing is to bundle the data at high speed to
minimize the bundling time. The next priority in 7 Hadoop’s Tools and Techniques for Big
big-data analytical processing is to efficiently update Data
and transform queries at a constant time[21] . The third
To manage unstructured big data that does not fit into
requirement in the big-data analytical processing is
to utilize and efficiently manage the storage area any database, special tolls are needed. To examine
space. The last specification of big-data analytics is to this type of big dataset, the IT sector uses the Hadoop
efficiently become familiar with the rapidly progressing platform for a wide variety of methods that have been
workload notations. Big-data analytics frameworks developed to record, organize, and analyze this type of
differ from traditional healthcare processing systems data[27, 28] . More efficient tools are needed to extract
with respect to how they process big data[22] . In the meaningful output from big data. Most of the tools
current health care system, data is processed using are implemented in the Apache Hadoop architecture
traditional tools installed in a single stand-alone including MapReduce, Mahout, Hive, and others[29] .
system like a desktop computer. In contrast, big data Below, we discuss the various tools used in processing
is processed by clustering and scans multiple nodes of healthcare big datasets.
clusters in the network[23] . This processing is based on Apache Hadoop: The name Hadoop has evolved
the concept of parallelism to handle large medical data to mean many different things[23] . In 2002, it was
sets[24] . Freely available frameworks, such as Hadoop, established as a single software project to support a
MapReduce, Pig, Sqoop, Hive, and HBase Avro, all web search engine. Since that time, it has grown into
have ability to process the health related data sets for an ecosystem of tools and applications that are used to
healthcare systems. analyze large amounts and types of data[30] . Hadoop
Big-data technologies broadly refer to scientific can no longer be considered to be a monolithic single
innovations that mimic those used for large project, but rather an approach to data processing
datasets[25] . In the first component is the requirement that radically differs from the traditional relational
for big data sources for processing. In the second database model[23] . A more practical definition of the
component clusters with a centralized big-data Hadoop ecosystem and framework is the following:
processing infrastructure are at the peak of high open source tools, libraries, and methodologies for
performance[24] . It has been observed that the tools “big data” analysis in which a number of data sets are
mainly available for big-data analytics processing collected from different sources, i.e., Internet images,
provide data security, scalability, and manageability audios, videos, and sensor records as both structured
with the help of the MapReduce paradigm. In the and unstructured data to be processed[22] . Figure 4
54 Big Data Mining and Analytics, March 2019, 2(1): 48-57
is complete[26] . The MapReduce programming phase typically HDFS, due to the tight integration of HBase
also has two stages: a mapping stage that accepts and HDFS[33] . If there is need for a structured low-
input in key value pairs and generates output in key latency view of the high-scale data stored via Hadoop,
value pairs and a second reducing stage, in which then HBase is the correct choice. Its open-source code
each phase consists of key-value pairs as input and scales linearly to handle petabytes of data on thousands
output[12] . There is a fixed size data segment division of nodes.
step in Hadoop which is called input splits[20] . The Map Apache Oozie: To run a complex system or tight
function generates the value pairs and the key, which system design or if there are a number of interconnected
are stored in the mapper. Any keys that are the same stations with data dependencies between them, there
are merged. A simplified view of MapReduce is shown is a need for sophisticated technique called Apache
in Fig. 5. Oozie. Apache Oozie can handle and run multiple jobs
Apache Hive: Hive is a data warehousing layer at related to Hadoop. Oozie has two portions: workflow
the top of Hadoop, in which analyses and queries can engines that store and execute workflow collections
be performed using SQL-like procedural language[32] . of Hadoop-based jobs and a coordinator engine that
Apache Hive can be used to perform ad-hoc queries, processes workflow jobs based on how they are
summarization, and data analysis. Hive is considered designed in the process schedule. Oozie is designed
to be a de facto standard for SQL based queries over to construct and manage Hadoop jobs as workflow in
petabytes of data using Hadoop and offers the features which the output of one job serves as the input for
easy data extraction, transformation, and access to the a subsequent job[37] . Oozie is not a substitute for the
HDFS comprising data files or other HBase storage Yarn scheduler. Oozie workflow jobs are represented as
system[33] . Directed Acyclic Graphs (DAGs) of actions[28] . Oozie
Apache Pig: Apache Pig is one of the available plays the role of a service in the cluster and clients
open-source platforms being used to better analyze submit their jobs for proactive or reactive execution.
big data. Pig is an alternative to the MapReduce Apache Avro: Avro is a serialization format
programming tool[34] . First developed by the Yahoo that makes it possible for data to be exchanged
web service provider as a research project, Pig allows between programs written in any language[38] . It is
users to develop their own user-define functions and often used to connect Flume data flows. The Avro
supports many traditional data operations such as join, system is schema-based, where the role of a scheme
sort, filter, etc. is to perform the read and write operations with the
Apache HBase: HBase is a column-oriented language being independent. Avro serializes the data
NoSQL database used in Hadoop[35] , in which user that have a built-in schema[33] . It is a framework
can store large numbers of rows and columns. HBase for the serialization of persistent data and remote
has the functionality of random read/write operations. procedure calls between Hadoop nodes and between
It also supports record level updates, which is not client programs and Hadoop services.
possible using HDFS[36] . HBase provides parallel data Apache Zookeeper: Zookeeper is a centralized
storage via the underlying distributed file systems system used by applications to maintain a healthcare
across commodity servers. The file system of choice is system and provide organizing and other elements
on and between nodes[39] . It maintains the common analytics can lead to treatments that are effective for
objects needed in large cluster environments, including specific patients by providing the ability to prescribe
configuration information and the hierarchical naming appropriate medications for each individual, rather than
space. These services can be used by different those that work for most people. As we know, big
applications to coordinate the distributed processing of data analytics is in the early stage of development and
Hadoop clusters. Zookeeper also ensures application current tools and methods cannot solve the problems
reliability[40] . If an application master dies, zookeeper associated with big data. Big data may be viewed as
generates a new application master to resume the tasks. big systems, which present huge challenges. Therefore,
Apache Yarn: Hadoop Yarn is a distributed a great deal of research in this field will be required to
shell application and is an example of a Hadoop solve the issues faced by the healthcare system.
non-MapReduce application built on top of Yarn[41] .
Yarn has two components, a Resource Manager (RM) References
that handles all the resources within a cluster that [1] A. Gandomi and M. Haider, Beyond the hype: Big data
are required for the tasks and Node Manager (NM), concepts, methods and analytics, International Journal of
located on every host in a cluster and handles the Information Management, vol. 35, no. 2, pp. 137–144,
available resources on the independent host. Both 2015.
components handle the scheduling of jobs and manage [2] A. O’Driscoll, J. Daugelaite, and R. D. Sleator, “Big Data”,
the containers, memory management, CPU throughput, Hadoop and cloud computing in genomics, Journal of
and I/O system which run the dedicated application Biomedical Informatics, vol. 46, no. 5, pp. 774–781, 2013.
[3] C. L. P. Chen and C. Y. Zhang, Data-intensive applications,
code.
challenges, techniques and technologies: A survey on big
Apache Sqoop: Apache Sqoop is a powerful data, Information Sciences, vol. 275, pp. 314–347, 2014.
tool that performs the functionality of extracting the [4] M. Herland, T. M. Khoshgoftaar, and R.Wald, A review of
data from Relational Database Management System data mining using big data in health informatics, Journal
(RDMS) and inputting it into Hadoop architecture for of Big Data, vol. 1, no. 1, p. 2, 2014.
query processing. To do so, this process uses the [5] D. H. Shin and M. J. Choi, Ecological views of big data:
MapReduce paradigm or other standard level tools, e.g., Perspective and issues, Telematics and Informatics, vol.
32, no. 2, pp. 311–320, 2015.
Hive[42] . Once placed in HDFS, the data can be used by
[6] B. Saraladevi, N. Pazhaniraja, P. V. Paul, M. S. Basha, and
Hadoop applications.
P. Dhavachelvan, Big data and Hadoop-A study in security
Apache Flume: Apache Flume is a highly reliable perspective, Procedia Computer Science, vol. 50, pp. 596–
service for accurately collecting data and moving 601, 2015.
large volumes of data from independent machines to [7] X. Wu, X. Zhu, G. Q. Wu, and W. Ding, Data mining
HDFS[43] . Often data transport involves a number of with big data, IEEE transactions on Knowledge and Data
flume agents that may traverse a series of machines Engineering, vol. 26, no. 1, pp. 97–107, 2014.
and locations. Flume is often used for log files, data [8] S. Sharma and V. Mangat, Technology and trends to handle
big data: Survey, in Proc. 5th International Conference
generated by social media, and email messages.
on Advanced Computing & Communication Technologies,
2015, pp. 266–271.
8 Conclusion
[9] R. Mehmood and G. Graham, Big data logistics: A health-
In this paper, we have provided an in-depth description care transport capacity sharing model, Procedia Computer
and a brief overview of big data in general and Science, vol. 64, pp. 1107–1114, 2015.
[10] D. P. Augustine, Leveraging big data analytics and Hadoop
in healthcare system, which plays a significant role
in developing India healthcare services, International
in healthcare informatics and greatly influences the
Journal of Computer Applications, vol. 89, no. 16, pp. 44–
healthcare system and the big data four Vs in
50, 2014.
healthcare. We also proposed the use of a conceptual
[11] J. A. Patel and P. Sharma, Big data for better health
architecture for solving healthcare problems in big data planning, in Proc. International Conference on Advances
using Hadoop-based terminologies, which involves the in Engineering and Technology Research, 2014, pp. 1–5.
utilization of the big data, generated by different levels [12] A. E. Youssef, A framework for secure healthcare systems
of medical data and the development of methods for based on big data analytics in mobile cloud computing
analyzing this data and to obtain answers to medical environments, International Journal of Ambient Systems
questions. The combination of big data and healthcare and Applications, vol. 2, no. 2, pp. 1–11, 2014.
Sunil Kumar et al.: Big Data Analytics for Healthcare Industry: Impact, Applications, and Tools 57
[13] MAPR, Healthcare and life science use cases, https:// using map reduce technique, in Proc. International
mapr.com/solutions/industry/healthcare-and-lifescience- Conference on Computational Intelligence &
use-cases/, 2018. Communication Technology, 2015, pp. 703–708.
[14] W. Raghupathi and V. Raghupathi, Big data analytics in [26] J. Dean and S. Ghemawat, MapReduce: Simplified data
healthcare: Promise and potential, Health Information processing on large clusters, Communications of the ACM,
Science and Systems, vol. 2, no. 1, p. 3, 2014. vol. 51, no. 1, pp. 107–113, 2008.
[15] J. Sun and C. K. Reddy, Big data analytics for healthcare, [27] Cloudera, Whole genome research drives healthcare to
in Proc. 19th ACM SIGKDD International Conference on Hadoop, https://www.cloudera.com/content/dam/www/
Knowledge Discovery and Data Mining, 2013, pp. 1525– marketing/resources/solution-briefs/whole-genome-
1525. research-inhealthcare.pdf.landing.html., 2018.
[16] C. Mike, W. Hoover, T. Strome, and S. Kanwal. [28] R. Misra, B. Panda, and M. Tiwary, Big data and
Transforming health care through big data strategies ICT applications: A study, in Proc. 2nd International
for leveraging big data in the health care industry, Conference on Information and Communication
http://ihealthtran.com/iHT2 BigData 2013.pdf, 2013. Technology for Competitive Strategies, 2016, p. 41.
[17] J. Anuradha, A brief introduction on big data 5Vs [29] A. G. Picciano, The evolution of big data and learning
characteristics and Hadoop technology, Procedia analytics in american higher education, Journal of
Computer Science, vol. 48, pp. 319–324, 2015. Asynchronous Learning Networks, vol. 16, no. 3, pp. 9–20,
[18] M. Viceconti, P. J. Hunter, and R. D. Hose, Big data, big 2012.
knowledge: Big data for personalized healthcare, IEEE [30] Apache Hadoop, http://hadoop.apache.org/, 2018.
Journal of Biomedical and Health Informatics, vol. 19, no. [31] A. Katal, M. Wazid, R. H. Goudar, and T. Noel, Big data:
4, pp. 1209–1215, 2015. Issues, challenges, tools and good practices, in Proc. 6th
[19] Y. Sun, H. Song, A. J. Jara, and R. Bie, Internet of International Conference on Contemporary Computing,
things and big data analytics for smart and connected 2013, pp. 404–409.
communities, IEEE Access, vol. 4, pp. 766–773, 2016. [32] Apache Hive, https://hive.apache.org/, 2018.
[20] A. Jain and V. Bhatnagar, Crime data analysis using Pig [33] K. K. Y. Lee, W. C. Tang, and K. S. Choi, Alternatives
with Hadoop, Procedia Computer Science, vol. 78, pp. to relational database: Comparison of NoSQL and XML
571–578, 2016. approaches for clinical data storage, Computer Methods
[21] T. Jach, E. Magiera, and W. Froelich, Application of and Programs in Biomedicine, vol. 110, no. 1, pp. 99–109,
Hadoop to store and process big data gathered from an 2013.
urban water distribution system, Procedia Engineering, [34] Apache Pig, https://pig.apache.org/, 2018.
vol. 119, pp. 1375–1380, 2015. [35] E. Dede, B. Sendir, P. Kuzlu, J.Weachock, M. Govindaraju,
[22] C. Uzunkaya, T. Ensari, and Y. Kavurucu, Hadoop and L. Ramakrishnan, Processing Cassandra datasets with
ecosystem and its analysis on tweets, Procedia-Social and Hadoop-streaming based approaches, IEEE Transactions
Behavioral Sciences, vol. 195, pp. 1890–1897, 2015. on Services Computing, vol. 9, no. 1, pp. 46–58, 2016.
[23] S. G. Manikandan and S. Ravi, Big data analysis using [36] Apache HBase, http://hbase.apache.org/, 2018.
Apache Hadoop, in Proc. International Conference on IT [37] Apache Oozie, https://oozie.apache.org/, 2018.
Convergence and Security, 2014, pp. 1–4. [38] Apache Avro, https://avro.apache.org/, 2018.
[24] V. Ubarhande, A. M. Popescu, and H. Gonzalez- [39] Apache Zookeeper, https://zookeeper.apache.org/, 2018.
Velez, Novel data-distribution technique for Hadoop [40] Apache Zookeeper, https://www.ibm.com/analytics/
in heterogeneous cloud environment, in Proc. 9th hadoop/zookeeper, 2018.
International Conference on Complex, Intelligent, and [41] Apache Yarn, https://yarn.apache.org/, 2018.
Software Intensive Systems, 2015, pp. 217–224. [42] Apache Sqoop, https://sqoop.apache.org/, 2018.
[25] S. Maitrey and C. K. Jha, Handling big data efficiently by [43] Apache Flume, https://flume.apache.org/, 2018.