Bda Unit I

UNIT - I
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
While many organizations boast of having good data or improving the quality of their data, the
real challenge is defining what those qualities represent. What some consider good quality others
might view as poor. Judging the quality of data requires an examination of its characteristics and
then weighing those characteristics according to what is most important to the organization and
the application(s) for which they are being used.
The seven characteristics that define data quality are:
1. Accuracy and Precision
2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot have
any erroneous elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without understanding
how the data will be consumed, ensuring accuracy and precision could be off-target or more
costly than necessary. For example, accuracy in healthcare might be more important than in
another industry (which is to say, inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to
a set of options and open answers are not permitted. Any answers other than these would not be
considered valid or legitimate based on the survey’s requirement. This is the case for most data
and must be carefully considered when determining its quality. The people in
each department in an organization understand what data is valid or not to them, so the
requirements must be leveraged when evaluating data quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must be
a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected too
soon or too late could misrepresent a situation and drive inaccurate decisions.
Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data.
Gaps in data collection lead to a partial view of the overall picture to be displayed. Without a
complete picture of how operations are running, uninformed actions will occur. It’s important to
understand the complete set of requirements that constitute a comprehensive set of data to
determine whether or not the requirements are being fulfilled.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and
manipulated collections of data could offer a different meaning than the data implied at a lower
level. An appropriate level of granularity must be defined to provide sufficient uniqueness and
distinctive properties to become visible. This is a requirement for operations to function
effectively.
INTRODUCTION TO BIG DATA AND ITS IMPORTANCE

What is Big Data?
The term Big Data refers to all the data that is being generated across the globe at an
unprecedented rate. Complex or massive data sets which are quite impractical to be managed
using the traditional database system and software tools are referred to as big data.
Big Data has certain characteristics and hence is defined using 4Vs
namely:
Volume: the amount of data that businesses can collect is really enormous and hence the volume
of the data becomes a critical factor in Big Data analytics.
Velocity: the rate at which new data is being generated all thanks to our dependence on the
internet, sensors, machine-to-machine data is also important to parse Big Data in a timely
manner.
Variety: the data that is generated is completely heterogeneous in the sense that it could be in
various formats like video, text, database, numeric, sensor data and so on and hence
understanding the type of Big Data is a key factor to unlocking its value.
Veracity: Refers to inconsistency or uncertainty of data i.e., knowing whether the data that is
available is coming from a credible source is of utmost importance before deciphering and
implementing Big Data for business needs.
Types of Big Data

BigData could be found in three forms:
• Structured
• Unstructured
• Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
'structured' data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Examples of Structured Data

An 'Employee' table in a database is an example of Structured Data
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.
Examples of Un-structured Data
The output returned by 'Google Search' and the data being used by Twitter, Facebook and other
social media in the form of posts
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file. Examples of
Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant
Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema
R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish
Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Difference between traditional data and big data:
Factors Big Data Traditional Data
Data Distributed Architecture Centralized Architecture

Architecture
Type of Data Semi-structured/Un Structured Data

structured Data
Volume of Data Consists of (250– 260 ) Consists of 240 bytes of data

bytes of data
Schema No schema Based on Fixed Schema
Data Complex Relationship Known Relationship

Relationship between data between the data
Importance of Big Data

Big Data brings in great process benefits to the enterprise. The top five are
• Understand market trends: Using big data, enterprises are enabled to forecast market trends,
predict customer preferences, evaluate product effectiveness, customer preferences, and gain
foresight into customer behavior. The insights can help understand purchasing patterns, when to
and which product to launch and suggest to clients product preferences based on buying patterns.
Such prior information helps bring in effective planning, management and leverages the Big Data
analytics to fend off competition.
• Understand customer needs better: Through effective analysis of big-data the company can plan
better for customer satisfaction and thus make alterations needed to ensure loyalty and customer
trust. Better customer experience definitely impacts growth. Complaint resolution, 24×7
customer service, interactive websites and consistent gathering of feedback from the customer
are some of the new measures that have made big-data analytics very popular and helpful to
companies.
• Work on bettering company reputation: Sentiments and their analysis can help correct false
rumors, better service customer needs and maintain company image through online presence
which eventually helps the company reputation using Big Data tools that can analyze emotions
both negative and positive.
• Promotes cost-saving measures: Though the initial costs of deploying Big Data analytics are high,
the returns and gainful insights more than pay for themselves. This also enables constant
monitoring, better risk-management and the IT infrastructure personnel can be freed up. This
translates into reduced personnel required. Besides this, the tools in Big Data can be used to store
data more effectively. Thus the costs are outweighed by the savings.
• Makes data available: Modern tools in Big Data can in actual-time present required portions of
data anytime in a structured and easily readable format.
EVOLUTION OF BIG DATA
The Foundations of Big Data
Data became a problem for the U.S. Census Bureau in 1880. They estimated it would take eight
years to handle and process the data collected during the 1880 census, and predicted the data
from the 1890 census would take more than 10 years to process. Fortunately, in 1881, a young
man working for the bureau, named Herman Hollerith, created the Hollerith Tabulating Machine.
His invention was based on the punch cards designed for controlling the patterns woven by
mechanical looms. His tabulating machine reduced ten years of labor into three months of labor.
In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing
informationmagnetically on tape. Pfleumer had devised a method for adhering metal stripes to
cigarette papers (to keep a smokers’ lips from being stained by the rolling papers available at the
time), and decided he could use this technique to create a magnetic strip, which could then be
used to replace wire recording technology. After experiments with a variety of materials, he
settled on a very thin paper, striped with iron oxide powder and coated with lacquer, for his
patent in 1928.
During World War II (more specifically 1943), the British, desperate to crack Nazi codes,
invented a machine that scanned for patterns in messages intercepted from the Germans. The
machine was called Colossus, and scanned 5.000 characters a second, reducing the workload
from weeks to merely hours. Colossus was the first data processor. Two years later, in 1945, John
Von Neumann published a paper on the Electronic Discrete Variable Automatic
Computer (EDVAC), the first “documented” discussion on program storage, and laid the
foundation of computer architecture today.
It is said these combined events prompted the “formal” creation of the United States’ NSA
(National Security Agency), by President Truman, in 1952. Staff at the NSA were assigned the
task of decrypting messages intercepted during the Cold War. Computers of this time had
evolved to the point where they could collect and process data, operating independently and
automatically.
The Internet Effect and Personal Computers
ARPANET began on Oct 29, 1969, when a message was sent from UCLA’s host computer to
Stanford’s host computer. It received funding from the Advanced Research Projects Agency
(ARPA), a subdivision of the Department of Defense. Generally speaking, the public was not
aware of ARPANET. In 1973, it connected with a transatlantic satellite, linking it to the
Norwegian Seismic Array. However, by 1989, the infrastructure of ARPANET had started to
age. The system wasn’t as efficient or as fast as newer networks. Organizations using ARPANET
started moving to other networks, such as NSFNET, to improve basic efficiency and speed. In
1990, the ARPANET project was shut down, due to a combination of age and obsolescence. The
creation ARPANET led directly to the Internet.
In 1965, the U.S. government built the first data center, with the intention of storing millions of
fingerprint sets and tax returns. Each record was transferred to magnetic tapes, and were to be
taken and stored in a central location. Conspiracy theorists expressed their fears, and the project
was closed. However, in spite of its closure, this initiative is generally considered the first effort
at large scale data storage.
Personal computers came on the market in 1977, when microcomputers were introduced, and
became a major stepping stone in the evolution of the internet, and subsequently, Big Data. A
personal computer could be used by a single individual, as opposed to mainframe computers,
which required an operating staff, or some kind of time-sharing system, with one large processor
being shared by multiple individuals. After the introduction of the microprocessor, prices for
personal computers lowered significantly, and became described as “an affordable consumer
good.” Many of the early personal computers were sold as electronic kits, designed to be built by
hobbyists and technicians. Eventually, personal computers would provide people worldwide with
access to the internet.
In 1989, a British Computer Scientist named Tim Berners-Lee came up with the concept of the
World Wide Web. The Web is a place/information-space where web resources are recognized
using URLs, interlinked by hypertext links, and is accessible via the Internet. His system also
allowed for the transfer of audio, video, and pictures. His goal was to share information on the
Internet using a hypertext system. By the fall of 1990, Tim Berners-Lee, working for CERN, had
written three basic IT commands that are the foundation of today’s web:
• HTML: HyperText Markup Language. The formatting language of the web. • URL: Uniform
Resource Locator. A unique “address” used to identify each resource on the web. It is also called a
URI (Uniform Resource Identifier).
• HTTP: Hypertext Transfer Protocol. Used for retrieving linked resources from all across the web.
In 1993, CERN announced the World Wide Web would be free for everyone to develop and use.
The free part was a key factor in the effect the Web would have on the people of the world. (It’s
the companies providing the “internet connection” that charge us a fee).
The Internet of Things (IoT)
The concept of Internet of Things was assigned its official name in 1999. By 2013, the IoT had
evolved to include multiple technologies, using the Internet, wireless communications, micro
electromechanical systems (MEMS), and embedded systems. All of these transmit data about the
person using them. Automation (including buildings and homes), GPS, and others, support the
IoT.
The Internet of Things, unfortunately, can make computer systems vulnerable to hacking. In
October of 2016, hackers crippled major portions of the Internet using the IoT. The early
response has been to develop Machine Learning and Artificial Intelligence focused on security
issues.
Computing Power and Internet Growth
There was an incredible amount of internet growth in the 1990s, and personal computers became
steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s
efforts, Cern’s free access, and access to individual personal computers.
In 2005, Big Data, which had been used without a name, was labeled by Roger Mougalas. He
was referring to a large set of data that, at the time, was almost impossible to manage and process
using the traditional business intelligence tools available. Additionally, Hadoop, which could
handle Big Data, was created in 2005. Hadoop was based on an open-sourced software
framework called Nutch, and was merged with Google’s MapReduce. Hadoop is an Open Source
software framework, and can process structured and unstructured data, from almost all digital
sources. Because of this flexibility, Hadoop (and its sibling frameworks) can process Big Data.
Challenges of Big Data
The major challenges associated with big data are as follows −
• Capturing data
• Curation (process of selecting, organizing, and looking after the items in a collection) •
Storage
• Searching
• Sharing
• Transfer
• Analysis/Processing
• Presentation
• Privacy and security
BIG DATA ANALYTICS AND ITS CLASSIFICATION

Big data analytics essentially picks up where conventional business intelligence and other
analytics platforms leave off, looking at large volumes of structured and (mostly) unstructured
data. Let’s do a quick comparison of the two.
BI software helps businesses make more calculated decisions by analyzing data within an
organization’s data warehouse. The focus of BI is more on data management and increasing
overall performance and operations.
Big data analytics, on the other hand, looks at more raw data in an attempt to uncover patterns,
market trends, and customer preferences to make informed predictions.
There are a number of ways in which big data analytics does this.
Descriptive Analysis
Descriptive analysis creates simple reports, graphs, and other visualizations which allow
companies to understand what happened at a particular point. It’s important to note that
descriptive analysis only pertains to events that happened in the past.
Diagnostic Analysis
Diagnostic analysis gives deeper insight into a specific problem, whereas descriptive analysis is
more of an overview. Companies can use diagnostic analysis to understand why a problem
occurred. This analysis is a bit more complex, and may even incorporate aspects of AI or
machine learning.
Predictive Analysis
By pairing advanced algorithms with AI and machine learning, companies may be able to predict
what will likely happen next. Being able to give an informed answer about the future can
obviously bring a ton of value to a business. This insight is useful for trend forecasting and
uncovering patterns.
Prescriptive Analysis
Prescriptive analysis is extremely complex, which is why it is not yet widely incorporated. While
other analytic tools can be used to draw your own conclusions, prescriptive analysis provides you
with actual answers. A high level of machine learning usage is needed for these type of reports.
Big Data Analytics
Analytics is the discovery and communication of meaningful patterns in data. Especially,
valuable in areas rich with recorded information, analytics relies on the simultaneous application
of statistics, computer programming and operation research to qualify performance. Analytics
often favors data visualization to communicate insight.
Firms may commonly apply analytics to business data, to describe, predict and improve business
performance. Especially, areas within include predictive analytics, enterprise decision
management etc. Since analytics can require extensive computation(because of big data), the
algorithms and software used to analytics harness the most current methods in computer science.
In a nutshell, analytics is the scientific process of transforming data into insight for making better
decisions. The goal of Data Analytics is to get actionable insights resulting in smarter decision
and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data set.
There are four type of data analytics:
• Predictive (forecasting)
• Descriptive (business intelligence and data mining)
• Prescriptive (optimization and simulation)
• Diagnostic analytics
Predictive Analytics: Predictive analytics turn the data into valuable, actionable information.
Predictive analytics uses data to determine the probable outcome of an event or a likelihood of a
situation occurring.
Predictive analytics holds a variety of statistical technique from modeling, machine, learning,
data mining and game theory that analyze current and historical facts to make prediction about
future event.
There are three basic cornerstones of predictive analytics-
• Predictive modeling
• Decision Analysis and optimization
• Transaction profiling
Descriptive Analytics: Descriptive analytics looks at data and analyze past event for insight as
how to approach future events. It looks at the past performance and understands the performance
by mining historical data to understand the cause of success or failure in the past. Almost all the
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
Descriptive model quantifies relationship in data in a way that is often used to classify customers
or prospect into groups. Unlike predictive model that focuses on predicting the behavior of single
customer, Descriptive analytics identify many different relationships between customer and
product.
Prescriptive Analytics: Prescriptive Analytics automatically synthesize big data, mathematical

science, business rule, and machine learning to make prediction and then suggests decision
option to take advantage of the prediction.
Prescriptive analytics goes beyond predicting future outcomes by also suggesting action benefit
from the predictions and showing the decision maker the implication of each decision option.
Prescriptive Analytics not only anticipates what will happen and when happen but also why it
will happen. Further, Prescriptive Analytics can suggest decision options on how to take
advantage of a future opportunity or mitigate a future risk and illustrate the implication of each
decision option.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics
to leverage operational and usage data combined with data of external factors such as economic
data, population demography etc.
Diagnostic Analytics: In this analysis, we generally use historical data over other data to answer
any question or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight for a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming.
Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are two technologies used in Big Data: Operational and Analytical. Operational
capabilities include capturing and storing data in real time where as analytical capabilities
include complex analysis of all the data. They both are complementary to each other hence
deployed together.
Operational and analytical technologies of Big Data have different requirement and in order to
address those requirement different architecture has evolved. Operational systems include NoSql
database which deals with responding to concurrent requests. Analytical Systems focuses on
complex queries which touch almost all the data.
Operational Big Data
This includes systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data with
minimal coding and without the need for data scientists and additional infrastructure. Analytical
Big Data
These includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that may
touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.
Operational vs. Analytical Systems

Operational Analytical
Latency 1 ms - 100 ms 1 min - 100 min
Concurrency 1000 - 100,000 1 - 10
Access Pattern Writes and Reads Reads
Queries Selective Unselective
Data Scope Operational Retrospective
End User Customer Data Scientist
Technology NoSQL MapReduce, MPP Database
Different Big Data Technologies

1. The Hadoop Ecosystem
While Apache Hadoop may not be as dominant as it once was, it's nearly impossible to talk about
big data without mentioning this open source framework for distributed processing of large data
sets. Forrester, one of the most influential research and advisory firms in the world predicted,
"100% of all large enterprises will adopt Hadoop and related technologies such as Spark for big
data analytics within the next two years."
Over the years, Hadoop has grown to encompass an entire ecosystem of related software, and
many commercial big data solutions are based on Hadoop. In fact, Zion Market Research
forecasts that the market for Hadoop-based products and services will continue to grow at a 50
percent CAGR through 2022, when it will be worth $87.14 billion, up from $7.69 billion in
2016.
Key Hadoop vendors include Cloudera, Hortonworks and MapR, and the leading public clouds
all offer services that support the technology.
2. Spark
Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that it
deserves a category of its own. It is an engine for processing big data within Hadoop, and it's up
to one hundred times faster than the standard Hadoop engine, MapReduce. In the AtScale 2016
Big Data Maturity Survey, 25 percent of respondents said that they had already deployed Spark
in production, and 33 percent more had Spark projects in development. Clearly, interest in the
technology is sizable and growing, and many vendors with Hadoop offerings also offer
Spark-based products.
It has following features.
• Speed
• Supports multiple languages.
• Also supports SQL queries, Streaming data, Machine learning, and Graph algorithms
and beside MapReduce.
3. R
R, another open source project, is a programming language and software environment designed
for working with statistics. It is managed by the R Foundation and available under the GPL 2
license. Many popular integrated development environments (IDEs), including Eclipse and
Visual Studio, support the language.
R, is also an open source project. It is a software suite that can help the task of data manipulation
and also displaying data graphically.
R Environment includes: -
• Can store and handle data effectively.
• Large collection of tools for data analysis.
• Well developed simple and effective programming language.
Several organizations that rank the popularity of various programming languages say that R has
become one of the most popular languages in the world. For example, the IEEE says that R is the
fifth most popular programming language, and both Tiobe and RedMonk rank it 14th. This is
significant because the programming languages near the top of these charts are usually general
purpose languages that can be used for many different kinds of work. For a language that is used
almost exclusively for big data projects to be so near the top demonstrates the significance of big
data and the importance of this language in its field.
4. Data Lakes
To make it easier to access their vast stores of data, many enterprises are setting up data lakes.
These are huge data repositories that collect data from many different sources and store it in its
natural state. This is different than a data warehouse, which also collects data from disparate
sources, but processes it and structures it for storage. In this case, the lake and warehouse
metaphors are fairly accurate. If data is like water, a data lake is natural and unfiltered like a
body of water, while a data warehouse is more like a collection of water bottles stored on
shelves.
Data lakes are particularly attractive when enterprises want to store data but aren't yet sure how
they might use it. A lot of Internet of Things (IoT) data might fit into that category, and the IoT
trend is playing into the growth of data lakes.
Markets predict that data lake revenue will grow from $2.53 billion in 2016 to $8.81 billion by
2021.
A Data Lake is a lake of unstructured and structured data.
The data lake supports the following capabilities:
• To capture and store raw data at scale for a low cost
• To store many types of data in the same repository
• To perform transformations on the data
• To define the structure of the data at the time it is used, referred to as schema on read.
5. NoSQL Databases
Traditional relational database management systems (RDBMS) store information in structured,
defined columns and rows. Developers and database administrators query, manipulate and
manage the data in those RDBMSes using a special language known as SQL.
NoSQL databases specialize in storing unstructured data and providing fast performance,
although they don't provide the same level of consistency as RDBMSes. Popular NoSQL
databases include MongoDB, Redis, Cassandra, Couchbase and many others; even the leading
RDBMS vendors like Oracle and IBM now also offer NoSQL databases.
• It is a DBMS where MapReduce is used with queries instead of Manual programming to
iterate over entire data sets – E.g. Hadoop, MongoDB.
• It is a MapReduce engine with a little query language on top, it is not a complete SQL:
HIVE on top of Hadoop provides HIVEQL
• It is a DBMS with a new query language for new applications –Virtuoso, Neo4J, Amos II
• Other non-relational databases – Including Object Stores
MongoDB is one of well-known NoSQL databases.
6. Hive
This is a distributed data management for Hadoop. This supports SQL-like query option
HiveSQL (HSQL) to access big data. This can be primarily used for Data mining purpose. This
runs on top of Hadoop.
7. Sqoop
This is a tool that connects Hadoop with various relational databases to transfer data. This can be
effectively used to transfer structured data to Hadoop or Hive.
8. Presto
Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop) named
Presto which is built to handle petabytes of data. Unlike Hive, Presto does not depend on
MapReduce technique and can quickly retrieve data.
9. Apache Pig
PIG is a high-level scripting language commonly used with Apache Hadoop to analyze large data
sets. The PIG platform offers a special scripting language known as PIG Latin to the developers
who are already familiar with the other scripting languages, and programming languages like
SQL.
The major benefit of PIG is that it works with data that are obtained from various sources and
store the results into HDFS (Hadoop Data File System). The programmers have to write the
scripts in PIG Latin language which are then converted into Map and reduce tasks with the Pig
Engine component
10. Predictive Analytics

Predictive analytics is a sub-set of big data analytics that attempts to forecast future events or
behavior based on historical data. It draws on data mining, modeling and machine learning
techniques to predict what will happen next. It is often used for fraud detection, credit scoring,
marketing, finance and business analysis purposes.
In recent years, advances in artificial intelligence have enabled vast improvements in the
capabilities of predictive analytics solutions. As a result, enterprises have begun to invest more in
big data solutions with predictive capabilities. Many vendors, including Microsoft, IBM, SAP,
SAS, Statistica, RapidMiner, KNIME and others, offer predictive analytics solutions. Zion
Market Research says the Predictive Analytics market generated $3.49 billion in revenue in
2016, a number that could reach $10.95 billion by 2022.
11. In-Memory Databases

In any computer system, the memory, also known as the RAM, is orders of magnitude faster than
the long-term storage. If a big data analytics solution can process data that is stored in memory,
rather than data stored on a hard drive, it can perform dramatically faster. And that's exactly what
in-memory database technology does.
Many of the leading enterprise software vendors, including SAP, Oracle, Microsoft and IBM,
now offer in-memory database technology. In addition, several smaller companies like Teradata,
Tableau, Volt DB and DataStax offer in-memory database solutions. Research from Markets and
Markets estimates that total sales of in-memory technology were $2.72 billion in 2016 and may
grow to $6.58 billion by 2021.
12. Big Data Security Solutions
Because big data repositories present an attractive target to hackers and advanced persistent
threats, big data security is a large and growing concern for enterprises. In the AtScale survey,
security was the second fastest-growing area of concern related to big data.
According to the IDG report, the most popular types of big data security solutions include
identity and access controls (used by 59 percent of respondents), data encryption (52 percent)
and data segregation (42 percent). Dozens of vendors offer big data security solutions, and
Apache Ranger, an open source project from the Hadoop ecosystem, is also attracting growing
attention.
13. Big Data Governance Solutions
Closely related to the idea of security is the concept of governance. Data governance is a broad
topic that encompasses all the processes related to the availability, usability and integrity of data.
It provides the basis for making sure that the data used for big data analytics is accurate and
appropriate, as well as providing an audit trail so that business analysts or executives can see
where data originated.
In the NewVantage Partners survey, 91.8 percent of the Fortune 1000 executives surveyed said
that governance was either critically important (52.5 percent) or important (39.3 percent) to their
big data initiatives. Vendors offering big data governance tools include Collibra, IBM, SAS,
Informatica, Adaptive and SAP.
14. Artificial Intelligence

While the concept of artificial intelligence (AI) has been around nearly as long as there have
been computers, the technology has only become truly usable within the past couple of years. In
many ways, the big data trend has driven advances in AI, particularly in two subsets of the
discipline: machine learning and deep learning.
The standard definition of machine learning is that it is technology that gives "computers the
ability to learn without being explicitly programmed." In big data analytics, machine learning
technology allows systems to look at historical data, recognize patterns, build models and predict
future outcomes. It is also closely associated with predictive analytics.
Deep learning is a type of machine learning technology that relies on artificial neural networks
and uses multiple layers of algorithms to analyze data. As a field, it holds a lot of promise for
allowing analytics tools to recognize the content in images and videos and then process it
accordingly.
Experts say this area of big data tools seems poised for a dramatic takeoff. IDC has predicted,
"By 2018, 75 percent of enterprise and ISV development will include cognitive/AI or machine
learning functionality in at least one application, including all business analytics tools."
Leading AI vendors with tools related to big data include Google, IBM, Microsoft and Amazon
Web Services, and dozens of small startups are developing AI technology (and getting acquired
by the larger technology vendors).
15. Streaming analytics

As organizations have become more familiar with the capabilities of big data analytics solutions,
they have begun demanding faster and faster access to insights. For these enterprises, streaming
analytics with the ability to analyze data as it is being created is more important. They are
looking for solutions that can accept input from multiple disparate sources, process it and return
insights immediately — or as close to it as possible. This is particular desirable when it comes to
new IoT deployments, which are helping to drive the interest in streaming big data analytics.
Several vendors offer products that promise streaming analytics capabilities. They include IBM,
Software AG, SAP, TIBCO, Oracle, DataTorrent, SQLstream, Cisco, Informatica and others.
MarketsandMarkets believes the streaming analytics solutions brought in $3.08 billion in
revenue in 2016, which could increase to $13.70 billion by 2021.
16. Edge Computing

In addition to spurring interest in streaming analytics, the IoT trend is also generating interest in
edge computing. In some ways, edge computing is the opposite of cloud computing. Instead of
transmitting data to a centralized server for analysis, edge computing systems analyze data very
close to where it was created — at the edge of the network.
The advantage of an edge computing system is that it reduces the amount of information that
must be transmitted over the network, thus reducing network traffic and related costs. It also
decreases demands on data centers or cloud computing facilities, freeing up capacity for other
workloads and eliminating a potential single point of failure.
While the market for edge computing, and more specifically for edge computing analytics, is still
developing, some analysts and venture capitalists have begun calling the technology the "next
big thing."
17. Blockchain
Also a favorite with forward-looking analysts and venture capitalists, blockchain is the
distributed database technology that underlies Bitcoin digital currency. The unique feature of a
blockchain database is that once data has been written, it cannot be deleted or changed after the
fact. In addition, it is highly secure, which makes it an excellent choice for big data applications
in sensitive industries like banking, insurance, health care, retail and others.
Blockchain technology is still in its infancy and use cases are still developing. However, several
vendors, including IBM, AWS, Microsoft and multiple startups, have rolled out experimental or
introductory solutions built on blockchain technology. Blockchain is distributed ledger
technology that offers great potential for data analytics.

Bda Unit I

Uploaded by

Copyright:

Available Formats

Bda Unit I

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bda Unit I

Uploaded by

Copyright:

Available Formats

UNIT - I

INTRODUCTION TO BIG DATA AND ITS IMPORTANCE

Types of Big Data

Examples of Structured Data

Data Distributed Architecture Centralized Architecture

Type of Data Semi-structured/Un Structured Data

Volume of Data Consists of (250– 260 ) Consists of 240 bytes of data

Schema No schema Based on Fixed Schema

Data Complex Relationship Known Relationship

Importance of Big Data

The major challenges associated with big data are as follows −

• Privacy and security

BIG DATA ANALYTICS AND ITS CLASSIFICATION

There are four type of data analytics:

Prescriptive Analytics: Prescriptive Analytics automatically synthesize big data, mathematical

Big Data Technologies

Operational vs. Analytical Systems

Latency 1 ms - 100 ms 1 min - 100 min

Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database

Different Big Data Technologies

10. Predictive Analytics

11. In-Memory Databases

14. Artificial Intelligence

15. Streaming analytics

16. Edge Computing

You might also like