Bda Unit I
Bda Unit I
Bda Unit I
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
While many organizations boast of having good data or improving the quality of their data, the
real challenge is defining what those qualities represent. What some consider good quality others
might view as poor. Judging the quality of data requires an examination of its characteristics and
then weighing those characteristics according to what is most important to the organization and
the application(s) for which they are being used.
The seven characteristics that define data quality are:
1. Accuracy and Precision
2. Legitimacy and Validity
3. Reliability and Consistency
4. Timeliness and Relevance
5. Completeness and Comprehensiveness
6. Availability and Accessibility
7. Granularity and Uniqueness
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot have
any erroneous elements and must convey the correct message without being misleading. This
accuracy and precision have a component that relates to its intended use. Without understanding
how the data will be consumed, ensuring accuracy and precision could be off-target or more
costly than necessary. For example, accuracy in healthcare might be more important than in
another industry (which is to say, inaccurate data in healthcare could have more serious
consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this characteristic.
For example, on surveys, items such as gender, ethnicity, and nationality are typically limited to
a set of options and open answers are not permitted. Any answers other than these would not be
considered valid or legitimate based on the survey’s requirement. This is the case for most data
and must be carefully considered when determining its quality. The people in
each department in an organization understand what data is valid or not to them, so the
requirements must be leveraged when evaluating data quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must be
a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the effort
required, which also means it has to be collected at the right moment in time. Data collected too
soon or too late could misrepresent a situation and drive inaccurate decisions.
Completeness and Comprehensiveness: Incomplete data is as dangerous as inaccurate data.
Gaps in data collection lead to a partial view of the overall picture to be displayed. Without a
complete picture of how operations are running, uninformed actions will occur. It’s important to
understand the complete set of requirements that constitute a comprehensive set of data to
determine whether or not the requirements are being fulfilled.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized and
manipulated collections of data could offer a different meaning than the data implied at a lower
level. An appropriate level of granularity must be defined to provide sufficient uniqueness and
distinctive properties to become visible. This is a requirement for operations to function
effectively.
namely:
Volume: the amount of data that businesses can collect is really enormous and hence the volume
of the data becomes a critical factor in Big Data analytics.
Velocity: the rate at which new data is being generated all thanks to our dependence on the
internet, sensors, machine-to-machine data is also important to parse Big Data in a timely
manner.
Variety: the data that is generated is completely heterogeneous in the sense that it could be in
various formats like video, text, database, numeric, sensor data and so on and hence
understanding the type of Big Data is a key factor to unlocking its value.
Veracity: Refers to inconsistency or uncertainty of data i.e., knowing whether the data that is
available is coming from a credible source is of utmost importance before deciphering and
implementing Big Data for business needs.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for
deriving value out of it. A typical example of unstructured data is a heterogeneous data source
containing a combination of simple text files, images, videos etc. Now day organizations have
wealth of data available with them but unfortunately, they don't know how to derive value out of
it since this data is in its raw form or unstructured format.
Examples of Un-structured Data
The output returned by 'Google Search' and the data being used by Twitter, Facebook and other
social media in the form of posts
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file. Examples of
Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant
Rao</name><sex>Male</sex><age>35</age></rec> <rec><name>Seema
R.</name><sex>Female</sex><age>41</age></rec> <rec><name>Satish
Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Difference between traditional data and big data:
Factors Big Data Traditional Data
• Understand customer needs better: Through effective analysis of big-data the company can plan
better for customer satisfaction and thus make alterations needed to ensure loyalty and customer
trust. Better customer experience definitely impacts growth. Complaint resolution, 24×7
customer service, interactive websites and consistent gathering of feedback from the customer
are some of the new measures that have made big-data analytics very popular and helpful to
companies.
• Work on bettering company reputation: Sentiments and their analysis can help correct false
rumors, better service customer needs and maintain company image through online presence
which eventually helps the company reputation using Big Data tools that can analyze emotions
both negative and positive.
• Promotes cost-saving measures: Though the initial costs of deploying Big Data analytics are high,
the returns and gainful insights more than pay for themselves. This also enables constant
monitoring, better risk-management and the IT infrastructure personnel can be freed up. This
translates into reduced personnel required. Besides this, the tools in Big Data can be used to store
data more effectively. Thus the costs are outweighed by the savings.
• Makes data available: Modern tools in Big Data can in actual-time present required portions of
data anytime in a structured and easily readable format.
EVOLUTION OF BIG DATA
The Foundations of Big Data
Data became a problem for the U.S. Census Bureau in 1880. They estimated it would take eight
years to handle and process the data collected during the 1880 census, and predicted the data
from the 1890 census would take more than 10 years to process. Fortunately, in 1881, a young
man working for the bureau, named Herman Hollerith, created the Hollerith Tabulating Machine.
His invention was based on the punch cards designed for controlling the patterns woven by
mechanical looms. His tabulating machine reduced ten years of labor into three months of labor.
In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing
informationmagnetically on tape. Pfleumer had devised a method for adhering metal stripes to
cigarette papers (to keep a smokers’ lips from being stained by the rolling papers available at the
time), and decided he could use this technique to create a magnetic strip, which could then be
used to replace wire recording technology. After experiments with a variety of materials, he
settled on a very thin paper, striped with iron oxide powder and coated with lacquer, for his
patent in 1928.
During World War II (more specifically 1943), the British, desperate to crack Nazi codes,
invented a machine that scanned for patterns in messages intercepted from the Germans. The
machine was called Colossus, and scanned 5.000 characters a second, reducing the workload
from weeks to merely hours. Colossus was the first data processor. Two years later, in 1945, John
Von Neumann published a paper on the Electronic Discrete Variable Automatic
Computer (EDVAC), the first “documented” discussion on program storage, and laid the
foundation of computer architecture today.
It is said these combined events prompted the “formal” creation of the United States’ NSA
(National Security Agency), by President Truman, in 1952. Staff at the NSA were assigned the
task of decrypting messages intercepted during the Cold War. Computers of this time had
evolved to the point where they could collect and process data, operating independently and
automatically.
The Internet Effect and Personal Computers
ARPANET began on Oct 29, 1969, when a message was sent from UCLA’s host computer to
Stanford’s host computer. It received funding from the Advanced Research Projects Agency
(ARPA), a subdivision of the Department of Defense. Generally speaking, the public was not
aware of ARPANET. In 1973, it connected with a transatlantic satellite, linking it to the
Norwegian Seismic Array. However, by 1989, the infrastructure of ARPANET had started to
age. The system wasn’t as efficient or as fast as newer networks. Organizations using ARPANET
started moving to other networks, such as NSFNET, to improve basic efficiency and speed. In
1990, the ARPANET project was shut down, due to a combination of age and obsolescence. The
creation ARPANET led directly to the Internet.
In 1965, the U.S. government built the first data center, with the intention of storing millions of
fingerprint sets and tax returns. Each record was transferred to magnetic tapes, and were to be
taken and stored in a central location. Conspiracy theorists expressed their fears, and the project
was closed. However, in spite of its closure, this initiative is generally considered the first effort
at large scale data storage.
Personal computers came on the market in 1977, when microcomputers were introduced, and
became a major stepping stone in the evolution of the internet, and subsequently, Big Data. A
personal computer could be used by a single individual, as opposed to mainframe computers,
which required an operating staff, or some kind of time-sharing system, with one large processor
being shared by multiple individuals. After the introduction of the microprocessor, prices for
personal computers lowered significantly, and became described as “an affordable consumer
good.” Many of the early personal computers were sold as electronic kits, designed to be built by
hobbyists and technicians. Eventually, personal computers would provide people worldwide with
access to the internet.
In 1989, a British Computer Scientist named Tim Berners-Lee came up with the concept of the
World Wide Web. The Web is a place/information-space where web resources are recognized
using URLs, interlinked by hypertext links, and is accessible via the Internet. His system also
allowed for the transfer of audio, video, and pictures. His goal was to share information on the
Internet using a hypertext system. By the fall of 1990, Tim Berners-Lee, working for CERN, had
written three basic IT commands that are the foundation of today’s web:
• HTML: HyperText Markup Language. The formatting language of the web. • URL: Uniform
Resource Locator. A unique “address” used to identify each resource on the web. It is also called a
URI (Uniform Resource Identifier).
• HTTP: Hypertext Transfer Protocol. Used for retrieving linked resources from all across the web.
In 1993, CERN announced the World Wide Web would be free for everyone to develop and use.
The free part was a key factor in the effect the Web would have on the people of the world. (It’s
the companies providing the “internet connection” that charge us a fee).
The Internet of Things (IoT)
The concept of Internet of Things was assigned its official name in 1999. By 2013, the IoT had
evolved to include multiple technologies, using the Internet, wireless communications, micro
electromechanical systems (MEMS), and embedded systems. All of these transmit data about the
person using them. Automation (including buildings and homes), GPS, and others, support the
IoT.
The Internet of Things, unfortunately, can make computer systems vulnerable to hacking. In
October of 2016, hackers crippled major portions of the Internet using the IoT. The early
response has been to develop Machine Learning and Artificial Intelligence focused on security
issues.
Computing Power and Internet Growth
There was an incredible amount of internet growth in the 1990s, and personal computers became
steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s
efforts, Cern’s free access, and access to individual personal computers.
In 2005, Big Data, which had been used without a name, was labeled by Roger Mougalas. He
was referring to a large set of data that, at the time, was almost impossible to manage and process
using the traditional business intelligence tools available. Additionally, Hadoop, which could
handle Big Data, was created in 2005. Hadoop was based on an open-sourced software
framework called Nutch, and was merged with Google’s MapReduce. Hadoop is an Open Source
software framework, and can process structured and unstructured data, from almost all digital
sources. Because of this flexibility, Hadoop (and its sibling frameworks) can process Big Data.
Challenges of Big Data
• Capturing data
• Curation (process of selecting, organizing, and looking after the items in a collection) •
Storage
• Searching
• Sharing
• Transfer
• Analysis/Processing
• Presentation
Firms may commonly apply analytics to business data, to describe, predict and improve business
performance. Especially, areas within include predictive analytics, enterprise decision
management etc. Since analytics can require extensive computation(because of big data), the
algorithms and software used to analytics harness the most current methods in computer science.
In a nutshell, analytics is the scientific process of transforming data into insight for making better
decisions. The goal of Data Analytics is to get actionable insights resulting in smarter decision
and better business outcomes.
It is critical to design and built a data warehouse or Business Intelligence(BI) architecture that
provides a flexible, multi-faceted analytical ecosystem, optimized for efficient ingestion and
analysis of large and diverse data set.
• Predictive (forecasting)
• Descriptive (business intelligence and data mining)
• Prescriptive (optimization and simulation)
• Diagnostic analytics
Predictive Analytics: Predictive analytics turn the data into valuable, actionable information.
Predictive analytics uses data to determine the probable outcome of an event or a likelihood of a
situation occurring.
Predictive analytics holds a variety of statistical technique from modeling, machine, learning,
data mining and game theory that analyze current and historical facts to make prediction about
future event.
There are three basic cornerstones of predictive analytics-
• Predictive modeling
• Decision Analysis and optimization
• Transaction profiling
Descriptive Analytics: Descriptive analytics looks at data and analyze past event for insight as
how to approach future events. It looks at the past performance and understands the performance
by mining historical data to understand the cause of success or failure in the past. Almost all the
management reporting such as sales, marketing, operations, and finance uses this type of
analysis.
Descriptive model quantifies relationship in data in a way that is often used to classify customers
or prospect into groups. Unlike predictive model that focuses on predicting the behavior of single
customer, Descriptive analytics identify many different relationships between customer and
product.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using analytics
to leverage operational and usage data combined with data of external factors such as economic
data, population demography etc.
Diagnostic Analytics: In this analysis, we generally use historical data over other data to answer
any question or for the solution of any problem. We try to find any dependency and pattern in the
historical data of the particular problem.
For example, companies go for this analysis because it gives a great insight for a problem, and
they also keep detailed information about their disposal otherwise data collection may turn out
individual for every problem and it will be very time-consuming.
2. Spark
Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that it
deserves a category of its own. It is an engine for processing big data within Hadoop, and it's up
to one hundred times faster than the standard Hadoop engine, MapReduce. In the AtScale 2016
Big Data Maturity Survey, 25 percent of respondents said that they had already deployed Spark
in production, and 33 percent more had Spark projects in development. Clearly, interest in the
technology is sizable and growing, and many vendors with Hadoop offerings also offer
Spark-based products.
It has following features.
• Speed
• Supports multiple languages.
• Also supports SQL queries, Streaming data, Machine learning, and Graph algorithms
and beside MapReduce.
3. R
R, another open source project, is a programming language and software environment designed
for working with statistics. It is managed by the R Foundation and available under the GPL 2
license. Many popular integrated development environments (IDEs), including Eclipse and
Visual Studio, support the language.
R, is also an open source project. It is a software suite that can help the task of data manipulation
and also displaying data graphically.
R Environment includes: -
• Can store and handle data effectively.
• Large collection of tools for data analysis.
• Well developed simple and effective programming language.
Several organizations that rank the popularity of various programming languages say that R has
become one of the most popular languages in the world. For example, the IEEE says that R is the
fifth most popular programming language, and both Tiobe and RedMonk rank it 14th. This is
significant because the programming languages near the top of these charts are usually general
purpose languages that can be used for many different kinds of work. For a language that is used
almost exclusively for big data projects to be so near the top demonstrates the significance of big
data and the importance of this language in its field.
4. Data Lakes
To make it easier to access their vast stores of data, many enterprises are setting up data lakes.
These are huge data repositories that collect data from many different sources and store it in its
natural state. This is different than a data warehouse, which also collects data from disparate
sources, but processes it and structures it for storage. In this case, the lake and warehouse
metaphors are fairly accurate. If data is like water, a data lake is natural and unfiltered like a
body of water, while a data warehouse is more like a collection of water bottles stored on
shelves.
Data lakes are particularly attractive when enterprises want to store data but aren't yet sure how
they might use it. A lot of Internet of Things (IoT) data might fit into that category, and the IoT
trend is playing into the growth of data lakes.
Markets predict that data lake revenue will grow from $2.53 billion in 2016 to $8.81 billion by
2021.
A Data Lake is a lake of unstructured and structured data.
The data lake supports the following capabilities:
• To capture and store raw data at scale for a low cost
• To store many types of data in the same repository
• To perform transformations on the data
• To define the structure of the data at the time it is used, referred to as schema on read.
5. NoSQL Databases
Traditional relational database management systems (RDBMS) store information in structured,
defined columns and rows. Developers and database administrators query, manipulate and
manage the data in those RDBMSes using a special language known as SQL.
NoSQL databases specialize in storing unstructured data and providing fast performance,
although they don't provide the same level of consistency as RDBMSes. Popular NoSQL
databases include MongoDB, Redis, Cassandra, Couchbase and many others; even the leading
RDBMS vendors like Oracle and IBM now also offer NoSQL databases.
• It is a DBMS where MapReduce is used with queries instead of Manual programming to
iterate over entire data sets – E.g. Hadoop, MongoDB.
• It is a MapReduce engine with a little query language on top, it is not a complete SQL:
HIVE on top of Hadoop provides HIVEQL
• It is a DBMS with a new query language for new applications –Virtuoso, Neo4J, Amos II
• Other non-relational databases – Including Object Stores
MongoDB is one of well-known NoSQL databases.
6. Hive
This is a distributed data management for Hadoop. This supports SQL-like query option
HiveSQL (HSQL) to access big data. This can be primarily used for Data mining purpose. This
runs on top of Hadoop.
7. Sqoop
This is a tool that connects Hadoop with various relational databases to transfer data. This can be
effectively used to transfer structured data to Hadoop or Hive.
8. Presto
Facebook has developed and recently open-sourced its Query engine (SQL-on-Hadoop) named
Presto which is built to handle petabytes of data. Unlike Hive, Presto does not depend on
MapReduce technique and can quickly retrieve data.
9. Apache Pig
PIG is a high-level scripting language commonly used with Apache Hadoop to analyze large data
sets. The PIG platform offers a special scripting language known as PIG Latin to the developers
who are already familiar with the other scripting languages, and programming languages like
SQL.
The major benefit of PIG is that it works with data that are obtained from various sources and
store the results into HDFS (Hadoop Data File System). The programmers have to write the
scripts in PIG Latin language which are then converted into Map and reduce tasks with the Pig
Engine component
Many of the leading enterprise software vendors, including SAP, Oracle, Microsoft and IBM,
now offer in-memory database technology. In addition, several smaller companies like Teradata,
Tableau, Volt DB and DataStax offer in-memory database solutions. Research from Markets and
Markets estimates that total sales of in-memory technology were $2.72 billion in 2016 and may
grow to $6.58 billion by 2021.
12. Big Data Security Solutions
Because big data repositories present an attractive target to hackers and advanced persistent
threats, big data security is a large and growing concern for enterprises. In the AtScale survey,
security was the second fastest-growing area of concern related to big data.
According to the IDG report, the most popular types of big data security solutions include
identity and access controls (used by 59 percent of respondents), data encryption (52 percent)
and data segregation (42 percent). Dozens of vendors offer big data security solutions, and
Apache Ranger, an open source project from the Hadoop ecosystem, is also attracting growing
attention.
13. Big Data Governance Solutions
Closely related to the idea of security is the concept of governance. Data governance is a broad
topic that encompasses all the processes related to the availability, usability and integrity of data.
It provides the basis for making sure that the data used for big data analytics is accurate and
appropriate, as well as providing an audit trail so that business analysts or executives can see
where data originated.
In the NewVantage Partners survey, 91.8 percent of the Fortune 1000 executives surveyed said
that governance was either critically important (52.5 percent) or important (39.3 percent) to their
big data initiatives. Vendors offering big data governance tools include Collibra, IBM, SAS,
Informatica, Adaptive and SAP.
The standard definition of machine learning is that it is technology that gives "computers the
ability to learn without being explicitly programmed." In big data analytics, machine learning
technology allows systems to look at historical data, recognize patterns, build models and predict
future outcomes. It is also closely associated with predictive analytics.
Deep learning is a type of machine learning technology that relies on artificial neural networks
and uses multiple layers of algorithms to analyze data. As a field, it holds a lot of promise for
allowing analytics tools to recognize the content in images and videos and then process it
accordingly.
Experts say this area of big data tools seems poised for a dramatic takeoff. IDC has predicted,
"By 2018, 75 percent of enterprise and ISV development will include cognitive/AI or machine
learning functionality in at least one application, including all business analytics tools."
Leading AI vendors with tools related to big data include Google, IBM, Microsoft and Amazon
Web Services, and dozens of small startups are developing AI technology (and getting acquired
by the larger technology vendors).
Several vendors offer products that promise streaming analytics capabilities. They include IBM,
Software AG, SAP, TIBCO, Oracle, DataTorrent, SQLstream, Cisco, Informatica and others.
MarketsandMarkets believes the streaming analytics solutions brought in $3.08 billion in
revenue in 2016, which could increase to $13.70 billion by 2021.
The advantage of an edge computing system is that it reduces the amount of information that
must be transmitted over the network, thus reducing network traffic and related costs. It also
decreases demands on data centers or cloud computing facilities, freeing up capacity for other
workloads and eliminating a potential single point of failure.
While the market for edge computing, and more specifically for edge computing analytics, is still
developing, some analysts and venture capitalists have begun calling the technology the "next
big thing."
17. Blockchain
Also a favorite with forward-looking analysts and venture capitalists, blockchain is the
distributed database technology that underlies Bitcoin digital currency. The unique feature of a
blockchain database is that once data has been written, it cannot be deleted or changed after the
fact. In addition, it is highly secure, which makes it an excellent choice for big data applications
in sensitive industries like banking, insurance, health care, retail and others.
Blockchain technology is still in its infancy and use cases are still developing. However, several
vendors, including IBM, AWS, Microsoft and multiple startups, have rolled out experimental or
introductory solutions built on blockchain technology. Blockchain is distributed ledger
technology that offers great potential for data analytics.