WINSEM2022-23 MAT6015 ETH VL2022230506274 ReferenceMaterialI WedFeb1500 00 00IST2023 IntroductiontoBigData

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

INTRODUCTION TO BIG DATA

Data is created constantly, and at an ever-increasing rate. Mobile phones,


social media, imaging technologies to determine a medical diagnosis - all
these and more create new data, and that must be stored somewhere for
some purpose.

Several industries have led the way in developing their ability to gather
and exploit data:
● Credit card companies monitor every purchase their customers make
and can identify fraudulent purchases with a high degree of accuracy
using rules derived by processing billions of transactions.
● Mobile phone companies analyze subscribers’ calling patterns to
determine, for example, whether a caller’s frequent contacts are on a rival
network. If that rival network is offering an attractive promotion that
might cause the subscriber to defect, the mobile phone company can
proactively offer the subscriber an incentive to remain in her contract.
● For companies such as LinkedIn and Facebook, data itself is their
primary product. The valuations of these companies are heavily derived
from the data they gather and host, which contains more and more
intrinsic value as the data grows.

Three attributes stand out as defining Big Data characteristics:


● Huge volume of data: Rather than thousands or millions of rows, Big
Data can be billions of rows and millions of columns.
● Complexity of data types and structures: Big Data reflects the variety
of new data sources, formats, and structures, including digital traces
being left on the web and other digital repositories for subsequent
analysis.
● Speed of new data creation and growth: Big Data can describe high
velocity data, with rapid data ingestion and near real time analysis.
 Although the volume of Big Data tends to attract the most
attention, generally the variety and velocity of the data provide
a more relevant definition of Big Data. (Big Data is sometimes
described as having 3 Vs: volume, variety, and velocity.)
 Due to its size or structure, Big Data cannot be efficiently
analyzed using only traditional databases or methods.
 Big Data problems require new tools and technologies to store,
manage, and realize the business benefit.
 These new tools and technologies enable creation,
manipulation, and management of large datasets and the storage
environments that house them
The above figures highlights several sources of the Big
Data deluge. Social media and genetic sequencing are among the
fastest-growing sources of Big Data and examples
of untraditional sources of data being used for analysis.
Classification of Digital Data
Digital data can be broadly classified into structured, semi structured
and unstructured data.
Structured Data

Digital Data Semi Structured Data

Unstructured Data

1. Unstructured data: This is the data which does not conform to a


data model or is not in a form which can be used easily by a
computer program. About 80-90% data of an organization is in
this format. For example – memos, chat rooms, power point
presentations, images, videos, letters, researchers, body of an
email, etc-.
2. Semi-structured data: This is the data which does not
conform to a data model but has some structure. However, it is
not in a form which can be used easily by a computer program.
For example- emails, XML, mark-up languages like HTML,
etc.

3. Structured data: This is the data which is in an organized


form (e.g. in rows and columns) and can be easily used by a
computer program. Relationships exist between entities of
data, such as classes and their objects. Data stored in databases
is an example of structured data.
How to deal with unstructured data?
Today, unstructured data constitutes approximately 80% of the data that is
being generated in any enterprise. The balance is clearly shifting in favour
of unstructured data. It is such a big percentage that it cannot be ignored.
So the next question one can ask, i.e., “How to deal with Unstructured
Data?”

The above figure states a few ways of dealing with unstructured data. The
above techniques are used to find patterns in or interpret unstructured data.
 Data mining: First, we deal with large data sets, Second, we use
methods at the intersection of AI, ML, Statistics and database
systems to unearth consistent patterns in large data sets. It is the
analysis step of the “Knowledge discovery in databases” process.
Few popular data mining algorithms are as follows:
 Association rule mining: It is also called “market basket analysis” or
“affinity analysis”. It is used to determine “What goes with what?”.
It is about when you buy a product, what is the other product that
you are likely to purchase with it.
 Regression analysis: It helps to predict the relationship between
two variables. The variable whose value needs to be predicted is
called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
 Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users.
 Text analytics or text mining: It is the process of gleaning high
quality and meaningful information from text. It includes tasks such
as text categorization, text clustering, sentiment analysis, etc.
 Natural language processing: It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
 Noisy text analytics: It is the process of extracting structured or
semi-structured information from noisy unstructured data such as
chats, blogs, wikis, emails, text messages, etc.
Characteristics of DATA
Data has three key characteristics:
1. Composition: The composition of data deals with the structure
of data, that is, the sources of data, the granularity, the types,
and the nature of data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of data,
that is, “Can one use this data as is for analysis?” or “Does it
require cleansing for further enhancement and enrichment?”
3. Context: The context of data deals with “Where has this data
been generated?” “Why was this data generated? ” “How
sensitive is this data?” “What are the events associated with this
data” and so on.
Evolution of BIG DATA
 1970s and before was the era of mainframes. The data was
essentially primitive and structured.
 Relational databases evolved in 1980s and 1990s.The era was
of data intensive applications.
 The World Wide Web (WWW) and Internet of Things (IoT)
have led to an onslaught of structured, unstructured, and
multimedia data.
Definition of BIG DATA

Big Data is data whose scale, distribution, diversity,


and/or timeliness require the use of new technical
architectures and analytics to enable insights that unlock
new sources of business value.

In other words, Big data is high-volume, high-velocity, and


high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced
insight and decision making.
For the sake of easy comprehension, we will look at the definition in three parts.

Cost-effective,
High-volume
innovative forms Enhanced insight
High-velocity
of information & decision making
High-variety processing

Data → Information → Actionable intelligence → Better decisions → Enhanced business value


Challenges with BIG DATA
Following are a few challenges with BIG DATA:
 Data today is growing at an exponential rate. Most of the data
that we have today has been generated in the last 2-3 years.
This high tide of data will continue to rise incessantly. The key
questions here are: “Will all this data be useful for analysis?”,
“Do we work with all this data or a subset of it?”, “How will
we separate the knowledge from the noise?”, etc.
 Cloud computing and virtualization are here to stay. Cloud
computing is the answer to managing infrastructure for big
data as far as cost efficiency, elasticity, and easy
upgrading/downgrading is concerned. This further
complicates the decision to host big data solutions outside the
enterprise.
 The other challenge is to decide on the period of retention of big
data. Just how long should one retain this data?
 Then, of course, there are other challenges with respect to
capture, storage, preparation, search, analysis, transfer, security,
and visualization of big data. Big data refers to data sets whose size
is typically beyond the storage capacity of traditional database
software tools. There is no explicit definition of how big the data
set should be for it to be considered “big data”. Here we are to
deal with data that is just too big.
 Data visualization is becoming popular as a separate discipline. We
are short by quite a number, as far as business visualization experts
are concerned.
What is BIG DATA ?
Big data is data that is big in volume, velocity and variety.
Volume: We have seen it grow from bits to bytes to petabytes and
exabytes.
Bits→Bytes→Kilobytes→Megabytes→Gigabytes→Terabytes
→Petabytes→Exabytes→Zettabytes→Yottabytes
Velocity: We have moved from the days of batch processing to real
time processing.
Batch→Periodic→Near real time→Real-time processing
Variety: Variety deals with a wide range of data types and sources of
data. We will study this under three categories: Structured data,
semi-structured data and unstructured data.
Why BIG DATA ?

More Data

More accurate analysis

Greater confidence in decision making

Greater operational efficiencies, cost reduction, time


reduction, new products development, and optimized
offerings, etc.
Business Intelligence (BI) Versus Big Data
 In traditional BI environment, all the enterprise's data is
housed in a central server where as in a big data environment
data resides in a distributed file system.
 In traditional BI, data is generally analyzed in an offline mode
whereas in big data, it is analyzed in both real time as well as
in offline mode.
 Traditional BI is about structured data and it is here that data
is taken to processing functions (move data to code) whereas
big data is about variety: structured, semi-structured and
unstructured data, and here the processing functions are
taken to the data (move code to data).
Thank You

You might also like