WINSEM2022-23 MAT6015 ETH VL2022230506274 ReferenceMaterialI WedFeb1500 00 00IST2023 IntroductiontoBigData
WINSEM2022-23 MAT6015 ETH VL2022230506274 ReferenceMaterialI WedFeb1500 00 00IST2023 IntroductiontoBigData
WINSEM2022-23 MAT6015 ETH VL2022230506274 ReferenceMaterialI WedFeb1500 00 00IST2023 IntroductiontoBigData
Several industries have led the way in developing their ability to gather
and exploit data:
● Credit card companies monitor every purchase their customers make
and can identify fraudulent purchases with a high degree of accuracy
using rules derived by processing billions of transactions.
● Mobile phone companies analyze subscribers’ calling patterns to
determine, for example, whether a caller’s frequent contacts are on a rival
network. If that rival network is offering an attractive promotion that
might cause the subscriber to defect, the mobile phone company can
proactively offer the subscriber an incentive to remain in her contract.
● For companies such as LinkedIn and Facebook, data itself is their
primary product. The valuations of these companies are heavily derived
from the data they gather and host, which contains more and more
intrinsic value as the data grows.
Unstructured Data
The above figure states a few ways of dealing with unstructured data. The
above techniques are used to find patterns in or interpret unstructured data.
Data mining: First, we deal with large data sets, Second, we use
methods at the intersection of AI, ML, Statistics and database
systems to unearth consistent patterns in large data sets. It is the
analysis step of the “Knowledge discovery in databases” process.
Few popular data mining algorithms are as follows:
Association rule mining: It is also called “market basket analysis” or
“affinity analysis”. It is used to determine “What goes with what?”.
It is about when you buy a product, what is the other product that
you are likely to purchase with it.
Regression analysis: It helps to predict the relationship between
two variables. The variable whose value needs to be predicted is
called the dependent variable and the variables which are used to
predict the value are referred to as the independent variables.
Collaborative filtering: It is about predicting a user’s preference or
preferences based on the preferences of a group of users.
Text analytics or text mining: It is the process of gleaning high
quality and meaningful information from text. It includes tasks such
as text categorization, text clustering, sentiment analysis, etc.
Natural language processing: It is related to the area of human
computer interaction. It is about enabling computers to understand
human or natural language input.
Noisy text analytics: It is the process of extracting structured or
semi-structured information from noisy unstructured data such as
chats, blogs, wikis, emails, text messages, etc.
Characteristics of DATA
Data has three key characteristics:
1. Composition: The composition of data deals with the structure
of data, that is, the sources of data, the granularity, the types,
and the nature of data as to whether it is static or real-time
streaming.
2. Condition: The condition of data deals with the state of data,
that is, “Can one use this data as is for analysis?” or “Does it
require cleansing for further enhancement and enrichment?”
3. Context: The context of data deals with “Where has this data
been generated?” “Why was this data generated? ” “How
sensitive is this data?” “What are the events associated with this
data” and so on.
Evolution of BIG DATA
1970s and before was the era of mainframes. The data was
essentially primitive and structured.
Relational databases evolved in 1980s and 1990s.The era was
of data intensive applications.
The World Wide Web (WWW) and Internet of Things (IoT)
have led to an onslaught of structured, unstructured, and
multimedia data.
Definition of BIG DATA
Cost-effective,
High-volume
innovative forms Enhanced insight
High-velocity
of information & decision making
High-variety processing
More Data