Data Science Vs Big Data

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Fundamentals of Data

Science (PCC-AI 405)


What is Big Data?
• Big data is defined as a complex and voluminous set of information
comprising structured, unstructured, and semi-structured datasets,
which is challenging to manage using traditional data processing
tools.
• It requires additional infrastructure to govern, analyze, and convert
into insights.
Types of Big Data:
Data sets are typically categorized into three types based on its structure and how straightforward (or not) it is
to index.
1. Structured data: This kind of data is the simplest to organize and search. like financial data, machine logs,
and demographic details. An Excel spreadsheet, with its layout of pre-defined columns and rows, is a
good way to envision structured data. Its components are easily categorized, allowing database designers
and administrators to define simple algorithms for search and analysis.
2. Unstructured data: This category of data can include things like social media posts, audio files, images,
and open-ended customer comments. This kind of data cannot be easily captured in standard row-
column relational databases. Traditionally, companies that wanted to search, manage, or analyze large
amounts of unstructured data had to use laborious manual processes. Instead of spreadsheets or
relational databases, unstructured data is usually stored in data lakes, data warehouses, and NoSQL
databases.
3. Semi-structured data: Semi-structured data is a hybrid of structured and unstructured data. E-mails are
a good example as they include unstructured data in the body of the message, as well as more
organizational properties such as sender, recipient, subject, and date. Devices that use geo-tagging,
time stamps, or semantic tags can also deliver structured data alongside unstructured content. An
unidentified smartphone image, for instance, can still tell you that it is a selfie, and the time and place
where it was taken.
Types of Big Data:
4. Geospatial data
• Geospatial data is information on things, occasions, or other features located on or close to the earth’s surface.
Geospatial data often combines temporal information with location information (coordinates typically on the planet)
and attribute information (the traits of the item, event, or phenomenon in question) (the time or life span at which
the location and attributes exist). The site reported may be static (such as the location of a piece of equipment, an
earthquake occurrence, or poor children) or dynamic (for instance, a moving car or pedestrian, the spread of an
infectious illness).
5. Machine or operational logging data
• Machine data is information produced by a computer process or application activity without the involvement of a
human being. Humans seldom alter machine data, although it may be gathered and studied. This implies that
manually input data by an end user is not identified as machine-generated data. These data are increasingly being
created by people accidentally or by machines, and they have an impact on all industries that employ computers in
their everyday operations. Examples of machine data include call detail records and application log files.
6. Open-source data
• Open-source databases house crucial data in software within the organization’s authority. Users of an open-source
database can build a system to suit their own demands and professional requirements. It is free and open to sharing.
It may accommodate any user choice by changing the source code. Open-source databases meet the need for more
affordable data analysis from an increasing number of innovative applications. An era of big data available to be
gathered and evaluated has arrived thanks to social media and the Internet of Things (IoT). Google Public Data
Explorer is an example of this big data type.
Fundamental characteristics of big data:
Fundamental characteristics of big data:
1. Volume-
• The volume of your data is how much of it there is – measured in gigabytes, zettabytes (ZB), and yottabytes
(YB). Industry trends predict a significant increase in data volume over the next few years. Sensors, social
media platforms, and application logs all continuously generate enormous volumes of data. Data gathered
from all these sources is organized using distributed systems like Hadoop.

2. Velocity-
• Velocity describes how quickly data is processed. Any significant data operation has to operate at a high
rate. The linkage of incoming data sets, activity bursts, and the pace of change make up this phenomenon.

3. Variety
• The many types of big data are referred to as variety. As it impacts performance, it is one of the main
problems the big data sector is now dealing with. It’s crucial to organize your data so that you can manage its
diversity effectively. Variety is the wide range of information you collect from numerous sources.
These fundamental characteristics of big data:
4. Veracity
• The correctness of your data is referred to as veracity. The accuracy of your findings can be
severely harmed by poor veracity, making it one of the most crucial big data qualities. It specifies
the level of data reliability. It is vital to remove the information that is not essential and use the
remaining data for processing because most of the data you encounter is unstructured.
5. Value
• Value is the advantage that the data provides to your company. Does it reflect the objectives of
your company? Does it aid in the growth of your company? It’s one of the most crucial
fundamentals of big data. Data scientists first transform unprocessed data into knowledge. The
best data from this data collection is then extracted once it has been cleaned. On this data set,
analysis and pattern recognition are performed.
• The results of the method may be used to determine the value of the
data.
Applications of Big Data
Big Data for Financial Services
• Credit card companies, retail banks, private wealth management
advisories, insurance firms, venture funds, and institutional
investment banks all use big data for their financial services. The
common problem among them all is the massive amounts of multi-
structured data living in multiple disparate systems, which big data
can solve. As such, big data is used in several ways, including:
• Customer analytics
• Compliance analytics
• Fraud analytics
• Operational analytics
Big Data in Communications
• Gaining new subscribers, retaining customers, and expanding within
current subscriber bases are top priorities for telecommunication
service providers. The solutions to these challenges lie in the ability to
combine and analyze the masses of customer-generated data and
machine-generated data that is being created every day.
Big Data for Retail
• Understanding the customer better: This requires the ability to
analyze all disparate data sources that companies deal with every day,
including the weblogs, customer transaction data, social media, store-
branded credit card data, and loyalty program data.
Importance of Big Data
1. Saving costs
2. Driving efficiency
3. Analysing the market
4. Improving customer experiences
5. Supporting innovation
6. Detecting fraud
7. Improving productivity
8. Enabling agility
Reference : https://www.spiceworks.com/tech/big-data/articles/what-
is-big-data/
Difference Between Big Data and Data Science
• While Big Data and Data Science both deal with data, their method of dealing with data is
different.
1. Big Data deals with handling and managing huge amount of data. Prior to Big Data,
industries did not possess the required tools and resources to manage such a large volume
of data. However, the emergence of MapReduce and Hadoop made it easier for them to
handle this form of data. Data Science, on the other hand, is the scientific analysis of data.
It is more quantitative in nature and uses various statistical approaches to find insights
within the data.
2. While Big Data is about storing data, Data Science is about analyzing it. However, it is to
be kept in mind that Data Science is an ocean of data operations, one that also includes Big
Data. A Data Scientist analyzes the data that is quite large and requires a big data platform.
Therefore, an ideal data scientist must also possess knowledge of big data tools.
3. Furthermore, Big Data is limited only to the storage and management of data. However,
recently, more components like PIG and HIVE have been added to the Hadoop framework in
order to facilitate the analysis of big data. Furthermore, newer frameworks like Spark have
analytical features that are intrinsic to it.
4. The roles of Data Scientists and Big Data specialists also differ. A Data Scientist is required
to analyze, draw insights from the data, visualize the data and communicate the results
through robust storytelling. A Big Data Specialist, on the other hand, develops, maintains,
and administers Big Data clusters that hold a voluminous amount of data.
Big Data Data Science
Big Data deals with handling and managing huge Data Science is the scientific analysis of data. It is
amount of data. Prior to Big Data, industries did not more quantitative in nature and uses various statistical
possess the required tools and resources to manage approaches to find insights within the data.
such a large volume of data. However, the emergence
of MapReduce and Hadoop made it easier for them to
handle this form of data.

Big Data is limited only to the storage and Data Science is about analyzing data.
management of data. Data Science is an ocean of data operations, one that
frameworks like PIG and HIVE have been added to the also includes Big Data. A Data Scientist analyzes the
Hadoop framework in order to facilitate the analysis of data that is quite large and requires a big data
big data. Spark have analytical features. platform. Therefore, an ideal data scientist must also
possess knowledge of big data tools.

A Big Data Specialist develops, maintains, and A Data Scientist is required to analyze, draw insights
administers Big Data clusters that hold a voluminous from the data, visualize the data and communicate
amount of data. the results through robust storytelling.
Similarities Between Big Data & Data
Science
• Data Science is the ocean of data operations. These data operations
also include Big Data.
• Data Science is like a bigger set that also contains Big Data as its sub-
set along with other important data operations. Both of these fields
deal with data.
• Furthermore, a data scientist is required to handle big data which is
frequently unstructured in nature.
What is Data Warehousing?
• Data warehousing can be defined as the process of data collection and storage
from various sources and managing it to provide valuable business insights.
• It can also be referred to as electronic storage, where businesses store a large
amount of data and information.
• It is a critical component of a business intelligence system that involves
techniques for data analysis.
• Data warehousing is a mixture of technology and components that enable a
strategic usage of data.
• It is the electronic collection of a significant volume of information by an
organization intended for query and analysis rather than for the processing of
transactions.
• Data warehousing is a method of translating data into information and making
it accessible to consumers in a timely way to make a difference.
Data Warehousing
Steps in Data Warehousing
The following steps are involved in the process of data warehousing:
• Extraction of data – A large amount of data is gathered from various
sources.
• Cleaning of data – Once the data is compiled, it goes through a
cleaning process. The data is scanned for errors, and any error found is
either corrected or excluded.
• Conversion of data – After being cleaned, the format is changed from
the database to a warehouse format.
• Storing in a warehouse – Once converted to the warehouse format,
the data stored in a warehouse goes through processes such as
consolidation and summarization to make it easier and more
coordinated to use. As sources get updated over time, more data is
added to the warehouse.
S.No. Big Data Data Warehouse
Big data is the data which is in enormous form on Data warehouse is the collection of historical data from
1. which technologies can be applied. different operations in an enterprise.

Big data is a technology to store and manage large Data warehouse is an architecture used to organize the
2. amount of data. data.

It takes structured, non-structured or semi-


3. structured data as an input. It only takes structured data as an input.

4. Big data does processing by using distributed file Data warehouse doesn’t use distributed file system for
system. processing.

Big data doesn’t follow any SQL queries to fetch In data warehouse we use SQL queries to fetch data
5. data from database. from relational databases.

Apache Hadoop can be used to handle enormous Data warehouse cannot be used to handle enormous
6. amount of data. amount of data.

When new data is added, the changes in data are


7. stored in the form of a file which is represented by a When new data is added, the changes in data do not
table. directly impact the data warehouse.

Data warehouse requires more efficient management


8. Big data doesn’t require efficient management techniques as the data is collected from different
techniques as compared to data warehouse. departments of the enterprise.
What is data mining?
• Data mining is the process of sorting through large data sets to identify- patterns and
relationships that can help solve business problems through data analysis.
• Data mining techniques and tools enable enterprises to predict future trends and make
more-informed business decisions.
• Data mining is the process of identifying fascinating patterns and information from huge
quantities of data.
• It includes various data sources, such as data warehouses, databases as well as the
internet, other repositories of information, and data streams that are fed into the system
continuously.
• The data mining process is a part of the Knowledge Discovery process in data mining. The
Knowledge Discovery in Databases (KDD) process in data mining involves extracting data
from large chunks of information.
• Data mining professionals work with databases to evaluate information and discard any
information that is not useful or reliable. This requires knowledge of big data, computing
and information analysis, and the ability to handle different types of software.
What is data mining?
• Data mining helps derive insights through careful extraction,
reviewing, and processing of raw data to discover patterns and
correlations that can be valuable for businesses.
• Data mining processes include different types of services such as:
• Web mining
• Text mining
• Audio mining
• Video mining
• Social network data mining
• Pictorial data mining
Applications Of Data Mining In Real Life
• Mobile Service Providers
• Mobile service providers use data mining to design their marketing campaigns
and to retain customers from moving to other vendors.
• From a large amount of data such as billing information, email, text messages,
web data transmissions, and customer service, the data mining tools can
predict “churn” that tells the customers who are looking to change the
vendors.
• With these results, a probability score is given. The mobile service providers
are then able to provide incentives, offers to customers who are at higher risk
of churning. This kind of mining is often used by major service providers such
as broadband, phone, gas providers, etc.
Applications Of Data Mining In Real Life
• Retail Sector
• Data Mining helps the supermarket and retail sector owners to know the
choices of the customers.
• Looking at the purchase history of the customers, the data mining tools show
the buying preferences of the customers.
• With the help of these results, the supermarkets design the placements of
products on shelves and bring out offers on items such as coupons on
matching products, and special discounts on some products.
• Data Mining can be used for product recommendation and cross-referencing
of items.
Applications Of Data Mining In Real Life
• Artificial Intelligence
• A system is made artificially intelligent by feeding it with relevant patterns.
These patterns come from data mining outputs. The outputs of the artificially
intelligent systems are also analyzed for their relevance using the data mining
techniques.
• The recommender systems use data mining techniques to make personalized
recommendations when the customer is interacting with the machines. The
artificial intelligence is used on mined data such as giving product
recommendations based on the past purchasing history of the customer in
Amazon.
Applications Of Data Mining In Real Life
• Transportation
• Data Mining helps in scheduling the moving of vehicles from warehouses to
outlets and analyze the product loading patterns.
• Science And Engineering
• Data mining in computer science helps to monitor system status, improve its
performance, find out software bugs, discover plagiarism and find out faults.
Data mining also helps in analyzing the user feedback regarding products, articles
to deduce opinions and sentiments of the views.
• Ecommerce
• The shopping sites such as Amazon, Flipkart show “People also viewed”,
“Frequently bought together” to the customers who are interacting with the
site.
Stages of Data Mining:
Data Mining Data Science

Data mining is a process of extracting useful Data science refers to the process of obtaining
information, patterns, and trends from huge valuable insights from structured and
databases. unstructured data by using various tools and
methods.

Data mining is a technique. Data science is a field.


Primarily used for business purposes. Primarily used for scientific purposes.
It is involved with the process. It emphasizes the science of the data.
Data mining aims to make data more important The objective of data science is to create a
and usable; it means extracting only useful dominant data product.
information.
Data mining is a technique that is a part of KDD It is related to the field of study like Mechanical
(Knowledge discovery in database process). engineering, Cloud architecture, etc.

It primarily deals with structured data. It deals with any kind of data like structured,
semi-structured, and unstructured.
Reference:
• https://data-flair.training/blogs/big-data-vs-data-science/
• Data Mining
• https://www.techtarget.com/searchbusinessanalytics/definition/data-mining
• https://intellipaat.com/blog/data-mining-vs-data-science/

You might also like