Big Data Analytics - Unit 1
Big Data Analytics - Unit 1
Big Data Analytics - Unit 1
•Structured Data
•Semi-Structured Data
•Unstructured Data
TYPES OF BIG-DATA
•Structured Data owns a dedicated data model, It also has a well-defined structure,
it follows a consistent order and it is designed in such a way that it can be easily
accessed and used by a person or a computer. Structured data is usually stored in
well-defined columns and also Databases.
Example: Database Management Systems(DBMS)
•Semi-Structured Data can be considered as another form of Structured Data. It
inherits a few properties of Structured Data, but the major part of this kind of data
fails to have a definite structure and also, it does not obey the formal structure of
data models such as an RDBMS.
Example: Comma Separated Values(CSV) File.
•Unstructured Data is completely a different type of which neither has a structure
nor obeys to follow the formal structural rules of data models. It does not even
have a consistent format and it found to be varying all the time. But, rarely it may
have information related to data and time.
Example: Audio Files, Images etc
TYPES OF BIG-DATA
THE CHARACTERISTICS OF BIG DATA
THE CHARACTERISTICS OF BIG DATA
Volume
Volume refers to the unimaginable amounts of information generated every second from social media, cell
phones, cars, credit cards, M2M sensors, images, video, and whatnot. We are currently using distributed
systems, to store data in several locations and brought together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like” button is recorded, and
over 350 million new posts are uploaded each day. Such a huge amount of data can only be handled by Big
Data Technologies
THE CHARACTERISTICS OF BIG DATA
Variety
As Discussed before, Big Data is generated in multiple varieties. Compared to the traditional data like phone
numbers and addresses, the latest trend of data is in the form of photos, videos, and audios and many more,
making about 80% of the data to be completely unstructured
Veracity
Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is
unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the
data is crucial in business developments
THE CHARACTERISTICS OF BIG DATA
Value
Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or
process. It is actually the amount of valuable, reliable and trustworthy data that needs to be stored, processed,
analyzed to find insights.
THE CHARACTERISTICS OF BIG DATA
Velocity
Last but never least, Velocity plays a major role compared to the others, there is no point in investing so much to
end up waiting for the data. So, the major aspect of Big Data is to provide data on demand and at a faster pace.
APPLICATIONS OF BIG DATA
Retail
Leading online retail platforms are wholeheartedly deploying big data throughout a
customer’s purchase journey, to predict trends, forecast demands, optimize pricing, and
identify customer behavioral patterns.
Big data is helping retailers implement clear strategies that minimize risk and maximize
profit.
Healthcare
Big data is revolutionizing the healthcare industry, especially the way medical professionals in
the past diagnosed and treated diseases.
In recent times, effective analysis and processing of big data by machine learning algorithms
provide significant advantages for the evaluation and assimilation of complex clinical data,
which prevent deaths and improve the quality of life by enabling healthcare workers to detect
early warning signs and symptoms.
APPLICATIONS OF BIG DATA
Financial Services and Insurance
The increased ability to analyze and process big data is dramatically impacting the financial
services, banking, and insurance landscape.
In addition to using big data for swift detection of fraudulent transactions, lowering risks, and
supercharging marketing efforts, few companies are taking the applications to the next levels.
Manufacturing
Advancements in robotics and automation technologies, modern-day manufacturers are
becoming more and more data focused, heavily investing in automated factories that exploit
big data to streamline production and lower operational costs.
Top global manufacturers are also integrating sensors into their products, capturing big data
to provide valuable insights on product performance and its usage.
APPLICATIONS OF BIG DATA
Energy
To combat the rising costs of oil extraction and exploration difficulties because of economic
and political turmoil, the energy industry is turning toward data-driven solutions to increase
profitability.
Big data is optimizing every process while cutting down energy waste from drilling to
exploring new reserves, production, and distribution.
computational linguistics, and so on, to understand the feelings expressed in the data. While the previous six
methods seek to analyze quantitative data (data that can be measured), sentiment analysis seeks to
interpret and classify qualitative data by organizing it into themes. It is often used to understand how
data-informed decisions.
to make better and faster decisions using data that was previously
inaccessible or unusable.
1. Collect Data
• Data collection looks different for every organization. With today’s technology, organizations can gather both
structured and unstructured data from a variety of sources — from cloud storage to mobile applications to in-store
• Some data will be stored in data warehouses where business intelligence tools and solutions can access it easily.
• Raw or unstructured data that is too diverse or complex for a warehouse may be assigned metadata and stored in a
data lake.
HOW BIG DATA ANALYTICS WORKS
2. Process Data
• Once data is collected and stored, it must be organized properly to get accurate results on analytical queries,
especially when it’s large and unstructured.
• Available data is growing exponentially, making data processing a challenge for organizations.
• One processing option is batch processing, which looks at large data blocks over time.
• Batch processing is useful when there is a longer turnaround time between collecting and analyzing data.
• Stream processing looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making.
• Stream processing is more complex and often more expensive.
3. Clean Data
• Data big or small requires scrubbing to improve data quality and get stronger results; all data must be
formatted correctly, and any duplicative or irrelevant data must be eliminated or accounted for.
• Dirty data can obscure and mislead, creating flawed insights.
HOW BIG DATA ANALYTICS WORKS
4. Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics processes can turn big data
into big insights. Some of these big data analysis methods include:
•Data mining sorts through large datasets to identify patterns and relationships by identifying anomalies
•Predictive analytics uses an organization’s historical data to make predictions about the future,
•Deep learning imitates human learning patterns by using artificial intelligence and machine learning to
layer algorithms and find patterns in the most complex and abstract data.
WHAT IS DATA LAKE?
• A Data Lake is a storage repository that can store large
amount of structured, semi-structured, and unstructured
data.
• Research Analyst can focus on finding meaning patterns in data and not data itself.
• Unlike a hierarchal Dataware house where data is stored in Files and Folder, Data lake has a flat architecture.
Every data elements in a Data Lake is given a unique identifier and tagged with a set of metadata
information.
WHY DATA LAKE?
The main objective of building a data lake is to offer an unrefined view of data to data scientists.
With the onset of storage engines like Hadoop storing disparate information has become easy. There is no
need to model data into an enterprise-wide schema with a Data Lake.
With the increase in data volume, data quality, and metadata, the quality of analyses also increases.
Data Lake offers business Agility
Machine Learning and Artificial Intelligence can be used to make profitable predictions.
It offers a competitive advantage to the implementing organization.
There is no data silo structure. Data Lake gives 360 degrees view of customers and makes analysis more
robust.
DATA LAKE ARCHITECTURE
DATA LAKE ARCHITECTURE
The figure shows the architecture of a Business Data Lake. The lower levels represent data that is mostly at rest
while the upper levels show real-time transactional data. This data flow through the system with no or little
latency.
Following are important tiers in Data Lake Architecture:
1. Ingestion Tier: The tiers on the left side depict the data sources. The data could be loaded into the data lake
in batches or in real-time
2. Insights Tier: The tiers on the right represent the research side where insights from the system are used.
SQL, NoSQL queries, or even excel could be used for data analysis.
3. HDFS is a cost-effective solution for both structured and unstructured data. It is a landing zone for all data
that is at rest in the system.
4. Distillation tier takes data from the storage tire and converts it to structured data for easier analysis.
5. Processing tier run analytical algorithms and users queries with varying real time, interactive, batch to
generate structured data for easier analysis.
6. Unified operations tier governs system management and monitoring. It includes auditing and proficiency
management, data management, workflow management.
KEY DATA LAKE CONCEPTS
Following are Key Data Lake concepts that one needs to understand to completely understand the Data Lake
Architecture
KEY DATA LAKE CONCEPTS
• Data Ingestion
Data Ingestion allows connectors to get data from a different data sources and load into the Data lake.
3. Many types of data sources like Databases, Webservers, Emails, IoT, and FTP.
• Data Storage
Data storage should be scalable, offers cost-effective storage and allow fast access to data exploration. It should
support various data formats.
• Data Governance
Data governance is a process of managing availability, usability, security, and integrity of data used in an
organization.
KEY DATA LAKE CONCEPTS
• Security
Security needs to be implemented in every layer of the Data lake. It starts with Storage, Unearthing, and
Consumption. The basic need is to stop access for unauthorized users. It should support different tools to access
data with easy to navigate GUI and Dashboards.
Authentication, Accounting, Authorization and Data Protection are some important features of data lake security.
• Data Quality:
Data quality is an essential component of Data Lake architecture. Data is used to exact business value. Extracting
insights from poor quality data will lead to poor quality insights.
• Data Discovery
Data Discovery is another important stage before you can begin preparing data or analysis. In this stage, tagging
technique is used to express the data understanding, by organizing and interpreting the data ingested in the Data
lake.
KEY DATA LAKE CONCEPTS
• Data Auditing
Two major Data auditing tasks are tracking changes to the key dataset.
• Data Lineage
This component deals with data's origins. It mainly deals with where it movers over time and what happens to it. It
eases errors corrections in a data analytics process from origin to destination.
• Data Exploration
It is the beginning stage of data analysis. It helps to identify right dataset is vital before starting Data Exploration.
All given components need to work together to play an important part in Data lake building easily evolve and
explore the environment.
MATURITY STAGES OF DATA LAKE
Stage 1: Handle and ingest data at scale
This first stage of Data Maturity Involves improving the ability to transform and analyze data. Here, business
owners need to find the tools according to their skillset for obtaining more data and build analytical applications.
• Design of Data Lake should be driven by what is available instead of what is required. The schema and data
requirement is not defined until it is queried
• Data discovery, ingestion, storage, administration, quality, transformation, and visualization should be
managed independently.
• The Data Lake architecture should be tailored to a specific industry. It should ensure that capabilities
necessary for that domain are an inherent part of the design
• The Data Lake should support existing enterprise data management techniques and methods
END OF UNIT 1