UNIT 4

UNIT 4:
INTRODUCTION TO BIG
DATA HADOOP
What is Big Data?
• Big data refers to extremely large and diverse collections of structured,
unstructured, and semi-structured data that continues to grow exponentially
over time. These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store, process, and
analyze them.
• The amount and availability of data is growing rapidly, spurred on by digital
technology advancements, such as connectivity, mobility, the Internet of
Things (IoT), and artificial intelligence (AI). As data continues to expand
and proliferate, new big data tools are emerging to help companies collect,
process, and analyze data at the speed needed to gain the most value from it.
• Big data describes large and diverse datasets that are huge in volume and
also rapidly grow in size over time. Big data is used in machine learning,
predictive modeling, and other advanced analytics to solve business
problems and make informed decisions.
• Read on to learn the definition of big data, some of the advantages of big
data solutions, common big data challenges, and how Google Cloud is
helping organizations build their data clouds to get more value from their
data.
Big data examples
• Here are some big data examples that are helping transform organizations
across every industry:
• Tracking consumer behavior and shopping habits to deliver hyper-
personalized retail product recommendations tailored to individual
customers.
• Monitoring payment patterns and analyzing them against historical
customer activity to detect fraud in real time.
• Combining data and information from every stage of an order’s shipment
journey with hyperlocal traffic insights to help fleet operators optimize
last-mile delivery.
• Using AI-powered technologies like natural language processing to
analyze unstructured medical data (such as research reports, clinical
notes, and lab results) to gain new insights for improved treatment
development and enhanced patient care.
• Using image data from cameras and sensors, as well as GPS data,
to detect potholes and improve road maintenance in cities.
• Analyzing public datasets of satellite imagery and geospatial datasets to
visualize, monitor, measure, and predict the social and environmental
impacts of supply chain operations.

The Vs of big data
• Big data definitions may vary slightly, but it will always be
described in terms of volume, velocity, and variety. These big data
characteristics are often referred to as the “3 Vs of big data” and
were first defined by Gartner in 2001.
Volume
• As its name suggests, the most common characteristic associated

with big data is its high volume. This describes the enormous
amount of data that is available for collection and produced from a
variety of sources and devices on a continuous basis.
Velocity
• Big data velocity refers to the speed at which data is generated. Today, data is
often produced in real time or near real time, and therefore, it must also be
processed, accessed, and analyzed at the same rate to have any meaningful
impact.
Variety
• Data is heterogeneous, meaning it can come from many different sources and
can be structured, unstructured, or semi-structured. More traditional structured
data (such as data in spreadsheets or relational databases) is now supplemented
by unstructured text, images, audio, video files, or semi-structured formats like
sensor data that can’t be organized in a fixed data schema.

In addition to these three original Vs, three others that are often mentioned in
relation to harnessing the power of big data: veracity, variability, and value.
Veracity: Big data can be messy, noisy, and error-prone, which makes it
difficult to control the quality and accuracy of the data. Large datasets can be
unwieldy and confusing, while smaller datasets could present an incomplete
picture. The higher the veracity of the data, the more trustworthy it is.
Variability: The meaning of collected data is constantly changing, which can
lead to inconsistency over time. These shifts include not only changes in
context and interpretation but also data collection methods based on the
information that companies want to capture and analyze.
Value: It’s essential to determine the business value of the data you collect.
Big data must contain the right data and then be effectively analyzed in order
How does big data work?
• The central concept of big data is that the more visibility you have
into anything, the more effectively you can gain insights to make better
decisions, uncover growth opportunities, and improve your business model.
• Making big data work requires three main actions:

• Integration: Big data collects terabytes, and sometimes even petabytes, of raw
data from many sources that must be received, processed, and transformed into the
format that business users and analysts need to start analyzing it.
• Management: Big data needs big storage, whether in the cloud, on-
premises, or both. Data must also be stored in whatever form required.
It also needs to be processed and made available in real time.
Increasingly, companies are turning to cloud solutions to take
advantage of the unlimited compute and scalability.
• Analysis: The final step is analyzing and acting on big data—

otherwise, the investment won’t be worth it. Beyond exploring the data
itself, it’s also critical to communicate and share insights across the
business in a way that everyone can understand. This includes using
tools to create data visualizations like charts, graphs, and dashboards.
Challenges of implementing big data analytics
• While big data has many advantages, it does present some challenges
that organizations must be ready to tackle when collecting,
managing, and taking action on such an enormous amount of data.
• The most commonly reported big data challenges include:
• Lack of data talent and skills. Data scientists, data analysts, and
data engineers are in short supply—and are some of the most highly
sought after (and highly paid) professionals in the IT industry. Lack
of big data skills and experience with advanced data tools is one of
the primary barriers to realizing value from big data environments.

• Speed of data growth. Big data, by nature, is always rapidly changing and increasing.
Without a solid infrastructure in place that can handle your processing, storage, network,
and security needs, it can become extremely difficult to manage.
• Problems with data quality. Data quality directly impacts the quality of decision-making,
data analytics, and planning strategies. Raw data is messy and can be difficult to curate.
Having big data doesn’t guarantee results unless the data is accurate, relevant, and properly
organized for analysis. This can slow down reporting, but if not addressed, you can end up
with misleading results and worthless insights.
• Security concerns. Big data contains valuable business and customer information, making
big data stores high-value targets for attackers. Since these datasets are varied and complex,
it can be harder to implement comprehensive strategies and policies to protect them.

• Compliance violations. Big data contains a lot of sensitive data and
information, making it a tricky task to continuously ensure data
processing and storage meet data privacy and regulatory requirements,
such as data localization and data residency laws.
• Integration complexity. Most companies work with data siloed across
various systems and applications across the organization. Integrating
disparate data sources and making data accessible for business users is
complex, but vital, if you hope to realize any value from your big data.
Types of Big Data
Structured Data.
Any data that can be processed, is
easily accessible, and can be stored in
a fixed format is called structured data.
In Big Data, structured data is the
easiest to work with because it has
highly coordinated measurements that
are defined by setting parameters.
Structured types of Big Data are:-
Overview:
Highly organized and easily searchable in databases.
Follows a predefined schema (e.g., rows and columns in a table).
Typically stored in relational databases (SQL).
Examples:
Customer information databases (names, addresses, phone
numbers).
Financial data (transactions, account balances).
Inventory management systems.
Metadata (data about data).
Merits:
• Easy to analyze and query.
• High consistency and accuracy.
• Efficient storage and retrieval.
• Strong data integrity and validation.
Limitations:
• Limited flexibility (must adhere to a strict schema).
• Scalability issues with very large datasets.
• Less suitable for complex big data types.

Semi-structured Data
• In Big Data, semi-structured data is a combination of both unstructured and structured types of
big data. This form of data constitutes the features of structured data but has unstructured
information that does not adhere to any formal structure of data models or any relational
database. Some semi-structured data examples include XML and JSON.
Overview:
• Contains both structured and unstructured elements.
• Lacks a fixed schema but includes tags and markers to separate data elements.
• Often stored in formats like XML, JSON, or NoSQL databases.
Examples:
• JSON files for web APIs.
• XML documents for data interchange.
• Email messages (headers are structured, body can be unstructured).

Semi-structured Data
Merits:
• More flexible than structured data.
• Easier to parse and analyze than unstructured data.
• Can handle a wide variety of data types.
• Better suited for hierarchical data.
Limitations:
• More complex to manage than structured data.
• Parsing can be resource-intensive.
• Inconsistent data quality.

Quasi-Structured Data
Overview:
• Loosely structured data that does not fit neatly into traditional database
schemas.
• Contains some organizational properties but lacks a fixed structure.
• Often encountered in large-scale data systems and logs.
Examples:
• Log files (system logs, application logs).
• Clickstream data from web analytics.
• Sensor data streams.
Merits:
• Can provide valuable insights with proper analysis.
• Flexible data format suitable for big data systems.
• Facilitates real-time data processing.
• Capable of capturing a wide range of data types.
Limitations:
• Data extraction and transformation can be challenging.
• Higher storage and processing costs.
• Requires specialized tools for analysis.

Unstructured Data
• Unstructured data in Big Data is where the data format

constitutes multitudes of unstructured files (images, audio,
log, and video). This form of data is classified as intricate data
because of its unfamiliar structure and relatively huge size. A
stark example of unstructured data is an output returned by
‘Google Search’ or ‘Yahoo Search.’
Overview:
• Data that does not conform to a predefined schema.
• Includes text, multimedia, and other non-tabular data types.
• Stored in data lakes, NoSQL databases, and other flexible storage solutions.
Examples:
• Text documents (Word files, PDFs).
• Multimedia files (images, videos, audio).
• Social media posts.
• Web pages.
Unstructured Data
Merits:
• Capable of storing vast amounts of diverse data.
• High flexibility in data storage.
• Suitable for complex data types like multimedia.
• Facilitates advanced analytics and machine learning applications.
Limitations:
• Difficult to search and analyze without preprocessing.
• Requires large storage capacities.
• Inconsistent data quality and reliability.

Subtypes of Data
Overview:
• Different categories within the main types of big data.
• Each subtype has unique characteristics and use cases.
• Important for selecting appropriate data management and analysis tools.
Examples:
• Time-series data (financial market data).
• Spatial data (geographic information systems).
• Graph data (social networks).
• Machine-generated data (IoT sensor data).
Merits:
• Tailored analysis techniques for each subtype.
• Enhanced insights and decision-making.
• Optimized storage and processing solutions.
• Improved data relevance and context.
Limitations:
• Requires specialized tools and expertise.
• Can be resource-intensive to manage.
• Integration of multiple subtypes can be complex.

Structured vs Unstructured vs Semi-Structured Data

UNIT 4

Uploaded by

Copyright:

Available Formats

UNIT 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

UNIT 4

Uploaded by

Copyright:

Available Formats

UNIT 4:

unstructured, and semi-structured data that continues to grow exponentially

• The amount and availability of data is growing rapidly, spurred on by digital

technology advancements, such as connectivity, mobility, the Internet of

Things (IoT), and artificial intelligence (AI). As data continues to expand

across every industry:

• Tracking consumer behavior and shopping habits to deliver hyper-

personalized retail product recommendations tailored to individual

• Monitoring payment patterns and analyzing them against historical

customer activity to detect fraud in real time.

• Combining data and information from every stage of an order’s shipment

journey with hyperlocal traffic insights to help fleet operators optimize

analyze unstructured medical data (such as research reports, clinical

development and enhanced patient care.

to detect potholes and improve road maintenance in cities.

• Analyzing public datasets of satellite imagery and geospatial datasets to

visualize, monitor, measure, and predict the social and environmental

impacts of supply chain operations.

• As its name suggests, the most common characteristic associated

can be structured, unstructured, or semi-structured. More traditional structured

data (such as data in spreadsheets or relational databases) is now supplemented

by unstructured text, images, audio, video files, or semi-structured formats like

sensor data that can’t be organized in a fixed data schema.

unwieldy and confusing, while smaller datasets could present an incomplete

Variability: The meaning of collected data is constantly changing, which can

information that companies want to capture and analyze.

• Making big data work requires three main actions:

• Analysis: The final step is analyzing and acting on big data—

that organizations must be ready to tackle when collecting,

managing, and taking action on such an enormous amount of data.

• The most commonly reported big data challenges include:

sought after (and highly paid) professionals in the IT industry. Lack

the primary barriers to realizing value from big data environments.

and security needs, it can become extremely difficult to manage.

with misleading results and worthless insights.

it can be harder to implement comprehensive strategies and policies to protect them.

information, making it a tricky task to continuously ensure data

processing and storage meet data privacy and regulatory requirements,

such as data localization and data residency laws.

• Integration complexity. Most companies work with data siloed across

various systems and applications across the organization. Integrating

• Easy to analyze and query.

• High consistency and accuracy.

• Efficient storage and retrieval.

• Strong data integrity and validation.

• Limited flexibility (must adhere to a strict schema).

• Scalability issues with very large datasets.

• Less suitable for complex big data types.

• Contains both structured and unstructured elements.

• Often stored in formats like XML, JSON, or NoSQL databases.

• JSON files for web APIs.

• XML documents for data interchange.

• Email messages (headers are structured, body can be unstructured).

• More flexible than structured data.

• Easier to parse and analyze than unstructured data.

• Can handle a wide variety of data types.

• Better suited for hierarchical data.

• More complex to manage than structured data.

• Parsing can be resource-intensive.

• Inconsistent data quality.