UNIT 4
UNIT 4
UNIT 4
INTRODUCTION TO BIG
DATA HADOOP
What is Big Data?
• Big data refers to extremely large and diverse collections of structured,
over time. These datasets are so huge and complex in volume, velocity, and
variety, that traditional data management systems cannot store, process, and
analyze them.
and proliferate, new big data tools are emerging to help companies collect,
process, and analyze data at the speed needed to gain the most value from it.
• Big data describes large and diverse datasets that are huge in volume and
also rapidly grow in size over time. Big data is used in machine learning,
predictive modeling, and other advanced analytics to solve business
problems and make informed decisions.
• Read on to learn the definition of big data, some of the advantages of big
data solutions, common big data challenges, and how Google Cloud is
helping organizations build their data clouds to get more value from their
data.
Big data examples
• Here are some big data examples that are helping transform organizations
customers.
last-mile delivery.
• Using AI-powered technologies like natural language processing to
notes, and lab results) to gain new insights for improved treatment
• Using image data from cameras and sensors, as well as GPS data,
Volume
• Big data velocity refers to the speed at which data is generated. Today, data is
often produced in real time or near real time, and therefore, it must also be
processed, accessed, and analyzed at the same rate to have any meaningful
impact.
Variety
• Data is heterogeneous, meaning it can come from many different sources and
relation to harnessing the power of big data: veracity, variability, and value.
Veracity: Big data can be messy, noisy, and error-prone, which makes it
difficult to control the quality and accuracy of the data. Large datasets can be
picture. The higher the veracity of the data, the more trustworthy it is.
lead to inconsistency over time. These shifts include not only changes in
context and interpretation but also data collection methods based on the
Value: It’s essential to determine the business value of the data you collect.
Big data must contain the right data and then be effectively analyzed in order
How does big data work?
• The central concept of big data is that the more visibility you have
into anything, the more effectively you can gain insights to make better
decisions, uncover growth opportunities, and improve your business model.
• While big data has many advantages, it does present some challenges
• Lack of data talent and skills. Data scientists, data analysts, and
data engineers are in short supply—and are some of the most highly
of big data skills and experience with advanced data tools is one of
Without a solid infrastructure in place that can handle your processing, storage, network,
• Problems with data quality. Data quality directly impacts the quality of decision-making,
data analytics, and planning strategies. Raw data is messy and can be difficult to curate.
Having big data doesn’t guarantee results unless the data is accurate, relevant, and properly
organized for analysis. This can slow down reporting, but if not addressed, you can end up
• Security concerns. Big data contains valuable business and customer information, making
big data stores high-value targets for attackers. Since these datasets are varied and complex,
disparate data sources and making data accessible for business users is
complex, but vital, if you hope to realize any value from your big data.
Types of Big Data
Structured Data.
Any data that can be processed, is
easily accessible, and can be stored in
a fixed format is called structured data.
In Big Data, structured data is the
easiest to work with because it has
highly coordinated measurements that
are defined by setting parameters.
Structured types of Big Data are:-
Overview:
Highly organized and easily searchable in databases.
Follows a predefined schema (e.g., rows and columns in a table).
Typically stored in relational databases (SQL).
Examples:
Customer information databases (names, addresses, phone
numbers).
Financial data (transactions, account balances).
Inventory management systems.
Metadata (data about data).
Merits:
Limitations:
Overview:
• Lacks a fixed schema but includes tags and markers to separate data elements.
Examples:
Limitations:
Examples:
• Log files (system logs, application logs).
• Clickstream data from web analytics.
• Sensor data streams.
Merits:
Limitations:
• Stored in data lakes, NoSQL databases, and other flexible storage solutions.
Examples:
• Web pages.
Unstructured Data
Merits:
• Capable of storing vast amounts of diverse data.
Limitations:
• Difficult to search and analyze without preprocessing.
Examples:
• Time-series data (financial market data).
• Spatial data (geographic information systems).
• Graph data (social networks).
• Machine-generated data (IoT sensor data).
Merits:
Limitations: