BIG DATA Notes
BIG DATA Notes
BIG DATA Notes
Volume
Big data involves massive amount of data often well beyond what traditional
systems can effectively manage. This data can be generated from various sources
such as social media, connected devices (IOT), commercial transactions, etc.
Variety
Bigdata can be structured, semi-structured or unstructured. They can take the form
of text, images, videos, sounds, sensor data and so on. The variety of data is one
of the key challenges of bigdata because it often has to be processed and analyzed
in different format.
Velocity
Bigdata is often at a fast and constant rate requiring real-time or near-real analyses
to give useful reports. For example, live streaming data, real-time GPS tracking
data.
Veracity
It refers to the quality and reliability of data collected. It is essential that the data
used for analysis and decision-making are accurate and reliable, so in terms of
data consistency and integrity.
Value
The value refers to the importance of this data for companies, its processing,
especially with regards to the strategic decisions they can provide. The ability to
transform data to make the right decisions in business is therefore important.
b. Challenges
- Storage and management of massive data.
- Security and confidentiality of data.
- Integration and processing of data from different sources and formats.
- Analysis and extraction of value from unstructured or semi-structured data.
- Need for specialized skills in data analysis and bigdata technology.
a. Structured Data
Structured data are organized in tabular with predefined columns. They are
usually stored in relational databases and are easy to query using SQL.
c. Unstructured Data
This data does not have a predefined structure and can take various forms such as
free text, images, videos, audios, files, etc. They are usually stored in distributed
file systems or object storage systems.
d. Real-time Data
This data is generated and processed in real-time requiring instant analysis to
make timely decisions. They often come from IOT sensors, commercial media
fields, online financial transactions.
FIGURE
The data received will be collected in the raw state in the master dataset, i.e., in
the batch layer. So, the master dataset is the first component responsible for
storage. It is imperative to keep the raw data from which we will make
aggregations. The data stored in the master dataset is considered perpetually
correct. The second component of the batch layer is the one responsible for
analyzing the massive data contained in the master dataset.
Once we have carried out distributed analyses on our data, we must store the
results of these analyses and make it available to users. This is the role of the
serving layer.
The serving layer is the component that will be in charge of storing and exposing
to users the results of batch analyses carried out in the batch layer. It acts as a
database in which information will be stored and this information will be able to
be read by users when they make queries.
With the combination of batch layer and serving layer, we are tempted to think
that we already have a complete bigdata architecture. Indeed, in situations where
it is not necessary to analyze the most recent data, these two components are self-
sufficient. But in the general case, these two-components architecture has a flaw:
It does not allow you to analyze the data collected during batch analyses in the
batch layer. To respond to this constraint, it is necessary to be interested in the
speed layer.
The role of the speed layer will be to aggregate real-time data and expose a view
so that users can make queries on the freshest data.
Part 3: BIG DATA TOOLS AND TECHNOLOGIES
1. NoSQL Databases
NoSQL (Not only SQL) databases are designed to meet the storage and processing
needs of unstructured or semi-structured data on a large scale. Some commonly
used types of NoSQL databases will be developed in a later part of the course.
b. Apache Spark
It is a fast and generalized data processing framework that supports in-memory,
processing and caching of intermediate data making it much faster than
MapReduce for certain type of task. Spark also offers APIs for real-time
processing, machine learning and graphic processing.
Tools such as Power BI, D3JS make it possible to create interactive visualizations
from the data which facilitates the understanding of trends and models. In
conclusion, bigdata tools and technologies offer companies the opportunity to
store, process and analyze massive volumes of data to gain valuable insights by
understanding the different types of NoSQL databases, distributed processing
frameworks and visualization tools, bigdata professionals can design solutions
adapted to the specific need of their organization.
Part 4: INTRODUCTION TO NOSQL DATABASES
AND MODELING
1. Differences Between NoSQL and SQL Databases
- Flexible schema: NoSQL databases allow a flexible schema where data of
the same type can have different structures making it easier to add new data
types without modifying the overall schema.
- Horizontal scalability: NoSQL databases are designed to run on a cluster
of machines allowing horizontal scalability by simply adding new nodes to
the cluster.
- Data model: NoSQL databases use different data models such as key
value, column, document or graph while relational databases use the
relational model based on tables.
b. Column databases
They store data in columns rather than in rows making them ideal for analytical
queries and aggregations. Example: Apache Cassandra, H-Base, Amazon
Redshift.
c. Document databases
They store data in the form of json, xml documents or other similar formats
providing flexibility for semi-structured data. Example: MongoDB, Couchbase,
Elastic-search, Raven DB, Amazon Document DB.
d. Graph databases
They are designed to store and query data represented in the form of graphs
making them ideal for social networks recommendations and network analysis.
Example: Neo4G, Amazon Neptune, Tiger-graph, Arango DB, Ganus Graph.
- Windows utility
- Hadoop 2.7
- Jdk 8
- Session Management: Efficiently store and retrieve user session data for
web applications.
- Distributed Caching: Accelerate data access by caching frequently used
results.
- Real-time Analytics: Quickly recover and analyze real-time data for
business insights.
COLUMN DATABASES
GRAPH DATABASES