BIG DATA Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

*Part 1: INTRODUCTION TO BIG DATA

1. What is exactly big data or data science?


Big data or data science or mega data refers to data sets that are so large and
complex that they go beyond the processing capability of traditional database
management tools. These data sets are generally characterized by 3 main
dimensions: volume, variety and velocity. But we can talk globally about 5 by
adding veracity and value.

Volume

Big data involves massive amount of data often well beyond what traditional
systems can effectively manage. This data can be generated from various sources
such as social media, connected devices (IOT), commercial transactions, etc.

Variety

Bigdata can be structured, semi-structured or unstructured. They can take the form
of text, images, videos, sounds, sensor data and so on. The variety of data is one
of the key challenges of bigdata because it often has to be processed and analyzed
in different format.

Velocity

Bigdata is often at a fast and constant rate requiring real-time or near-real analyses
to give useful reports. For example, live streaming data, real-time GPS tracking
data.

Veracity

It refers to the quality and reliability of data collected. It is essential that the data
used for analysis and decision-making are accurate and reliable, so in terms of
data consistency and integrity.

Value
The value refers to the importance of this data for companies, its processing,
especially with regards to the strategic decisions they can provide. The ability to
transform data to make the right decisions in business is therefore important.

2. Characteristics and challenges of bigdata


a. Characteristics
Among the characteristics we have;

- Scalability: Bigdata systems must be able to adapt to consistently


increasing data volume.
- Heterogeneity: Bigdata can come from different sources and have various
formats.
- Speed: The ability to quickly process and analyze data is essential to obtain
timely insights.
- Variability: Bigdata may be subject to frequent and unpredictable changes.

b. Challenges
- Storage and management of massive data.
- Security and confidentiality of data.
- Integration and processing of data from different sources and formats.
- Analysis and extraction of value from unstructured or semi-structured data.
- Need for specialized skills in data analysis and bigdata technology.

3. Bigdata applications in different sectors


Bigdata has diverse and varying applications in many areas including:

- Health: Analysis of medical data for disease research, epidemy


surveillance, personalization of treatment and so on.
- Finance: Fraud detection, risk analysis, algorithmic trading, customer
behavior analysis and so on.
- Marketing: Advertising targeting analysis of customer for market
segmentation, customization of offers and so on.
- Transport: Road optimization, traffic management and so on.
- Energy:
Part 2: FOUNDATIONS OF BIG DATA
1. Different Data Sources
Bigdata draws its data from various sources, each with its own characteristics and
challenges.

a. Structured Data
Structured data are organized in tabular with predefined columns. They are
usually stored in relational databases and are easy to query using SQL.

b. Semi- Structured Data


Although they have defined structure, semi-structured data do not follow a rigid
pattern like structured data. They are often stored in formats such as json, xml and
so on.

c. Unstructured Data
This data does not have a predefined structure and can take various forms such as
free text, images, videos, audios, files, etc. They are usually stored in distributed
file systems or object storage systems.

d. Real-time Data
This data is generated and processed in real-time requiring instant analysis to
make timely decisions. They often come from IOT sensors, commercial media
fields, online financial transactions.

2. Massive Data Storage and Processing Technologies


The first technology is:

a. Distributed File System: These systems distribute data over a cluster of


machines to offer horizontal scalability. Example; HDFS (Hadoop
Distributed File System), Amazon s3, google cloud storage.
b. Distributed Databases: These databases distributed the data over several
nodes to allow parallel processing and queries. Example: MongoDB,
Apache Cassandra.
c. Distributed Processing Technologies: These technologies make it
possible to process large amount of data in parallel or several nodes.
Examples: Hadoop, Apache Spark and Apache Flink.

3. Distributed Data Architecture


Distributed data architectures are designed to manage the challenges posed by
bigdata such as scalability, fault tolerance, workload distribution. Here are some
commonly used architectures;

- Lambda architecture: This architecture combines batch processing


(Hadoop and map reduce) and real-time processing (Apache Storm and
Apache Flink) to offer a unified view of the data.
- Cappa architecture: Unlike lambda, this architecture only uses a real-time
processing system to process both real-time and historical data.
- Microservices architecture: This approach divides applications into small
independent services which facilitate the scalability and management of
bigdata applications.

4. Lambda Architecture Realization


The design of lambda architecture is designed by the following constraints;

- Scaling: The proposed architecture must be able to scale horizontally, i.e.,


by adding servers. This growth must be done by guaranteeing the
robustness and fault tolerance of the different systems.
- Ease of maintenance: The technical choices made must not force the
architecture structure to be fixed. It must be easy to debug and modify
applications using this architecture. Finally, few manual interventions must
be necessary to carry out the maintenance of the systems.
- Ease of data exploitation: The purpose of an average architecture is not
only to store data but also to make available to other applications, to exploit
it and extract values from it.

FIGURE

The data received will be collected in the raw state in the master dataset, i.e., in
the batch layer. So, the master dataset is the first component responsible for
storage. It is imperative to keep the raw data from which we will make
aggregations. The data stored in the master dataset is considered perpetually
correct. The second component of the batch layer is the one responsible for
analyzing the massive data contained in the master dataset.

Once we have carried out distributed analyses on our data, we must store the
results of these analyses and make it available to users. This is the role of the
serving layer.

The serving layer is the component that will be in charge of storing and exposing
to users the results of batch analyses carried out in the batch layer. It acts as a
database in which information will be stored and this information will be able to
be read by users when they make queries.

With the combination of batch layer and serving layer, we are tempted to think
that we already have a complete bigdata architecture. Indeed, in situations where
it is not necessary to analyze the most recent data, these two components are self-
sufficient. But in the general case, these two-components architecture has a flaw:
It does not allow you to analyze the data collected during batch analyses in the
batch layer. To respond to this constraint, it is necessary to be interested in the
speed layer.

The role of the speed layer will be to aggregate real-time data and expose a view
so that users can make queries on the freshest data.
Part 3: BIG DATA TOOLS AND TECHNOLOGIES
1. NoSQL Databases
NoSQL (Not only SQL) databases are designed to meet the storage and processing
needs of unstructured or semi-structured data on a large scale. Some commonly
used types of NoSQL databases will be developed in a later part of the course.

2. Distributed Processing Frameworks


a. Apache Hadoop
Apache Hadoop is an open-source framework designed for the storage and
distributed processing of massive data. It consists of several modules including
HDFS for storage, YARN (Yet Another Resource Negotiator) and MapReduce
for batch processing.

b. Apache Spark
It is a fast and generalized data processing framework that supports in-memory,
processing and caching of intermediate data making it much faster than
MapReduce for certain type of task. Spark also offers APIs for real-time
processing, machine learning and graphic processing.

3. Data Analysis and Visualization Tools


Tools such as Apache Pig, Apache Hive and Apache Impala make it possible to
process and analyze data stored in distributed systems using query language
similar to SQL.

Tools such as Power BI, D3JS make it possible to create interactive visualizations
from the data which facilitates the understanding of trends and models. In
conclusion, bigdata tools and technologies offer companies the opportunity to
store, process and analyze massive volumes of data to gain valuable insights by
understanding the different types of NoSQL databases, distributed processing
frameworks and visualization tools, bigdata professionals can design solutions
adapted to the specific need of their organization.
Part 4: INTRODUCTION TO NOSQL DATABASES
AND MODELING
1. Differences Between NoSQL and SQL Databases
- Flexible schema: NoSQL databases allow a flexible schema where data of
the same type can have different structures making it easier to add new data
types without modifying the overall schema.
- Horizontal scalability: NoSQL databases are designed to run on a cluster
of machines allowing horizontal scalability by simply adding new nodes to
the cluster.
- Data model: NoSQL databases use different data models such as key
value, column, document or graph while relational databases use the
relational model based on tables.

2. Types of NoSQL Databases


a. Key value databases
They store data as key value pairs and are optimized for quick recovery of data
by key. They associate each key with a value and are optimized for fast data
recovery by key. Example of key value DBMS: Redis, Amazon DinamoDB,
Riak.

b. Column databases
They store data in columns rather than in rows making them ideal for analytical
queries and aggregations. Example: Apache Cassandra, H-Base, Amazon
Redshift.

c. Document databases
They store data in the form of json, xml documents or other similar formats
providing flexibility for semi-structured data. Example: MongoDB, Couchbase,
Elastic-search, Raven DB, Amazon Document DB.
d. Graph databases
They are designed to store and query data represented in the form of graphs
making them ideal for social networks recommendations and network analysis.
Example: Neo4G, Amazon Neptune, Tiger-graph, Arango DB, Ganus Graph.

NoSQL databases offer an alternative to traditional relational databases, offering


increased flexibility and scalability to manage unstructured or semi-structured
data on a large scale.

Assignment: Give 3 applications or website using nosql database for each


type

- Windows utility
- Hadoop 2.7
- Jdk 8

KEY VALUE DATABASES

- Session Management: Efficiently store and retrieve user session data for
web applications.
- Distributed Caching: Accelerate data access by caching frequently used
results.
- Real-time Analytics: Quickly recover and analyze real-time data for
business insights.

COLUMN DATABASES

- Time-Series Data Storage: Handle vast amounts of time-stamped sensor


data for monitoring and analysis.
- Internet of Things (IoT) Data Management: Efficiently manage data from
IoT devices with varying data structures.
- Product Catalogs and Inventories: Store and retrieve extensive product data
efficiently.
DOCUMENT DATABASES

- Content Management Systems (CMS): Manage diverse content elements


like text, images, and videos within a single system.
- E-commerce Product Catalogs: Handle various product attributes and
variations efficiently.
- User Profiles: Store user information with varying data fields based on user
preferences.

GRAPH DATABASES

- Social Networks: Identify and suggest connections, enhancing user


engagement.
- Recommendation Engines: Power personalized content and product
recommendations.
- Fraud Detection: Analyze complex relationships to detect fraudulent
activities.
Part 5: USAGE AND QUERIES IN NOSQL DATABASES

Part 6: USE CASES OF BIG DATA WITH NOSQL


DATABASE

Part 7: SAFETY AND ETHICS IN THE FIELD OF BIG


DATA

Part 8: CASE STUDIES AND PRACTICAL PROJECTS

You might also like