Introduction To Data Science

Principles of Data Science
Unit 4: Handling large data

June 20, 2023
Contents
1 Handling large data on a single computer 2
1.1 The problems you face when handling large data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 General techniques for handling large volumes of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Case study 1: Predicting malicious URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Case study 2: Building a recommender system inside a database . . . . . . . . . . . . . . . . . . . . . . 4
2 First Step in Big Data: 4
2.1 Distributing data storage and processing with frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Hadoop: a framework for storing and processing large data sets . . . . . . . . . . . . . . . . . . 4
2.1.2 Spark: replacing MapReduce for better performance . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Case study: Assessing risk when loaning money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 Database Management System and NoSQL 9
This unit covers

• Working with large data sets on a single computer
• Working with Python libraries suitable for larger data sets
• Understanding the importance of choosing correct algorithms and data structures
• Understanding how you can adapt algorithms to work inside databases

1 Handling large data on a single computer
In the era of big data, the abundance of information can sometimes overwhelm us. What if you find yourself facing
a deluge of data that seems to surpass your capabilities and your conventional techniques no longer suffice? Do you
surrender or adapt? Luckily, you’ve chosen the path of adaptation, as evidenced by your continued reading. In this
chapter, we will introduce you to techniques and tools that enable you to handle larger data sets, even when restricted
to a single computer, provided you adopt the right strategies. This unit aims to equip you with the necessary tools to
perform classifications and regressions on data sets that are too large to fit into your computer’s RAM (random access
memory), Key Topics Covered:
• Working with large data sets on a single computer: We will delve into the challenges associated with handling
substantial data sets and explore various approaches to address them effectively.
• Python libraries suitable for larger data sets: Discover and leverage powerful Python libraries specifically designed
to handle extensive data sets efficiently.
• The importance of choosing correct algorithms and data structures: Understand the significance of selecting appro-
priate algorithms and data structures that can effectively manage and process large data volumes.
• Adapting algorithms to work inside databases: Gain insights into adapting algorithms to operate seamlessly within
databases, leveraging their inherent capabilities to handle large-scale data processing.
• Applying general best practices: Learn from the experiences of data scientists and apply their general best practices
to tackle the challenges posed by large data volumes.
• Case studies: To provide practical context, we will present two case studies. The first case demonstrates how
to detect malicious URLs using the techniques and tools discussed. The second case illustrates how to build a
recommender engine inside a database, leveraging the concepts covered in this chapter.
1.1 The problems you face when handling large data

Challenges of Handling Large Data Sets:
• Overloaded memory and algorithms: Dealing with a large volume of data presents new challenges, such as exceeding
the computer’s available memory and algorithms that are not optimized for large data sets. This requires adapting
and expanding your techniques to overcome these issues.
• I/O and CPU starvation: When analyzing large data sets, it’s essential to consider input/output (I/O) operations
and CPU utilization. These factors can cause speed issues during data processing, and careful management is
required to optimize performance.
• Memory limitations: Computers have a finite amount of RAM, and attempting to load more data into memory than
it can handle leads to swapping memory blocks to disks. This process, known as swapping, is inefficient compared
to having all the data in memory. Most algorithms are designed to load the entire data set into memory, resulting
in out-of-memory errors.
• Time constraints: Time is another crucial resource to consider when working with large data sets. Certain algorithms
do not account for time constraints and can run indefinitely. On the other hand, some algorithms struggle to
complete within a reasonable time frame, even when processing only a few megabytes of data.
• Bottlenecks in computer components: Dealing with large data sets can expose bottlenecks in different computer
components. While one system may be overwhelmed, others remain idle. This imbalance incurs a significant cost
in terms of both time and computing resources. For example, programs may experience CPU starvation due to
slow data retrieval from the hard drive, which is typically one of the slowest components in a computer.
• Introduction of solid-state drives (SSD): To address the slow data retrieval from traditional hard disk drives (HDD),
solid-state drives (SSD) were introduced. SSDs offer faster performance but are still more expensive compared to
HDDs, which are widely used.
1.2 General techniques for handling large volumes of data

Problems:
1. Never-ending algorithms
2. Out-of-memory errors
3. Speed issues
Solutions: The solutions for handling large data sets can be categorized into three main areas: using the correct
algorithms, choosing the right data structure, and utilizing the appropriate tools. It’s important to note that these
solutions often address both memory limitations and computational performance, and there is no direct one-to-one
mapping between specific problems and solutions.
1. Choose the right algorithms:
• Select algorithms specifically designed to handle large data sets, as they are optimized for efficient memory
usage and processing.
• Consider algorithms that operate on smaller subsets of the data at a time, rather than loading the entire data
set into memory.
• Look for algorithms that can provide incremental or streaming processing capabilities, allowing data to be
processed in smaller chunks or on-the-fly.
2. Choose the right data structures:
• Opt for data structures that can efficiently store and manipulate large data sets. Use compressed data repre-
sentations or techniques like data set compression to reduce memory requirements.
• Consider data structures that support parallel processing or distributed computing, enabling efficient utilization
of computational resources.
3. Use the right tools:
• Leverage specialized software libraries and frameworks that are designed for working with large data sets.
• Utilize high-performance programming languages such as Python with libraries like NumPy, pandas, or Apache
Spark for efficient data processing.
• Explore database systems that can handle large-scale data and offer optimized querying capabilities, such as
Apache Hadoop, MongoDB, or Apache Cassandra.
• Consider utilizing hardware technologies like solid-state drives (SSDs) or distributed computing frameworks
to improve I/O performance and enable parallel processing.
4. General Tips:
• Keep in mind the trade-offs between memory usage and computational performance. Compression techniques
may reduce memory requirements but can impact processing speed. Consider the nature of your data and the
specific requirements of your analysis when selecting algorithms, data structures, and tools.
• Explore techniques for parallel processing and distributed computing to leverage the power of multiple machines
or cores for faster data processing.
• Continuously monitor and optimize the performance of your algorithms and data structures to achieve the
best balance between memory usage and computational efficiency.
• By applying these techniques and considerations, you can effectively handle the challenges posed by large data
sets and improve the performance of your data analysis tasks.
1.3 Case study 1: Predicting malicious URLs

To be done in Class separately
1.4 Case study 2: Building a recommender system inside a database

To be done in Class separately
2 First Step in Big Data:

2.1 Distributing data storage and processing with frameworks
New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of
computers. Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage. This enables
businesses to grasp the value of the massive amount of data available.
2.1.1 Hadoop: a framework for storing and processing large data sets
Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all of the following
things and more:
• Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.
• Fault tolerant —It detects faults and applies automatic recovery.
• Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).
• Portable—Installable on all kinds of hardware and operating systems.
The core framework is composed of a distributed file system, a resource manager, and a system to run distributed
programs. In practice it allows you to work with the distributed file system almost as easily as with the local file system
of your home computer. But in the background, the data can be scattered among thousands of servers.
The different components of Hadoop can be summarized as follows:
• Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It is designed to store
Figure 1: A sample from the ecosystem of applications that arose around the Hadoop Core Framework
and manage large volumes of data across multiple machines in a distributed manner. HDFS provides fault tolerance,
high throughput, and scalability, making it suitable for big data applications.
• MapReduce: MapReduce is a programming model and processing framework used for distributed processing of large
datasets in Hadoop. It divides data processing tasks into two stages: Map and Reduce. The Map stage processes
data in parallel across multiple nodes, and the Reduce stage aggregates the results. MapReduce allows for efficient
processing of data across a cluster of computers.
• Yet Another Resource Negotiator (YARN): YARN is a resource management framework in Hadoop. It enables
efficient allocation of cluster resources and manages the execution of MapReduce tasks or other data processing
frameworks. YARN provides a unified platform for running various data processing workloads, making Hadoop
more versatile and flexible.
In addition to these core components, Hadoop has an ecosystem of applications and frameworks built on top of it. Some
notable examples are:
• Hive: Hive is a data warehousing infrastructure built on Hadoop. It provides a SQL-like language called HiveQL
to query and analyze data stored in Hadoop. Hive translates HiveQL queries into MapReduce jobs, allowing users
familiar with SQL to interact with big data stored in Hadoop.
• HBase: HBase is a distributed NoSQL database that runs on top of Hadoop. It provides real-time read/write access
to large datasets and is designed to handle massive amounts of structured and semi-structured data. HBase is often
used for random and low-latency read/write operations.
• Mahout: Mahout is a machine learning library for Hadoop. It offers a set of scalable algorithms and tools for
data mining, recommendation systems, clustering, and classification. Mahout allows users to perform large-scale
machine learning tasks on big data using the distributed processing capabilities of Hadoop.
These additional components and applications extend the functionality of Hadoop, enabling various data processing,
analysis, and machine learning tasks to be performed on large-scale datasets stored in Hadoop’s distributed file system.
MAPreduce: How Hadoop Achieves parallelism MapReduce is a programming model and processing framework that
plays a key role in achieving parallelism in Hadoop. It enables the distributed processing of large datasets across a cluster
of machines. Here’s how Hadoop achieves parallelism through MapReduce:
• Data Partitioning: Before processing, the input data is divided into smaller chunks called input splits. Each input
split represents a portion of the dataset. Hadoop ensures that these input splits are stored and processed in parallel
across multiple machines in the cluster.
• Map Phase: In the Map phase, the processing tasks are executed in parallel on different nodes of the cluster.
Each node processes its assigned input split independently. The Map function takes the input data and produces
intermediate key-value pairs as output. These intermediate results are generated in parallel for each input split.
• Shuffle and Sort: The intermediate key-value pairs produced by the Map phase are then shuffled and sorted. The
keys are grouped and sent to the Reducer tasks based on their values. This step ensures that all the values with
the same key are processed by the same Reducer, enabling the aggregation and analysis of related data.
• Reduce Phase: In the Reduce phase, the Reducer tasks process the intermediate key-value pairs received from the
Map phase. Each Reducer processes its assigned key-value pairs independently, performing aggregation, summa-
rization, or any custom logic required for the data analysis. Reducers work in parallel, processing different keys
simultaneously.
• Output Generation: Finally, the output of the Reducer tasks is collected and combined to produce the final output
of the MapReduce job. This output can be stored in Hadoop Distributed File System (HDFS) or used for further
analysis or visualization.
By dividing the data into smaller input splits, processing them in parallel using the Map phase, and then aggregating
the results with the Reduce phase, Hadoop achieves parallelism and distributes the workload across multiple machines.
This parallel processing capability allows for efficient handling of large datasets and speeds up data processing tasks in
Hadoop clusters.
As the name suggests, the process roughly boils down to two big phases: Mapping phase—The
documents are split up into key-value pairs. Until we reduce, we can have many duplicates. Reduce phase—It’s not
unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function,
a different result can be created. Here we wanted a count per color, so that’s what the reduce function returns.
The whole process is described in the following six steps 1 Reading the input files.
2 Passing each line to a mapper job.
3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has
been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).
4 The keys get shuffled and sorted to facilitate the aggregation.
5 The reduce phase sums the number of occurrences per color and outputs one
file per key with the total number of occurrences for each color.
6 The keys are collected in an output file.
2.1.2 Spark: replacing MapReduce for better performance
Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it can take awhile
until an algorithm converges to a solution. As this is a weak point of the MapReduce framework, we’ll introduce the
Figure 2: An example of a MapReduce flow for counting the colors in input texts
Spark Framework to overcome it. Spark improves the performance on such tasks by an order of magnitude.
WHAT IS SPARK?
Apache Spark is an open-source, distributed data processing and analytics framework that provides a fast and general-
purpose computation engine for big data processing. It is designed to handle large-scale data processing tasks efficiently
and supports a wide range of data processing scenarios, including batch processing, real-time streaming, machine learning,
and graph processing.
Spark Key Features
• In-Memory Processing: Spark leverages in-memory computing to store intermediate data in memory, which signifi-
cantly speeds up data processing compared to traditional disk-based processing systems. It minimizes the need for
data movement between disk and memory, resulting in faster computations.
• Resilient Distributed Datasets (RDDs): RDDs are the fundamental data structure in Spark. They are immutable,
fault-tolerant, distributed collections of data that can be processed in parallel across a cluster. RDDs provide a
high-level API for performing transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save) on
distributed data.
• Spark SQL: Spark SQL is a module in Spark that provides a programming interface for working with structured
and semi-structured data using SQL-like queries. It allows users to query and manipulate data using SQL syntax
and supports integration with various data sources, including Hive, Avro, Parquet, and JDBC.
• Spark Streaming: Spark Streaming enables real-time processing of streaming data. It ingests data in real-time
from various sources such as Kafka, Flume, or TCP sockets, and processes it in micro-batches. This allows for
near-real-time analytics and processing of continuous data streams.
• Machine Learning Library (MLlib): MLlib is a scalable machine learning library in Spark. It provides a wide
range of algorithms and tools for common machine learning tasks such as classification, regression, clustering, and
recommendation. MLlib leverages Spark’s distributed computing capabilities to perform machine learning tasks on
large datasets.
• Graph Processing (GraphX): GraphX is a graph processing library in Spark that enables graph computations and
analysis. It provides an API for building and manipulating graph structures and supports a variety of graph
algorithms, making it suitable for tasks such as social network analysis, page ranking, and graph-based recommen-
dations.
HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?
While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM memory between the
computers of your cluster. This allows the different workers to share variables (and their state) and thus eliminates
the need to write the intermediate results to disk. More technically and more correctly if you’re into that: Spark uses
Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that lets programmers perform in-
memory computations on large clusters in a faulttolerant way.1 Because it’s an in-memory system, it avoids costly disk
operations.
Apache Spark addresses some of the limitations of the traditional MapReduce framework and provides several im-
provements and optimizations that help overcome its drawbacks. Here are some ways Spark solves the problems of
MapReduce:
• In-Memory Computation: Spark leverages in-memory computing to store intermediate data and computations
in memory, rather than writing them to disk after each MapReduce stage. This significantly reduces disk I/O
operations and improves overall processing speed by allowing data to be accessed quickly from memory.
• DAG (Directed Acyclic Graph) Execution Model: Spark uses a DAG execution model, where it optimizes and
schedules a series of computational steps as a DAG of stages. This allows for efficient pipelining and data reuse
across multiple operations, reducing the overhead of multiple MapReduce jobs and enhancing performance.
• Resilient Distributed Datasets (RDDs): RDDs in Spark provide an efficient and fault-tolerant data abstraction that
allows for in-memory data processing across distributed nodes. RDDs enable iterative computations and caching
of intermediate data, leading to faster and more interactive data analysis compared to the disk-based nature of
MapReduce.
• Data Sharing across Multiple Workloads: Spark allows data to be shared across different workloads, such as batch
processing, interactive queries, and real-time streaming. This eliminates the need to reload data from external
storage systems for different processing tasks, resulting in improved performance and reduced latency.
• Wide Range of Libraries and APIs: Spark provides a rich ecosystem of libraries and APIs, including Spark SQL,
MLlib, GraphX, and Spark Streaming. These libraries offer high-level abstractions and optimized implementations
for common data processing tasks, such as SQL queries, machine learning, graph processing, and real-time streaming
analytics.
• Integration with External Data Sources: Spark integrates with various data sources and formats, including Hadoop
Distributed File System (HDFS), Apache Cassandra, Apache HBase, JDBC, Parquet, Avro, and more. This enables
seamless data integration and processing across diverse data systems, making it easier to work with existing data
infrastructures.
The Different components of the spark Ecosystem Spark core provides a NoSQL environment well suited for
interactive, exploratory analysis. Spark can be run in batch and interactive mode and supports Python. Spark has four
other large components, as listed below.
1 Spark streaming is a tool for real-time analysis.

Figure 3: An example of a MapReduce flow for counting the colors in input texts
2 Spark SQL provides a SQL interface to work with Spark.
3 MLLib is a tool for machine learning inside the Spark framework.
4 GraphX is a graph database for Spark
2.2 Case study: Assessing risk when loaning money

Will be discussed in Class
3 Database Management System and NoSQL

A Database Management System (DBMS) is a software that allows the efficient management of databases. It provides
a platform for storing, organizing, retrieving, and manipulating data. Whether you’re a beginner or a professional, our
DBMS Tutorial covers both basic and advanced concepts to help you understand and master the subject.
What is Database A database is a structured collection of data that is organized, managed, and stored in a computer
system. It is designed to efficiently store, retrieve, manipulate, and analyze large amounts of data. A database acts as a
repository for storing different types of data, such as text, numbers, images, audio, and video.
In a database, data is organized into tables, which consist of rows and columns. Each row represents a record or a set
of related data, while each column represents a specific attribute or characteristic of the data. The tables in a database
are interconnected through relationships, allowing data to be linked and accessed in a meaningful way.
Databases provide a way to manage and control data effectively. They offer features such as data integrity (ensuring
the accuracy and consistency of data), data security (protecting data from unauthorized access), and data concurrency
(managing simultaneous access to data by multiple users). Databases also support query languages, such as SQL (Struc-
tured Query Language), which allow users to retrieve, update, and manipulate data using specific commands.
Databases are widely used in various industries and applications, including businesses, organizations, scientific re-
search, healthcare, finance, and more. They play a crucial role in storing and organizing large volumes of data, enabling
efficient data management, decision-making, and analysis.
Key Features of DBMS:
• Data Definition: DBMS allows the creation, modification, and removal of data definitions that define the organiza-
tion of data within the database.

• Data Manipulation: It enables insertion, modification, and deletion of actual data within the database.
• Data Retrieval: Users can retrieve data from the database for various purposes using query and retrieval commands.
• User Administration: DBMS facilitates user registration, monitoring, data integrity enforcement, security manage-
ment, concurrency control, performance monitoring, and recovery from failures.
Characteristics of DBMS:
• Digital Repository: DBMS utilizes a digital repository on a server to store and manage data.
• Logical View: It provides a clear and logical view of data manipulation processes.
• Backup and Recovery: DBMS includes automatic backup and recovery procedures to protect data from hardware
or software failures.
• ACID Properties: It maintains data integrity and consistency by following ACID (Atomicity, Consistency, Isolation,
Durability) properties.
• Data Reduction: DBMS reduces the complexity of data relationships by managing data in an organized manner.
• Data Manipulation Support: It supports efficient data manipulation and processing.
• Data Security: DBMS provides mechanisms for data security, including user authentication, authorization, and
access control.
• Multiple Viewpoints: It allows users to view the database from different perspectives based on their requirements.
Advantages of DBMS:
• Data Redundancy Control: DBMS helps in controlling data redundancy by storing data in a centralized database
file. Data Sharing: Authorized users within an organization can easily share data through the DBMS.
• Easy Maintenance: DBMS is designed for easy maintenance due to its centralized nature.
• Time Efficiency: It reduces development time and maintenance requirements. Backup and Recovery: DBMS
provides backup and recovery subsystems to safeguard data from failures.
• Multiple User Interfaces: DBMS offers different user interfaces, including graphical interfaces and application
program interfaces (APIs).
Disadvantages of DBMS:
• Cost of Hardware and Software: Running a DBMS may require high-speed processors and large memory sizes, which
can be costly. Space Requirements: DBMS occupies significant disk space and memory for efficient operation.
• Complexity: Implementing and managing a database system introduces additional complexity and requirements.
• Impact of Failures: Database failures can have significant consequences, as data is often stored in a single database.
Power outages or database corruption can result in permanent data loss.
ACID Properties: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These
properties are fundamental principles that ensure the reliability and integrity of transactions in a database management
system (DBMS). Let’s explore each of these properties:
• Atomicity: Atomicity guarantees that a transaction is treated as a single, indivisible unit of work. It ensures that
either all the operations within a transaction are successfully completed, or none of them are. If any part of a
transaction fails, the entire transaction is rolled back, and the database returns to its previous state.
Figure 4: A table in DBMS
• Consistency: Consistency ensures that a transaction brings the database from one valid state to another. It enforces
integrity constraints, business rules, and data validation rules to maintain the overall correctness and validity of
the data. A transaction should not violate any defined rules or leave the database in an inconsistent state.
• Isolation: Isolation ensures that concurrent transactions do not interfere with each other. It guarantees that each
transaction is executed in isolation, as if it were the only transaction being processed. Isolation prevents issues like
data inconsistencies, lost updates, and conflicts that may arise when multiple transactions access and modify the
same data simultaneously.
• Durability: Durability guarantees that once a transaction is committed and its changes are saved, they are perma-
nent and will survive any subsequent failures, such as power outages or system crashes. The changes made by a
committed transaction are stored in a durable medium, usually disk storage, to ensure their persistence even in the
event of a failure.
Relational Database Management System RDBMS stands for Relational Database Management System. It is a
type of database management system that is based on the relational model of data. In an RDBMS, data is organized
and stored in the form of tables, which consist of rows and columns. The relationships between these tables are defined
through keys, allowing for the establishment of associations between data.
Tables:Tables are the fundamental building blocks of a relational database. They are used to store and organize data in
a structured manner. A table consists of rows (also known as records or tuples) and columns (also known as attributes).
Each column represents a specific data attribute or field, while each row represents an individual data record or instance.
Tables are designed to hold related data entities, and the relationships between tables are established through keys.
Structured Query Language (SQL): SQL is the standard language used to interact with an RDBMS. It provides
a set of commands for creating, modifying, and querying databases. SQL allows you to define the structure of tables,
insert, update, and delete records, and retrieve data based on various conditions using queries.

Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Introduction To Data Science

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Data Science

Uploaded by

Copyright:

Available Formats

Principles of Data Science

Unit 4: Handling large data

1.1 The problems you face when handling large data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 General techniques for handling large volumes of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Case study 1: Predicting malicious URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Case study 2: Building a recommender system inside a database . . . . . . . . . . . . . . . . . . . . . . 4

2 First Step in Big Data: 4

2.1 Distributing data storage and processing with frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Spark: replacing MapReduce for better performance . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Case study: Assessing risk when loaning money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Database Management System and NoSQL 9

This unit covers

• Working with Python libraries suitable for larger data sets

• Understanding the importance of choosing correct algorithms and data structures

• Understanding how you can adapt algorithms to work inside databases

memory), Key Topics Covered:

to handle extensive data sets efficiently.

databases, leveraging their inherent capabilities to handle large-scale data processing.

to tackle the challenges posed by large data volumes.

1.1 The problems you face when handling large data

and expanding your techniques to overcome these issues.

required to optimize performance.

HDDs, which are widely used.

1.2 General techniques for handling large volumes of data

mapping between specific problems and solutions.

1. Choose the right algorithms:

usage and processing.

set into memory.

processed in smaller chunks or on-the-fly.

2. Choose the right data structures:

sentations or techniques like data set compression to reduce memory requirements.

3. Use the right tools:

Spark for efficient data processing.

Apache Hadoop, MongoDB, or Apache Cassandra.

or cores for faster data processing.

best balance between memory usage and computational efficiency.

sets and improve the performance of your data analysis tasks.

1.3 Case study 1: Predicting malicious URLs

1.4 Case study 2: Building a recommender system inside a database

2 First Step in Big Data:

businesses to grasp the value of the massive amount of data available.

things and more:

• Fault tolerant —It detects faults and applies automatic recovery.

• Portable—Installable on all kinds of hardware and operating systems.

The different components of Hadoop can be summarized as follows:

processing of data across a cluster of computers.

more versatile and flexible.

notable examples are:

familiar with SQL to interact with big data stored in Hadoop.

used for random and low-latency read/write operations.

of machines. Here’s how Hadoop achieves parallelism through MapReduce:

across multiple machines in the cluster.

2 Passing each line to a mapper job.

4 The keys get shuffled and sorted to facilitate the aggregation.

6 The keys are collected in an output file.

2.1.2 Spark: replacing MapReduce for better performance

and graph processing.

Spark Key Features