Introduction To Data Science

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Principles of Data Science

Unit 4: Handling large data


June 20, 2023

Contents
1 Handling large data on a single computer 2

1.1 The problems you face when handling large data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 General techniques for handling large volumes of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Case study 1: Predicting malicious URLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Case study 2: Building a recommender system inside a database . . . . . . . . . . . . . . . . . . . . . . 4

2 First Step in Big Data: 4

2.1 Distributing data storage and processing with frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Hadoop: a framework for storing and processing large data sets . . . . . . . . . . . . . . . . . . 4

2.1.2 Spark: replacing MapReduce for better performance . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Case study: Assessing risk when loaning money . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Database Management System and NoSQL 9

This unit covers


• Working with large data sets on a single computer

• Working with Python libraries suitable for larger data sets

• Understanding the importance of choosing correct algorithms and data structures

• Understanding how you can adapt algorithms to work inside databases


1 Handling large data on a single computer
In the era of big data, the abundance of information can sometimes overwhelm us. What if you find yourself facing

a deluge of data that seems to surpass your capabilities and your conventional techniques no longer suffice? Do you

surrender or adapt? Luckily, you’ve chosen the path of adaptation, as evidenced by your continued reading. In this

chapter, we will introduce you to techniques and tools that enable you to handle larger data sets, even when restricted

to a single computer, provided you adopt the right strategies. This unit aims to equip you with the necessary tools to

perform classifications and regressions on data sets that are too large to fit into your computer’s RAM (random access

memory), Key Topics Covered:

• Working with large data sets on a single computer: We will delve into the challenges associated with handling

substantial data sets and explore various approaches to address them effectively.

• Python libraries suitable for larger data sets: Discover and leverage powerful Python libraries specifically designed

to handle extensive data sets efficiently.

• The importance of choosing correct algorithms and data structures: Understand the significance of selecting appro-

priate algorithms and data structures that can effectively manage and process large data volumes.

• Adapting algorithms to work inside databases: Gain insights into adapting algorithms to operate seamlessly within

databases, leveraging their inherent capabilities to handle large-scale data processing.

• Applying general best practices: Learn from the experiences of data scientists and apply their general best practices

to tackle the challenges posed by large data volumes.

• Case studies: To provide practical context, we will present two case studies. The first case demonstrates how

to detect malicious URLs using the techniques and tools discussed. The second case illustrates how to build a

recommender engine inside a database, leveraging the concepts covered in this chapter.

1.1 The problems you face when handling large data


Challenges of Handling Large Data Sets:

• Overloaded memory and algorithms: Dealing with a large volume of data presents new challenges, such as exceeding

the computer’s available memory and algorithms that are not optimized for large data sets. This requires adapting

and expanding your techniques to overcome these issues.

• I/O and CPU starvation: When analyzing large data sets, it’s essential to consider input/output (I/O) operations

and CPU utilization. These factors can cause speed issues during data processing, and careful management is

required to optimize performance.

• Memory limitations: Computers have a finite amount of RAM, and attempting to load more data into memory than

it can handle leads to swapping memory blocks to disks. This process, known as swapping, is inefficient compared

to having all the data in memory. Most algorithms are designed to load the entire data set into memory, resulting

in out-of-memory errors.

• Time constraints: Time is another crucial resource to consider when working with large data sets. Certain algorithms

do not account for time constraints and can run indefinitely. On the other hand, some algorithms struggle to
complete within a reasonable time frame, even when processing only a few megabytes of data.

• Bottlenecks in computer components: Dealing with large data sets can expose bottlenecks in different computer

components. While one system may be overwhelmed, others remain idle. This imbalance incurs a significant cost

in terms of both time and computing resources. For example, programs may experience CPU starvation due to

slow data retrieval from the hard drive, which is typically one of the slowest components in a computer.

• Introduction of solid-state drives (SSD): To address the slow data retrieval from traditional hard disk drives (HDD),

solid-state drives (SSD) were introduced. SSDs offer faster performance but are still more expensive compared to

HDDs, which are widely used.

1.2 General techniques for handling large volumes of data


Problems:

1. Never-ending algorithms

2. Out-of-memory errors

3. Speed issues

Solutions: The solutions for handling large data sets can be categorized into three main areas: using the correct

algorithms, choosing the right data structure, and utilizing the appropriate tools. It’s important to note that these

solutions often address both memory limitations and computational performance, and there is no direct one-to-one

mapping between specific problems and solutions.

1. Choose the right algorithms:

• Select algorithms specifically designed to handle large data sets, as they are optimized for efficient memory

usage and processing.

• Consider algorithms that operate on smaller subsets of the data at a time, rather than loading the entire data

set into memory.

• Look for algorithms that can provide incremental or streaming processing capabilities, allowing data to be

processed in smaller chunks or on-the-fly.

2. Choose the right data structures:

• Opt for data structures that can efficiently store and manipulate large data sets. Use compressed data repre-

sentations or techniques like data set compression to reduce memory requirements.

• Consider data structures that support parallel processing or distributed computing, enabling efficient utilization

of computational resources.

3. Use the right tools:

• Leverage specialized software libraries and frameworks that are designed for working with large data sets.

• Utilize high-performance programming languages such as Python with libraries like NumPy, pandas, or Apache

Spark for efficient data processing.

• Explore database systems that can handle large-scale data and offer optimized querying capabilities, such as

Apache Hadoop, MongoDB, or Apache Cassandra.

• Consider utilizing hardware technologies like solid-state drives (SSDs) or distributed computing frameworks
to improve I/O performance and enable parallel processing.

4. General Tips:

• Keep in mind the trade-offs between memory usage and computational performance. Compression techniques

may reduce memory requirements but can impact processing speed. Consider the nature of your data and the

specific requirements of your analysis when selecting algorithms, data structures, and tools.

• Explore techniques for parallel processing and distributed computing to leverage the power of multiple machines

or cores for faster data processing.

• Continuously monitor and optimize the performance of your algorithms and data structures to achieve the

best balance between memory usage and computational efficiency.

• By applying these techniques and considerations, you can effectively handle the challenges posed by large data

sets and improve the performance of your data analysis tasks.

1.3 Case study 1: Predicting malicious URLs


To be done in Class separately

1.4 Case study 2: Building a recommender system inside a database


To be done in Class separately

2 First Step in Big Data:


2.1 Distributing data storage and processing with frameworks
New big data technologies such as Hadoop and Spark make it much easier to work with and control a cluster of

computers. Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage. This enables

businesses to grasp the value of the massive amount of data available.

2.1.1 Hadoop: a framework for storing and processing large data sets

Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all of the following

things and more:

• Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.

• Fault tolerant —It detects faults and applies automatic recovery.

• Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).

• Portable—Installable on all kinds of hardware and operating systems.

The core framework is composed of a distributed file system, a resource manager, and a system to run distributed

programs. In practice it allows you to work with the distributed file system almost as easily as with the local file system

of your home computer. But in the background, the data can be scattered among thousands of servers.

The different components of Hadoop can be summarized as follows:

• Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It is designed to store
Figure 1: A sample from the ecosystem of applications that arose around the Hadoop Core Framework

and manage large volumes of data across multiple machines in a distributed manner. HDFS provides fault tolerance,

high throughput, and scalability, making it suitable for big data applications.

• MapReduce: MapReduce is a programming model and processing framework used for distributed processing of large

datasets in Hadoop. It divides data processing tasks into two stages: Map and Reduce. The Map stage processes

data in parallel across multiple nodes, and the Reduce stage aggregates the results. MapReduce allows for efficient

processing of data across a cluster of computers.

• Yet Another Resource Negotiator (YARN): YARN is a resource management framework in Hadoop. It enables

efficient allocation of cluster resources and manages the execution of MapReduce tasks or other data processing

frameworks. YARN provides a unified platform for running various data processing workloads, making Hadoop

more versatile and flexible.

In addition to these core components, Hadoop has an ecosystem of applications and frameworks built on top of it. Some

notable examples are:

• Hive: Hive is a data warehousing infrastructure built on Hadoop. It provides a SQL-like language called HiveQL

to query and analyze data stored in Hadoop. Hive translates HiveQL queries into MapReduce jobs, allowing users

familiar with SQL to interact with big data stored in Hadoop.

• HBase: HBase is a distributed NoSQL database that runs on top of Hadoop. It provides real-time read/write access

to large datasets and is designed to handle massive amounts of structured and semi-structured data. HBase is often

used for random and low-latency read/write operations.

• Mahout: Mahout is a machine learning library for Hadoop. It offers a set of scalable algorithms and tools for

data mining, recommendation systems, clustering, and classification. Mahout allows users to perform large-scale

machine learning tasks on big data using the distributed processing capabilities of Hadoop.

These additional components and applications extend the functionality of Hadoop, enabling various data processing,

analysis, and machine learning tasks to be performed on large-scale datasets stored in Hadoop’s distributed file system.

MAPreduce: How Hadoop Achieves parallelism MapReduce is a programming model and processing framework that
plays a key role in achieving parallelism in Hadoop. It enables the distributed processing of large datasets across a cluster

of machines. Here’s how Hadoop achieves parallelism through MapReduce:

• Data Partitioning: Before processing, the input data is divided into smaller chunks called input splits. Each input

split represents a portion of the dataset. Hadoop ensures that these input splits are stored and processed in parallel

across multiple machines in the cluster.

• Map Phase: In the Map phase, the processing tasks are executed in parallel on different nodes of the cluster.

Each node processes its assigned input split independently. The Map function takes the input data and produces

intermediate key-value pairs as output. These intermediate results are generated in parallel for each input split.

• Shuffle and Sort: The intermediate key-value pairs produced by the Map phase are then shuffled and sorted. The

keys are grouped and sent to the Reducer tasks based on their values. This step ensures that all the values with

the same key are processed by the same Reducer, enabling the aggregation and analysis of related data.

• Reduce Phase: In the Reduce phase, the Reducer tasks process the intermediate key-value pairs received from the

Map phase. Each Reducer processes its assigned key-value pairs independently, performing aggregation, summa-

rization, or any custom logic required for the data analysis. Reducers work in parallel, processing different keys

simultaneously.

• Output Generation: Finally, the output of the Reducer tasks is collected and combined to produce the final output

of the MapReduce job. This output can be stored in Hadoop Distributed File System (HDFS) or used for further

analysis or visualization.

By dividing the data into smaller input splits, processing them in parallel using the Map phase, and then aggregating

the results with the Reduce phase, Hadoop achieves parallelism and distributes the workload across multiple machines.

This parallel processing capability allows for efficient handling of large datasets and speeds up data processing tasks in

Hadoop clusters.

As the name suggests, the process roughly boils down to two big phases: Mapping phase—The

documents are split up into key-value pairs. Until we reduce, we can have many duplicates. Reduce phase—It’s not

unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function,

a different result can be created. Here we wanted a count per color, so that’s what the reduce function returns.

The whole process is described in the following six steps 1 Reading the input files.

2 Passing each line to a mapper job.

3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has

been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).

4 The keys get shuffled and sorted to facilitate the aggregation.

5 The reduce phase sums the number of occurrences per color and outputs one

file per key with the total number of occurrences for each color.

6 The keys are collected in an output file.

2.1.2 Spark: replacing MapReduce for better performance

Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it can take awhile

until an algorithm converges to a solution. As this is a weak point of the MapReduce framework, we’ll introduce the
Figure 2: An example of a MapReduce flow for counting the colors in input texts

Spark Framework to overcome it. Spark improves the performance on such tasks by an order of magnitude.

WHAT IS SPARK?

Apache Spark is an open-source, distributed data processing and analytics framework that provides a fast and general-

purpose computation engine for big data processing. It is designed to handle large-scale data processing tasks efficiently

and supports a wide range of data processing scenarios, including batch processing, real-time streaming, machine learning,

and graph processing.

Spark Key Features

• In-Memory Processing: Spark leverages in-memory computing to store intermediate data in memory, which signifi-

cantly speeds up data processing compared to traditional disk-based processing systems. It minimizes the need for

data movement between disk and memory, resulting in faster computations.

• Resilient Distributed Datasets (RDDs): RDDs are the fundamental data structure in Spark. They are immutable,

fault-tolerant, distributed collections of data that can be processed in parallel across a cluster. RDDs provide a

high-level API for performing transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save) on

distributed data.

• Spark SQL: Spark SQL is a module in Spark that provides a programming interface for working with structured

and semi-structured data using SQL-like queries. It allows users to query and manipulate data using SQL syntax

and supports integration with various data sources, including Hive, Avro, Parquet, and JDBC.

• Spark Streaming: Spark Streaming enables real-time processing of streaming data. It ingests data in real-time

from various sources such as Kafka, Flume, or TCP sockets, and processes it in micro-batches. This allows for

near-real-time analytics and processing of continuous data streams.

• Machine Learning Library (MLlib): MLlib is a scalable machine learning library in Spark. It provides a wide

range of algorithms and tools for common machine learning tasks such as classification, regression, clustering, and

recommendation. MLlib leverages Spark’s distributed computing capabilities to perform machine learning tasks on

large datasets.

• Graph Processing (GraphX): GraphX is a graph processing library in Spark that enables graph computations and
analysis. It provides an API for building and manipulating graph structures and supports a variety of graph

algorithms, making it suitable for tasks such as social network analysis, page ranking, and graph-based recommen-

dations.

HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?

While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM memory between the

computers of your cluster. This allows the different workers to share variables (and their state) and thus eliminates

the need to write the intermediate results to disk. More technically and more correctly if you’re into that: Spark uses

Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that lets programmers perform in-

memory computations on large clusters in a faulttolerant way.1 Because it’s an in-memory system, it avoids costly disk

operations.

Apache Spark addresses some of the limitations of the traditional MapReduce framework and provides several im-

provements and optimizations that help overcome its drawbacks. Here are some ways Spark solves the problems of

MapReduce:

• In-Memory Computation: Spark leverages in-memory computing to store intermediate data and computations

in memory, rather than writing them to disk after each MapReduce stage. This significantly reduces disk I/O

operations and improves overall processing speed by allowing data to be accessed quickly from memory.

• DAG (Directed Acyclic Graph) Execution Model: Spark uses a DAG execution model, where it optimizes and

schedules a series of computational steps as a DAG of stages. This allows for efficient pipelining and data reuse

across multiple operations, reducing the overhead of multiple MapReduce jobs and enhancing performance.

• Resilient Distributed Datasets (RDDs): RDDs in Spark provide an efficient and fault-tolerant data abstraction that

allows for in-memory data processing across distributed nodes. RDDs enable iterative computations and caching

of intermediate data, leading to faster and more interactive data analysis compared to the disk-based nature of

MapReduce.

• Data Sharing across Multiple Workloads: Spark allows data to be shared across different workloads, such as batch

processing, interactive queries, and real-time streaming. This eliminates the need to reload data from external

storage systems for different processing tasks, resulting in improved performance and reduced latency.

• Wide Range of Libraries and APIs: Spark provides a rich ecosystem of libraries and APIs, including Spark SQL,

MLlib, GraphX, and Spark Streaming. These libraries offer high-level abstractions and optimized implementations

for common data processing tasks, such as SQL queries, machine learning, graph processing, and real-time streaming

analytics.

• Integration with External Data Sources: Spark integrates with various data sources and formats, including Hadoop

Distributed File System (HDFS), Apache Cassandra, Apache HBase, JDBC, Parquet, Avro, and more. This enables

seamless data integration and processing across diverse data systems, making it easier to work with existing data

infrastructures.

The Different components of the spark Ecosystem Spark core provides a NoSQL environment well suited for

interactive, exploratory analysis. Spark can be run in batch and interactive mode and supports Python. Spark has four

other large components, as listed below.

1 Spark streaming is a tool for real-time analysis.


Figure 3: An example of a MapReduce flow for counting the colors in input texts

2 Spark SQL provides a SQL interface to work with Spark.

3 MLLib is a tool for machine learning inside the Spark framework.

4 GraphX is a graph database for Spark

2.2 Case study: Assessing risk when loaning money


Will be discussed in Class

3 Database Management System and NoSQL


A Database Management System (DBMS) is a software that allows the efficient management of databases. It provides

a platform for storing, organizing, retrieving, and manipulating data. Whether you’re a beginner or a professional, our

DBMS Tutorial covers both basic and advanced concepts to help you understand and master the subject.

What is Database A database is a structured collection of data that is organized, managed, and stored in a computer

system. It is designed to efficiently store, retrieve, manipulate, and analyze large amounts of data. A database acts as a

repository for storing different types of data, such as text, numbers, images, audio, and video.

In a database, data is organized into tables, which consist of rows and columns. Each row represents a record or a set

of related data, while each column represents a specific attribute or characteristic of the data. The tables in a database

are interconnected through relationships, allowing data to be linked and accessed in a meaningful way.

Databases provide a way to manage and control data effectively. They offer features such as data integrity (ensuring

the accuracy and consistency of data), data security (protecting data from unauthorized access), and data concurrency

(managing simultaneous access to data by multiple users). Databases also support query languages, such as SQL (Struc-

tured Query Language), which allow users to retrieve, update, and manipulate data using specific commands.

Databases are widely used in various industries and applications, including businesses, organizations, scientific re-

search, healthcare, finance, and more. They play a crucial role in storing and organizing large volumes of data, enabling

efficient data management, decision-making, and analysis.

Key Features of DBMS:

• Data Definition: DBMS allows the creation, modification, and removal of data definitions that define the organiza-

tion of data within the database.


• Data Manipulation: It enables insertion, modification, and deletion of actual data within the database.

• Data Retrieval: Users can retrieve data from the database for various purposes using query and retrieval commands.

• User Administration: DBMS facilitates user registration, monitoring, data integrity enforcement, security manage-

ment, concurrency control, performance monitoring, and recovery from failures.

Characteristics of DBMS:

• Digital Repository: DBMS utilizes a digital repository on a server to store and manage data.

• Logical View: It provides a clear and logical view of data manipulation processes.

• Backup and Recovery: DBMS includes automatic backup and recovery procedures to protect data from hardware

or software failures.

• ACID Properties: It maintains data integrity and consistency by following ACID (Atomicity, Consistency, Isolation,

Durability) properties.

• Data Reduction: DBMS reduces the complexity of data relationships by managing data in an organized manner.

• Data Manipulation Support: It supports efficient data manipulation and processing.

• Data Security: DBMS provides mechanisms for data security, including user authentication, authorization, and

access control.

• Multiple Viewpoints: It allows users to view the database from different perspectives based on their requirements.

Advantages of DBMS:

• Data Redundancy Control: DBMS helps in controlling data redundancy by storing data in a centralized database

file. Data Sharing: Authorized users within an organization can easily share data through the DBMS.

• Easy Maintenance: DBMS is designed for easy maintenance due to its centralized nature.

• Time Efficiency: It reduces development time and maintenance requirements. Backup and Recovery: DBMS

provides backup and recovery subsystems to safeguard data from failures.

• Multiple User Interfaces: DBMS offers different user interfaces, including graphical interfaces and application

program interfaces (APIs).

Disadvantages of DBMS:

• Cost of Hardware and Software: Running a DBMS may require high-speed processors and large memory sizes, which

can be costly. Space Requirements: DBMS occupies significant disk space and memory for efficient operation.

• Complexity: Implementing and managing a database system introduces additional complexity and requirements.

• Impact of Failures: Database failures can have significant consequences, as data is often stored in a single database.

Power outages or database corruption can result in permanent data loss.

ACID Properties: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These

properties are fundamental principles that ensure the reliability and integrity of transactions in a database management

system (DBMS). Let’s explore each of these properties:

• Atomicity: Atomicity guarantees that a transaction is treated as a single, indivisible unit of work. It ensures that

either all the operations within a transaction are successfully completed, or none of them are. If any part of a

transaction fails, the entire transaction is rolled back, and the database returns to its previous state.
Figure 4: A table in DBMS

• Consistency: Consistency ensures that a transaction brings the database from one valid state to another. It enforces

integrity constraints, business rules, and data validation rules to maintain the overall correctness and validity of

the data. A transaction should not violate any defined rules or leave the database in an inconsistent state.

• Isolation: Isolation ensures that concurrent transactions do not interfere with each other. It guarantees that each

transaction is executed in isolation, as if it were the only transaction being processed. Isolation prevents issues like

data inconsistencies, lost updates, and conflicts that may arise when multiple transactions access and modify the

same data simultaneously.

• Durability: Durability guarantees that once a transaction is committed and its changes are saved, they are perma-

nent and will survive any subsequent failures, such as power outages or system crashes. The changes made by a

committed transaction are stored in a durable medium, usually disk storage, to ensure their persistence even in the

event of a failure.

Relational Database Management System RDBMS stands for Relational Database Management System. It is a

type of database management system that is based on the relational model of data. In an RDBMS, data is organized

and stored in the form of tables, which consist of rows and columns. The relationships between these tables are defined

through keys, allowing for the establishment of associations between data.

Tables:Tables are the fundamental building blocks of a relational database. They are used to store and organize data in

a structured manner. A table consists of rows (also known as records or tuples) and columns (also known as attributes).

Each column represents a specific data attribute or field, while each row represents an individual data record or instance.

Tables are designed to hold related data entities, and the relationships between tables are established through keys.

Structured Query Language (SQL): SQL is the standard language used to interact with an RDBMS. It provides

a set of commands for creating, modifying, and querying databases. SQL allows you to define the structure of tables,

insert, update, and delete records, and retrieve data based on various conditions using queries.

You might also like