Introduction To Data Science
Introduction To Data Science
Introduction To Data Science
Contents
1 Handling large data on a single computer 2
2.1.1 Hadoop: a framework for storing and processing large data sets . . . . . . . . . . . . . . . . . . 4
a deluge of data that seems to surpass your capabilities and your conventional techniques no longer suffice? Do you
surrender or adapt? Luckily, you’ve chosen the path of adaptation, as evidenced by your continued reading. In this
chapter, we will introduce you to techniques and tools that enable you to handle larger data sets, even when restricted
to a single computer, provided you adopt the right strategies. This unit aims to equip you with the necessary tools to
perform classifications and regressions on data sets that are too large to fit into your computer’s RAM (random access
• Working with large data sets on a single computer: We will delve into the challenges associated with handling
substantial data sets and explore various approaches to address them effectively.
• Python libraries suitable for larger data sets: Discover and leverage powerful Python libraries specifically designed
• The importance of choosing correct algorithms and data structures: Understand the significance of selecting appro-
priate algorithms and data structures that can effectively manage and process large data volumes.
• Adapting algorithms to work inside databases: Gain insights into adapting algorithms to operate seamlessly within
• Applying general best practices: Learn from the experiences of data scientists and apply their general best practices
• Case studies: To provide practical context, we will present two case studies. The first case demonstrates how
to detect malicious URLs using the techniques and tools discussed. The second case illustrates how to build a
recommender engine inside a database, leveraging the concepts covered in this chapter.
• Overloaded memory and algorithms: Dealing with a large volume of data presents new challenges, such as exceeding
the computer’s available memory and algorithms that are not optimized for large data sets. This requires adapting
• I/O and CPU starvation: When analyzing large data sets, it’s essential to consider input/output (I/O) operations
and CPU utilization. These factors can cause speed issues during data processing, and careful management is
• Memory limitations: Computers have a finite amount of RAM, and attempting to load more data into memory than
it can handle leads to swapping memory blocks to disks. This process, known as swapping, is inefficient compared
to having all the data in memory. Most algorithms are designed to load the entire data set into memory, resulting
in out-of-memory errors.
• Time constraints: Time is another crucial resource to consider when working with large data sets. Certain algorithms
do not account for time constraints and can run indefinitely. On the other hand, some algorithms struggle to
complete within a reasonable time frame, even when processing only a few megabytes of data.
• Bottlenecks in computer components: Dealing with large data sets can expose bottlenecks in different computer
components. While one system may be overwhelmed, others remain idle. This imbalance incurs a significant cost
in terms of both time and computing resources. For example, programs may experience CPU starvation due to
slow data retrieval from the hard drive, which is typically one of the slowest components in a computer.
• Introduction of solid-state drives (SSD): To address the slow data retrieval from traditional hard disk drives (HDD),
solid-state drives (SSD) were introduced. SSDs offer faster performance but are still more expensive compared to
1. Never-ending algorithms
2. Out-of-memory errors
3. Speed issues
Solutions: The solutions for handling large data sets can be categorized into three main areas: using the correct
algorithms, choosing the right data structure, and utilizing the appropriate tools. It’s important to note that these
solutions often address both memory limitations and computational performance, and there is no direct one-to-one
• Select algorithms specifically designed to handle large data sets, as they are optimized for efficient memory
• Consider algorithms that operate on smaller subsets of the data at a time, rather than loading the entire data
• Look for algorithms that can provide incremental or streaming processing capabilities, allowing data to be
• Opt for data structures that can efficiently store and manipulate large data sets. Use compressed data repre-
• Consider data structures that support parallel processing or distributed computing, enabling efficient utilization
of computational resources.
• Leverage specialized software libraries and frameworks that are designed for working with large data sets.
• Utilize high-performance programming languages such as Python with libraries like NumPy, pandas, or Apache
• Explore database systems that can handle large-scale data and offer optimized querying capabilities, such as
• Consider utilizing hardware technologies like solid-state drives (SSDs) or distributed computing frameworks
to improve I/O performance and enable parallel processing.
4. General Tips:
• Keep in mind the trade-offs between memory usage and computational performance. Compression techniques
may reduce memory requirements but can impact processing speed. Consider the nature of your data and the
specific requirements of your analysis when selecting algorithms, data structures, and tools.
• Explore techniques for parallel processing and distributed computing to leverage the power of multiple machines
• Continuously monitor and optimize the performance of your algorithms and data structures to achieve the
• By applying these techniques and considerations, you can effectively handle the challenges posed by large data
computers. Hadoop can scale up to thousands of computers, creating a cluster with petabytes of storage. This enables
2.1.1 Hadoop: a framework for storing and processing large data sets
Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all of the following
• Reliable—By automatically creating multiple copies of the data and redeploying processing logic in case of failure.
• Scalable—Data and its processing are distributed over clusters of computers (horizontal scaling).
The core framework is composed of a distributed file system, a resource manager, and a system to run distributed
programs. In practice it allows you to work with the distributed file system almost as easily as with the local file system
of your home computer. But in the background, the data can be scattered among thousands of servers.
• Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. It is designed to store
Figure 1: A sample from the ecosystem of applications that arose around the Hadoop Core Framework
and manage large volumes of data across multiple machines in a distributed manner. HDFS provides fault tolerance,
high throughput, and scalability, making it suitable for big data applications.
• MapReduce: MapReduce is a programming model and processing framework used for distributed processing of large
datasets in Hadoop. It divides data processing tasks into two stages: Map and Reduce. The Map stage processes
data in parallel across multiple nodes, and the Reduce stage aggregates the results. MapReduce allows for efficient
• Yet Another Resource Negotiator (YARN): YARN is a resource management framework in Hadoop. It enables
efficient allocation of cluster resources and manages the execution of MapReduce tasks or other data processing
frameworks. YARN provides a unified platform for running various data processing workloads, making Hadoop
In addition to these core components, Hadoop has an ecosystem of applications and frameworks built on top of it. Some
• Hive: Hive is a data warehousing infrastructure built on Hadoop. It provides a SQL-like language called HiveQL
to query and analyze data stored in Hadoop. Hive translates HiveQL queries into MapReduce jobs, allowing users
• HBase: HBase is a distributed NoSQL database that runs on top of Hadoop. It provides real-time read/write access
to large datasets and is designed to handle massive amounts of structured and semi-structured data. HBase is often
• Mahout: Mahout is a machine learning library for Hadoop. It offers a set of scalable algorithms and tools for
data mining, recommendation systems, clustering, and classification. Mahout allows users to perform large-scale
machine learning tasks on big data using the distributed processing capabilities of Hadoop.
These additional components and applications extend the functionality of Hadoop, enabling various data processing,
analysis, and machine learning tasks to be performed on large-scale datasets stored in Hadoop’s distributed file system.
MAPreduce: How Hadoop Achieves parallelism MapReduce is a programming model and processing framework that
plays a key role in achieving parallelism in Hadoop. It enables the distributed processing of large datasets across a cluster
• Data Partitioning: Before processing, the input data is divided into smaller chunks called input splits. Each input
split represents a portion of the dataset. Hadoop ensures that these input splits are stored and processed in parallel
• Map Phase: In the Map phase, the processing tasks are executed in parallel on different nodes of the cluster.
Each node processes its assigned input split independently. The Map function takes the input data and produces
intermediate key-value pairs as output. These intermediate results are generated in parallel for each input split.
• Shuffle and Sort: The intermediate key-value pairs produced by the Map phase are then shuffled and sorted. The
keys are grouped and sent to the Reducer tasks based on their values. This step ensures that all the values with
the same key are processed by the same Reducer, enabling the aggregation and analysis of related data.
• Reduce Phase: In the Reduce phase, the Reducer tasks process the intermediate key-value pairs received from the
Map phase. Each Reducer processes its assigned key-value pairs independently, performing aggregation, summa-
rization, or any custom logic required for the data analysis. Reducers work in parallel, processing different keys
simultaneously.
• Output Generation: Finally, the output of the Reducer tasks is collected and combined to produce the final output
of the MapReduce job. This output can be stored in Hadoop Distributed File System (HDFS) or used for further
analysis or visualization.
By dividing the data into smaller input splits, processing them in parallel using the Map phase, and then aggregating
the results with the Reduce phase, Hadoop achieves parallelism and distributes the workload across multiple machines.
This parallel processing capability allows for efficient handling of large datasets and speeds up data processing tasks in
Hadoop clusters.
As the name suggests, the process roughly boils down to two big phases: Mapping phase—The
documents are split up into key-value pairs. Until we reduce, we can have many duplicates. Reduce phase—It’s not
unlike a SQL “group by.” The different unique occurrences are grouped together, and depending on the reducing function,
a different result can be created. Here we wanted a count per color, so that’s what the reduce function returns.
The whole process is described in the following six steps 1 Reading the input files.
3 The mapper job parses the colors (keys) out of the file and outputs a file for each color with the number of times it has
been encountered (value). Or more technically said, it maps a key (the color) to a value (the number of occurrences).
5 The reduce phase sums the number of occurrences per color and outputs one
file per key with the total number of occurrences for each color.
Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it can take awhile
until an algorithm converges to a solution. As this is a weak point of the MapReduce framework, we’ll introduce the
Figure 2: An example of a MapReduce flow for counting the colors in input texts
Spark Framework to overcome it. Spark improves the performance on such tasks by an order of magnitude.
WHAT IS SPARK?
Apache Spark is an open-source, distributed data processing and analytics framework that provides a fast and general-
purpose computation engine for big data processing. It is designed to handle large-scale data processing tasks efficiently
and supports a wide range of data processing scenarios, including batch processing, real-time streaming, machine learning,
• In-Memory Processing: Spark leverages in-memory computing to store intermediate data in memory, which signifi-
cantly speeds up data processing compared to traditional disk-based processing systems. It minimizes the need for
• Resilient Distributed Datasets (RDDs): RDDs are the fundamental data structure in Spark. They are immutable,
fault-tolerant, distributed collections of data that can be processed in parallel across a cluster. RDDs provide a
high-level API for performing transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save) on
distributed data.
• Spark SQL: Spark SQL is a module in Spark that provides a programming interface for working with structured
and semi-structured data using SQL-like queries. It allows users to query and manipulate data using SQL syntax
and supports integration with various data sources, including Hive, Avro, Parquet, and JDBC.
• Spark Streaming: Spark Streaming enables real-time processing of streaming data. It ingests data in real-time
from various sources such as Kafka, Flume, or TCP sockets, and processes it in micro-batches. This allows for
• Machine Learning Library (MLlib): MLlib is a scalable machine learning library in Spark. It provides a wide
range of algorithms and tools for common machine learning tasks such as classification, regression, clustering, and
recommendation. MLlib leverages Spark’s distributed computing capabilities to perform machine learning tasks on
large datasets.
• Graph Processing (GraphX): GraphX is a graph processing library in Spark that enables graph computations and
analysis. It provides an API for building and manipulating graph structures and supports a variety of graph
algorithms, making it suitable for tasks such as social network analysis, page ranking, and graph-based recommen-
dations.
While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM memory between the
computers of your cluster. This allows the different workers to share variables (and their state) and thus eliminates
the need to write the intermediate results to disk. More technically and more correctly if you’re into that: Spark uses
Resilient Distributed Datasets (RDD), which are a distributed memory abstraction that lets programmers perform in-
memory computations on large clusters in a faulttolerant way.1 Because it’s an in-memory system, it avoids costly disk
operations.
Apache Spark addresses some of the limitations of the traditional MapReduce framework and provides several im-
provements and optimizations that help overcome its drawbacks. Here are some ways Spark solves the problems of
MapReduce:
• In-Memory Computation: Spark leverages in-memory computing to store intermediate data and computations
in memory, rather than writing them to disk after each MapReduce stage. This significantly reduces disk I/O
operations and improves overall processing speed by allowing data to be accessed quickly from memory.
• DAG (Directed Acyclic Graph) Execution Model: Spark uses a DAG execution model, where it optimizes and
schedules a series of computational steps as a DAG of stages. This allows for efficient pipelining and data reuse
across multiple operations, reducing the overhead of multiple MapReduce jobs and enhancing performance.
• Resilient Distributed Datasets (RDDs): RDDs in Spark provide an efficient and fault-tolerant data abstraction that
allows for in-memory data processing across distributed nodes. RDDs enable iterative computations and caching
of intermediate data, leading to faster and more interactive data analysis compared to the disk-based nature of
MapReduce.
• Data Sharing across Multiple Workloads: Spark allows data to be shared across different workloads, such as batch
processing, interactive queries, and real-time streaming. This eliminates the need to reload data from external
storage systems for different processing tasks, resulting in improved performance and reduced latency.
• Wide Range of Libraries and APIs: Spark provides a rich ecosystem of libraries and APIs, including Spark SQL,
MLlib, GraphX, and Spark Streaming. These libraries offer high-level abstractions and optimized implementations
for common data processing tasks, such as SQL queries, machine learning, graph processing, and real-time streaming
analytics.
• Integration with External Data Sources: Spark integrates with various data sources and formats, including Hadoop
Distributed File System (HDFS), Apache Cassandra, Apache HBase, JDBC, Parquet, Avro, and more. This enables
seamless data integration and processing across diverse data systems, making it easier to work with existing data
infrastructures.
The Different components of the spark Ecosystem Spark core provides a NoSQL environment well suited for
interactive, exploratory analysis. Spark can be run in batch and interactive mode and supports Python. Spark has four
a platform for storing, organizing, retrieving, and manipulating data. Whether you’re a beginner or a professional, our
DBMS Tutorial covers both basic and advanced concepts to help you understand and master the subject.
What is Database A database is a structured collection of data that is organized, managed, and stored in a computer
system. It is designed to efficiently store, retrieve, manipulate, and analyze large amounts of data. A database acts as a
repository for storing different types of data, such as text, numbers, images, audio, and video.
In a database, data is organized into tables, which consist of rows and columns. Each row represents a record or a set
of related data, while each column represents a specific attribute or characteristic of the data. The tables in a database
are interconnected through relationships, allowing data to be linked and accessed in a meaningful way.
Databases provide a way to manage and control data effectively. They offer features such as data integrity (ensuring
the accuracy and consistency of data), data security (protecting data from unauthorized access), and data concurrency
(managing simultaneous access to data by multiple users). Databases also support query languages, such as SQL (Struc-
tured Query Language), which allow users to retrieve, update, and manipulate data using specific commands.
Databases are widely used in various industries and applications, including businesses, organizations, scientific re-
search, healthcare, finance, and more. They play a crucial role in storing and organizing large volumes of data, enabling
• Data Definition: DBMS allows the creation, modification, and removal of data definitions that define the organiza-
• Data Retrieval: Users can retrieve data from the database for various purposes using query and retrieval commands.
• User Administration: DBMS facilitates user registration, monitoring, data integrity enforcement, security manage-
Characteristics of DBMS:
• Digital Repository: DBMS utilizes a digital repository on a server to store and manage data.
• Logical View: It provides a clear and logical view of data manipulation processes.
• Backup and Recovery: DBMS includes automatic backup and recovery procedures to protect data from hardware
or software failures.
• ACID Properties: It maintains data integrity and consistency by following ACID (Atomicity, Consistency, Isolation,
Durability) properties.
• Data Reduction: DBMS reduces the complexity of data relationships by managing data in an organized manner.
• Data Security: DBMS provides mechanisms for data security, including user authentication, authorization, and
access control.
• Multiple Viewpoints: It allows users to view the database from different perspectives based on their requirements.
Advantages of DBMS:
• Data Redundancy Control: DBMS helps in controlling data redundancy by storing data in a centralized database
file. Data Sharing: Authorized users within an organization can easily share data through the DBMS.
• Easy Maintenance: DBMS is designed for easy maintenance due to its centralized nature.
• Time Efficiency: It reduces development time and maintenance requirements. Backup and Recovery: DBMS
• Multiple User Interfaces: DBMS offers different user interfaces, including graphical interfaces and application
Disadvantages of DBMS:
• Cost of Hardware and Software: Running a DBMS may require high-speed processors and large memory sizes, which
can be costly. Space Requirements: DBMS occupies significant disk space and memory for efficient operation.
• Complexity: Implementing and managing a database system introduces additional complexity and requirements.
• Impact of Failures: Database failures can have significant consequences, as data is often stored in a single database.
ACID Properties: ACID is an acronym that stands for Atomicity, Consistency, Isolation, and Durability. These
properties are fundamental principles that ensure the reliability and integrity of transactions in a database management
• Atomicity: Atomicity guarantees that a transaction is treated as a single, indivisible unit of work. It ensures that
either all the operations within a transaction are successfully completed, or none of them are. If any part of a
transaction fails, the entire transaction is rolled back, and the database returns to its previous state.
Figure 4: A table in DBMS
• Consistency: Consistency ensures that a transaction brings the database from one valid state to another. It enforces
integrity constraints, business rules, and data validation rules to maintain the overall correctness and validity of
the data. A transaction should not violate any defined rules or leave the database in an inconsistent state.
• Isolation: Isolation ensures that concurrent transactions do not interfere with each other. It guarantees that each
transaction is executed in isolation, as if it were the only transaction being processed. Isolation prevents issues like
data inconsistencies, lost updates, and conflicts that may arise when multiple transactions access and modify the
• Durability: Durability guarantees that once a transaction is committed and its changes are saved, they are perma-
nent and will survive any subsequent failures, such as power outages or system crashes. The changes made by a
committed transaction are stored in a durable medium, usually disk storage, to ensure their persistence even in the
event of a failure.
Relational Database Management System RDBMS stands for Relational Database Management System. It is a
type of database management system that is based on the relational model of data. In an RDBMS, data is organized
and stored in the form of tables, which consist of rows and columns. The relationships between these tables are defined
Tables:Tables are the fundamental building blocks of a relational database. They are used to store and organize data in
a structured manner. A table consists of rows (also known as records or tuples) and columns (also known as attributes).
Each column represents a specific data attribute or field, while each row represents an individual data record or instance.
Tables are designed to hold related data entities, and the relationships between tables are established through keys.
Structured Query Language (SQL): SQL is the standard language used to interact with an RDBMS. It provides
a set of commands for creating, modifying, and querying databases. SQL allows you to define the structure of tables,
insert, update, and delete records, and retrieve data based on various conditions using queries.