Distributed Databases: Daniel Marcous

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Distributed Databases

Daniel Marcous
A distributed database is a database
in which storage devices are not all
attached to a common processing
What? unit such as the CPU, controlled by a
distributed database management
system.
Introduction
Definitions
Understanding the vocabulary
● RDBMS - Relational Database Management System

● DDB - Distributed Database

● Node - a unit in a distributed system (mainly a single server)

● DDBMS - Distributed Database Management System

○ In charge of managing the different DDB nodes as one integrated system

● Centralized System - data is stored in one place

● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Basic Concepts
Distributed Database Concepts
● Number of processing elements (database nodes)

● Connection between nodes over a computer network

● Logical interrelation between different database nodes

● Absence of node homogeneity


Types of Distributed Databases
Multiprocessing Systems
● Parallel Systems

○ Shared Memory (tightly coupled) - multiple processors share the same main memory

○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage

● Truly Distributed Systems

○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
Classification of Distributed Systems
● Distribution - Data and software distributed over multiple nodes

● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs

● Heterogeneity - use of different software / hardware on different nodes


Why? Reasons for choosing a distributed
database over a “plain” centralized
database.
The power of distribution
Advantages
Performance
● More computing power

○ CPU

○ Memory

○ Storage

○ Network bandwidth

● Parallelism

○ Inter-query

○ Intra-query
Ease of use / development
● Transparency

● Geographically distributed sites

● Backups

● Elasticity

○ Growing

○ Shrinking
Challenges
Management Challenges
● Transparency - One software (Ring) to rule them all

○ Management - one command

○ Data - one query

● Autonomy - Degree of Independence

○ Different settings / configurations / Cache size

○ “Master” node / Master Election

● Keeping track of data distribution

○ which server has the table / partition I need?


Complex Features Implementation
● Reliability - Probability of failures

○ Does one server failure affects the whole system? (“Freeze”)

● Availability - Percent of time when a data source is available

○ If a node goes down, does it’s data get lost? unavailable until its up again?

● Recovery

○ What is a single point of time?

○ Nodes clocks Synchronisation (NTP)

● Transaction Management - Server X must assure that the data is “safe” and no
Scaling

● Synchronisation Overhead
CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)

○ C - a read see all previously completed writes

○ A - reads and writes always succeed

○ P - read and write while network is down

● Choose 2! (2000)

● Sorry, actually only C or A… (2012)


How does a distributed database

How? work?

● Advanced Concepts
Internals
● Architectures
Advanced Concepts
Replication
● Assumptions

○ Nodes will fail

○ Commodity Hardware - prone to failure

● Settings

○ Replication Factor

○ Data / Actions /Apply logs

○ Synchronous / ASynchronous

○ Delay
Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts

● Types

○ Horizontal - row wise

○ Vertical - column wise (Vertica/ Parquet)

○ Hybrid - both

● Advantages

○ Reports on part of the data - horizontal

○ Increased parallelism - multiple physical files


Distributed Processing
● Access by key Only!

○ Using Hash Tables

■ keys are hashed and spread (=sharded) across nodes

■ result of hash tells you which node to access

■ Hash maps exist on every node / client

● Batch Processing

○ MapReduce

■ Map - partition by key


Data Locality
● Local storage (VS centralised storage controller)

○ Bring the processing to the data

○ Free bandwidth

● Smart Load Balancing

○ Route users to the “closest” node with the data (replication duh..)

● Data sorted by Key /Hash Key

○ Same / Close enough key = Same node

○ “Process” all the rides in the TLV area


ACID
BASE
● Atomicity ● Basic Availability

○ Transactions ○ Response to every request

● Consistency ● Soft State

○ Locked until done ○ States change, results are


not determinant
● Isolation
● Eventual Consistency
○ No interference
○ Consistent state may take
● Durability time but is promised

○ Completed = Persistent ○ (CAS - Compare & Swap


Operations exist)
Architectures
Plain Old Centralized Database
● Oracle

● SQLServer (MS)

● DB2 (IBM)

● MySQL

● PostgreSQL
Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)

● DB2 Data Sharing

● PostgresXL
Federated Database System
● IBM IIDR
Data Warehouse
● Oracle Exadata

● Teradata

● SQL Data Warehouse (MS)

● Vertica (HP)

● Greenplum (EMC)
Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)

● Redshift (Amazon)

● Presto (Facebook)

● Impala (Cloudera)
NoSQL (BASE) Shared Nothing Database
● MongoDB

● CouchBase

● Cassandra

● HBase
When?/Where? Where did the ideas come from and
what do we have present for use
nowadays?
History and Present
The Founding Fathers
Articles
● Old School

○ Fundamentals of Database Systems (1989)

○ Principles of Distributed Database Systems (1991)

● Distributed File System

○ The Google file system (2003)

● Distributed Processing

○ MapReduce: simplified data processing on large clusters (2004)

● Interactive Querying on large scale


Adopters
NoSQL – Database Types
● Document DB (Mostly JSON)

○MongoDB

○CouchBase

● Key-Value DB

○Cassandra

○HBase

● Graph DB

○Neo4J
Known Users
Big Guys
● Google - Inside tools

○ MapReduce

○ Dremel -> Big Query

○ Flume -> DataFlow

● Facebook - Inside tools open-sourced and modified

○ Cassandra -> HBase

○ Presto

● Yahoo - Hadoop / HBase


Israel
● IDF

● Waze

● Viber - Couchbase

● Liveperson - MongoDB, CouchBase

● SimilarWeb - HBase
Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.

You might also like