Distributed Databases: Daniel Marcous

Distributed Databases
Daniel Marcous
A distributed database is a database
in which storage devices are not all
attached to a common processing
What? unit such as the CPU, controlled by a
distributed database management
system.
Introduction
Definitions
Understanding the vocabulary
● RDBMS - Relational Database Management System
● DDB - Distributed Database
● Node - a unit in a distributed system (mainly a single server)
● DDBMS - Distributed Database Management System
○ In charge of managing the different DDB nodes as one integrated system
● Centralized System - data is stored in one place
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Basic Concepts
Distributed Database Concepts
● Number of processing elements (database nodes)
● Connection between nodes over a computer network
● Logical interrelation between different database nodes
● Absence of node homogeneity

Types of Distributed Databases
Multiprocessing Systems
● Parallel Systems
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
● Truly Distributed Systems
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
Classification of Distributed Systems
● Distribution - Data and software distributed over multiple nodes
● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs
● Heterogeneity - use of different software / hardware on different nodes

Why? Reasons for choosing a distributed
database over a “plain” centralized
database.
The power of distribution
Advantages
Performance
● More computing power
○ CPU
○ Memory
○ Storage
○ Network bandwidth
● Parallelism
○ Inter-query
○ Intra-query
Ease of use / development
● Transparency
● Geographically distributed sites
● Backups
● Elasticity
○ Growing
○ Shrinking
Challenges
Management Challenges
● Transparency - One software (Ring) to rule them all
○ Management - one command
○ Data - one query
● Autonomy - Degree of Independence
○ Different settings / configurations / Cache size
○ “Master” node / Master Election
● Keeping track of data distribution
○ which server has the table / partition I need?

Complex Features Implementation
● Reliability - Probability of failures
○ Does one server failure affects the whole system? (“Freeze”)
● Availability - Percent of time when a data source is available
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
○ What is a single point of time?
○ Nodes clocks Synchronisation (NTP)
● Transaction Management - Server X must assure that the data is “safe” and no
Scaling
● Synchronisation Overhead
CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
○ C - a read see all previously completed writes
○ A - reads and writes always succeed
○ P - read and write while network is down
● Choose 2! (2000)
● Sorry, actually only C or A… (2012)

How does a distributed database
How? work?
● Advanced Concepts
Internals
● Architectures
Advanced Concepts
Replication
● Assumptions
○ Nodes will fail
○ Commodity Hardware - prone to failure
● Settings
○ Replication Factor
○ Data / Actions /Apply logs
○ Synchronous / ASynchronous
○ Delay
Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Horizontal - row wise
○ Vertical - column wise (Vertica/ Parquet)
○ Hybrid - both
● Advantages
○ Reports on part of the data - horizontal
○ Increased parallelism - multiple physical files

Distributed Processing
● Access by key Only!
○ Using Hash Tables
■ keys are hashed and spread (=sharded) across nodes
■ result of hash tells you which node to access
■ Hash maps exist on every node / client
● Batch Processing
○ MapReduce
■ Map - partition by key

Data Locality
● Local storage (VS centralised storage controller)
○ Bring the processing to the data
○ Free bandwidth
● Smart Load Balancing
○ Route users to the “closest” node with the data (replication duh..)
● Data sorted by Key /Hash Key
○ Same / Close enough key = Same node
○ “Process” all the rides in the TLV area

ACID
BASE
● Atomicity ● Basic Availability
○ Transactions ○ Response to every request
● Consistency ● Soft State
○ Locked until done ○ States change, results are

not determinant
● Isolation
● Eventual Consistency
○ No interference
○ Consistent state may take
● Durability time but is promised
○ Completed = Persistent ○ (CAS - Compare & Swap

Operations exist)
Architectures
Plain Old Centralized Database
● Oracle
● SQLServer (MS)
● DB2 (IBM)
● MySQL
● PostgreSQL
Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)
● DB2 Data Sharing
● PostgresXL
Federated Database System
● IBM IIDR
Data Warehouse
● Oracle Exadata
● Teradata
● SQL Data Warehouse (MS)
● Vertica (HP)
● Greenplum (EMC)
Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)
● Redshift (Amazon)
● Presto (Facebook)
● Impala (Cloudera)
NoSQL (BASE) Shared Nothing Database
● MongoDB
● CouchBase
● Cassandra
● HBase
When?/Where? Where did the ideas come from and
what do we have present for use
nowadays?
History and Present
The Founding Fathers
Articles
● Old School
○ Fundamentals of Database Systems (1989)
○ Principles of Distributed Database Systems (1991)
● Distributed File System
○ The Google file system (2003)
● Distributed Processing
○ MapReduce: simplified data processing on large clusters (2004)
● Interactive Querying on large scale

Adopters
NoSQL – Database Types
● Document DB (Mostly JSON)
○MongoDB
○CouchBase
● Key-Value DB
○Cassandra
○HBase
● Graph DB
○Neo4J
Known Users
Big Guys
● Google - Inside tools
○ MapReduce
○ Dremel -> Big Query
○ Flume -> DataFlow
● Facebook - Inside tools open-sourced and modified
○ Cassandra -> HBase
○ Presto
● Yahoo - Hadoop / HBase

Israel
● IDF
● Waze
● Viber - Couchbase
● Liveperson - MongoDB, CouchBase
● SimilarWeb - HBase
Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.

Distributed Databases: Daniel Marcous

Uploaded by

Copyright:

Available Formats

Distributed Databases: Daniel Marcous

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Databases: Daniel Marcous

Uploaded by

Copyright:

Available Formats

Distributed Databases

● DDB - Distributed Database

● Node - a unit in a distributed system (mainly a single server)

● DDBMS - Distributed Database Management System

○ In charge of managing the different DDB nodes as one integrated system

● Centralized System - data is stored in one place

● Connection between nodes over a computer network

● Logical interrelation between different database nodes

● Absence of node homogeneity

● Truly Distributed Systems

● Autonomy - Provision DBMS as one whole VS multiple standalone DBMSs

● Heterogeneity - use of different software / hardware on different nodes

● Geographically distributed sites

○ Management - one command

○ Data - one query

● Autonomy - Degree of Independence

○ Different settings / configurations / Cache size

○ “Master” node / Master Election

● Keeping track of data distribution

○ which server has the table / partition I need?

○ Does one server failure affects the whole system? (“Freeze”)

● Availability - Percent of time when a data source is available

○ What is a single point of time?

○ Nodes clocks Synchronisation (NTP)

○ C - a read see all previously completed writes

○ A - reads and writes always succeed

○ P - read and write while network is down

● Sorry, actually only C or A… (2012)

○ Nodes will fail

○ Commodity Hardware - prone to failure

○ Data / Actions /Apply logs

○ Horizontal - row wise

○ Vertical - column wise (Vertica/ Parquet)

○ Reports on part of the data - horizontal

○ Increased parallelism - multiple physical files

○ Using Hash Tables

■ keys are hashed and spread (=sharded) across nodes

■ result of hash tells you which node to access

■ Hash maps exist on every node / client

■ Map - partition by key

○ Bring the processing to the data

● Smart Load Balancing

● Data sorted by Key /Hash Key

○ Same / Close enough key = Same node

○ “Process” all the rides in the TLV area

○ Transactions ○ Response to every request

● Consistency ● Soft State

○ Locked until done ○ States change, results are

○ Completed = Persistent ○ (CAS - Compare & Swap

● DB2 Data Sharing

● SQL Data Warehouse (MS)

○ Fundamentals of Database Systems (1989)

○ Principles of Distributed Database Systems (1991)

● Distributed File System

○ The Google file system (2003)

○ MapReduce: simplified data processing on large clusters (2004)

● Interactive Querying on large scale

○ Dremel -> Big Query

○ Flume -> DataFlow