Distributed Databases: Daniel Marcous
Distributed Databases: Daniel Marcous
Distributed Databases: Daniel Marcous
Daniel Marcous
A distributed database is a database
in which storage devices are not all
attached to a common processing
What? unit such as the CPU, controlled by a
distributed database management
system.
Introduction
Definitions
Understanding the vocabulary
● RDBMS - Relational Database Management System
● Homogenous system - built of parts (nodes) that all act the same way / consist of
the same hardware (Opposite of Heterogeneous).
Basic Concepts
Distributed Database Concepts
● Number of processing elements (database nodes)
○ Shared Memory (tightly coupled) - multiple processors share the same main memory
○ Shared Disk (loosely coupled) - multiple processors share the same secondary disk storage
○ Shared Nothing - each processor with its own memory and disk,
interrelations are only through network (no SPOF)
Classification of Distributed Systems
● Distribution - Data and software distributed over multiple nodes
○ CPU
○ Memory
○ Storage
○ Network bandwidth
● Parallelism
○ Inter-query
○ Intra-query
Ease of use / development
● Transparency
● Backups
● Elasticity
○ Growing
○ Shrinking
Challenges
Management Challenges
● Transparency - One software (Ring) to rule them all
○ If a node goes down, does it’s data get lost? unavailable until its up again?
● Recovery
● Transaction Management - Server X must assure that the data is “safe” and no
Scaling
● Synchronisation Overhead
CAP Theorem
● Eric Brewer (Berkeley->Yahoo->Google)
● Choose 2! (2000)
How? work?
● Advanced Concepts
Internals
● Architectures
Advanced Concepts
Replication
● Assumptions
● Settings
○ Replication Factor
○ Synchronous / ASynchronous
○ Delay
Fragmentation
● Dividing a single Data Object (Table/ File) into multiple parts
● Types
○ Hybrid - both
● Advantages
● Batch Processing
○ MapReduce
○ Free bandwidth
○ Route users to the “closest” node with the data (replication duh..)
● SQLServer (MS)
● DB2 (IBM)
● MySQL
● PostgreSQL
Relational (ACID) “Distributed” Database
● Oracle RAC (Real Applications Cluster)
● PostgresXL
Federated Database System
● IBM IIDR
Data Warehouse
● Oracle Exadata
● Teradata
● Vertica (HP)
● Greenplum (EMC)
Interactive Multiple Parallel Processing (MPP)
● Dremel (Big Query, Google)
● Redshift (Amazon)
● Presto (Facebook)
● Impala (Cloudera)
NoSQL (BASE) Shared Nothing Database
● MongoDB
● CouchBase
● Cassandra
● HBase
When?/Where? Where did the ideas come from and
what do we have present for use
nowadays?
History and Present
The Founding Fathers
Articles
● Old School
● Distributed Processing
○MongoDB
○CouchBase
● Key-Value DB
○Cassandra
○HBase
● Graph DB
○Neo4J
Known Users
Big Guys
● Google - Inside tools
○ MapReduce
○ Presto
● Waze
● Viber - Couchbase
● SimilarWeb - HBase
Distribution is
awesome, but
requires complex
skills to do right.
Don’t overkill it.