Virtual Machine Block Storage With The Distributed Storage System

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

virtual machine block storage with

the ceph distributed storage system

sage weil
xensummit – august 28, 2012
outline
● why you should care
● what is it, what it does
● how it works, how you can use it
● architecture
● objects, recovery
● rados block device
● integration
● path forward
● who we are, why we do this
why should you care about another
storage system?

requirements, time, cost


requirements
● diverse storage needs
● object storage
● block devices (for VMs) with snapshots, cloning
● shared file system with POSIX, coherent caches
● structured data... files, block devices, or objects?
● scale
● terabytes, petabytes, exabytes
● heterogeneous hardware
● reliability and fault tolerance
time
● ease of administration
● no manual data migration, load balancing
● painless scaling
● expansion and contraction
● seamless migration
cost
● low cost per gigabyte
● no vendor lock-in

● software solution
● run on commodity hardware
● open source
what is ceph?
APP APP HOST/VM CLIENT

RADOSGW RBD CEPH FS


LIBRADOS
A bucket-based A reliable and fully- A POSIX-compliant
A library allowing REST gateway, distributed block distributed file
apps to directly compatible with S3 device, with a Linux system, with a
access RADOS, and Swift kernel client and a Linux kernel client
with support for QEMU/KVM driver and support for
C, C++, Java, FUSE
Python, Ruby,
and PHP

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing,


intelligent storage nodes

8
open source
● LGPLv2
● copyleft
● free to link to proprietary code
● no copyright assignment
● no dual licensing
● no “enterprise-only” feature set
● active community
● commercial support available
distributed storage system
● data center (not geo) scale
● 10s to 10,000s of machines
● terabytes to exabytes
● fault tolerant
● no SPoF
● commodity hardware
– ethernet, SATA/SAS, HDD/SSD
– RAID, SAN probably a waste of time, power, and money
object storage model
● pools
● 1s to 100s
● independent namespaces or object collections
● replication level, placement policy
● objects
● trillions
● blob of data (bytes to gigabytes)
● attributes (e.g., “version=12”; bytes to kilobytes)
● key/value bundle (bytes to gigabytes)
object storage cluster
● conventional client/server model doesn't scale
● server(s) become bottlenecks; proxies are inefficient
● if storage devices don't coordinate, clients must
● ceph-osds are intelligent storage daemons
● coordinate with peers
● sensible, cluster-aware protocols
● sit on local file system
– btrfs, xfs, ext4, etc.
– leveldb
OSD OSD OSD OSD OSD

FS FS FS FS FS btrfs
xfs
ext4
DISK DISK DISK DISK DISK

M M M

13
Monitors:

Maintain cluster state

M

Provide consensus for
distributed decision-making

Small, odd number

These do not serve stored
objects to clients

OSDs:

One per disk or RAID group

At least three in a cluster

Serve stored objects to clients

Intelligently peer to perform
replication tasks
HUMAN

M M
data distribution
● all objects are replicated N times
● objects are automatically placed, balanced, migrated
in a dynamic cluster
● must consider physical infrastructure
● ceph-osds on hosts in racks in rows in data centers

● three approaches
● pick a spot; remember where you put it
● pick a spot; write down where you put it
● calculate where to put it, where to find it
CRUSH
• Pseudo-random placement algorithm
• Fast calculation, no lookup
• Ensures even distribution
• Repeatable, deterministic
• Rule-based configuration
• specifiable replication
• infrastructure topology aware
• allows weighting
• Stable mapping
• Limited data migration
distributed object storage
● CRUSH tells us where data should go
● small “osd map” records cluster state at point in time
● ceph-osd node status (up/down, weight, IP)
● CRUSH function specifying desired data distribution
● object storage daemons (RADOS)
● store it there
● migrate it as the cluster changes
● decentralized, distributed approach allows
● massive scales (10,000s of servers or more)
● efficient data access
● the illusion of a single copy with consistent behavior
large clusters aren't static
● dynamic cluster
● nodes are added, removed; nodes reboot, fail, recover
● recovery is the norm
● osd maps are versioned
● shared via gossip
● any map update potentially triggers data migration
● ceph-osds monitor peers for failure
● new nodes register with monitor
● administrator adjusts weights, mark out old hardware, etc.
CLIENT

??
what does this mean for my cloud?
● virtual disks
● reliable
● accessible from many hosts
● appliances
● great for small clouds
● not viable for public or (large) private clouds
● avoid single server bottlenecks
● efficient management
VM

VIRTUALIZATION CONTAINER
LIBRBD
LIBRADOS

M M
CONTAINER VM CONTAINER
LIBRBD LIBRBD
LIBRADOS LIBRADOS

M M
HOST
KRBD (KERNEL MODULE)
LIBRADOS

M M
RBD: RADOS Block Device

Replicated, reliable, high-
performance virtual disk

Allows decoupling of VMs and
containers
• Live migration!

Images are striped across the
cluster

Snapshots!

Native support in the Linux
kernel
• /dev/rbd1

librbd allows easy integration
HOW DO YOU
SPIN UP
THOUSANDS OF VMs
INSTANTLY
AND
EFFICIENTLY?
instant copy

144
0 0 0 0 = 144
write
CLIENT
write

write

write

144 4 = 148
read

read
CLIENT

read

144 4 = 148
current RBD integration
● native Linux kernel support
● /dev/rbd0, /dev/rbd/<poolname>/<imagename>
● librbd
● user-level library
● Qemu/KVM
● links to librbd user-level library
● libvirt
● librbd-based storage pool
● understands RBD images
● can only start KVM VMs... :-(
● CloudStack, OpenStack
what about Xen?
● Linux kernel driver (i.e. /dev/rbd0)
● easy fit into existing stacks
● works today
● need recent Linux kernel for dom0
● blktap
● generic kernel driver, userland process
● easy integration with librbd
● more featureful (cloning, caching), maybe faster
● doesn't exist yet!
● rbd-fuse
● coming soon!
libvirt
● CloudStack, OpenStack
● libvirt understands rbd images, storage pools
● xml specifies cluster, pool, image name, auth
● currently only usable with KVM
● could configure /dev/rbd devices for VMs
librbd
● management
● create, destroy, list, describe images
● resize, snapshot, clone
● I/O
● open, read, write, discard, close
● C, C++, Python bindings
RBD roadmap
● locking
● fence failed VM hosts
● clone performance
● KSM (kernel same-page merging) hints
● caching
● improved librbd caching
● kernel RBD + bcache to local SSD/disk
why
● limited options for scalable open source storage
● proprietary solutions
● marry hardware and software
● expensive
● don't scale (out)
● industry needs to change
who we are
● Ceph created at UC Santa Cruz (2007)
● supported by DreamHost (2008-2011)
● Inktank (2012)
● growing user and developer community
● we are hiring
● C/C++/Python developers
● sysadmins, testing engineers
● Los Angeles, San Francisco, Sunnyvale, remote

http://ceph.com/
APP APP HOST/VM CLIENT

RADOSGW RBD CEPH FS


LIBRADOS
A bucket-based A reliable and fully- A POSIX-compliant
A library allowing REST gateway, distributed block distributed file
apps to directly compatible with S3 device, with a Linux system, with a
access RADOS, and Swift kernel client and a Linux kernel client
with support for QEMU/KVM driver and support for
C, C++, Java, FUSE
Python, Ruby,
and PHP

RADOS

A reliable, autonomous, distributed object store comprised of self-healing, self-managing,


intelligent storage nodes

39

You might also like