Cassandra - An Introduction

Cassandra An Introduction
Mikio L. Braun Leo Jugel TU Berlin, twimpact LinuxTag Berlin 13. Mai 2011
LinuxTag Berlin, 13. 5. 2011
(c) 2011 by Mikio L. Braun
@mikiobraun, blog.mikiobraun.de
What is NoSQL
For many web applications, classical data bases are not the right choice:

Database is just used for storing objects. Consistency not essential. A lot of concurrent access.
NoSQL in comparison
Classical Databases Powerful query language Scales by using larger servers (scaling up) Changes of database schema very costly ACID: Atomicity, Consistency, Isolation, Duratbility Transactions, locking, etc. NoSQL very simple query language skales through clustering (scaling out) No fixed database schema Typically only eventually consistent Typically no support for transactions etc.
Brewer's CAP Theorem
CAP: Consistency, Availability, Partition Tolerance

Consistency: You never get old data. Availability: read/write operations always possible. Partition Tolerance: other guarantees hold even if network of servers break.
You can only have two of these!
Gilbert, Lynch, Brewer's conjecture and the feasibility of consistent, available, partitiontolerant web services, ACM SIGACT News, Volume 33, Issue 2, June 2002
LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
Homepage Language History
http://cassandra.apache.org Java Developed at Facebook for inbox search, released as Open Source in July 2008 Apache Incubator since March 2009 Apache Top-Level since February 2010 structured key value store eventually consistent fully equivalent nodes cluster can be modified without restarting DataStax (http://datastax.com)
Main Properties
Support Licence
Apache 2.0
(c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
Version 0.6.x and 0.7.x
Most important changes in 0.7.x

config file format changed from XML to YAML schema modification (ColumnFamilies) without restart Beginning support for secondary indices
However, also problems with stability initially.
Inspirations for Cassandra
Amazon Dynamo

Clustering without dedicated master node Peer-to-peer discovery of nodes, HintedHintoff, etc. data model requires central master node Provides much more fine grained control:

Google BigTable

which data should be stored together on-the-fly compression, etc.
Installation
Download tar.gz from http://cassandra.apache.org/download/ Unpack ./conf contains config files ./bin/cassandra -f to start Cassandra, Ctrl-C to stop
Configuration
Database

Version 0.6.x: conf/storage-conf.xml Version 0.7.x: conf/cassandra.yaml Version 0.6.x: bin/cassandra.in.sh Version 0.7.x: conf/cassandra-env.sh
JVM Parameters

Cassandra's Data Model

Keyspace (= database) Column Family (= table)
key
Row
byte arrays
{name1: value1, name2: value2, name3: value3, ...}
strings sorted according to partitioner
column
sorted by name!
Super Column Family key key

{name1: value1, ...}
Example: Simple Object Store

class Person { long id; String name; String affiliation; } Convert fields to byte arrays
Keyspace MyDatabase: ColumnFamily Person: 1: {id: 1, name: Mikio Braun, affiliation: TU Berlin}
Example: Index
class Page { long id; List<Links> links; } class Link { long id; ... int numberOfHits; } Object data fields Keyspace MyDatabase ColumnFamily Pages 3: {id: 3, } 4: {id: 4, } ColumnFamily Links 1: {id: 1, url: } 17. {id: 17, url: }
Used for both, linking and indexing!
ColumnFamily LinksPerPageByNumberOfHits 3: { 00000132:00000001: t, 000025: 00000017: 4: { 00000044:00000024: t, } Here we exploit that columns are sorted by their names.
Of course, everything encoded in byte arrays, not ASCII

Are SuperColumnFamilies necessary?
Usually, you can replace a SuperColumnFamily by several CollumnFamilies. Since SuperColumnFamilies make the implementation and the protocol more compelx, there are also people advocating the remove SuperCFs... .
Cassandra's Architecture
MemTable Read Operation
Memory Disk
Flush
Write Operation
Commit Log
SSTable
SSTable
SSTable
Compaction!
Cassandras API
THRIFT-based API
Write operations single column range of columns range of columns in several rows column count several columns from range of rows insert batch_mutate remove truncate single column several columns in several rows single column while ColumnFamily
Read operations get get_slice multiget_slice get_count get_range_slice
get_indexed_slices range of columns from index Sonstige login, describe_*, add/drop column family/keyspace since 0.7.x
Cassandra Clustering

Fully equivalent nodes, no master node. Bootstrapping requires seed node.

Storage Proxy
Node
Node
Node
Reads/writes according to consistency level Query

Consistency Level and Replication Factor

Replication factor: On how many nodes is a piece of data stored?

Consistency level:
A node has received the operation, even a HintedHandoff node. One node has completed the request. Operation has completed on majority of nodes / newest result is returned. QUORUM in local data center QUORUM in global data center Wait till all nodes have completed the request
Consistency Level ANY ONE QUORUM LOCAL_QUORUM GLOBAL_QUORUM ALL

How to deal with failure
As long as requirements of the consistency level can be met, everything is fine. Hinted Handoff:
A write operation for a faulty node is stored on another node and pushed to the other node once it is available again. Data won't be readable after write! After read operation has completed, data will be compared and updated on all nodes in the background.
Read Repair:
Libraries
Python Java Pycassa: http://github.com/pycassa/pycass Telephus: http://github.com/driftx/Telephus Datanucleus JDO:http://github.com/tnine/Datanucleus-Cassandra-Plugin Hector: http://github.com/rantav/hector Kundera http://code.google.com/p/kundera/ Pelops: http://github.com/s7/scale7-pelops grails-cassandra: https://github.com/wolpert/grails-cassandra Aquiles: http://aquiles.codeplex.com/ FluentCassandra: http://github.com/managedfusion/fluentcassandra Cassandra: http://github.com/fauna/cassandra phpcassa: http://github.com/thobbs/phpcassa SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie
Grails .NET Ruby PHP
Or roll your own based on THRIFT http://thrift.apache.org/ :)
TWIMPACT: An Application

Real-time analysis of Twitter Trend analysis based on retweets Very high data rate (several million tweets per day, about 50 per second)
TWIMPACT: twimpact.jp
TWIMPACT: twimpact.com
Application Profile

Information about tweets, users, and retweets Text matching for non-API-retweets Retweet frequency and user impact Operation profile:
get_slice (all) get 6.0% 1.7ms get_slice (range) 0.1% 0.8ms batch_mutate (one row) 14.9% 0.9ms insert 21.5% 1.1ms batch_mutate 6.8% 0.8ms remove 0.8% 1.2ms 50.1%
Fraction
Duration 1.1ms
Practical Experiences with Cassandra
Very stable Read operations relatively expensive Multithreading leads to a huge performance increase Requires quite extensive tuning Clustering doesn't automatically lead to better performance Compaction leads to performance decrease of up to 50%
Performance through Multithreading

Multithreading leads to much higher throughput How to achieve multithreading without locking support?
64 32 4 2 16 8
Core i7, 4 cores (2 + 2 HT)
Performance through Multithreading

Multithreading leads to much higher throughput How to achieve multithreading without locking support?
Cassandra Tuning
Tuning opportunities:

Size of memtables, thresholds for flushes Size of JVM Heap Frequency and depth of compaction MemTableThresholds etc. in conf/cassandra.yaml JVM Parameters in conf/cassandra-env.sh
Where?

Overview of JVM GC
Young Generation Old Generation
CMSInitiatingOccupancyFraction
Eden
Survivors dozens of GBs
up to a few hundred MB
LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun
Additional memory usage while GC is running
Cassandra's Memory Usage
Flush
Memtables, indexes, etc.
Size of Memtable: 128M, JVM Heap: 3G, #CF: 12

Compaction
Cassandra's Memory Usage
Memtables may survive for a very long time (up to several hours)

are placed in old generation GC has to process several dozen GBs heap to small, GC triggered too late GC storm I/O load vs. memory usage
Trade-off:
Do not neglect compaction!

The Effects of GC and Compactions
Compaction
Groe GC
Cluster vs Single Node
Our set-up:

1 Cluster with six-core CPU and RAID 5 with 6 hard disks 4 Cluster with six-core CPU and RAID 0 with 2 hard disks
Single node consistently performs 1,5-3 times better. Possible causes:

Overhead through network communication/consistency levels, etc. Hard disk performance significant Cluster still too small 1 Cluster: 6 * 500 GB = 3TB with RAID 5 = 2.5 TB (83%) 4 Cluster: 4 * 1TB = 4TB with replication factor 2 = 2TB (50%)
Effectively available disk space:

Alternatives
MongoDB, CouchDB, redis, even memcached... . Persistency: Disk or RAM? Replication: Master/Slave or Peer-to-Peer? Sharding? Upcoming trend towards more complex query languages (Javascript), map-reduce operations, etc.
Summary: Cassandra
Platform which scales well Active user and developer community Read operations quite expensive For optimal performance, extensive tuning necessary Depending on your application, eventually consistent and lack of transactions/locking might be problematic.
Links

Apache Cassandra http://cassandra.apache.org Apache Cassandra Wiki http://wiki.apache.org/cassandra/FrontPage DataStax Dokumentation fr Cassandra http://www.datastax.com/docs/0.7/index My Blog: http://blog.mikiobraun.de Twimpact: http://beta.twimpact.com

Cassandra - An Introduction

Uploaded by

Copyright:

Available Formats

Cassandra - An Introduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cassandra - An Introduction

Uploaded by

Copyright:

Available Formats

Cassandra An Introduction

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Brewer's CAP Theorem

CAP: Consistency, Availability, Partition Tolerance

You can only have two of these!

Homepage Language History

Version 0.6.x and 0.7.x

Most important changes in 0.7.x

However, also problems with stability initially.

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Inspirations for Cassandra

which data should be stored together on-the-fly compression, etc.

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Cassandra's Data Model

{name1: value1, name2: value2, name3: value3, ...}

strings sorted according to partitioner

Super Column Family key key

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Example: Simple Object Store

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Used for both, linking and indexing!

Of course, everything encoded in byte arrays, not ASCII

Are SuperColumnFamilies necessary?

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Read operations get get_slice multiget_slice get_count get_range_slice

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Fully equivalent nodes, no master node. Bootstrapping requires seed node.

Reads/writes according to consistency level Query

Consistency Level and Replication Factor

Consistency Level ANY ONE QUORUM LOCAL_QUORUM GLOBAL_QUORUM ALL

How to deal with failure

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

Grails .NET Ruby PHP

Or roll your own based on THRIFT http://thrift.apache.org/ :)

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun

LinuxTag Berlin, 13. 5. 2011

(c) 2011 by Mikio L. Braun