Cassandra - An Introduction
Cassandra - An Introduction
Cassandra - An Introduction
Mikio L. Braun Leo Jugel TU Berlin, twimpact LinuxTag Berlin 13. Mai 2011
@mikiobraun, blog.mikiobraun.de
What is NoSQL
For many web applications, classical data bases are not the right choice:
Database is just used for storing objects. Consistency not essential. A lot of concurrent access.
@mikiobraun, blog.mikiobraun.de
NoSQL in comparison
Classical Databases Powerful query language Scales by using larger servers (scaling up) Changes of database schema very costly ACID: Atomicity, Consistency, Isolation, Duratbility Transactions, locking, etc. NoSQL very simple query language skales through clustering (scaling out) No fixed database schema Typically only eventually consistent Typically no support for transactions etc.
@mikiobraun, blog.mikiobraun.de
Consistency: You never get old data. Availability: read/write operations always possible. Partition Tolerance: other guarantees hold even if network of servers break.
Gilbert, Lynch, Brewer's conjecture and the feasibility of consistent, available, partitiontolerant web services, ACM SIGACT News, Volume 33, Issue 2, June 2002
LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
http://cassandra.apache.org Java Developed at Facebook for inbox search, released as Open Source in July 2008 Apache Incubator since March 2009 Apache Top-Level since February 2010 structured key value store eventually consistent fully equivalent nodes cluster can be modified without restarting DataStax (http://datastax.com)
Main Properties
Support Licence
LinuxTag Berlin, 13. 5. 2011
Apache 2.0
(c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
config file format changed from XML to YAML schema modification (ColumnFamilies) without restart Beginning support for secondary indices
@mikiobraun, blog.mikiobraun.de
Amazon Dynamo
Clustering without dedicated master node Peer-to-peer discovery of nodes, HintedHintoff, etc. data model requires central master node Provides much more fine grained control:
Google BigTable
@mikiobraun, blog.mikiobraun.de
Installation
Download tar.gz from http://cassandra.apache.org/download/ Unpack ./conf contains config files ./bin/cassandra -f to start Cassandra, Ctrl-C to stop
@mikiobraun, blog.mikiobraun.de
Configuration
Database
Version 0.6.x: conf/storage-conf.xml Version 0.7.x: conf/cassandra.yaml Version 0.6.x: bin/cassandra.in.sh Version 0.7.x: conf/cassandra-env.sh
JVM Parameters
@mikiobraun, blog.mikiobraun.de
Row
byte arrays
column
sorted by name!
@mikiobraun, blog.mikiobraun.de
Keyspace MyDatabase: ColumnFamily Person: 1: {id: 1, name: Mikio Braun, affiliation: TU Berlin}
@mikiobraun, blog.mikiobraun.de
Example: Index
class Page { long id; List<Links> links; } class Link { long id; ... int numberOfHits; } Object data fields Keyspace MyDatabase ColumnFamily Pages 3: {id: 3, } 4: {id: 4, } ColumnFamily Links 1: {id: 1, url: } 17. {id: 17, url: }
ColumnFamily LinksPerPageByNumberOfHits 3: { 00000132:00000001: t, 000025: 00000017: 4: { 00000044:00000024: t, } Here we exploit that columns are sorted by their names.
LinuxTag Berlin, 13. 5. 2011
Usually, you can replace a SuperColumnFamily by several CollumnFamilies. Since SuperColumnFamilies make the implementation and the protocol more compelx, there are also people advocating the remove SuperCFs... .
@mikiobraun, blog.mikiobraun.de
Cassandra's Architecture
MemTable Read Operation
Memory Disk
Flush
Write Operation
Commit Log
SSTable
SSTable
SSTable
Compaction!
LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
Cassandras API
THRIFT-based API
Write operations single column range of columns range of columns in several rows column count several columns from range of rows insert batch_mutate remove truncate single column several columns in several rows single column while ColumnFamily
get_indexed_slices range of columns from index Sonstige login, describe_*, add/drop column family/keyspace since 0.7.x
@mikiobraun, blog.mikiobraun.de
Cassandra Clustering
Node
Node
Node
Consistency level:
A node has received the operation, even a HintedHandoff node. One node has completed the request. Operation has completed on majority of nodes / newest result is returned. QUORUM in local data center QUORUM in global data center Wait till all nodes have completed the request
(c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
As long as requirements of the consistency level can be met, everything is fine. Hinted Handoff:
A write operation for a faulty node is stored on another node and pushed to the other node once it is available again. Data won't be readable after write! After read operation has completed, data will be compared and updated on all nodes in the background.
Read Repair:
@mikiobraun, blog.mikiobraun.de
Libraries
Python Java Pycassa: http://github.com/pycassa/pycass Telephus: http://github.com/driftx/Telephus Datanucleus JDO:http://github.com/tnine/Datanucleus-Cassandra-Plugin Hector: http://github.com/rantav/hector Kundera http://code.google.com/p/kundera/ Pelops: http://github.com/s7/scale7-pelops grails-cassandra: https://github.com/wolpert/grails-cassandra Aquiles: http://aquiles.codeplex.com/ FluentCassandra: http://github.com/managedfusion/fluentcassandra Cassandra: http://github.com/fauna/cassandra phpcassa: http://github.com/thobbs/phpcassa SimpleCassie: http://code.google.com/p/simpletools-php/wiki/SimpleCassie
@mikiobraun, blog.mikiobraun.de
TWIMPACT: An Application
Real-time analysis of Twitter Trend analysis based on retweets Very high data rate (several million tweets per day, about 50 per second)
@mikiobraun, blog.mikiobraun.de
TWIMPACT: twimpact.jp
@mikiobraun, blog.mikiobraun.de
TWIMPACT: twimpact.com
@mikiobraun, blog.mikiobraun.de
Application Profile
Information about tweets, users, and retweets Text matching for non-API-retweets Retweet frequency and user impact Operation profile:
get_slice (all) get 6.0% 1.7ms get_slice (range) 0.1% 0.8ms batch_mutate (one row) 14.9% 0.9ms insert 21.5% 1.1ms batch_mutate 6.8% 0.8ms remove 0.8% 1.2ms 50.1%
Fraction
Duration 1.1ms
@mikiobraun, blog.mikiobraun.de
Very stable Read operations relatively expensive Multithreading leads to a huge performance increase Requires quite extensive tuning Clustering doesn't automatically lead to better performance Compaction leads to performance decrease of up to 50%
(c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
Multithreading leads to much higher throughput How to achieve multithreading without locking support?
64 32 4 2 16 8
@mikiobraun, blog.mikiobraun.de
Multithreading leads to much higher throughput How to achieve multithreading without locking support?
@mikiobraun, blog.mikiobraun.de
Cassandra Tuning
Tuning opportunities:
Size of memtables, thresholds for flushes Size of JVM Heap Frequency and depth of compaction MemTableThresholds etc. in conf/cassandra.yaml JVM Parameters in conf/cassandra-env.sh
Where?
@mikiobraun, blog.mikiobraun.de
Overview of JVM GC
Young Generation Old Generation
CMSInitiatingOccupancyFraction
Eden
up to a few hundred MB
LinuxTag Berlin, 13. 5. 2011 (c) 2011 by Mikio L. Braun
@mikiobraun, blog.mikiobraun.de
Flush
Memtables, indexes, etc.
Compaction
@mikiobraun, blog.mikiobraun.de
Memtables may survive for a very long time (up to several hours)
are placed in old generation GC has to process several dozen GBs heap to small, GC triggered too late GC storm I/O load vs. memory usage
Trade-off:
Compaction
Groe GC
@mikiobraun, blog.mikiobraun.de
Our set-up:
1 Cluster with six-core CPU and RAID 5 with 6 hard disks 4 Cluster with six-core CPU and RAID 0 with 2 hard disks
Overhead through network communication/consistency levels, etc. Hard disk performance significant Cluster still too small 1 Cluster: 6 * 500 GB = 3TB with RAID 5 = 2.5 TB (83%) 4 Cluster: 4 * 1TB = 4TB with replication factor 2 = 2TB (50%)
(c) 2011 by Mikio L. Braun @mikiobraun, blog.mikiobraun.de
Alternatives
MongoDB, CouchDB, redis, even memcached... . Persistency: Disk or RAM? Replication: Master/Slave or Peer-to-Peer? Sharding? Upcoming trend towards more complex query languages (Javascript), map-reduce operations, etc.
@mikiobraun, blog.mikiobraun.de
Summary: Cassandra
Platform which scales well Active user and developer community Read operations quite expensive For optimal performance, extensive tuning necessary Depending on your application, eventually consistent and lack of transactions/locking might be problematic.
@mikiobraun, blog.mikiobraun.de
Links
Apache Cassandra http://cassandra.apache.org Apache Cassandra Wiki http://wiki.apache.org/cassandra/FrontPage DataStax Dokumentation fr Cassandra http://www.datastax.com/docs/0.7/index My Blog: http://blog.mikiobraun.de Twimpact: http://beta.twimpact.com
@mikiobraun, blog.mikiobraun.de